Machine Learning Lecture 14 "(Linear) Support Vector Machines" -Cornell CS4780 SP17

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome everybody okay this is better all right few logistics things there I hand it out so I hand out the handouts already so please make sure you have a handout also probably the next project which is I believe project four is that correct project three is nine phase yeah so now you project three is still going strong I see a lot of submissions it's very very nice project four is you're still testing a few aspects and be talking with Bukharian right now just make sure everything works seamlessly so it may be delayed by a few days so maybe late by until Friday you will get your full two weeks so this is not eating into your time also add the the CEO of voc Ihram just email me he has seen all your complaints and all your suggestions Piazza thanks to those who posted something there they're taking this very very seriously and they are very interested in fixing things so please continue to post things there they actually they found some bugs and their software and hopefully most of these things will be addressed very soon all right any questions about logistics the other thing is the exam is just around the corner by the way all right so we have two more weeks it's less than two weeks so please get started hey please I just want to remind everybody the exam is a huge part of the grade like if the biggest failure mode in this class is to do really well the projects but not do well on the exam so please take it seriously okay so last time we talked about yeah question yeah bikuni try to find one date where everybody can do it it may be a Sunday it's really hard to forget everybody in the same room and yeah please follow the Piazza posts in that regard I guess if you if you can't if you missed the makeup exam then it will end up being an oral exam in my office so that's kind of the way it works alright so last time we talked about the squirrel us so the square loss was basically linear regression but we said you have our data and we make an assumption that the data is somewhat you know lies on the line right like this is basically a line is a good model of that of that data how that the better set has a linear relationship between x and y and then we made a second assumption that well of course the points won't be exactly on that on that line so if they're off we actually assume you have Gaussian noise so they put a Gaussian around raising this this has the probability of being here it's kind of a Gaussian sent around the with the mean which is the position off that line we derived this and turned out actually this last function is a parabola and once we went through all the notions that just turned out to be quite simple the loss function was faces just well let me write it down here the loss function end up just being some of our points W transpose X I minus y squared and if you do that the MLE answer so this is basically the last function you're trying to minimize with gradient descent and the MLE or the map answer is you basically add a regular rise to it yeah lambdas lambda x squared this is if you have Matt okay so one thing is this is a parabola so turns out actually has coast form solution that's a function that a parabola you can just find the minimum in closed form this is essentially actually doing a Hessian update if you just do have used the Newton's method and take one step with Newton's method OHS then actually you get right to the minimum we can quickly derive this I just want to spend three minutes deriving this one question people ask me is you know do people actually like you know if there's a the coast form solution why don't we just always use the closed form solution why don't we actually just gradient descent and the answer is as you will see in a minute the closed form solution requires d squared memory and is actually D cubed in complexity so I'll get to that in a minute but it turns out that you know for large enough dimension that's actually is prohibitive so let me just just derive this real quick what's the closed form solution is let's say you have to find a matrix X where X transpose is the following x1 xn and y transpose is y1 yn right so X is basically a matrix where every row is a data input is exactly the kind of matrices you use in your projects then I can write this last function as a by the way please everybody kills your laptop's thank you and you can write this last function as following there's just x times W minus y squared all right so that's the that's the last function if I just write it in matrix notation crazy handle that makes sense all right awesome so I can write that out and I get XY w minus y transpose X W minus y so that's you know so far nothing interesting here and if I multiply this out you get the following it W transpose X suppose xw at these two terms then I have minus twice y transpose X W plus y transpose one alright so this is just completing the square and that's not my function there's a this is a parabola you see basically this year is a positive definite matrix and if you now want to find the minimum we just take the derivative and set the derivative to zero I take the derivative of this function here with respect to W all right any questions so far so one more time I took this last function I just divided the matrix notation just because that's cool right and then I write everything out and make me a matrix notation this makes it very obvious if there's a closed form solution and now I have this function here this is just a function of W right it's a function it tells me the last with respect to W if I want to find the minimum I take the derivative with respect to W and set that to 0 all right this is something you've been doing your whole life right and so if I do this and the derivative of this function becomes X transpose X W alright this is this quadratic term this is by the way if you're not familiar with taking derivatives like two vectors please look at the matrix cookbook it's something that's very very useful if you're familiar with it if you're not if you're uncomfortable with it you can always go back and take the derivative respect to every single WI all right so that that always works it's just a lot of notation a lot of summation so it's actually it can you know when you do this kind of work it can can save you hours and hours of time if you were just familiar with taking derivatives spective matrices and okay so this this is the term you get here this because we have W in both sides just ends up being x transpose x times W and we get minus two y transpose X and then you know the W there's a linear term so we just get Y transpose X there's the derivative we set this equal zero so if you put an equal zero here that the same thing is put an equal you raise your hand if you still with me all right now to get rid of the tutu's because we don't need them now on both sides and now we have some matrix times W equals something well we just multiply with the inverse of that matrix so we get W equals X transpose X minus one I'm sorry I made a mistake here this is X transpose Y X transpose okay after transpose and that's the solution so there's actually the closed form solution to this last function so if you can compute this then there's nothing else to be done right you just jump right to the minimums you know that's basically if your function is a parabola and you can just you eat jumps right here and then your doc why don't people always do this the answer is this this matrix can be gigantic and inverting it a gigantic matrix can take forever right so there's two problems a you have to store this in memory which you may not be able to do and then also actually inverting inversion is very expensive in practice actually you don't do the inversion the way actually you can also buy this as a linear program and then you can do it a little faster but ultimately the complexity is the same the complexity is he cubed so cubic complexity is always a problem so what that means is basically if you double your data it becomes a times small a times slower and so that's that's what cubing complexity the complexity is detail acute right that's the and this is the space complexity so there's a space there's computation crazy and that makes sense to you okay that's something basically that's gonna you know Elementary you know computer science the idea is basically whenever you have an algorithm you try to argue and how much slower does the algorithm Gatz how much more space do you need as you increase your input size and here this is a function of the number of dimensions that you have right so for example in Stanford during my economical example there you actually have millions of dimensions potentially because these are the dimensions are the number of words in the English language so if you have millions of dimensions and this year becomes prohibitively expensive so what you do is just to grade it sand and that actually works really well any more questions about this quest yet oh it's just because when you take the derivative you have to actually take the a times W if you take derivatives back to W it becomes a transpose so that's the matrix cookbook yes yeah okay any more questions all right today we will do something much cooler even cooler than linear regression okay so then we'll talk about support vector machines all right so one more time we the last like to be talking about regression let's go take a step back once more time hey all the time and talk about classification so our labels by eye at this time either plus 1 or minus 1 and just as before we make the assumption that the data is linearly separable so we have theta you know if you have some crosses and positive points with some negative points and you would like to find a hyperplane between these two now one thing you already knows like if you you know if he knows that your hyperplane exists we have an algorithm to find such a hyperplane who remembers the name of the algorithm perceptron right so the perceptron finds us a hyperplane but you have no guarantee what hyperplane it finds right and turns out actually there's infinitely many right if they exist one there's infinitely many so you could except this one or this is a hyperplane or this is a hyperplane etc and so people have been wandering ever since it was possible to find a hyperplane the obvious question is well which one should I find right that's where the perceptron is weak why do you have no guarantees on what you find what the SVM does it actually finds you one a very specific one and I would argue that's the one you do want to find and so if you remember correctly in the in the SVA the perceptron proof what we talked about is yet this notion of margin lazy said you know the how many steps you need to find this hyperplane is inverse proportional to the margin so what you want is the data to be far away from each other it has to remind you the margin for hyperplane is the distance to the closest point all right that's that's the margin I call it gamma and the intuition behind SVM says that if I want to find any hyperplane the best hyperplane is actually the one that has maximum margin and that's the idea so originally actually they were phrases margin maximum margin classification and that's not just a good intuition you can actually you know theoretically derive that this actually is you know likely to generalize very well so the idea is if I have a hyperplane that's kind of really far away from the positive points and really far away from the negative points then if my positive points are negative points drawn from the same distribution they won't be exactly the same points like my test points that I'm actually that I actually care about and I actually deploy this thing I'm gonna get see slightly different points right let's see maybe this point at this point right hopefully people you know there will be someone in that region worthy with the axis for my training X's are and if I move my hyperplane as far as possible away from these points I give a lot of margin of error right so they basically if they're here and that's not a big deal right because I actually have this large margin in which they could fall into if instead I drawn the hyperplane like this which is perfectly fine it separates the training data very well but now if a test point is here it will actually be misclassified right and dutifully that seems like the wrong thing to do right because that point is still very close to these guys does that make sense so the idea is if we find the maximizing hyperplane and and it's actually it is defined as the hyperplane that has the maximum distance to its closest point and that's always exactly the same on both sides can anyone tell me why that is the case why do you have the same margin on both sides one person already knows it yeah sure that's right if one side was closer than the other then it's a here's a little larger than what I could do the margin is actually the distance to the closest point so let's say my hyperplane is this and this guy is closer than that guy but what I then can do is actually I can my my the margins defined is the shortest distance to a data point then I could find a better hyperplane that's moved a little bit over here all right that increases this much so by definition the distance to the closest point and in the positive class and the distance of the closest point in the negative class must be exactly the same okay any questions at this point all right so an SVM SR where invented I think in 1994 by Karina Cortes and Isabella Guillaume and vladimir vapnik at 80 82 to this day they are the most popular machine learning algorithm and are extremely successful this it's also one of my favorite algorithms there's one little caveat they actually it's actually patented so you're not allowed to use it but besides that it's great or but we put another way if you use it and you know you just can't admit so there's actually a patent troll company that owns the patent and if you have ever you know we're to admit that you use SVM's than any product that makes money they will sue you quicker than you can say support vector machine and so it actually Oracle for example they foolishly agree you know admitted at some point they advertised our databases now have support vector machines and bam I think it's you the next day but you can always say we're just using a learning a linear classifier that has a maximum gap between the hyperplane and the closest point right so if you avoid the language a little bit then it's very hard to prove that you're actually using it okay good so let me just formalize this notion of margin we've talked about this a little bit already but let's just make this crystal clear so we have our hyperplane our hyperplane is the following set is the data points except of W transpose X plus P equals 0 all right so this is my hyperplane index by W and P so far that should be that's let's place the set of all points that lie right here in this middle and what I want you to think about a little bit is what is the distance to a point from a hyperplane and this was you know you may recognize this it was actually on the placement exam for exactly that reason because I wanted you to kind of get a little bit of a head start just thinking about the geometry so that say this years my hyper plane defined by the vector W and now I have some point X and I would like to know what's the distance from this point X to the hyperplane and well the first thing I can do is it say well and I can project this point X onto the hyperplane that's basically this point here has called this X P and then the distance from the hyperplane to this point is exactly the distance between these two points and we call this this vector here T any questions so far raise your hand if you with me all right good so one more time I have this hyperplane W this hyper plane defined by W I have some point X I would like to know how far away is that point from the hyperplane what I do is I project the point onto the hyperplane that's the basis the closest point on the hyperplane to X and now I measure the distance in these two points X P and X I call the vector between these two deep so then I can say x equals sorry X P equals X minus D okay so if I take X as a track D from it and I get exactly my point XP all right any questions all right so then what do we know about X P we know that X P lies on the hyperplane what do we know about points on the hyperplane by definition these are the points where W transpose X plus B equals 0 so what we know is that W transpose X P plus B equals 0 all right that's just the very very definition of points in the hyperplane we know X P lies on the hyperplane so far no big problem now we can plug into our definition of X P in here so what we get is w transpose X minus D plus B equals 0 and now comes the question what is this B and here's one interesting realization that D is actually ORS I didn't draw it that way must be parallel to W so because it actually points away from the hyperplane what points exactly your thumb to the hyperplane is this vector W right so actually D is basically a rescaled version of W it must be all right because it's parallel to it so I can say instead of D I can write d equals alpha times W raise your hand of you with me awesome so just plug this in here and what I get is w transpose X minus alpha times W plus B a zero and that's all I need because here B is given W is given X is given so the only thing I don't know is alpha so you can solve for alpha all right and if I do this what I get I get I solve this follows that alpha equals W transpose X plus B over W transpose W so what I'm doing here well I know maybe I give you a minute to digest this and kind of verify this any questions raise your hand if you want me to move on awesome okay good so I hope was that if you've all done it for the placement exam then we can go through this quickly okay great so now we know exactly what alpha is that means we know exactly what D is I so D is alpha times W so D equals this term here I offered em - w alpha is w transpose X plus P over W transpose W times that let's play see that's a scalar that's my alpha and that's my dog alright so now what I want to know is what is the length of D night and that's basically well that's a computer's what's the norm of T so the norm of the equals D transpose D and the square root thereof and that equals okay so I guess I have yes ice let me write down alpha square W transpose W because these alpha times W lizards are squared and I people nodding maybe just not I don't have to raise your hand all the time because I'm bigger is nothing okay thank you take the ALF out alpha squared you can pull that out and then you get square root of W transpose W and now what is this well what's alpha one more time that's this term up here that's my alpha sits W transpose X plus P over W transpose W that's my alpha times the square root of W transpose W okay one more time I take my D I have a formulation for D I want to know its norm norm was just the square root of a square vector I spot a square now I plug in the definition of Debaters alpha times W squared there's our alpha del W I square the whole thing I get alpha square times W squared I pulled the Alfa out I get alpha times the square root of W transpose W and now alpha is this term here if I multiply that with the square root of W transpose W I can just divide by it and thus this term disappears and that's ultimately what I get so it is W transpose X plus B over the norm of W which is what this here's the norm of W so this here is the distance of a point X to the hyperplane and if you remember this from geometry when you with the high school that's exactly the formula you learned any questions ready to move on crashing already move on okay person yeah it works in all the spaces we care about yeah it doesn't have to be oh sorry I mean that's just the it has to be in this is Euclidean space it's so it doesn't have to be two-dimensional or something but this can be n dimensional or D dimensional that's totally fine and it also works in actually in general Banach spaces etc and this is this will become relevant in a few lectures good question though yes any other questions okay awesome so now this is the distance to a data point what we want to know them the smallest distance to a data point and so we can just define the margin of a hyperplane you find but W and B is the smallest such distance so the smallest W times X plus B over the normal answer that space you go over all the data points in the data set and you compute the distance to the hyperplane for every single one of them and you say what is the smallest one and which one has the smallest distance okay maybe take a second and digest everything all right any questions all right so what last VM does it says well a second right so we can define the you know this is the formula of the margin of a classifier this gamma here and what we would like to know is the hyperplane that has the largest has the largest margin right and this is this is what the margin is so why don't we do exactly that why don't we just say you want to find the W you want to find the W which maximizes W comma B gamma W this place it says you wonder find W that maximizes their margin so if you just leave it as this what do we get all right so maybe I'm give you a minute to think about so are we done at this point if we just say this is my optimization problem I just want to find any hyperplane WB which has a maximum margin so I'll give you a minute to think about it with your neighbor why is this gonna blow up in my face all right who knows that this is my dataset I'm trying to find a hyperplane that maximizes the margin where should I draw the hyperplane why does an optimization going to give me yeah you could draw it here that's right and so if you want to maximize this since two data points just gonna go push it away to infinity all right that's a great hyperplane really large margin to both classes but that's not what we want you want to have a brain that actually lies in between the two classes and it's actually separated so we have to somehow force this this maximization to also do something sensible and we do this with a constraint this goes into constraint optimization problem so a constrained optimization you may say well minimize this function and as I maximize this function subject to a certain constraints the swallowing has to be satisfied and the constraint is that for all our data points I you have to have that why I W transpose X I plus B must be greater equals 0 and so we had this before and when we talked about the perceptron this here we see says which which sign of the hyperplane you lie on right if you're positive and in one side you're positive on the other side that's negative but if you multiply with the label the positive points if the positive points are the positive side and that's positive and if the negative points in the negative side that's also positive so if this constraint is satisfied for oh I then you must lie between them and then actually we get the right justify any questions yeah are you looking ahead I'm not there yet right good point so we'll get there very very soon all right so we want to solve this right now we don't yet know how to solve this all right and so this is actually where people get to in a second so you want we know we want to maximize this function subject to this constraint and now just pay attention for five more minutes right let me this is only three steps so you're gonna learn something really really cool being able to solve these things makes great conversation pieces at first dates cocktail parties all right so now you plug in the definition of gamma which is basically just saying you want to find you know here you want to find you know the minimum X element of T W transpose X plus B over W XW like this is the absolute value okay so now we have something really nasty right oh god I'm gonna put on some glass so it's gonna protect our eyes it's a maximization of a minimization that's nasty so let's not be scared just keep going what can we do alright the first thing we can do is we can pull this W transpose W out of the internal minimization and if a maximization of a minimization but double transposed over here is not actually a function of X so we can pull that out so you pull that out sure why not W transpose W so now we just get this term here so minimize W transpose X plus P over X and then we multiply the whole thing minute maximize that over W that's still pretty scary right but now comes the cool trick and the cool trick is that if you have a hyperplane right and does he have plane is defined up here but this is my W right may set us all the points such that W transpose X plus B equals 0 what I can do is I can multiply my W and B by any constant right it doesn't change a thing okay does that make sense so if I take this equation here W transpose X plus B equals 0 if I rescale W and B by any positive value it makes no difference but it's gonna be exactly the same set so what I'm doing is essentially just rescaling this vector here w right you can have infinitely many scales here it's always exactly the same hyperplane so that's a degree of freedom that we can utilize it currently there's no unique solution so we don't care which scale we get and it doesn't really matter we just want to find a separating hyperplane so we can just fix a scale all right raise your hand if that makes sense all right awesome good so let's just fix a scale that's really convenient to us and here's the scale you're gonna fix we're gonna say okay well we want a very specific scale such that the minimum W transpose X plus P actually that's just this well here minimum equals 1 so you'd recent say we just rescale it exactly the way that this here is one it's good trick all right so now this whole thing is just what and so now we're left with just you know the maximization with 1 over W transpose W right and we just had normal constraints to talk about this so so the objective function is now much much simpler okay any questions so we still have to DVF you know we have committed some sins here we still have to deal with this this constraint in the future but but for now just just you know easily pushed it down as a constraint yeah why because there's always I guess if I give you some X and B right I can always rescale this in any way such as the minimum what the margin is exactly want I think whatever your margin is I can just divide my W and B by exactly that value so this term is 1 all right it gives me the same hyperplane doesn't give me the same solution to the problem but I don't care about actually the the the actual value of this problem I care about the hyperplane right and for my means of purposes that's that's identical does that make sense all right good so now we can you get rid of all this scary stuff now we have why not what W transpose W right now here's a very important lesson that you learned last time right maximization is for losers I have you want to minimize that's put a minimization yeah alright if we just do you know in Reno just say instead of 1 over W transpose W just max a mini minus W transpose W this is the same thing there's no difference right you can either maximize X or minimize 1 over X the same thing okay good raise your hand if you're still with me all right ok good so now we almost there now we just have these two constraints and so now I'm a claim something and I want you to verify it that instead of these two constraints you can actually write something else you can write instead of these two constraints we can actually write that's the case if and only if for all I why I W transpose X plus B this greater equal one so I claim that this set of constraints is identical to these two constraints in particular these two constraints if they are satisfied implies this and if these are satisfied and implies these two and I give you a couple minutes please talk with your neighbor about it and see if you can prove it that these that that's actually exactly the same thing all right who can go one direction or the other left to right or right to left who is brave nobody all right Arthur okay good that's right so this is from going this direction to that direction right he's saying everything is is great equals zero you have to be a little careful actually everything here basically says has the right sign it's on the right side so what I can do is I can actually rewrite this thing here as Y I times this right instead of the absolute value because that's the same thing we said based in my Jeep I'm with the same sign that's here otherwise this would be negative and then I actually know that the minimum here is one right therefore they must all be greater equal one that's another way of saying it okay raise your hand if that makes sense okay good any questions there's some questions yeah so in this case actually you're just going from left to right you're just saying the fact that everything is the minimum is 1 must mean that everything is greater so that's just going from left to right I think what you are concerned about is the other way right is saying going from right to left it could be that the minimum is actually 5 right which is so great a ego wants everything here's satisfied but be a set B of violating this guy here right that's your question ok good so left to right everybody's okay left to right tick ok now right left so assume this year is satisfied for every single data point now he's raising a very very good point right he's saying wait a second assume for everything data point this expression here is greater equal 1 well it could be 5 right Oh could be hundreds all right smallest one copy one hundred so then the C is definitely satisfied everything is you know greater than zero weight equals zero so you're not worried about this constraint but this here seems fishy right he'll be saying they're explicitly the smallest margin has to be one and he was just saying has to be greater equal one so it could be larger than one why is that not an issue we'll figure it out other than Arthur yeah you could and why is it going to be scared downscaled why is it gonna be exactly one yeah that's right because we are minimizing W right so if you had something that's larger than one let's say you have 10 right for the smallest one and every single one of them is larger weight equal ten then what you could do is you could actually reduce your W further right you could just actually divide W and B both by ten your constraints are still all satisfied but you're getting a smaller objective value right and therefore it can't be the optimal solution right so in the optimum this implies that right does that make sense that argument makes sense so the key is because we are minimizing the square of W we are fine any more questions all right so the good news is this problem on the right actually is a linear quadratic function up here right before it disappears this is a parabola and this is a linear constraint that's a quadratic program the quadratic programs is something that our friends in the math department have been studying for 50 years right so they've written all these these you know packages and obscure languages that nobody uses you know to minimize these functions and we can use them or write our own all right demo yeah you called my bluff I made up that demo but no no I do actually have one all right all right thanks thanks for keeping me honest um well it let me first do the demo for this class and then you can still do the other one um why is the banana I don't know okay um so here we go so we have a data point data set so we can now create a data set let me just make some linearly separable data's oh shoot what's going on oh that's good all right and I hear my negative hear my positive points are the other rounds now can I now run my SVM solver and what it will give me is exactly the hyperplane with maximum margin right so you busy see here's the hyperplane and the white lines are basically exactly the parallels that goes through the closest point and you can see they're exactly parallel their accurate distance from each other all right and so the beautiful thing is about you know you just run this QP solver this quadratic programming solver and bam right you get the answer right there's no iterative process I mean ultimately inside that package you have iterative process but it's just such a beautiful formulation of machine learning problem so we've been this came out and people I people hadn't thought of machine learning that everyone thought of like some algorithms that iterate over something right and then basic your inner core test and Isabella Guillaume came along and they just said well you just write this as an optimization problem and attack it gives you the answer and that put the entire feed into a frenzy like for ten years nobody that anything else but quadratic programming I know you know a convex programming I can show you some you know I don't know anyone want to see any other examples I mean the I guess we can do something but they have really close together right that's something but the perceptron will take a really long time so here we have you know very very small margin so the perception would take a long time to converge right but the SVM bear might immediately find it right and so this is what makes SVM so so so popular it turns out there like when this came out suddenly they blew everything else away you get much much lower error most data set right and you could actually run it like it's such an elegant framework right everybody understood it any questions about the demo yeah no no you automatically minimizing both right because you're minimizing the meaning the constraint the basis the closest points the one away but by doing that you have to have equal distance from both classes otherwise you could always move closer to the one that's further away and increase your distance to the closer one all right so the other demo I guess I push that away to next lecture please remind me at the beginning
Info
Channel: Kilian Weinberger
Views: 20,721
Rating: 4.9725084 out of 5
Keywords: artificial intelligence, machine learning, kilian weinberger, cornell, cs4780, support vector machines, svm
Id: xpHQ6UhMlx4
Channel Id: undefined
Length: 49min 58sec (2998 seconds)
Published: Wed Jul 11 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.