Support Vector Machines (SVMs): A friendly introduction

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello my name is luis serrano and this is a friendly introduction to support vector machines or SVM for short this is the third of a series of three videos on linear models if you haven't take a look at the first one it's called linear regression and the second one called logistic regression this one builds up a lot on the second one in particular and I'll start with the credits this year I thought a machine learning class at Qwest University in British Columbia Canada and had a wonderful group of students who had an awesome time here's a picture of us with my friend Richard Hoshino on the right he's also a professor and actually my students were the ones who helped me figure out the key idea for this video so as VMS are a very important classification algorithm and basically what it does is the usually tries to separate points of two classes using a line however it tries really hard to find the best line and the best line will be the one that is sort of the farthest from the points as possible to want to separate them best normally as VMS are explained in terms of either some kind of linear optimization or some kind of gradient descent what I want to show you today is something that I actually haven't seen in the literature it may exist but I haven't seen it and it's a method that is small greatest a step like method which is sort of an iteration and in this iteration what you do is you first of all try to find a better line that classifies the points and then at every step you just take two lines parallel and just kind of stretch them apart let me be more explicit so let me start with a very quick recap on the previous video on logistic regression of perceptron algorithm basically what we want to do is we have data split into two classes red points and blue points and we want to find the perfect line this is the perceptron algorithm so what I want to do is not just find the perfect line but a line with a red side and a blue side that splits the point in the best possible way and the way we did this was we start with a random line and then we start asking the points what can they tell us to make our line better so for example this point over here says I'm good so don't worry don't do anything blue one says well he I'm on the wrong side so you better move closer to me in order to classify me better so we move a little closer my mother machine-learning want to do tiny steps we don't want to make any big drastic steps so we asked another point this one is red in the red area so it says I'm good don't do anything then we ask this one over here and it says get over here so we get over there then we ask this blue one in the blue area so it says I'm good by the way we're adding random points here there's no particular order then we ask this one it's a red point in the red area so it's just some good we ask this point which is a blue point in the red area so it says get over here so we move closer then we ask this red on the red area so it says I'm good then this red which is now misclassifies on the blue area so it says move over here and we listen to it and now it seems like all the points are good so that is in a nutshell the perceptron algorithm I'd like to remind you that the way we did it is we started with a random line with red and blue sides then we picked a large number the number of repetitions or epochs which in this case is going to be a thousand that's the number of times we're going to repeat our iterative step then step three says repeat a thousand times we pick a random point we ask the point if it's correctly classified or not if it's correctly classified we do nothing if it's not correctly classified then we move the line a little bit towards a point and we do this repeatedly so we get the line that separates the data pretty well so anyway that's a small recap of the perceptron algorithm and this algorithm is going to be very similar but it's gonna have a little bit of an extra step so let me show you that extra step first let's start by defining what is it that the SVM does best so I'm gonna give you an example of some data here and I'm gonna copy it twice and this is a line that separates as data and this is another line that separates that data so question for you which line is better so feel free to think about it for a minute I think the best one is this one on the left and this one on the right is not so good even though they both separate the data if you notice the one on the left separates the points really well like it's really far away from the points whereas the one on the right is really really close to two of the points so if you were to wiggle the line on the right around you may miss one of the points and you may miss classify them whereas so lying on the left you can wiggle it freely and you still get a good classifier so now the question is how do we train the computer to pick the line in the left instead of the line in the right because if you remember perceptron algorithm just finds a good line that separates the data but it doesn't necessarily find the best one so let's rephrase the question what what don't do is not just find one line but find two lines that are spaced apart as possible from each other so here for example centered on the main line we have these two parallel equidistant lines and notice that for this case on the Left we can actually have them pretty far away from each other on the other hand if we do this with the line on the right the farthest we can get is two lines that are pretty close so we're gonna compare these Green distance over here with this distance over here and the one on the left is pretty wide whereas the one on the right is pretty narrow so we're gonna go for white so we're gonna tell the computer when you find a white one you're good but if you find a narrow one then you're not good and now the question is how do we train an algorithm to find two lines as far apart from each other that are parallel that still split our data so this is what we're gonna do very similar to what we did before we're gonna start by dropping a random line that doesn't do a very good job necessarily then we draw two parallel lines around it at some small random distance and then what we're gonna do is we're gonna do something very similar to the perceptron algorithm we're gonna start listening to the points and asking them what we need to do so let's say one point tells us to move in this direction so we move in this direction and then what we're gonna do is at every step we are going to separate the lines just a little bit and then we listen to another point that maybe tells us to move in this direction and then again we're gonna separate the lines a little bit and then again another point tells us to move in this direction and then we're gonna separate the lines a little bit and that's pretty much it that's what the SVM algorithm does of course we need to go through some technicalities one technicality is how to separate lines so let me show you how to separate lines using equations so let's say we have a line with equation for example 2x plus 3y plus minus 6 equals 0 and then again recall that really in the Cartesian plane where the horizontal axis is the x axis and the vertical axis is the y axis so notice that this line is the set of points that satisfy that two times the x-coordinate plus 3 times the y-coordinate minus 6 is equal to 0 what happens if I multiply this 2 3 and -6 by some constant for example by 2 I get for example 4x plus 6y plus minus 12 equals 0 well what line do you think this is it's actually the exact same line because any point that satisfies 2x + 3 y - 6 0 0 also satisfies that 2 times that thing equals 0 because 2 times 0 is equal to 0 so in particular we get the same line and if I multiply this equation by any factor for example by 10 I get 20 x plus 3y plus minus 60 is equal to 0 I get the exact same line so this is actually this line actually represents a family of equations I can also multiply it by numbers are smaller than 1 for example 0.2 X plus point 3y plus minus point 6 that's dividing the original equation by 10 that also satisfies the same line and I can even multiply in my negative numbers and it still works but now let's see what changes so here again we have 2x plus 3y plus minus 6 equals 0 and the exact same line which is 4x plus 6y plus minus 12 equals 0 now let's actually draw the line 2x plus 3y plus minus 6 1 & 2 X plus 3y plus minus 6 equals minus 1 because what we're gonna do is our two parallel lines to the original one are the ones with the same equation except the ones that don't give 0 but they give one and minus 1 now what do you think happens if I do the same thing on the graph in the right and this is important so actually feel free to pause this video and think about it for a minute I'll tell you what happens what happens is that we get two lines that are parallel but much closer so the equations for X plus six Y plus a minus 12 equals one and minus one are actually a lot closer to the original one than the ones with equation 2x plus 3y plus minus 6 equals 1 and actually if I multiply this equation by a smaller factor for example by dividing by 10 so I get zero point two X plus zero point three y plus minus zero point six equals one and minus one then I get lines that are much more far away from the original one and if I were to multiply it by a huge number by ten for example I get 20 X plus 3y plus minus 60 equals 1 and minus 1 then the lines get much much closer so the original line stays the same if I multiply by a constant but these two parallel lines move farther away or closer depending on if I'm multiplying by a number that is close to zero a small number or by a large number this is not gonna appear in this video but if you multiply by a negative number that the two lines actually switch but this is not so important for this algorithm but basically what we're gonna do is we're gonna be able to separate lines by multiplying them by a small number that's really what we're gonna do in this algorithm but first we need some justification why is it that this phenomenon happens so let's look at this line for example 2x plus 3y plus minus 6 equals 0 and let's just look at one side of it so 2x plus 3y plus minus 6 equals 1 so why is it that this line over here in between is the equation 4x plus 6y plus minus twelve is equal to one it's actually exactly in the middle well let's take a look at this equation 4x plus 6y plus minus 12 equals 1 is the same line as if I just divide the entire thing by 2 including the 1 so if I were to divide 2x plus 3y plus it's minus 6 equals 0.5 I get the exact same line and the reason is that any X&Y that satisfied 4x plus 6y plus minus 12 equals 1 they also satisfied 2x plus 3y plus minus 6 equals 0.5 the exact same equation so when I bring back this equation well now you can see that it's a value of 0.5 actually lies right in between the value of 0 and the value of 1 so that's why this equation is in between and you can see that this works for pretty much any constant that I multiply the line by so what we're gonna do is we're gonna introduce something called the expanding rate and expanding rate is very simple we have again our equation 2x plus 3y plus minus 6 equals 0 which gives us this line and then we have our two neighbor equations the one that gives us one which is over here and the one that gives us minus one which is over here and our expanding rate is just gonna be some number and remember that in machine learning we always want to make tiny steps we don't want to make any big steps so we want to separate this line but by a very very little amount so we're gonna take a number that is very close to 1 for example 0.99 let's say that's my favorite number that is close to 1 and we're gonna call that the expanding rate and what we're gonna do is we're just gonna multiply all these numbers here by 0.99 so what do we get well we get these equations the equations are 1.9 8x plus 2.9 7y plus minus 5.94 is equal to 0 to 1 and 10 minus 1 and this equations gives us three lines that one of the middle is still the same one but the two on the sides are actually just a little spread apart so we're just gonna add that step to the perceptron algorithm and that's gonna spread our lines apart a little bit every time we aerate so now we're ready to formulate the SVM algorithm and it's gonna be very similar to the perceptron algorithm step one is we're gonna start with a random line and to equi distant parallel lines to it and I'm gonna color them red and blue just to emphasize which side of the line is red and which side of the line is blue in order to see which points we have correctly or incorrectly classified now step two is gonna be pick a large number on the number of repetitions or epochs the number of times we're gonna iterate this algorithm step three is gonna be pick a number close to 1 so the expanding factor and we saw it's gonna be 0.99 I can pick anything but that's the one I'm gonna pick close to one step four is now the loop so repeat a thousand times pick a random point and the point is correctly classified for example this one says I'm good then we do nothing if it's incorrectly classified then for example like this one which is a blue point in the red area says get over here so we move the line towards a point so we learned in the previous video how to move a line towards a point like this and then we're gonna do the extra step which is separate the lines using the expanding factor so we're gonna do separate the lines a little bit and we're just gonna repeat these many many many times thousand times until we get a pretty good result and then we enjoy the lines that separate the data best so notice that the two steps that we've added is this step three pick a number the imp expanding factor close to one and the one where we separate the lines using the expanding factor the rest is pretty much the same thing as the perceptron algorithm so now just for full disclosure if you want to code this like this is actually the perceptron algorithm that we saw in the previous video where we added step four is the the mathematical step where we check if something is in the blue and red area by checking if the equation applied on the point become Q is been a 0 or less than 0 so we update the values of a B and C accordingly by adding the learning rate times the coordinates of the point so the SVM algorithm is actually very similar what we do is we start the random line of equation ax will be y plus C equals zero and we draw the parallel lines with equations x will be y plus c equals one and minus one then we pick a large number the number of epochs which is gonna be a thousand then we pick a learning rate which is gonna be zero point zero one we saw it in the logistic regression video then we pick an expanding rate which is gonna be 0.99 it's a number close to one and then the loop step is repeated thousand times pick a random point and the point is correctly classified and do nothing if the point is blue in the red area then we update the values of a B and C accordingly if the point is red in the blue area we update the values in a different way and then it's a final step we multiply the values a B and C by 0.99 which is the expanding step and again the two new steps are step three and the expanding step so that's it that's the SVM training algorithm I encourage you to code it and see how it does try different values for number of epochs learning rate expanding rate etc and let me know how you went in the comments so that's the SVM algorithm as I said I encourage you to code it take a look at in some datasets and see how it goes however this comes out of somewhere this comes out of an error function development with gradient descent so now I'm gonna show you what the error function is and it's very similar to the perceptron algorithm where we had a classification error based on how far the points are from the boundary however now we're gonna have a another thing that adds to the error which is based on how far away these two lines are so let me show you so to start there functions let me first ask you a question here we have the same data set twice and I'm gonna show you two support vector machines that classified the first one is this one and the second one is this one now the question is which one do you think is better I feel free to pause the video and think about it so notice that the model on the left has one problem which is that it misclassifies point however it's good because it's got the lines pretty wide apart the mall on the right is great there's a classification because it classifies every point correctly however the lines are very close together so the question is which one is better and the answer is we don't really know it depends on our data it depends on our model it depends on the scenario but with error functions we can actually have an approach to maybe analyze what exactly do we want so let's recall what happened with the perceptron error so we here we have some points and a model a perception that separates them now this will make some mistakes right it makes these two because these two are blue points in the red area and makes these two because these are red points in the blue area so the question is how do we measure the error or how bad this model is and the rationale is if a point is on the correct side then this error is zero if a point is on the wrong side then the error can change if a point is close to the boundary then the error is small and if it's far from the battery then the error is huge because if you're for example a blue point and you're close to the blue area but still in the red area you have a small error but if you well into the red area then you generate a lot of error because that model is very wrong on that point so what you want is the distance or not exactly the distance but something proportional to this distance and the same here so we're gonna add a number of proportional to these distances and that's gonna be the perceptron error so for SVM is gonna be similar we're gonna have our lines and now we're just gonna have to classification errors coming from different places so what we're gonna have is a red one so our red area now doesn't start from the middle but it starts from the bottom line and every point above this line that is blue it's automatically misclassify so these three are misclassified and the error is precisely the distance from the bottom line and that simple so notice that this blue point that is close to the bottom line is actually misclassified even though it was correctly classifying the perceptron algorithm it is okay it's a harsh error function now the blue error comes from the line in the top so it comes from here now every red point underneath this top line is gonna be misclassified and its error is gonna be similar to the perceptron error it's gonna be proportional to this distance over here so we're adding all those distances and that's our error so those two errors form the classification error now we have something called the margin error and the margin error is simply something that tells us if these two lines are close by or far apart I'm gonna be a little more specific later but it's basically a number that is gonna be big if the lines are close together and small if the lines are far apart because it's an error so the better model the smaller the error and the better our model the wider our lines are so let's actually look a little bit more on the marry margin error here so we have our data set and our data set again and two models so this one has the lines pretty far apart therefore it has a large margin so it's gonna have a small margin error and this one over here the lines are pretty close so it's got a small margin therefore it has a large margin error and just show the contrast notice that this way model on the right has a small classification error and this model on the left has a large classification error because the mandar right classifies all the points correctly and the one of the left classify points one point incorrectly but let's get back to our margin error so we have our three lines and let's recall the questions of the line are something along lines of ax plus B y plus C equals 1 and X will do i plus c equals minus 1 so now we're gonna do is calculate the distance so that I'm gonna leave as a challenge for you to do some math and show that this is actually 2 divided by the square root of a squared plus B squared so I challenge you to to prove this what you have to do is play with linear equations and Pythagorean theorem and so now the question is what can our error be so let's think about it we need a number that is big if the distance is small and a number that is small if the distance is big so what can our error be feel free to think about it the hint is look at the denominator right the bigger a square plus B Square is the smaller this number is and vice versa so what about just taking the margin error to be this a squared plus B squared notice that in this number is large that means the denominator is small and vice versa so if we let our margin error just be that sum of squares that works that actually measures how far apart the lines are in the opposite way so if the lines are close the error is big even lines are far the error is small that looks familiar shouldn't be a surprise it's actually the regularization term if you've seen l2 regularization so now we can summarize where as vm error is here it is we have our data set on our model and the error basically splits in three first is the blue classification error which is basically measures all the red points that are in the blue side then we have the red classification error which measures all the blue points that are in the red side and then we have the margin error which measures how far apart the lines are so the red and the blue get together to form the total classification error which tells us how many points are misclassified and how badly they are misclassified and then the margin error that tells us if the lines are far apart or close by and these two get together to form the total SVM error so that is the error in a support vector machine the one that we're supports minimize and the gradient descent steps very similar what it does is actually the same thing as the SVM strick what it does is here we have our data and here we have a model and this model is pretty bad notice that the lines are pretty narrow and misclassifies a bunch of the points so this is a bad as vm it's got a large error both in the classification sense and in the margin sense and we wanted to is using calculus or using gradient descent we minimize this error in order to get to a good place a good as vm that has a good boundary the lines are far apart and it actually classifies most of the points correctly so in the same way that we did with the perceptron algorithm this gradient descent process takes us from a large error to a small error and this actually is exact same thing as the SVM trick that I show you recently of moving the line closer to the points plus separating the lines a tiny little bit so now I have a challenge for you and the challenge is simply to convince yourself that the expanding step actually comes out of gradient descent so take a look at this we have our lines for the question x plus b y plus Z equals 1 and ax plus B a plus C equals -1 and we have the margin over here and the margin error which is a square plus B Square so if you're familiarize with gradient descent what happens is that we want to take a step in the direction of the negative of the gradient so the gradient is the derivative of the margin error with respect to the two parameters a and B this is a very simple gradient because it's it's a the derivative respect to a is simply to a because a square plus B square that is back today is to a and the respect to B is to be there for our grading the same step takes a and sends it to a minus the learning rate a de times to a which is a derivative and that's the same thing with B it turns it into B minus a dot times to B now we can factor this as 8 times 1 minus 2 ADA and the bottom one we can factor it as B times 1 minus 2 ADA but notice something here notice this number over here this is exactly the expanding factor because what we're doing is multiplying a by a number that is close to one remember that we multiplied a MB by 0.99 this one here is the 0.99 because if we take a that to be a small number then we're multiplying a by a number that is very very close to one because if it is small then one minus two ADA is very close to one so that is exactly the expanding step so the expanding step is coming from gradient descent and using the regularization step anyway the challenge is to formalize this and to and to really convince yourself that this is the case so now let's go back a little bit and remember these two models because we never really answered a question of which one is better remember that the one on the Left misclassifies one blue point and the one on the right just has a very very short distance between the lines so they're both good and bad in some way so let's really study them the one on the left has a large classification error because it makes one mistake and a small margin error because the lines are pretty far apart and the one on the right has a small classification error because it classifies every point correctly and a very large margin error because the lines are too close by so again which one to pick depends on us it depends on what we want from the algorithm however we need to pass this information to the computer so we need to we need to pass information of which one do we care more about the classification error or the margin error and the way to pass this information to the computer is using a parameter or a hyper parameter this one's gonna call we call the C parameter so recall that the error here is the classification error plus the margin error so we're just gonna take a number C and attach it to the classification error and so now our error is not the sum but a weighted sum where one of those weighted by C now what happens over here well recall our error is the C times the classification error plus the margin error so what happens if we have a small value of C if we have a small value of C then the classification error gets multiplied by a very small number so it's all of a sudden is less important and then the margin error is an important one so we are really training an algorithm to focus a lot more on the margin error so we end up with a good margin and maybe a bad classification so we end up with the model on the left however if we have a large value of C then C is attached to the classification error so this means that the classification error ends up being a lot more important and the margin errors are being a little if if C is large so therefore the model with a large C focuses more on classification because it tries to minimize the classification error more than it tries to minimize the margin error so we end up with a model like the one in the right which is good for classification bad for margin and so again we decide this parameter ourselves what we really do in real life is try a bunch of different ones and see which algorithm did better but but it's good to know that we have certain control over this training and these these are called hyper parameters every every machine learning algorithm has a bunch of hyper parameters that one can tune to decide what we want so that's all folks thank you very much for your attention I remind you that this is the last of a series of three videos on linear models linear regression logistic regression and support vector machines so I hope you enjoyed this as much as I enjoyed it thank you remember to subscribe if you want to get notifications of more videos coming if you liked it please hit like share it with your friends or comment I love reading the comments I read them all if you have suggestions on what other ways to make I love to hear them and if you want to tweet at me this is my Twitter handle Louis Highsmith thank you very much and see you in the next video

Info

Channel: Serrano.Academy

Views: 80,111

Rating: undefined out of 5

Keywords: svm, support vector machines, machine learning, artificial intelligence, mathematics, math, linear algebra, calculus, technology

Id: Lpr__X8zuE8

Channel Id: undefined

Length: 30min 57sec (1857 seconds)

Published: Sun Jan 27 2019