6.034 Recitation 7: Support Vector Machines (SVMs)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

today we're going to talk about support vector machines which are one of the two core topics on quiz 3 which is a week from Wednesday so that's neural nets which we talked about last week and support vector machines and then next week recitation we're just going to go over some practice problems for both neural nets and support vector machines so don't worry about if you're still a little shaky on those so support vector machines are type of supervised machine learning just like other things we've talked about which were K nearest neighbors ID trees and neural nets so what all of these things do is they draw boundaries in some space so for example if we had this data how would how would ID trees draw boundaries there how many lines would you need with ID trees one line will say that you can only use x and y tests what one line movie need two lines if we're only using x and y tests with ID trees so we would have something like that and that or there there are other possible solutions nearest neighbors would just draw the perpendicular bisectors between plus and minus points what would neural nets do how many lines would a neural net need to draw yeah just one because the data is linearly separable so a neural net might do something like say that line you might also draw one flow switch other minuses or halfway in between but the neural net is just going to separate the data somehow a support vector machine is in this case it's going to be very similar to a neural net but it's going to actually separate the data with the largest possible margin so in this case a support vector machine might be the best choice because it gives us a wide margin between the pluses and minuses and it's not going to over fit so an SVM or support vector machine is a numerical classifier and what I mean by that is the data has these numerical features which might be something like x and y so they're not symbolic features and specifically it's a classifier that will draw a single decision boundary such that it maximizes the margin width so we could say it maximizes the margin between the plus and minus points so one thing that I sort of glossed over is the support vector machine is a binary classifier that's classifying points as either plus or minus whereas neural Nets classifying them as 1 or 0 but it was still binary so it's basically the same thing the classifications are just labeled differently so now let's let's try this on another data set well first let's say what if we had exactly one plus and one minus point there are few different boundaries we could draw so we could have something like this line a or this line B which one would be a better decision boundary a because it's maximizing this margin so basically the distance from a to any one training point is a lot bigger than the distance from B to a training point so using that principle what might we put as the decision boundary for this data yeah x equals 1.5 would be good because that's exactly halfway between the plus points and the minus point so we could have this line x equals 1.5 so we're going to say that's our decision boundary and then because we're talking about the margin between the positive points on the negative points we're going to put in these things called gutters so the gutters are going to go through the closest negative points so the decision boundary and the closest positive points and they'll be parallel to the decision boundary so we'll have those two lines one possibly confusing thing about the terminology is the gutter is not actually a region the gutter is just the line itself so you can't really be in the gutter but you can have a point on a gutter so these blue dashed lines are gutters and then another thing we have are the points that determine the decision boundary and those are called support vectors so in this case which points are necessary to determine the boundary yeah yeah a B and D the reason why we know that is if we removed any one of those points the boundary would be different so we have here our we have four training points we have two classes of points plus and minus and in this case we have three support vectors where vector is just a point and those are the circled pluses and minuses so we could also add other extraneous points here like we could add other minus points out here or plus points out there but they wouldn't change the boundary at all unless we put them inside this margin area so the margin is another term the margin is this distance between the two gutters where this area between the two gutters so this thing that's the margin and the margin width is the distance from one gutter to the other what would be the value of the margin width in this case one because it's just the distance from one to two so the margin width is equal to one another thing we could do is we could add additional points say on the gutters would those be support vectors so it depends on how we define support vectors the definition is there any point that are necessary for determining the boundary so if you could remove the point and the boundary wouldn't change then it's not a support vector so in the in this case we don't need any of those points to be support vectors but it's actually somewhat ambiguous because we could say that a B and D are support vectors or we could say that DB and this point are the support vectors so in this case you can't tell exactly which points are the support vectors but we only need two or three of them so the different cases we could have in two dimensions either we could have something like what we have right now where we have two pluses and a minus or two minuses and a plus and one of the points is sort of between the others and then we're going to have a total of three support vectors or we could have a case where we only need two support vectors so if the point are directly across from each other like if we had a point right here then we could use just D and that point another thing we can think about is what would happen in other numbers of dimensions so in two dimensions our boundary is a line and we need either two or three support vectors what about in one dimension what would the boundary look like yeah a point so in one dimension the boundary itself is a point so we might have our data could look like this on a number line where we have some pluses and some minuses and then that would be our boundary what about in three dimensions a plane and so you could also do higher numbers of dimensions they're just harder to draw on the board and think about in one dimension how many support vectors would we need one which point what is a support vector yeah - because we're going to we're going to need one plus and one minus so I always need at least two support vectors and in three dimensions in some cases we'll need more than three but we're not going to worry about that for today so now one question could be how do we figure out where this decision boundary goes in this case it was pretty easy to see that it was just going to be this vertical line but sometimes it's not quite so easy so we're going to have something called the convex hole method to help with that a convex hole is basically a convex shape so as a shape that doesn't have any concavities so what we're going to do is we'll start with some data maybe there's some pluses and there's some minuses and what we're going to do is we'll pretend that we're stretching a rubber band around the plus points and another rubber band around the minus points so then that'll give us shapes that look like this one thing that does it makes it easy to see that some of the points definitely won't affect the boundary like this plus in the middle of the shape is definitely not going to be the plus that's closest to the minuses another thing is it makes it easier to see that this line is closest to that point so in this case we would have these three points being support vectors and that would be our boundary so this is one of three cases this is the case where we have basically a line and a pointy area where one area is plus and the other area is minus you could also reverse them and the boundary is going to be parallel to the line and the gutters will just be along this line and then parallel going through that point another case we get has is if we have two pointy regions facing each other so we might have something like there's some pluses and there's some minuses so again we can stretch rubber bands around them to make convex holes sometimes it's not entirely clear so you might ask well is this point closest to that point or is this point closest to that line so the way we can figure that out is by imagining that we extend this line and figure out where is the point closest point on this line to this point and so that would be if you draw a perpendicular from this point to the line and so then you would see that the closest point on the line is way out here and so it's not part of the convex hull so that means that this line does not determine the boundary so in this case all we care about is one plus and one minus so our boundary will just be the perpendicular bisector of those so this is our second case where we have a point E Plus region and a pointy minus region a boundary that's the perpendicular bisector and so then again once we have the boundary the gutters are pretty easy to find because they just go through those points what do you think the third case is yeah two lines so we might have something like there's some pluses there's some minuses so we put rubber bands around the two sets of points and then we can see that these two lines happen to be parallel so that means that our boundary will just be another parallel line that's halfway in between the two lines and then the gutters will go along those points so this is the case where we have the plus region and a minus region that are both but they both have lines that are parallel and then we have gutters how many support vectors will we have in the line and point case you always have three because the point will always be one well I guess we could have only two if there was a point right there so we'll have either two or three what about in the point and point case yeah only two because we just have these two pointy areas what about in this case so we said up here we'll only ever need either two or three so it's going to depend on if there are two points that are directly across from each other we could say just those two otherwise we could say it's like these two and that plus or two of the pluses and a minus it's ambiguous but we only need two or three of them there any questions on that yeah since then is our there more than one yeah there's more than one answer so so we could say for example that it's those three points but you could also say it's the two minuses and this plus some answers would not be correct like if we said it's these two pluses and that - that wouldn't actually work because then the boundary would be diagonal like this so you just have to make sure that the points that you choose do determine this particular boundary other questions okay so this is sort of a human way of solving this problem because it's harder to tell a computer to stretch rubber bands around points and figure out whether they look parallel so we're going to formalize this a little using some parameters W B and alpha the W will be a normal vector which specifically is perpendicular to our boundary and B will be the offset and it'll just be a number so these are the parameters that the computer would use to figure out where the boundary is but for now we're going to compute W and B using our boundary and then at the end we'll talk about how the computer would actually get that boundary so here is how to calculate W and B so for calculating W and B we're going to have four steps and overall we're going to end up with five equations all number them as we go just to keep track now our first step will be we'll just draw the boundary line so this is still a human method and we've done that it's our pink line here and then the next step is we'll figure out the equation we'll write the equation of the boundary what's the equation for a decision boundary in this case yeah it's just it's on here ready x equals 1.5 okay then our third step is we're going to rewrite this equation in a specific format we'll rewrite the equation in the form W dotted with X plus B equals zero one note about this in lecture we used minus B instead of plus B it's not really going to make a difference because B is just a constant that we're figuring out just that the sine of B would be different but most notes use plus B so we're going to do that just by convention so what we're going to want is some vector W dot it with the vector X is just our features which in this case are x and y so if we were doing this in more or less dimensions then this vector would be would have more or less things in it and then plus some value be it is equal to zero so we're just going to reformat this equation similarly to how we did with neural nets so we could say this equation is the same thing as 1 X plus 0 Y minus 1.5 equals 0 so what would be the vector W 1 0 because it's 1 times X goes here and then 0 times y it was there then what would be this offset B minus 1 point 5 so we can just read those numbers off the equation but this equation has a degree of freedom in it because we could multiply through this equation by something like negative one or two hundred and we would still have it would still draw this exact same line but what we want is we want a rule for classifying points which is the classification of some unknown point X should be equal to the sign of the expression WX plus B so that means that we care about whether or not we multiply through this thing by minus one so we need some way of scaling it and we're also going to care about the degree of freedom so that we can use that to figure out how close we are to the margin so what we'll do as our fourth step is we'll scale this equation specifically we'll scale it so that the margin width is equal to two over the norm of W which is an expression we derived in plus Oh so the equation this was equation one up here and this thing with the margin width will be equation two but putting in strain on the length of W still doesn't tell us whether or not we should multiply through by negative one so we're also going to say and we want W to point toward the positive points because W is going to be normal to this line so it's going to go either to the left or to the right in this case and so we want it to point toward the pluses so way we could scale this equation but it turns out there's another method that's equivalent and actually easier mathematically so equivalently we can use something called the gutter constraint the gutter constraint is going to be a constraint regarding the points on the gutter as you would expect specifically it says that for the support vectors or for any other points on the gutter we want this expression WX plus B to be equal to exactly plus or minus one so that means that this expression when you're on the gutter will be plus one or minus one when you're inside the gutter will be between 1 and negative 1 and when you're outside it'll be the absolute value will be greater than 1 and so that way this expression will be able to tell us how far from the decision boundary something is so the good the gutter constraint says W dot it with some point X sub I plus B should be equal to the classification Y sub I for all the support vectors or points on the gutter and Y sub I'd is the classification so we can put where y sub I is classification of the point I which will be there plus one or minus one so what we're going to do in this case is we'll first use the gutter constraint to scale our equation and then we'll check our work using the margin width and the direction of W and the gutter constraint is equation three so in order to use the gutter constraint we only need to choose one specific point that's on the gutter so we could choose a or B or D or we could even choose one of the non support vectors let's just say we're going to use point D so using point D what we're going to do is we'll write out WX plus B for point D so we'll have we said W is 1 0 X is the coordinates of point D so what are the coordinates of point D 2 2 and then B we said is negative 1 point 5 and what do we want that to be equal to what's the classification of point D negative 1 so this is what what the gutter constraint says should be true is it true what's the left side equal to point-five yeah positive 0.5 so that means this equation currently doesn't hold but that's why we need to scale it so the thing we're scaling is we're scaling W and we're scaling B so we could write that as some constant times W plus some constant times B equivalently we could just put a C in front of the whole thing and scale a whole W X plus B thing it would be the same so we could say this is the same as C times 0.5 is equal to negative 1 and then we can use that to solve for the constant C so we get C is equal to negative 2 so that means that the gutter constraint tells us we should scale this equation by a factor of negative 2 so that will give us our scale W is instead of 1 0 we'll have negative 2 0 and the scale value of B instead of negative 1.5 will be positive 3 so that means now if we want to classify points we can just use this value of W and that value of B and it'll give us the correct sign and it will give us yeah yes this is the classification y sub I so if we had used a or B then we would have done plus 1 but because it's a negative point it was negative 1 yeah good question so now we could use this to classify points and we would find that the expression WX plus B will be between negative 1 1 if you're in the gutter and absolute value greater than 1 if you're outside of the gutters so the next thing that we want to do is calculate the supportiveness values well guess we haven't said what supportiveness is yet so there are alpha values well actually we should course we should check our work here so let's see what happens with the marginwidth so if we use equation 2 which says that the margin width should be equal to 2 over W this should now hold now that we've calculated now that we've scaled W so let's see if it's true margin width so what is the actual margin width in our problem it's on the graph so what did we determine was the margin was just by inspection 1 because we said the margin is this distance between the gutters and that was equal to 1 so this should be 1 what's the length of the vector W 2 so 1 is equal to 2 over 2 and that's correct so that's good and then we just need to check that W points toward the positive points so if W is the vector negative 2 0 if we drew it from the origin it would go like that or if we drew it from our boundary line it would go like that so it does point toward the positive side so that checks out too so that means that these values of W and are consistent yep is that you defined as like the line yes so W will always be a normal vector to the boundary one slightly confusing thing is that it's inversely proportional to the margin with so as the margin gets smaller W gets bigger so it's a little dangerous to think about it in terms of representing the margin with you just remember that it's opposite for any other questions on this ok so now let's go on to calculating the Alpha values so these are the parameters that determine how we classify things our classification rule but the way that the computer does this instead of finding the boundary by some magic it has these alpha values which are Lagrange multipliers don't know what that means don't worry about it but basically their values where each point has one and the computer is going to use something called sequential minimal optimization where it goes through a bunch of iterations where it adjusts the alpha values until it maximizes this margin width so we're going to calculate what the alpha values will be at the end of the problem after the SVM has been trained another name for these we're going to call them supportiveness values because another way of thinking about these alpha values is they represent how big of a role each point plays in supporting or determining the boundary so that means that a point like me out here that doesn't play any role in determining the boundary will have a supportiveness of zero because it doesn't determine the boundary so our first step here will be to find all of the points that aren't support vectors and just give them alpha equals zero so for all non support vectors alpha is equal to zero otherwise the point is a support vector and it lies on the gutter and it's going to have a positive supportiveness so alpha greater than zero so note that the supportiveness values will never be negative because no point is going to contribute negatively to determining the boundary so then we'll have two more equations to figure out what these values actually are these are equations that we derived in lecture the first one will be the sum over all of the support vectors so each support vector I of there supportiveness values times their classifications will be equal to zero but because there are only two possible classifications either plus or minus we could write this equivalently as the sum over the positive support vectors which we'll call P of their alpha values is equal to the sum over the negative support vectors of their alpha values so let's see what this equation is for our support vector machine so we have three three points that we're considering a B and D so we'll start with we have alpha a times what's the classification of point a positive so x plus one and then B is also positive so plus alpha B and D is negative so of minus alpha D equals zero and so I can say equivalently it's the sum of the positive points alpha values so alpha A plus alpha B is equal to alpha D and you can see that those are equivalent because we just move D to the other side of the equation so this will be equation four now we just need one more equation so that we can solve well we need another pair of equation so that we can solve for these three unknowns so we're going to get that by doing another summation over the support vectors and well this time we're going to sum their supportiveness values times their classifications times their coordinates and that's going to come out equal to W so this is a vector equation and we could equivalently write that as the vector W is equal to the sum over the positive support vectors P of their supportiveness values times their coordinates minus that sum over the negative ones so our positive support vectors we have a and B so on the side of the equation will have alpha a times what our ace coordinates 1 3 plus alpha B times what are B's coordinates 1 1 and then minus alpha D times what are these coordinates 2 2 and that's going to be equal to the vector W what did we say W is do we know what W is yeah negative 2 0 we computed that earlier so make sure we should make sure we're using the scale W because otherwise it's just some arbitrary vector that's some scaled version of that what we want the actual W so now we have three equations because we could expand this out into two other two equations where we have negative 2 equals 1 alpha A plus 1 alpha B minus 2 alpha D and then we also the second line so now we have three equations and three unknowns so you can take a minute and solve for alpha a alpha B and alpha D and we'll see if we get the same answers doesn't even get an answer yet yeah oh there's one they say one or negative one one okay so if we get a negative one something's wrong because they should all end up being positive so that means we made a mistake somewhere what were the other values you got okay I think they were correct though the other ones for alpha B and alpha D so we have alpha B is one oh but you said what is D okay this should be - yes alpha a equals one based on this equation did anyone actually get these answers did we make a mistake somewhere are these answer is correct yeah okay great so a couple way we could waste we could check these if we're not sure the easiest thing is just plug them into the equations and see if the equations hold another thing to get more intuition is the sum of the alphas on the plus side is equal to the sum of the alphas on the negative side so if we write them up here first what's the alpha value for e zero because it's not a support vector and then we have office one for a one for B and two for D so sure enough the alphas on the plus side add up to the ones on the minus side and that's true just because of this equation another thing to notice is the sum is that alpha a and alpha B are the same because those two points are symmetric with respect to point D so in terms of how much of a role they play in determining the boundary they're they're each working just as hard so there's no reason why one should have a different supportiveness value than the other another thing to notice is that D has a greater supportiveness because it's the only point determining the negative gutter where is we have both points on the positive gutter a couple other things we can look at for intuition are what would happen if we made the margin a lot wider so let's say that we had the same sort of situation but we now have anyway out here and B way out here and then dua out here how do you think that would affect the supportiveness values we can find out yes they would increase okay let's see if that's correct so one so one thing that we notice is that the margin width is now a lot bigger so the margin width increased and that means that the expression two over W which is equal to the margin with also increased so that means that W is going to decrease and equation five over here basically tells us that the supportiveness values are proportional to the vector W so that means the alphas are going to decrease and the the intuition behind that is over here we had this really squished together margin so there wasn't very much wiggle room if we moved a point even a little bit it would change the boundary a lot in comparison whereas here because the margin is much wider if we move one of the points it won't change the boundary significantly so we're going to say that here the points are playing a lesser role in supporting the boundary than here when there's a really small margin so an example of the Alpha values here might be say 0.1 0.1 and 0.2 so they'll still be in the same ratio as in this case there any questions on why that is okay and then one other thing we can look at for intuition let's go back to having our small margin let's say that we have a up here B down there and D a lot lower close to B so we could say that D still has a supportiveness of two because the margin width is the same as in this case but now the question is which one will have a bigger supportiveness a or B here guess for B so way to figure this out is to see what would happen if we removed either of them so if we move point a then the boundary would change a little bit and it would look like that instead whereas if we kept point a but we moved point B the boundary would swing around and look like that so that means that B is playing a greater role in determining this boundary so as you guessed it's going to have a larger supportiveness value so we might have something like alpha is 1.8 for B and 0.2 for a the exact numbers are going to depend on how far apart they are with respect to D but those are just an example of what we might get in that case any questions on calculating WB and supportiveness values before we move on to kernels okay so so far we've been assuming that our data is linearly separable but oftentimes that's not going to be the case so we'll need some other way of handling it so if the data is not linearly separable we have a couple options one thing we can do is apply a transformation such as transforming to polar coordinates another option is instead of using a dot product to replace it with something called a kernel function so we'll replace the dot product wherever it appears like in W X plus B with some other function this other function is called a kernel and this basically the same effect as transforming the space except that we don't have to actually apply a transformation so one way of thinking about this kernel function is it the similarity measure and we're going to do a very short derivation to see why so a kernel function such as the function some function K it takes in two vectors U and V is a similarity measure and the reason why comes from this equation for classifying points so the classification of an unknown point we said was equal to the expression W or the sign of the expression W X plus B we could replace W using equation v with this expression and it would be the same so we're going to do that and then we get this is equal to the sign replace W with the sum over all of the support vectors of their supportiveness values times their classifications times their coordinates dot that with X and then plus B so one more thing we can do is we can take out this dot product X sub I X and replace that with a criminal function and so then we get we have the sum of the supportiveness values times the classifications but then we have a kernel function in our sum which is the kernel function of X sub I and X plus B and so when we have the kernel function is equal to X sub i dot it with X and that's the exact same thing as the expression above but the neat thing to notice here is the when you're when we're classifying unknown points so if we have some mystery point over here that we're trying to classify as a plus or a minus the Alpha values aren't changing the classifications of the training of the support vectors aren't changing and the value B isn't changing so the only thing that changes is this kernel function so what I mean when I say that the kernel function is a similarity measure is that basically the way that we're figuring out the classification of this unknown point is we're using the kernel function to see how similar it is to each of the three supporters and we're using that to determine is it more similar to the pluses or more similar to the minuses and that tells us whether it's a plus point or a minus point so we're going to look at a few common kernel functions so far we've just been using the dot product which is a linear kernel and it draws a single line in the space so we could also have other linear kernels that would similarly draw a line we'll make a list of common kernel functions so a linear function for example the kernel of two vectors U and B could be their dot product or their dot product plus a constant and that will still just draw a line and then we could do something more complicated like a quadratic kernel for example the kernel function of U and V could be equal to U dot V plus 1 all squared or basically any quadratic function of the vectors U and V so if a linear function draws a line what do you think a quadratic function does yes something like a parabola so it basically draws any conic section so this could this could draw a line it could also draw a hyperbola so if we are data for example look like this then it might draw Sarah parabola might draw a parabola it could also draw a circle or an ellipse oh it could do something like that to classify this data or could even draw a pair of hype a pair of hyperbolas that's a tongue twister which is another conic section so this might look like two lines but it's really only one boundary because it's a single function that draws both of those and then we could also do a higher order polynomial which is just the same idea but more but a more complicated polynomial so we could give it some more complicated data and it might do something like that for example to the cubic function and finally the last kernel is something called radial basis kernel the main things to know about this kernel are that it's really powerful and can classify basically any data set unless you have something like a plus and minus point with the exact same features in which case nothing we'll be able to classify it and so basically that it can classify any data set and the way it does that is sort of you can think of it as drawing fuzzy nearest neighbor type boundaries around all the points so if we have well it's the first an example of a radial basis function the general form is the kernel function of two vectors U and V is equal to e to the difference between them over some parameter Sigma so this parameter Sigma is a Gaussian parameter and depending on how we tune this parameter this might draw different things like if it's tuned to prefer minus points it might do something like drawing little circles around the pluses so that almost everything is classified as a minus whereas if it's tuned to preferred to treat them equally then it might do something more like that and so again this might look like it's drawing multiple boundaries but the way to think of this is it's really just drawing something in another plane so if we've turned this on its side then what's actually going on is there's a function that's coming out of the board and it sort of has two little mountains coming out of the board that looked like that so these are these are the things coming out of the board and where they intersect this plane they form circles so it's not actually two separate circles it's really just a manifold or a floppy plane that's sort of coming through the board in various Isis so basically with that you can classify anything because you're adding another dimension to your classifier any questions on that so that's basically everything that you need to know about support vector machines

Info

Channel: Jessica Noss

Views: 31,504

Rating: 4.9414225 out of 5

Keywords: MIT, Massachusetts Institute of Technology, AI, Artificial Intelligence, CSAIL, 6.034, Patrick Winston, Patrick Henry Winston, phw, jmn, Machine Learning, Supervised Machine Learning, ML, Supervised ML, Support Vectors, SVM, Vlad Vapnik, Vladimir Vapnik, Vapnik, Kernel, Kernel function, radial basis kernel, Support Vector Machine

Id: ik7E7r2a1h8

Channel Id: undefined

Length: 52min 4sec (3124 seconds)

Published: Fri Nov 04 2016