Machine Learning Lecture 22 "More on Kernels" -Cornell CS4780 SP17

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

welcome everybody please put away your laptops okay one quick reminder the project on erm is due today you may want to use some some late days I looked at the top scoring teams and I really like people have done very principled things to to get to the top so it's very very nice I will maybe donate some time after spring back to talk about the winning I'll give the winners some chance to kind of explain what they did but it seems like they really know what they're doing any questions about logistics sure okay any other questions all right okay so we've been talking about bias-variance tradeoff and so if you lectures ago we basically established that the error and machine learning decomposes into these three terms you know the noise bias and variance and what we will do now is basically try to debug or give you tools to kind of address each one of these carefully it's the last lecture we talked about how to identify if your problem is a high bias problem high variance problem or neither one of these than it's a high noise problem and now we're getting into how do we combat these and we are focusing for now in high bias all right so high bias crémeux tell me how you diagnose a how you diagnose high bias what happens when you have high bias how do you recognize who remembers yeah hi training errors I did the small gap between training and testing yeah that's the important part right so your training error in some sense it's already too high and your test errors actually not much worse right so the problem is not the gap between training and and testing which is variance averages the high variance but instead actually is that your classifier itself cannot even do it on the training set right that's clearly high bias but could also be attributed to noise but typically it's high bias so one thing we did last time is to say well I gave you this very simple problem where we said we have these circle from the middle and then you have crosses around it means that what we really want to use a linear classifier because we just got a discount on SVM and how can we apply a linear SVM on this dataset and turns out this way it doesn't work right there's no way you can kind of divide these into crosses and knots but one thing you can do if you just do it add some features so in this case actually if you take x1 x2 and you will just I had some features for example x1 you know one person actually suggested a polar coordinates but you know you could also just add for example x1 squared x2 squared or the sum of those two right that's the same thing because if you assign weights to those who stand them up and then actually suddenly this becomes linearly separable right it's really really easy it's a really easy problem so that leaves a makes us believe well maybe that's what we should do right so maybe one good way of addressing high bias is just adding more features and you know of course if you have your data specialists you could maybe extract more features from the data by Vaizey for instance to you if house prices maybe also get the school district or something of the average grades of school district or something that's one that's the kind of the data and creation part we're looking at the just you know that's it you have some features already can be somehow combined these features to create some you know to capture them nonlinear interactions between different features one ideas or in this case it's it's simple we can actually look at the data and you can just figure it out ourselves it's more like a puzzle usually that's not the case right usually you cannot just plot the data and just you know futz around with it instead you know you actually have to do you know you know that's say the data is hundred dimensional while you wouldn't really know how to visualize that in that way so one thing is if you just try to model all the different interactions between different features so one idea is we have this vector vector X 1 2 X D and one thing I proposed does we say well what we do is we just add features between of any possible interaction between any of these features so the first one is just a constant that means no interaction between any features and we you know we say okay this is just the feature itself and let me take any square terms so x1 times x2 x1 times x3 and so on x1 times XD and we have x2 times x3 and so on and you keep making you know all three very modifications until at the end we get x1 x times XD and that seems like a good thing and probably captures a lot of the you know possible nonlinearities you could have in your data set so that's very effective at decreasing bias but there's one downside and that's you know how many dimensions is this and last time you you figured this out very clearly does that even remember what the dimensionality of this vector is yeah there's two to the T right why is it two to the D yeah that's one way up that's right another way to looking at it you just always have X 1 2 X D and each one of them can be on or can be off right and then actually this way you get exactly two possible options for each one of these terms this is the option where everything is off so everything is just 1 and here is basically you know everyone is on so you have exactly 2 to the D possible switches and that should scare the living daylights out of you right because you could easily have a thousand dimensional data set to to the thousandth is more than there's electrons in the universe right there's no way you can possibly write down that vector so it's great but right now it's not very feasible only for very small data sets and let's not get scared by this and let's just continue to pretend we don't have that problem that's kind of looming over our heads right so you want to do this stuff and but let's just you know ignore it for now so what I want you to show you today is you know basically two tricks how to deal with this so you would like to extend the dimensionality of our data by and of combining features with each other that seems like a good approach and the problem is this is really hard I mention all that it is you know the first thing that's really really slow and B that needs a lot of memory so the first thing I want to show you is just a simple trick very basically and I just you know that's there's a lot of mathematical reasons why you want to do this but I now show you just a very pragmatic reason and that's well one thing we can do is if you have a linear classifier turns out we can actually express the little clearly a classifier in terms of inner products and if the only access our data so in the products then you can pre-compute these and therefore you can kind of store them compute them once and then during the training we just save them so we just pull them up so so let me just explain that so let's say you have a matrix K you say K IJ equals x I transpose XJ okay so basically what I'm trying to convince you off and there's part of the homework so I kind of primed you here so I think one one book ago you had to do this already it's well if I can express my entire classifier in terms of inner products then one thing I could do is I could pre compute the inner product and the advantage is that I just have to look them up I don't ever have to compute them and you know one reason you may want to do this is because you already have the feeling these are going to be really expensive these are the paths okay so let me show this that you can actually do this right that basically we can express everything in terms of inner product and so let me for now just look at the square loss just because I like to square us but it also holds for all the other losses that we actually talked about so if I remind you the stare loss is w equals and we sum over all our points 1 to N and the loss is w transpose X I minus y squared okay so what's the gradient of this the gradient and be the gradient descent it's please correct me if I'm wrong now W transpose X I minus y I X I ok so this is the last function this here's the gradient it's called this G ok so far so good now I'm so I did the the notes are a little bit different I just realized just now on the you know as I walked over that maybe I can explain this a little bit more condensed so here's the trick the first thing I want to convince you off is that at all times during our algorithm this was part of the homework if you remember and our W can be written as a linear combination of our inputs so I that W equals one to n alpha I X I for some assignment of off of alphas okay so that's that's a claim that I show you approved to you in a few seconds and their visit shoe right then actually think about this what happens here right in this via he only access our data in terms of W transpose X right here we only exit our data in terms of tau B transpose X well there's a little trivia but don't worry about this for now so let me have W transpose X well W actually is this thing here right so what does W transpose X become let say W suppose XJ becomes the sum over I want to n alpha I X I transpose XJ well this here is just K IJ o here just just to be clear what I'm trying to do here is to express everything in terms of inner products because then I can then I can pre compute the inner product save them somewhere memory and then just look them up every time I access one I didn't actually have to compute it and the nice thing is now my vectors can be very high dimensional it doesn't matter as long as my inner products are pre computed and good to go okay that's not worry about the pre-computer kind of computation of the inner products so all right so when we doing training basically you know the only exit our data to this W transpose X if W actually has this form that it's just you know a linear combination of my axis which I claim is true and I will prove to you in a second then actually everything is good right then W transpose XJ for any vector X J might reindeer set just becomes this following thing we just just you know the sum over alpha I and then K IJ any questions at this point okay good so let me briefly prove to you that off W really is of this form who remembers the proof from the homework Wow maybe wasn't not on the homework it was on the homework right okay okay I'm well on listen Skippy who remembers there was a homework of a couple all right good and okay here you go so here's the algorithm we do gradient descent the square loss is quadratic so it's convex so the beautiful thing is that we can initialize our W anyway we want initially so let's just say initially I do it by induction okay proof by induction so W zero equals the all zero vector that's how I start right I just start there because actually if you have a convex function that you're minimizing integrating descent it will always get to the minimum it doesn't matter where you start but you can start here you can start here it doesn't actually make any difference so I choose the starting point and I choose it to be all zero right and the reason I'm choosing around zero because now I can actually write well obviously that's a combination of my exes what are the values of alpha in this case it's all zero right after I equals 0 for all i okay so check right we're good right at the beginning we start out and our vector W is a linear combination of all our inputs great now we make it update and so in some sense assume basically the case already assume that now Riaan iteration t and wi aw equals sum over i equals 1 to n alpha i XR okay this is just proof by induction now we use the induction hypothesis do you now prove by induction who knows proof by induction okay thank you thank you all right so we assumed this now we do one more update so what is the update the update is w becomes W - step size s times G okay so what is G G is I have it up here is this term here right and the sum of our eyes but 2 n 2 times this term here times X I sum of all X honey well I can call this your gamma I this is just a scalar right the scene of scalar this is a scalar this is just a scalar so my G is gamma I X I okay and so what does this mean this means this becomes this years but W is just alpha ixi that's the induction hypothesis minus s times G well G is this thing gamma I X I and if you write this what does this become this becomes sum over i equals 1 to n alpha i minus s gamma I X I so this is my new alpha alright so this actually proves there's already we're done with the proof right so it is true initially all the alphas are 0 if it's true at the beginning of any iteration it will still be proved with the proved true at the end of the iteration that we just changed our alpha alpha I becomes after I minus s gamma so another way of looking at this is what we could do is we could actually change the gradient descent algorithm by saying initially alpha I equals 0 for all i and now we iterate you compute gammas compute gamma I for you know this is another loop compute gamma I and then update your office by saying alpha I becomes alpha I minus s gamma and now loop around any questions yeah that's fine still just a number right it's just a scalar that's okay right at the end you know this is still just a linear combination of my X's right where the scammer comes from doesn't matter right if you put in the age of my mother-in-law right that's fine right it's the no depends on the age of my mother and my mother-in-law but doesn't actually matter it is still a linear combination of the exes yeah any more questions this is only to Bolivia classifies so what we had be trying to make you're holding on to the linear classifiers we want to make them more powerful right because linear classifier is a beautiful okay any more questions okay Arthur so she comes a beautiful thing so now we can basically we have our linear classifier and we can basically write it instead of actually Co ever computing W right we just have to compute n alphas now why is that a big deal right now can anyone tell me why this is incredibly important when you deal with two to the D damaged think about if we started the lecture by saying we take our feature vector right and we map in just a really high dimensional space and now we learn a classifier there right this is 2 to the D this is ridiculously damn high damage right if we have to store the W vector in this ridiculously high dimensional space you would use up all disks on the entire planet right even those in North Korea right it's only a couple but so what I just told you is we don't have to do this all you ever need to store is n numbers UN alphas right and it's independent of the dimensionality of your data right so the storage we need to do to do this I've written is independent of the data and I mention alt right so that will allow us to go to really really high dimensional spaces actually by the end of the lecture you go to infinite dimensional spaces all right so that you know certainly reassuring that we don't have to store into the dimensional W so in Phase II the way we now - clintus and we just store n data points where n is number of data points s is our n mates these alphas for enzyme of data points so it only scales with n and is always finite right the only a finite amount of data any questions about it ok so we can do the training by just storing these alphas how do you do testing well if you do testing your H of X is there's a test point it's W transpose X right W super super high dimensional we never computed it because we were way too scared about it so what we do instead is we do the following we say well this is just W is just I equals 1 to N alpha ixi and now we just have transpose X okay so now we can order do the testing without actually ever computing double okay so doing testing during training and during testing we never need www different X's right that's exactly right right and you couldn't have asked a better question right now because that's exactly the last piece all right last piece of the puzzle so I can show you that you can also do that so the first magic trick in some sense is that we can train a linear classifier in this extremely high dimensional space without ever ever computing the W that defines the hyperplane we just have to store these alphas one for every data point and we wrote everything in terms of inner product now what he's saying is wait a second right you still have to compute these inner products right so when you compute kij that's X I transpose XJ right or here the test point if it'll be in a product for every single training point with every single test point have you gained anything right that still seems very very expensive all right so here comes the second magic trick are you guys ready all right Arthas ready I want didn't know those are yo all right good so let me just let's just go back to the very beginning of the lecture where I showed you this this description of you all right so let's just go with this for now and say we use this expansion right just because it's very convenient and it would be awesome if you could do it so all right here we go so here comes the magic trick so because this here PI of X X goes to Phi of X now we want to compute the inner product between two such vectors let's call them X and Z so you want to compute this means X transpose Z becomes Phi of X transpose Phi of Z ok any questions at this point raise your hand if you understand what I'm doing ok so base if you're mapping our data from this low dimensional space and this is ridiculously high dimensional space right scares the out of your pants right I'm in this high dimensional space we only have to compute in the products that's the only thing we need to do let's do it so we have our data points now we have to compute then a pair efficient at five acts of piracy what does that mean well we compute the inner product between this vector and the same thing with disease 1 C 1 ZD Z 1 Z 2 T 1 C 3 C 1 C ok so if you write that out what is this alright it's first these two multiplied 1 plus X 1 Z 1 plus X 2 Z 2 plus plus X DZ d plus and so on and so on plus X 1 XT plus a times C 1 CD ok that's a massive sum it has to the D different terms that we sum up but I claim actually if you think about this what are we doing here it's actually exactly 1/2 D the product between these terms 1 plus XK CK that's it and this only takes D iterations this kit takes a few milliseconds so this it takes computing this takes from now until the end of the universe right so well the solar system let's say the Sun collapses collapse into the Sun like you know ten million years I don't know how many years and maybe five billion years this year it takes ten milliseconds the answer is the same thing you can choose right now which one you going to do so maybe spend a few seconds you and your neighbor and convince yourself that these two are actually exactly the same thing [Music] [Music] [Music] [Music] [Music] all right who thinks it's crystal clear why the - those two are the same raise the end all right it's a Fairmont okay so essentially what we are doing right if you basically take these these terms right we have one plus let's say we just have the simple case where you have x1 and x2 we blow this up right this becomes 1 X 1 X 2 X 1 X 2 right and if you take the inner product which we choose X vectors and this would you know correspond to this x1 z1 1 plus X 2 Z 2 right so basically everything you may see take all the costumes right you take one one times one that's the very top one then 1 times X 2 Z 2 then 1 times x1 z1 and then at the end these two cross tabs which is exactly what you get right you get 1 plus X 1 z 1 plus X 2 Z 2 plus X 1 Z 1 X 2 Z 2 and that's basically all possible across them so the number of costumes you have is exactly 2 to the D and that is the exact sort of list that we have here any questions about this so what I just showed you is that we take our data like we do kind of two crazy tricky right we take our data we map it inches at infinity or not infinite like exponentially high dimensional space and we run our algorithm in this high dimensional space but actually we never once compute a single instance in that high dimensional space we can't because we couldn't even afford it but because we only need inner products in this high dimensional space we can compute those very very cheap do you have no idea of what these vectors look like you never have to compute this guy right because you all you ever have to compute is the inner product between two such vectors right you can do training in this in that high dimensional space right and the solution is exactly the same as if you would nap your data up there and it's extremely powerful because now you capturing any possible nonlinear interaction or you know a pairwise interaction between any two features yet you never compute a single one of them and the answer is exactly the same so all you need to compute once is this kernel matrix K IJ equals Phi of X I transpose Phi of X J right for any two points which is just order D computation right so each one of them takes a few milliseconds so it's not a big deal at all so suddenly it took a very simple classifier let's just can make a linear decision boundaries I made it exceedingly powerful another way of looking at this is that all we need from our vector space isn't in the product function and you can also view this as kind of saying we just really find in a product we say the inner product between two vectors is no longer just X I transpose XJ instead the inner product between two vectors is K of X I comma XJ where that is some kernel function right and where that actually equals this right so this here actually equals K of X comma Z and so I just define a new inner product and I run my algorithm with this different inner product and I know this is well-defined because I know it corresponds to some extremely high dimensional vector space in which is just the good old inner product that you all remember right from high school any questions yeah yeah you have to be careful now the dimensions are no longer there particularly interpretable right because you have an excessive number of them and they really kind of capture different combinations of existing features but the notion of the margins exactly is saying that the classifier doesn't change at all yeah the loss so here's what you do where don't oh he raised it sorry okay well let me write down again so then the last was the following right my last was L of W well some of our I equals 1 to n W transpose X I minus y I square R I try to square s square regression ok so now we realize that W can be expressed in terms of alphas right so I can write my hole loss in terms of alphas and W it becomes I equals 1 J equals 1 to n alpha J XJ transpose X I minus y I squared right now I have everything in terms of inner products now I plug in my new inner product function and I just say this is actually K of X I comma X J which is this function up here and now I can just compute the loss and it takes me now sub seconds right no big deal at all super super easy and you take the gradient with respect to alpha and you get exactly the gradient descent algorithm that we wrote down in earlier it works for all other ones we talked about yeah good question and no you just use all of them but because why not right because you could assign zero way to them so he comes the crazy thing right so he's maybe saying the following right he's saying wait a second yeah there's 2 to the D possible features right and some of them have capture interactions between features that I don't care about I don't think there's any interaction between these two features right so why do we blow it up right but it's not a big deal because you assign a weight to each one of them right you have this gigantic weight vector W and you can just assign 0 weights to these right that you don't care now the cool thing is that you never actually compute those those W vector right but implicitly right by choosing your alphas you actually assign your weights of these right this it's pretty crazy that he can do this right this without ever computing the vector yeah sorry if you yeah okay good good so the question is is it linearly its data not always the nearly separable and let me let me show me a few so though let me get back to this question three minutes so one thing I just I just showed you I just showed you this very very carefully constructed feature expansion right that we basically take every any possible cross you know correlation between different features and I showed you that that can be written in a you know in this very compact form right the question is is there other are there other inner product functions that we could use and turns out actually there's tons of them at tons and tons and tons of them and you can make your own actually and if it catches on you can name it after yourself or after some loved one or something so let me give you a few examples off in the product functions that we never use so the first one and they call these kernel functions for good reasons because they're actually what we're doing actually is via by doing this we are actually mapping our data and they're reproducing kernel Hilbert space that actually is defined to that count so the linear kernel function is very simple that's just K of X comma Z equals X transpose Z right if you use the linear kernel the algorithm becomes the good old linear classifier that you're all familiar with that you all like from you know that you did before the midterm another one is polynomial crap so here we say K of X D becomes one plus X transpose e to the power of P actually not the P well I can gone D whatever doesn't matter some constant just raise that's not constant well that captures is actually some sense now you get if P is 2 then you basically can model quadratic functions if P is 1 it becomes the linear kernel exactly plus some constant which doesn't matter right so then again you're just learning linear functions becomes quadratic you learn quadratic functions it becomes you know three than you learn cubic functions etc the most famous one this like this like the Brad Pitt of kernels right it's like people you know sometimes faint when they see it it's like it's called the radial basis function a function kernel yeah RBF solve got RBF and the RBF kernel K of X comma Z it's e to the minus X minus Z squared over Sigma squared like in mitosis this is vector yeah and so that should look familiar to you does anyone remember this never know the cousin after the of the RBF kernel there's a Gaussian distribution right so the RBF kernel is so popular for various reasons number one you can prove that it's universal approximator so what are the years the universal proximity means that actually you can fit any function arbitrarily closely given the few assumptions so going back to your thing your question is now everything linear set linearly separable and the answer is yes but if you use a RBF kernel suddenly every problem becomes linearly separable provided you don't have two identical data points the only exceptions like if you have two data points that are identical and have different labels you can't do it but you know taking that one point away you know that you know which is ridiculous then actually everything becomes so you can learn anything you want it's just a simple linear classifier the RBF kernel corresponds to us to a space that's infinite dimensional right so you can't actually write down Phi of X right it corresponds like they exist some Phi of X such that you know K of X see equals Phi of X transpose Phi of C but this is actually infinitely dimensional so you can actually never write it down completely but you can approximate it pretty well that's arguably the most popular color if you you know in most problems if you use one particular product kernel that is that is the one that usually works best out of the box in one there's a connection with the Gaussian distribution and years in some sense what it does and what the RBF kernel does it takes your data set and puts many many little gaussians around every single data point this here's the Gaussian this here's my training data point right and Sigma basically tells me how wide these gaussians are and what you do is by doing this basically you get some weird space where you basically say how similar are you to any given data point and the reason you can basically approximate anything you want is because if you have enough it's kind of like it's similar to the nearest neighbor proof to some degree but in some sense basically if you have enough training data then actually there's always some trained data point you put a Gaussian around that data point and that then defines a one dimension in your into the dimensional space and does that make any sense so it said you know it's a little bit hard to kind of picture these infinite dimensional spaces but but you know you get used to it it's it's a nice world to live in that's a few more let me just think you know how much time do we have I want to show you a little demo there's a few more kernels I wrote them down few people use them the most common one is is this the RBF kernel now here comes the important part and let next lecture actually we'll go to this and into this a little bit more that you know that the natural question is what is a well-defined kernel all right he just do anything okay just make my Killian crown right and just you know take any kind of function and just say that's my you know take two inputs and say the output is the car and the answer is no it has to be a positive semi-definite function so what does that mean that means if I take compute this matrix K for any set of vectors that matrix set has to be positive semi-definite who knows what's positive some definite means raise your hand it was mentioned briefly on the homework and so a matrix is positive positive semi definite if if and only if there's a few definitions the K is positive semi definite this is how we write this if and only if for every vector Q Q transpose KQ it's great equals zero and that's the case if and only if K can be decomposed into Z transpose Z and another one is if all the eigen values are non-negative and it's real and symmetric in this case so this here is the one you really want this is this should make it obvious why why why you have to have this definition these Z's could anyone tell me why this is why why there you know if you know this is the case then it must be about a very fine kernel can anyone tell me this yeah that's right that's right that's exactly right so each column here in Z is one particular Phi of X right so these these are basically the really high dimensional feature representations right and if I just end up you know based you know now it's inner product between it is inner product function right this doesn't have to be ridiculously high dimensional and the reason is because this matrix case only over a finite number of points and if you have a finite number of points they always like in a finite dimensional space right so you can just do PCA on those points so that's why this doesn't have to be so even if you have an RV f kernel you can take the kernel matrix and compute these Z's that's fine you know so the important thing is for any set of points if you compute the kernel matrix K you can do this decomposition so that's basically that that's that's the only condition you have to have so the beautiful thing about this is that that's extremely it's extremely flexible right so once this came out people went nuts right people in biology defined kernels of a molecules right and over sit DNA sequences right there is no vectorial representation it you don't need it right you just have to define some of the product matrix that's actually always positive semi-definite right so it's extremely powerful you can define in the product over sentences right over all sorts of weird data compare constructs and now you can use all these linear classifiers like SVM's right suddenly they apply to data sets but you don't even know how to write a vector right yet you can define a linear classifier all you need to know that is in principle they exist some mapping from this theta into some infinite dimensional space which I don't know but it doesn't matter because as long as I can compute in the product I'm fine so it's incredibly powerful and when this came out the whole you know machine learning world went crazy this was in 2000 you know this was basically with SVM so because Karina Cortez in is a valid yo and and they be introduced this for their support vector machines and that's why support vector machines became so incredibly popular and for I think 4/5 10 10 years the whole machinery fields did nothing else but you know mapping everything into kernel spaces and analyzing these very very thoroughly let me show you a little demo that you see a little bit what this looks like in practice any questions about I'm including the why you can't have really terrible kernels I guess right the zero kernel maps everything to zero or something so what do you want is that a Colonel says similar points are similar when they have similar labels right and so you can encode that if you know a lot about your dataset the better you encode that the last data you need to train it yeah [Music] ultimately people typically use the the RBF count yeah any more questions yeah as I really still limited by alphas what does that mean that's right no it connects all these last functions are convex so you can you can optimize them to arbitrary precision so you can use the you know second order method like the Hessian and and you can you can make this as precise as your computer can represent it okay so let me show you there's an SVM well let me actually first okay so here's an RBF kernel which regressions the Richard Grieco just draws on points and and now have the RBF kernel so what you see is the following so what I'm changing here is may seem a regularization term and my Sigma and down here actually if my Sigma is very very wide you can't see this because the blackboard is in front of it and this becomes the linear classifier this becomes the gaussians of the RBF kernel really really right but if I just changed my thinking I'll make it smaller and that's that's going upwards and if I regular rise less right and actually you see on top left you get an extremely powerful classifier right so you basically and here somewhere in the middle right you get something that actually captures this this data pretty well right here right I basically come you know I have a variance problem right I'm totally over specialize into every single point of me they have a spike that just hits exactly that point right so I'm memorizing exactly my training test set right but just shows you the RBF kernel can learn anything right but this is too much that won't generalize well but if you're going on here in the middle right you see beautiful curves right that capture your data set very very nicely right so what do you now just have to do is you have to basically find your lambda and your Sigma because you find the right hyper parameters and you can learn any any data set you want let me show you another demo for actually classification so let's make some really nasty data set and what's a nasty data set any ideas any suggestions a smiley face all right good so what are the two classes in the smiley face so these are positive what are the negative points oh wait okay wait I guess I already made it wrong so let me just do this let me just put the other ones all around it here so this guy is like you know smiley person who has freckles okay good so now let's run this right this is now now you're running a linear classifier on this right a few lectures ago you would have laughed at me right hopefully so let's just do this well this is a linear classifier it was just if I mount a linear classifier doesn't do anything the decision boundary actually is so so it doesn't even draw anything it gets a precision errors so this here is if I take a linear classifier and I now show you the prediction values between positive or negative and what you see is that classifies everything at point two because it has no idea what to do so now I run RBF kernel da da right and so what you see here is all right so the blue is busy negative the the cross the positive and the white line is actually exactly the points that are one one in your margin one away from the decision boundary these are my support vectors right so we have the same thing and by the way this is a straight line in this high dimensional space right it's just curved and you know because we are so limited now you know in in 2d and I can now show you the value you know how much a certain point is positive how much is negative if I now this here is actually and I can show ya like I can now actually make you know show you this in 3d you need see a landscape right yeah but you see the smile basically right like it goes negative into these cars out these valleys [Applause] alright let's leave on a high note see you next Friday

Info

Channel: Kilian Weinberger

Views: 12,491

Rating: 4.9652176 out of 5

Keywords: artificial intelligence, machine learning, kilian weinberger, cornell, cs4780, kernels

Id: FgTQG2IozlM

Channel Id: undefined

Length: 51min 54sec (3114 seconds)

Published: Mon Nov 12 2018