Lecture 4 "Curse of Dimensionality / Perceptron" -Cornell CS4780 SP17

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
alright ok good so we I hope you enjoyed the last two lectures actually do this every year that I have a little bit of a math background this time I was out of towns of my jaikant Iran was a nice and sub for me I know some people knew all the stuff but for every person who knows everything that's probably five people who you know forgot something and it's certainly valuable because we will dive into statistics very soon project 1 and project 0 are both still out project 2 will be out very soon so please if you haven't started as project one yet don't fall behind right so the projects are overlapping so don't that all accumulate any questions about logistics will carry him or something I know some people just joined today today's last day of adding if you don't yet have a kerime account please that the TA is no on Piazza and hand in your placement exam alright so we were talking about the K nearest neighbor classifier and if you remember correctly the assumption that we made was although the assumption the K nearest neighbor classifier makes is that similar points have similar labels right si idea is very very simple you have a test point and what you do is you just look for the nearest training point and you just steal the label from the nearest any pawn you assume well if it's really really similar then you know it likely has the same label as the test point it's a fair enough assumption and then I was like well that's great let me make sure you meet a little proof when we show that actually if you have enough data if your data goes to infinity and goes to infinity then this classifier only can it most have twice the error rate of the Bayes optimal classifier which is the best you can possibly do alright so that's a great sign so that's all very promising however then that's the upside then comes the downside and that's the curse of dimensionality and let me just go through with it over this one more time maybe the curse of dimensionality what we said is but basically look there data that was drawn from a from the unit square here or that you know a hypercube so this is old distance one and this is of course already three dimensions because it's hard to draw D dimensions but these really large but we may see a Schumer kV draw data uniformly at random from you know within this cube within this hypercube every length no edge has length one and so then the question was well imagine you have any point in here and I would like to know what's the smallest little cube that encapsulates this is length l and it claps capsule if the K nearest neighbors of this point okay so this is in some sense you know you find the key nears neighbors you draw a tiny box around it and one question we looked into is how big does this little box have to be all right and well the the math was quite simple well we can't compute the volume of this little box just as L to the D all right D dimensional space and have a cube with length L and we know that this this little box contains K points out of n well these are uniformly distributed we can say well roughly you know this is K over N that's roughly the volume over all has to be the same as the ratio of the points right because of the uniform distribution and so far do you remember this is a raise your hand if that's that's clear okay awesome and so then the question was alright you know if you have this relation we can now solve for the size of L and we get L is roughly K over N to the 1 over D and we can now solve for this let say K is 10 and n I think we said to a thousand so this is no 100 to the power of 1 over D and we can solve this for several values of B and here comes the shocker for different values of D when we have D equals to the is not a big deal as 0.1 so roughly this out square right you know the box about that big and four D equals ten is 0.63 so this means the box is now already that large and for now D quotes that take you know 100 or it's 0.95 five and thousand it becomes 0.995 four so basically when the thousand dimensional space we have this box from which the data is sampled and for any given point if you just look at the K nearest neighbors right you say how much space does this box contain that contains only my K nearest neighbors it's essentially in the same size so what does that mean all right so think about this why and why is this so troubling right why am i why am i making such a big fuss well think about all right so the basically means that this entire space must be empty right so here a thousand points well ten points are but you know this whole entire interior there's only ten points right all the other 990 points are you know just you know here yet right squeeze between these two boxes all right right why is that so problematic it's so problematic because think about it what the fact that this little box is so big can only mean that basically these K nearest neighbors where are they well they're also here in the edges right otherwise you could draw a smaller box and the points that are not the K nearest neighbors where they're where they they are between these two boxes right so let's say here's my point well the K nearest neighbors here and the points are not the K nearest neighbors are right next to them right so the K nearest neighbors are not close at all right there's no notion of closeness right they can't be close they're really far away at the edges all the points are really far away at the edges right all points have roughly the same distance from each other they're all at some crazy you know it's crazy distance you know they're all of the edges are really far away from each other it's somewhat counterintuitive our brain is just so made for three-dimensional spaces and three-dimensional spaces things behave very differently three-dimensional spaces you know the drop points randomly they would be in the interior in high dimensional spaces that's never the case that the interiors empty there's never anything there everything is far away from each other and that means that our assumption of K nearest neighbors that nearby points of the same label is nonsense right because there is nothing nearby everything is far away from each other and everything is about the same distance far away from each other all right so that K nearest neighbors are here but some other many many other points are right next to it right in the difference in distance is nothing all right so it seems unreasonable to say you know here the K nearest here's a nearest neighbor and here's another point and clearly this point should have the same label as this guy but not this guy i despite that they actually have no real you know they're not significantly further for a far apart let me give you one more intuition why that's the case so why is everything so you know at the edges think about the following way think about it like if I draw a point randomly in this cube right if I draw it uniformly demandin what am i doing I could just draw every single edge every single coordinate independently right so I first draw this chord let's say like what do I ran and up here then this coordinate and up here this Gordon appears then I kind of take these three directions and now I have you know I have a point okay does that make sense crazy hander that makes sense okay all right so now I'm saying I'm doing this over and over again we have a length of one and I'm drawing drawing a chord BC I take one chord enough the other and I basically draw you know some numbers you're uniformly at random between 0 and 1 in order to be in the interior I can't be at the edge so what's the probability that I'm not at the edge well let's define the edge as saying it's epsilon apart from from you know from the 0 from the 1 ok so this is my interior which is 1 minus 2 epsilon and the edges is this epsilon and the these are my two edges okay and if you look at this if absolutely small it's very very likely that I end up in the interior so that's why your intuition is the low dimensional spaces if you draw some points randomly in a cube you end up in the interior but now comes you know the change is that if you actually do this in D dimensions and D becomes large so what's the probability that I'm ending up in the interior well it's exactly 1 minus 2 Epsilon okay that's the probability that I end up in the interior that's in one dimension what's the probability that I end up in interior in every single dimension it's 1 minus 2 epsilon to the power of D right because in every single dimension I cannot be at the edge because if there's a single dimension which I'm at the edge or now I'm an edge point okay does that make sense raise your hand if that makes sense awesome well what is this this is less than 1 right and any number less than 1 to some large power heels to 0 damn quickly so this is you know very very quickly to the 0 that means you can never the probability of hitting the the the interior is basically zero any questions yeah thank you why I did it all right I'm just telling you this doesn't work right high dimensions doesn't work but then I run ten years names on pictures and pictures have like thousands of dimensions right so what's going on and how can this be possible right what's happening well the key is that pictures are not uniformly distributed so they actually a very very different to stupid and so here comes the key so in general if you make no assumptions about the space and you have high dimensional data kenya's neighbors will not work but it could be but you have high dimensional data so your high dimensional space R is high right but actually in this high dimensional space lies a subspace it's much smaller and your data only lies in that subspace and you never actually draw any points that are off that plane you could for example imagine you have a two dimensional plane that's embedded in a thousand dimensional space well then K nearest neighbor still works because your data is essentially just two-dimensional you just have a very high dimensional ambient space doesn't matter that who cares if that's the case you're fine so the assumption of K nearest neighbors is you have you have a low intrinsic dimensionality that's what it's called and so you have that either if you data lies in a subspace in a low dimensional subspace or it lies in a low dimensional sub manifold so what is that right you could have in a high dimensional space but your data could be you know there's a surface up here that's curled or something let me draw this so here you have the surface like this and the surface may be you know ten dimensional but you actually in a thousand dimensional space and the surface can be curled up in anything so you really need the four thousand dimensional space to represent your data in this case you wouldn't you could just project this on the two dimensions you're fine all right so seems silly to have this in a thousand dimensional space but in this case you may need the whole dimensionality because this the surface itself explores the thousand dimensions but the data never leaves the underlying manifold which is actually low dimension right and that's the key and because the manifold is no dimensional well let me just explain what money I don't even heard of manifolds before raise your hand if you've heard of n the manifold is basically mu there's a surface that is low dimensional and has a rehmannia mit has many different definitions but the one we use is basically that you it has two properties number one it's locally Euclidean so if you were a little little ant right you live here congratulations and now you move around right you would have no idea that you that this is actually a curse place right if you just look at this it looks completely flat right which is actually by the way you're laughing but actually that's exactly what we do right so we humans live on a big sphere right which is actually a mannequin right and so this is why for centuries or millennia people were convinced that the earth is flat right because you have you a little tiny human on this gigantic earth it looks it looks flat locally right so locally Euclidean space is totally appropriate despite globally actually it's not Yuki I such this field so that's what you have on a manifold so manifold locally you have Euclidian these Edison's the valid globally they are not valid so if you for example take the distance between this point at this point right and this point in this one you could let me accede to a better example so you could have the manifold actually curl up like this alright so then actually this point in this point these are actually closer in Euclidean space and this at this point but actually on the manifold if you would want to go from this to this point you would actually have to traverse all the way around here alright so the true distance between these two points actually much much much large so globally Euclidean distances don't work on manifold data but that's okay because for K nearest neighbors we just do local distances you only look at the K nearest neighbors and that works right so what you've basally half years that all the points in this region we have one label all the points in this region have one label all the points of this region if you have a point here you find the nearest neighbors and there they're close any questions about this yeah and you should that's exactly right hey so so you can for it so this is you know there's all sorts of tricks how to actually estimate if you lie on a subspace on it's a manifold subspace for example is just you know I know people heard of principal component analysis or a singular value decomposition raise your hand if you've heard of those yeah okay so that's essentially what these algorithms are doing is they find a new coordinate system that captures in a very most the most of the variance of the data in this case for example you would find a coordinate system actually be centered here that actually would only have two dimensions and you would see that in the remaining dimensions you have no data anymore so you can just drop those manifolds a little trickier so you have to what people do is you draw a little spheres around you keep doubling the spheres and you see how much bigger does your data get so and let's say you have a point here you can draw a little little sphere around and you measure how many points to 1/2 inside the sphere and now I double the sphere and make it twice as large and I count how many points do I have now well if it's a two-dimensional surface if I double the sphere then I should get four times as many points but if I have a three dimensional surface I get eight times as many points right and so on so you can actually use this to estimate the intrinsic dimensionality of your data items exactly right and so if you can use neighbor algorithm doesn't work why that's a good test to do any other questions one thing that also helps is to think about what is the true dimensionality of your data so for example in faces you think about faces you know images of faces may have ten thousand pixels right but clearly you don't need ten thousand attributes to describe a face right so if you describe someone let's say you know a new boyfriend or girlfriend to your mom right you wouldn't say like okay well the top left pixel is green right and then what you would say you know you get let's say someone steals steals your your wallet you go to the police right and they have someone who sketches sketch artist all right he or she could probably with maybe you know 20 or 30 questions make a pretty good picture of the person that you have in mind right that means that maybe face is lying at 30 40 dimensional space right but not in the 10,000 that you you know that the pixel representation has any more questions all right let me show you a little demo [Music] and by the way it's a good pre-processing step in general right whenever you work with data it's a good pre-processing step to first try to reduce the dimensionality right a lot of algorithms scale pretty badly you behave badly venue data is very high dimensional but a lot of datasets don't need the high dimensions that they come in and so the first two in PCA or doing some other dimensionality reduction algorithms is typically well there should be the one of your first tools in your tool sets okay and second to this work oh I see oh yeah awesome whoo okay the first thing I want to show you is actually a demo that IOU in some sense that I wanted to do last time and I didn't get around to it's actually a K nearest neighbor so I just want to show you visualize the K nearest neighbor classifier for those who you know you essentially that's the same as what you have to do in the homework but let's say we just ran away the data set I don't know what I know let's try to do something yeah maybe I know a banana another banana all right and how does an umbrella alright good and then let's say we have some I don't know see that's why I'm not an artist all right so there's my data set I have circles and I've crosses if I now run the K nearest neighbor classifier and so what do you see here is the Kenya's neighbor classifier with different distances so I use different distance metrics and this is the one nearest neighbor classifier this is large enough so and what I draw here is basically read as the the region any point in the red region would be classified as a cross in any point inside the blue region would be classified as a circle and the widely is wide we just write when you're at the edge so it can go either way and and you can see that you know what the difference actually makes here in terms of distances in this data set actually it's not so drastic but for example here with the city block distance and you know for example you get this weird stripe here that's actually still blue and someone asked back in the last week what happens when you increase K right and so you can do this now we can actually increase K to 3 and 2 5 and so on and so what you see actually in this case is that I drew more circles and then crosses so what happens actually the circles become more dominant around around where they are but actually eventually the decision boundary should smooth out so maybe may I do a slightly suboptimal dataset and he mentioned he just becomes the mean label and so you see that you know here now all the top is everything it's blue because the average is actually blue and as we keep doing this and eventually the basis just overrides the Union and that's advantageous if some of these these points may be errors right if they basically may be mislabeled right and you don't want to over react to them and say well globally still that seems like a blue region because there's a lot of circles here right and eventually Anna any questions about this demo any of the data set you want to see that I will attempt to draw all right let me show another demo this is the curse of dimensionality and this is also on your sheet but let me just explain what I do here yeah so what I do here is I draw data points within the cube like I did exactly what we just talked about right I take this hypercube and I draw endpoints randomly and then what I do is I come to compute the distance between any two points and what you see here is a histogram that theythey says how many points have that distance from each other right so the maximum distance you can have from each other is exactly square root of two all right that's exactly if you're on the opposite end right you to each other corners right and this is exactly square root of 2 all right so that's of course only two points like the ones you know or the ones very few points that I write on the edge and the bulk of the points in two dimensions like the most of them this is like the y axis here's how many points have that distance how many pairs of points of the distance most of them have a distance of 0.5 and now as we increase the distance the dimensionality what happens is this this you know the shape can get squished together and so in ten dimensions already you have hardly any points they have less than you know 0.5 distance that's amazing right so basically that means there's no neighbors anymore right no points are within point 5 distance of each other everybody gets far away from each other all right this is the the fascinating part here and to K nearest neighbors works great in this regime I think they'd say here's my point and some my neighbors really close and therefore we have the same label right but what happens in 10 dimensions that disappear so you have no neighbors anymore so like it's like the suburbs you know everyone is spread out right and so 100 dimensions gets even more extreme in 10000 dimensions you have this ridiculous like distribution right there basically there's no neighbors at all it is like farms in Kansas or something right they're really far away from each other right and there's really nobody has any neighbors at all right and so now you can see that it just it's just ridiculous to say you know oh one point it's slightly closer to the other point but they all have the same distance from each other at every point as the same distance from each other there is no such thing as nearest neighbors we can't run nearest neighbors in the state of any questions yeah yes yes yeah it'll be same yeah see I use l2 here yeah but it's the same effect with other metrics yeah and so sometimes you can make it work with other with with special let's for example for images you could have certain metrics that work better that but usually what they do is actually they take advantage of the fact that your data truly lies in a low dimensional space right so maybe yeah first up through a transformation to bring that out or something right but it's really a matter of the space yes sorry jeez you ask questions alright let me think about this so the Alan thing you know I'm busy would say you look at maximum distance and oh no it's still the same though right because the maximum distance would still be very very large right because it would be very unlikely that for every single dimension you know there isn't one dimension for which you have a very large distance right so by the same is the same with the product you know if you understood the argument of the probability distribution that I made earlier right it's the same thing all right so in some sense you're drawing yeah you can talk about offline but but your basic join two points you see that over and over again at some point they will be far away from each other okay any more questions is that a question are you raising that you waving oh no okay if if I'm sorry one more time if you did I guess I don't send the accuracy of what is perfect oh no no I see I see oh I see what you're saying okay I see I see what you think so what he's saying is the following so like okay here all the points are really really you know all the points have the same distance right but they don't have exactly the same distance right they just very very very little but let's say I have a super precise computer right you know get the latest super awesome computer has like you know 1000 bit precision floating point numbers so I can tell the difference between that point and that point right then I can still use nearest neighbor no the answer is because these two points like these are two points they have this distance they're just not close right that's basically the the problem right and the assumption is that similar points have same neighbors right they're actually really far away from each other so who knows if they share the same label or not yeah sorry but yeah that always depends I mean really it really it's always about intrinsic there it's also let me put another way if your data does not have a lower intrinsic dimensionality then you shouldn't analyze it because it's probably not interesting but theta that's uniformly sampled is just not very interesting right there's nothing you know what pattern you're trying to find there there's no pattern that is uniform right so in some sense all the data you're trying to e but you actually try to make something reasonable okay you know try to find something reasonable has some low dimensional structure right by the way I can I've low dimensional I on a lot of mine subspace it can be on a low dimensional manifold the other thing is actually could have clusters that that's another thing yeah okay maybe last question yeah it has something to do with that of drawing it from the OSE because I have this tail here well no it's just because they are confined to a to a hyper hyper box let go I have a cube sorry and so the hypercube you always have a maximum distance you can have and just shut off oh oh that it goes up oh I see then I think that that's more a function of that the yeah that the mean distance approaches the maximum distance well no no is this actually it is a factor of a property of the curse dimensionality right because the bases all the points are at the edges so if you're all at the edges then you have maximum distance from each other right so we're in this very first case I said a very few points will have exactly square root of 2 distance it's only if you're at the exact opposite corners of this two dimensional square right well that's very unlikely right but in a high dimensional space you're always in the I opposite opposite corner of some edges right so it is a property of dimensionality and in this particular setup yeah okay very last person all right [Music] no no I'll give you the example so imagine a teleporting classifier but I want to know I pick a random American and I want to know what baseball team they're cheering for right well if I pick a guy and he or she lives in Chicago all right well I don't know what baseball team they cheer for in Chicago right but I look at his neighbor and his neighbors a huge Cubs fan right I say well he's probably also cups man okay but now in this case you you know your distribution is that should I speak to neighbors one and one is in Cleveland then the other one in San Francisco all right they're all really far away we just don't tell me anything about this person in Chicago anymore and that's the problem yes the one in Cleveland is a little closer than the one in San Francisco arguably but it's still wrong right because basically that the labels changed quicker then actually you know you danced your sample density okay any more questions I'm very proud of myself there's my very first baseball-reference actually I know nothing about baseball thank thank you thank you my wife would be very proud of myself all right engine all right okay so k-nearest neighbors make this assumption that nearby points are have similar labels can anyone tell me some advantages and some disadvantages of K nearest neighbors when you think about it like what what could be a problem with K nearest neighbor so I showed you that besides the curse of dimensionality that we just talked about if impractical you know we showed that if n becomes really really large this truly becomes a very good classifier assume you have a reasonably intrinsically low dimensional data and you collect a lot of data right you know because you know as n gets large the K nearest neighbor classifier becomes really accurate what could be a problem you know yeah yeah I'm gonna skip you because you always said something yeah so if you have many many classes is that what you're saying so if you have many many classes you've even more data right that's right absolutely but let's say okay well you know we go all out right we just say okay let's collect more data right like we haven't you know trip all the datasets and that's correct as much data as we can to make it really accurate right um you can't even do this I'd say you know collect data until we get 0.99% AI percent accuracy why could Kenya's neighbors be a bad algorithm in practice yeah oh so yeah it may not do very well on bond boundaries well you just have to collect even more than it's collect even more than all right good yeah sorry assume the name of the discrete fair enough let's assume the labor the discrete let's choose if it's not the speedy after the regression you can still do that though yeah let's assume the assumptions right I want to hear practical think about it you have to coat it up like those people project one already why could it be about now that you've said five times would be a click even more data right so so you made me collect more and more data right so now we've collected a ton of data alright one-one Killian data points yeah yeah that's right so you have to sift through all your data points right so you have to you have this one Killian data points and you have to compute the distance to every single one other all right it's gonna take a really long time I would sleep you know for those who computer scientists of you the the complexity during test time is order n times D so for every I go through every single data point of my training data set and compute this turns and D dimensions okay so people know Oh Big O notation raise your hand that make sense okay awesome that basically means like my algorithm if I my algorithm scales linearly with n so the e how long it takes me to Rama algorithm is a linear function of n so if I would double my data set my algorithm takes twice as long so that is prohibited right because you know think about Google or something they have classification problems on billions of data points right you can't every time you make a prediction comes with every single one right and then you know if you for example a log on to Google News etc so um that that's a real limitation of K near Snape so as the K nearest neighbors becomes really really good if n becomes large but it comes also damn slow right and becomes really really large and so in most applications you know the time that you take during inference inference is the time and you actually make a test prediction you know actually matters right yet someone has to pay for this computation let me show you another Eiger than that doesn't have that downside and actually algorithm that was invented before k nearest neighbors and it was invented here at Cornell University and it was actually the first machine learning algorithm it was the perceptron by Frank Rosenblatt 1957 yeah the perception makes a very different but equally easy to understand us um it says okay well assume you have some data and you have basically exes and yet you know nods and crosses here my assumption is that there must be a hyperplane so the line that separates the one class from the other right so in this case we assume the only F for now we just assume we just have two classes so the tube dining reclassification problem we say they exists a hyper plane such that all the points of one class lie on one side of the hyperplane and all the points of the other class lay on the other - other side of type of a now I know what do you think all right that seems like a crazy coincidence hey you know what if bond circle was over here while the whole thing goes out the window alright doesn't work any but turns out actually I think about about the curse of dimensionality told us right so in high dimensional space data points tend to be far away from each other right so low dimensional spaces this is a this doesn't hold very enough in high dimensional spaces is almost always holds in fact actually one thing we will do later on the core of course we will actually take our data and map it in an infinite dimensional space and infinite dimensional spaces you can show that this actually always holds and you can always find such a hyperplane just kerbs think we know this is hard to visualize that you know infinite dimensional spaces but but it becomes very powerful so in some sense the perceptron is kind of the opposite of key nearest neighbors right so K nearest neighbors what you want to do is you want to use it in low dimensional spaces for two reasons a because the curse of dimensionality doesn't kick kick you and B it solves a lot faster than low dimensional spaces because computing all these distances is slow all right like scales linearly the number of dimensions perceptron you must do in high dimensional spaces because that's where they're not that assumption is satisfied right it's so it's important to remember these things because as a data scientist you know you should I get an intuition of which I wear them works what data sets okay good so assuming that such a hyperplane exists but the perceptual algorithm does it it tries to find and so the assumption is just this there's a search hyperplane and our goal is to find a hyperplane not the same any of them and mathematically how do we define a hyperplane a hyperplane is defined by the vector W and offset B and since the following set did you find a hyperplane as I said H is a set of all X such that W transpose X plus B equals 0 so that always defines a hyperplane has one less dimension than the ambient space and so in this two dimensional space the hyperplane is one image any questions at this point raise your hand that makes sense okay good so we try to find a hyperplane and so all the car points of one side one label on one side points of the other side and as a class on the other side and now what do we do doing tests during test time doing test time that's it you get a new point this is my mystery point I don't know what it is what do I do I just look which side of the hyperplane a lie on and the nice thing about this is it it's always the same speed right no matter how many training points I have computing which side of the hyperplane anion is just you know actually computing this right so you just compute W transpose X plus B and you look at the sign if that's this expression is greater than 0 then you lie on one side if it's less than 0 you lie on the other side and if it's exactly 0 you lie on the hyperplane any questions yeah so yeah so this just this this particular algorithm does not work for continuous data so we make two assumption the data is binary and there's a linear hyperplane that separates it so it's linearly separable yes you can extend it to multi class we will get to this later and there is of course other algorithms that are also linear for regression but they're different algorithms yeah any more questions good question yeah it resides exactly on the hyperplane then you get out a $1 coin you flip it if it's heads you say here tells us it any more questions so it could be that a test point lies in a hyperplane right so during training it will always find a hyperplane that separates then exactly if it exists but a test point could very well lie on right and then you just well you're just returns out yeah okay yeah so the day doesn't consent do you say our exes are here and circles are here well in this case you your and your assumption is doesn't hold right you're basically your there is no hyperplane that separates these two not in 2d as a very simple trick to make it linear acceptable and we will get to this in a few in a few lectures right so if you map this into a high dimensional space and it's immediately usable and that's what we will do so you can still run the perceptron you just have to a little trick before yeah hey it's not it's not in it if that you know would be a crazy kook yeah they're always many as almost many infinitely many yet all right um okay let me go let me tell you how to find this type of thing so one more time we assume our labels are binaries let me just formalize this we say our label set by is minus 1 and plus so these might be our X's and these must be my circles all right so we have two different classes and I would like to find a hyperplane such that all the x's are on one side although the positive points on one side all the negative points on the other side and for now this is assumed someone told me that such a hyperplane exists all right so the first thing is you have to learn two things you have to learn W and we have to learn B and that's a pain and you have to learn two things that's just you know but be much nice if you could only learn one thing so we can easily do this and it's a very simple trick how to get rid of the B so let's just do this let's assume there is no offset and it's very simple trick you just say all our data X I it's mapped to X I and mu2 a little one below it so we have a vector with all the previous dimensions of X and then a little one below it and whatever vector W we are learning becomes the old work the W and a little B alright and I claim that the inner product between this vector and this vector is exactly the old W transpose X plus P does that make sense raise your hand that makes sense okay awesome all right so this little transformation is so basic what we do is the first thing we do is just take all our data where's ido1 add one more dimension with all ones and now we just find that vector of is one more dimension and the last dimension is actually our B okay so now we define the hyperplane that's something else you just defined the hyperplane s W X such that W transpose X equals zero okay so if we busy assume that the B doesn't exist anymore and just absorb that in the data okay so we just make that little transformation they can just you know learn this one W instead geometrically what's going on geometrically what we are doing is if they see say that our hyperplane now has to go through zero right our hyperplane always goes through the origin but there's no offset anymore so what W do is shows you the angle of the hyperplane or the orientation of the hyperplane and B is the offset so now we have no more offset but we add at one dimension to our data set right so in some sense what we did is we moved all our data in this additional dimension we moved all our data one off right because in this in this one dimension we said it all to one so if beforehand we had a data like this if we have X's here and zeros here now we just take this whole plane and move it one over here so that we have X's and O's here all right this is the entire data is basely in this space here okay and beforehand we could find a hyperplane like this now in a three-dimensional space so we find a two dimensional hyper plane and it basically looks like that so I had to face it goes through here all right so this here's my two dimensional hyper plane and if you put if you look at this this the plane that the data lies on is exactly the same solution okay that does that make sense raise the handle that makes sense awesome good stuff all right if two more minutes let me maybe let me stop here and be on Monday we continue with the perception
Info
Channel: Kilian Weinberger
Views: 37,019
Rating: 4.9626865 out of 5
Keywords: machine learning, cornell, perceptron, curse of dimensionality, course
Id: BbYV8UfMJSA
Channel Id: undefined
Length: 47min 43sec (2863 seconds)
Published: Mon Jul 09 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.