Machine Learning Lecture 17 "Regularization / Review" -Cornell CS4780 SP17

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
everybody all right so a quick thing on Tuesday as you hopefully know this point is the midterm if you haven't started this point please realize that you're late in the game and get started as soon as possible last time we talked about regularization and I showed you that the L to regularizer is basically saying I have a ball oh wait this is not on it's on no is it on now better yeah well to regularize that says you have a ball around the origin and maybe let me move this up after all so the yell to regularize has a ball around the origin and what you're essentially saying is you see the solution it has to lie within this ball and this is how you control complexity please close your laptop thank you this is how you control complexity so essentially what you want to avoid is that you get a really really complicated solution where some values that really really large and others are really small so if you shrink everything it basically has to come up with some simpler solution that's that's the optimal intuition and you're restricting this space right instead of actually finding a solution anywhere in the space you just say well I has to be lot lighter than this ball and so you trying to minimize some function and see this is my function here's the minimum all right so ideally if I didn't have any constraints I would end up here but my worry is that that would actually that could over fit with the training data so what I'm saying is find a solution to this function but stay close to zero and that typically helps with generalization gives you a better estimate you know for your test data and also actually that's actually what the map estimate gives you if you have a prior Gaussian prior over your your vector another norm and another regularizer that's just as popular is the l1 norm so the l1 norm is the following you just regularize lambda times the absolute value of W and if you draw that as the same kind of image what you get is not a ball it's called the l1 ball the l1 ball actually is not a ball at all as corners it looks like this like this the l1 ball it's kind of like a diamond right and so you're restricting you to trying to find the optimal solution and you're restricting your set to be inside this l1 ball that's the idea can anyone imagine what are the one of the advantages of the l1 ball what are the properties of the element ball like why would you do this why would you not always use a circle ideas well this is this plan that basically finds how big that that area is right you can make it as large as you want doesn't have to be uniform yeah that's right increase the sparsity that's exactly right so so what it does and let me explain this for everybody else what happens is it's like you're minimizing this a function right and let's say this function is up here the minimum is up here okay so ideally what would you want right this here's my table you want well he does he might have you to value the best thing would be to w2 pretty large and w1 some small value okay so that's the that's the optimal solution if no constraint was was enforced now I'm enforcing matter one constraint and now if you look at this object let's say this objective function looks something like this right well you will see that actually the optimal point will be exactly this corner here right because because it has these pointy edges the pointy edges kind of you know can penetrate this function a little bit further until the optimal solution will be this value here and what does that mean I cranked up w2 as much as I can and I said W 1 to 0 not to a small value to 0 right and so that's what what ends up happening here in l2 you don't really see this true you would end up somewhere here pretty very close like you would have a very very small value to of w1 but it wouldn't be 0 and the reason people really really like setting things literally to 0 is because then you can rule out that whatever feature w1 you know whatever dimension that corresponds to contributed anything to the prediction so if you make predictions and biology and you want to know you know for example here's my you know DNA or something right and and I want to predict there's a person half you know is likely to get cancer yes or no what I would love right if most of the way to actually zeros if I can do well by a prediction well and then actually most of my weights a zero then I know all of these things actually don't didn't didn't actually contribute to the prediction right and that makes it much much easier for me to understand how the algorithm arrived at the prediction okay so what people like especially in Sciences people actually care much more about understanding the prediction that actually the extra prediction itself and so therefore they love these sparse vectors they're in a lot of things set to zero and so with very high dimensionals what you will see is that actually you know in three dimensions this kind of you know had kind of goes out of the it's out of year and so these entire edges all have w10 so in high dimensional spaces this thing consists of many many edges and almost all of these edges have the majority of the weight set to zero so yeah your answer will always almost always lie on one of these and so you will get a very very sparse answer your weight back table set almost always to zero yes right so what this thing's gonna do is this this will set something non zero to every single dimension right so I guess what you're saying is why not train this and then just look at the largest way to something and see you know and the problem is you don't know wait a small weight makes you still contribute a lot right only once it's zero it's really kind of off let's switch stuff the people do do this occasionally that they actually then look at the top base but why not just use the other one regularization is actually to be honest the optimization is more annoying to that one regularization but ultimately the predictive power is usually not not worse is that going to answer your question the nice thing actually and we won't get into this now but actually if you do i-12 regularization one thing you can do you can actually solve in closed form for the values of lambda such that exactly the first suite of feature becomes nonzero the second feature becomes nonzero and so on so just to make this clear if my lunder is very very large then this little diamonds going to be really really tiny and almost certainly I will sit in exactly one corner so only a single feature will be nonzero and that also intuitively makes sense if you think about your last function that's going to be a single feature that's most important to predicting the label you know if you can only spend very very little weight what it's going to do it's going to put all the weight on that one feature all right so that's what the l1 regular visor does and so as you basically one thing you people do is they kind of grow this ball slowly slowly essentially lowering lambda and then kind of see which features pick first which pictures features pick second and so on any questions one downside of the l1 regularizer is that it's not strictly convex so that means there's actually you could have for example imagine you have two features that are identical then it doesn't matter how you know how much weight you put on one or the other and you could put all the weight on one feature all the weight on the other feature or half on both or any combination thereof and they're all exactly the same solution that's one downside and that's annoying so what people do in practice and this is called the elastic net it's that you do a little bit of both your lambda L 1 plus mu L 2 and the Mew is very very small so if you do this then actually because this year is strictly convex and there's a unique solution that cannot be multiple solutions although it is it yeah yeah how do you excite this constraint bassy says minimize my loss subject to is less equal sum budget and this project corresponds to lambda so last lecture I said that is equivalent just minimize lfw plus lambda times w those are the same that's the same thing the same optimization problem there's just the Lagrangian formulation so I can go back and forth between these it's just these are nicer because now actually you can draw the set and you say you have to sign that inside that set Li inside the set but those two are exactly the same thing in fact actually if you solve this problem and if you just take the norm of W that is exactly B and if you think about it what is the absolute value right like absolute value of W equals the sum I da Bo that's cos alpha for a be used to basically I'm penalizing every single W the same way everything that I mentioned the same same amount so I'm saying I try to find a simple solution that's my goal that's why I'm regular rising I'm saying I'm you know you don't just minimize the last you minimize the loss but find a simple solution yeah he finds implicitly as saying the sum of all the weights must not be larger than P but it's okay to put all the weight on one dimension that's okay that's not more complicated than putting a little bit of weight on all the different dimensions right that's okay here I do something else here I say it's the sum of our a alpha W alpha squared so here I'm saying if I put all my weight on one dimension that's worse than putting a little bit of weight on all the other you know and every single dimension because I'm squaring right so if I put you know Mike one weight really really large a square this that's a lot more than many many small squares so it's a different different definition of complexity and so this definition of complexity basically leads to you know the classifier putting all its weight under a few features and that's very informative has the disadvantage that it's more brittle right so if you for example imagine have a self-driving car something try to classify something and they put all its ways and a few sensors if these sensors then drop out then actually you know your prediction is off whereas this one can be more robust because it spreads out the way any questions yeah it does not solve you differentiability problem well the last thing that solves is that one thing you could do is you could actually if you have to in our features that are highly correlated then you could put weight on one feature on the other feature on both features and that's the same thing according to this thing where it's this thing only has a solute in the unique solution imagine you have two identical features the optimal solution in terms of the l2 norm is to put the same amount of weight on both at the moment you would want would be larger because you get Greece's increases quadratically you're actually paying more for that so that that's what that does so the elastic net is basically kind of combines the robustness of the solution with the ability for l1 norm to to actually identify features yeah yeah same thing yeah any more questions about this yeah yeah but the problem his answer is saying the following imagine I have two identical features wouldn't be the best thing if the answer would be to just drop set one to zero and yes at the element to a large value but that's not what the l1 norm is doing right the one room you have no idea what you're doing because they're all exactly the same right so depending on your implementation and that's what scares people right because two people run basically the same you know optimize the same optimization problem they get different results and you don't know which one is right right what the promised people would like to have deterministic answers and so by doing this now you have you know now if you run this and I run this we must get the same solution no matter what solve are you using and if you use MATLAB of Python doesn't matter yeah you fit your model you minimize the nut model about you you add this complexity penalty basically right and this serves two purposes a it avoids overfitting there's the bias-variance tradeoff it's just the first lecture after the midterm and the second thing is that you will always discover free which features are responsible for the prediction right it shrinks the weights yes so okay good so he's asking very good questions saying wait one second if I'm only if I'm a biologist and all I care about is basically you know the first day okay about is which features are you know are useful but then actually I really want to have a good predictor now these things are in conflict with each other right so I have to make a tight budget be I say I want to have my classifier inside this small area right that's good because that will tell me which features are responsible for the prediction it will set 0 to most features but also but those that I'm pill is selecting it will also shrink those right because evil in this case we'll put all the way to w2 but ideally w2 should be this value up here right so I want to learn the w1 is maybe not not relevant or w3 maybe is not relevant at all right but I still actually don't want to cramp w2 Stein here too much and so he's saying why don't we just run this first then be identified which features are useful and then we run without regularization or with l2 regularization on on with these features and that's what people actually do in practice so if you'd invented the 20 years ago famous now okay any more questions that was going to the end off of erm please look at the notes I wrote down to it two or three famous cases one is called lasso lasso a squared loss and the other one is elastic net if you add an L to regularizer to it so if people refer to lasso or elastic net this is what these are so basic just the square loss with aiyla I'll run regularizer that's lasso and l1 and there to is elastic net and then you also have SVM and of course logistic regression you already know maybe I show you a very very quick demo for a few seconds and then we can then I actually want to do a little bit of review for the exam just to get you guys ready okay what I want to show you now it's a small data set iron I don't think you've seen it yet it's small you know some some data set actually from life sciences and what do you see here if let me just explain this just a attention for a second guys so what I want to see but you want to show you here is basically if I run if I run the classifier there's a squared loss with l1 regularization that's called lasso and I crank up a bit a lower lambda so if lambda is really really large this here's the first solution so the x-axis here is basically whenever I kind of lower my lambda just enough such that one more feature becomes unlocked okay so my phone my lambda is infinity knowledge that every feature is zero okay so everything is at zero and what do you see here these lines are the weights associated with the different features so if I lower my lambda a little bit the first weight shoots up that's the first feature that it picks okay so basically at the beginning it right here I now advisee allow it just a little bit of weight right yeah to distribute it over the weight vector and what it does it picks one feature and puts all the weight on that one all right so that's the most important feature and then I lower lambda further and what it does it keeps increasing that feature but it also increases another one that's the second one all right and these are some features they're not necessarily dimension one or two these are just over the vector right here you see the same thing does he is the weight vector so what you see here is the first weight does all greens let's zero everything is zero and then the next row is the weight vector it's like the weight he is incorporate as color then you may see see here the first value up here that's nonzero that's the first feature that basically some main assigned to and then as we keep to the right what you see is that basically the features get more and more pronounced right more and more weight is distributed across the different features and so here that's the same graph as this one this just basically you see the value you know as negative or positive points here along these lines so here at the end basically what this line tells you is that this particular feature has a value of minus 2 ohms minus 0.2 okay because that graph makes sense right ahead of the graph makes sense okay good and what do you see you on the top right is the error and one thing you can see is as I'm lowering my lambda right the training error the loss goes down very very nicely right that makes sense right because actually now I'm not constraining myself into a ball anymore at the beginning I'm constraint is very very tiny more now I'm growing this and growing this so eventually I basically you know I can get closer and closer to the actual minimum what do you ever see is when you look at the test error and that's the yellow line well of let's look at the training error first training error is the red line that all the gist goes down nicely but the yellow line is the test arrow and that actually goes down up to here and then it goes back up again right so actually the lowest value is actually at point five it with five features so here so you only need five features right to do really well on the test set afterwards the other features they help you on the training set but they don't help you on the test set anymore okay and that's exactly what the biologists want to know they want to know which five features right are the ones that really help me on the test set right because those are the ones that really generalize right the other ones are just fitting noise right something that's particularly specific to that particular data set okay so when you work with biologists so you know in life sciences make these kind of graphs then find the cutoff point and they're gonna get really really excited when you then tell them it's these three genes something that explained actually everything about you know the entire test era I might test data or the accuracy my test data can be explained just by the these five genes or something or this protein or something like this any questions yeah oh yeah well you got me there I'm I can't remember it was something with something with iron I don't remember if it's like in the blood or something sorry selector braum I just think about the abstract data and yeah lambda this is actually this but this here's the number of features that are nonzero so always decrease my lambda until the feature-based becomes nonzero and turns out actually there's the algorithm called least angular regression that we won't cover in this class but that algorithm actually tells me exactly what the next value for lambda is true P so text you can solve that in closed form it's it's correlated with with B but it's basically that that's right it's just that it's somewhat it's there may be different gaps you know B may suddenly increase a lot until you get the first feature then it may just go a little bit further up to the second feature right so it's really the number of nonzero features that's what the x-axis is but yes you're right as that goes to the right B goes up as well it's just not linear right yeah so the loss is the actual loss that I'm minimizing is the square error but in this case actually it's a classification problem so actually have a training error as well and those are not the same thing so if you actually see that the last keeps going down but the training error goes up that means I'm I'm overfitting to the last function right I'm putting too much energy trying to get the square loss right when I really care about the training error okay any more questions yep yeah if you only care about prediction then I guess this is like looking at the feature vectors less interesting right also may help you debug things you may still want to use lambda right because here the best test error is actually at lambda you know at five features so you still want to do regularization right oh do you the otherwise you don't know when your fitting noise when you just memorizing training the data and you know when you still generalize it any more questions about regularization right let me turn out that actually is a funny thing that elastic net which is I guess people use it actually can be solved it's the same thing as an SVM there's a very interesting relationship that actually a student of mine discovered two years ago and we call it's spent that you can actually reduce the elastic net exactly to the support vector machine so for any problem of this kind you can actually construct a new artificial problem that has no meaning and if you feed that into an SVM the weight vector is exactly the one that you get here and so the nice thing is SVM's can be solved much faster all right what I want to do today is a little review okay good so I saw you make two teams so how much undergrads again grad students that's not good who is undergrads raise your hand for grad students oh that's that's rough do we have anyone who's not it was neither he add them to the grad students how should we other grad students are you willing to take the challenge yes okay okay good all right let's do this undergrads those brats together you know street bragging rights okay does this work yes this worked okay good okay so because the grad students of you so the way it works I do people know Japanese yeah yes okay great I don't but if you can tell me how it works so I guess the people bid on on these fields and then this how many points they get if they get it right so the Iranian versus someone says for example SVM's for forty prompted questions get more you know get harder as you go to the higher points and then I read the question the moment I'm done reading anyone can answer it you raise your hand and you make a loud buzzer noise it goes like e all right so many inspectors that for a second okay one two three good good don't be shy maybe everybody can each individual can only answer two questions like the fiddle undergrads can only say two questions grad students can only answer four questions all right that's fair let's put some regularization on okay grad students go first who wants to go first any grad student you don't have to answer a question you just have to say which field I should read it's pretty safe yeah sorry KN for ten ok he's playing it safe all right here we go the era of one year's neighbor classifier is at least as bad as twice the era of the Bayes optimal classifier and goes to infinity who is the Hector Herzl was it you tight well two of us - good okay other undergrad or grad okay good ten points yeah I took Moses back right otherwise it would be wouldn't be not as useful of a bount right oh you can barely see this let's please it's supposed to be dark okay okay you can pick the next one [Music] nice fifty all right ready go oh here we go when is the knife based decision boundary identical to logistic regression okay good you almost got it can someone help am i allowing some other grad students to help Oh your undergrad or grad you great okay now the undergrads are that's it the other guy yeah that's exactly right right so it's nothing we assumed it but it's actually true right that's what yeah I will post those yeah yeah okay okay you can ask the sorry the person who just answered the question you can ask next one you can take a pic the excellent SVM 50 all right ready go why shouldn't you incorporate the bias as a constant features as a constant feature when working with SVM's just let's make this clear and we had a perceptron algorithm you just incorporate the bias as a constant feature right we just added a feature of one just why can't you do that but the SVM wait I didn't hear a buzz announce that has come on okay good under Greta Greta right okay the undergrads get a chance to answer it there's 50 points why can't you do this so if I would just absorb the yep ah exactly that's it that's it right undergrad I presume oh boy so the answer was that because we're minimizing W transpose W right that's actually you know we're maximizing the margin but a crucial part of this that we are not minimizing P squared and so if we would actually minimize W transpose W but you absorb to be in there right then that would be the same thing as minimizing W transpose W plus P squared in your objective all right and that's actually not maximizing the margin they're actually saying I'm maximizing the margin plus I want my decision boundary close to zero right which is not what SVM's do does that make sense so B actually has to be free right B is not minimized very nice okay next question which one do you wait let me just yeah sure so who's talking oh yeah that's that's right that's part of that constraint that's right for it to actually become a linear classifier yeah that's right so the Gauchos tributed with the same variant at last both countries a country's classes yeah Natacha right I'll do you can pick the next one canine 50 you guys I like it you get over the big ones all right ready set go how would you modify the K nearest neighbor algorithm for regression some of that good okay sure I gladly undergrad ID okay okay this vote you've scared the grad student so much they say and yes exactly right right so one thing you can do is you can just take an average off the K nearest neighbors right and that gives you a value and so even better actually if you take a weighted average so actually if you just take you know a sign of weight to every single point so you do the following I sum over all my neighbors I element of my neighbor's Wi-Fi I and then I normalize the whole thing some of our I element of n WI right so they have to sum to one otherwise it's not a weighted average and so but you know one thing you could for example have this you know this way WI could be the distance one over the distance X minus X I or it one thing that people often pick is exponentiated so basically what you do is you want to give less way to points in a further way that's all you can define your own weighing function typically that improves that improves results all right next question general 5050 all right ready go for each algorithm name one assumption that it makes on the data cleanest neighbors naive Bayes logistic regression and svm okay keep going independent given the class name very very important yes condition [Music] the almost AI mean it's very very close so yeah I mean I don't know okay you know what I give you you know I give you half the points how about this in part also to motivate the grad students to kill 25 here we go we can't just beat them up like so yes K nearest neighbors data points the clothes you know similar points of similar labels that's that's a good one naive Bayes you know features are conditionally independent right and oh just ik aggression you said the right thing actually it's actually a little less strict to be honest in practice I would count it because we didn't talk about it in detail it doesn't have to be a Gaussian distribution actually it also works for any distribution of the exponential family Josh was one of such example and SVM linearly is linearly separable yes except when you actually have slack variables and it doesn't have to be right then actually still gives you you know it gives you a reasonable answer but it on the other hand the assumption you're making is that a linear classifier you know is a you know it's a good classifier right that you can fit a hyperplane which is it gives you reasonable results which is a big assumption right that could be totally wrong right could be that the function is totally you know a dataset is totally not a ball yet but I mean you know you just can't do it with a linear hyperplane right so in that sense you know let let me count it another assumption is that the you know a larger margin actually generalizes that be something specific transients yeah so you know it was basically correct okay good Arthur one more you can pick one here I'm 50 okay I like it I like it which loss function was used in each case two different three different examples the squared hinge loss with C very very small square hinge loss with C large or the exponential loss that may have been taken from an exam actually I'm not sure okay good down the grata grass awesome all right go for it come on be all cheering for you the first ones are really good smell see that's correct that's correct exactly perfect whoo all right it's the comeback all right so just to make you know a few indication why there's obviously the case so here the first one but you know it gets basic everything wrong right headache it's horrible all right the decision boundaries not between these different classes it doesn't care at all about getting the points right you just care about having a smiley smiley w it's a tiny W which is exactly here you know the W you can't even see it it's so tiny right so that's you know basically you have almost a zero loss in front of your your constraint and you in front of your spec variables so these two must be the other two and I guess what you see is that the hinge last year basic tries to enforce a margin between these two you see one point is on this side so that should be here right survey Z gives up on that one that's a lot harder for the exponential loss it also has to give up on that one but it will be moved much much closer to that point just because it gets exponential penalty all right svm 440 okay why is the SVN margin exactly 1 over the norm of W you have to do the buzzer sound okay trick no no that's just Max maximizing will maximizing the marginal it's sure that it's equal on both sides but the question is why is it exactly that darling yeah it's exactly right grad student undergrad grad all right all right you guys are coming back it's getting getting exciting so so the answer is very simple is the margins actually W transpose X plus P times I I guess divided by the norm of W but we doing that motivation set well let's fix the scale of W and B right to any particular value and what if we do we fix that exactly such that it's 1 all right and this was just to make the optimization problem easier so essentially it's it's the answer is a because we fixed it that way or B its enforced by the constraints at the constraints basically ensure that this must be the case and by the way if you just look at this objective that's the size of the margin now it becomes clear by minimizing W squared actually maximizes the margin right you're basically minimizing they stop at the norm of W yet but the margins exactly one over that so if you minimize the norm of W you know basically must blow up right you're exactly maximizing the module all right which one do you want to pick erm for 4002 do my grad students taught me to make that sound I don't have to know what it means um alright so there's no shoot I shouldn't have shown you the question before you can actually now baton member wait this was a grad student so you can now bet but who who asked that question okay you can now bet as much as you know as many points as you want on this and if you get it right it will double them so you got a double you know to be honest the best strategy but you can still choice you can how many points do you wanna bet yeah everything you want to go everything so base the waiver did you bid a certain number of points and if you get it right you doubt we get it double them if you get it wrong you lose them that's correct you get nothing all of it you gotta go all-in that's the optimal strategy you gotta win Oh [Laughter] 115 okay good so now only the grad students allowed to answer right but if you get it wrong we lose everything name two distinct reasons why you might not be able to use Newton's method on a particular data set independent of convergence so I'm taking away the one that made no convert give me two reasons why you may not be able to use Newton's method only grad students only grad students yesterday okay okay so we do teamwork okay sure it's the first one okay first one okay first one if the last function is not wise differentiable that's correct second one I can think of at least two more yeah it's exactly right all right so if the data set is really really high dimensional it may not be feasible to compute the hash it all right I'll give you the points awesome oh and mayor's are not be invertible right you actually have to invert the head okay now it's getting really really tight and I guess they have 230 and oh yeah yeah they are this is like the election like Khan expect don't take it too personal sorry and alright who can pick I guess that yeah you were the first one so nine face forty all right ready set go map estimation with plus one smoothing in naive faith with base with C classes and a categorical distribution with K categories per feature is equivalent to adding how many hallucinate examples to the training set you got a new sedated at least native see no yeah it's C times K right because every single Casa for everything category you have to have loosely assume that he see that exactly once right that was the answer grad student okay that's how many points forty oh my gosh the grad students are winning there was that was the Daily Double yes that's what it is actually by the way that's how and Watson actually won so what's next e they knew they have no chance against the human players the IBM Watson they actually much much much better the human players but then do one thing the observer watching the videos is that humans are terrible at betting correctly and actually finding the Daily Double and so that was actually the that entire team working on two strategies a betting optimally for the Daily Double but you can actually solve for exactly and be the other thing was actually to find the Daily Double so Daily Double is always hidden somewhere and what they did is they actually analyzed all these past videos and they realized that that's the distribution is not uniform at all and so they actually incorporated that and bought some strategy and so what's that a much much higher chance of getting the Daily Double and then actually capitalizing on that's actually exactly okay um who was this uh suppose you yeah nine phase four forty all right I we just had that one sorry Kanan for forty all right why does Kenny all still perform well on high dimensional faces and handwritten digits data who was the first one you were the first okay that's right that was thirty forty right so the answer is the data at high dimensional but that actually is not what matters is the dimension sorry that the the space is high dimensional but the data actually intrinsically is not high dimensional all right you can even take a low dimensional data and just add representatives I'm really high dimensional space make everything else zero then just rotate the space arbitrarily then it will look like the data is really high dimensional when it's really not right so K nearest neighbors won't change okay general over 40 all right which algorithm would you expect to result in lower test time classification error ten years neighbors or linea SVM a on text classification by topic be patient spinal signs given the patient's vital signs predict the health status your fifteen features when C sixty thousand handwritten digits calcify the squeeze a business eight so on each one of them which one do you think will do better SVM or K nearest neighbors yeah let me do the out of water okay Susie SVM SVM KNN well okay tell me why do you have a good reason then I have that count right I'm with you I'm with you the questions be so he has a fear so the point here is that this is very low dimensional data right so typically a low dimensional data linear classifiers do not work well right just because there isn't just much much space you know many options to to place a hyperplane so the first one definitely linear classifier text documents a really high dimensional right you always find a good no they just a linear classifier is amazing at this and but the other two typically are not very high dimensional and tend to have nonlinear decision boundary so kanan tends to work better how about this I give you I give you 30 I give you 30 points they give 110 to the grad students all right no no no not the first one is a little bit sorry are you a grad student I want him to be a tight race all right I just leave it at this I just say you know I feel like 10 points [Applause] okay okay we already have kind of out of tummy if you do one last question so there's the last one okay you didn't ask last one KN for 30 all right true or false and nearest neighbors on a data set with n training points and D dimensions takes order n times D computation true is a tricky question [Laughter] this is set up the discretion to set up yeah yeah exactly right so yes yes what it is right so the because usually you accept me right so usually the computation is K times K times D so it's n times D to comply the K nearest neighbors actually past the bookkeeping of what the K nearest neighbors are which is some some little tree heap tree but XE if you take all the points as your training as your neighbors then actually all you need to basically always predicting the most common label so you actually don't need to look at your data at all right it's a cynic a surprise there's a grad student grad undergrad or grad right okay good well it was a tie [Music]
Info
Channel: Kilian Weinberger
Views: 9,053
Rating: 4.972414 out of 5
Keywords: artificial intelligence, machine learning, kilian weinberger, regularization, cs4780, cornell
Id: NLBwFyT9Qks
Channel Id: undefined
Length: 52min 30sec (3150 seconds)
Published: Wed Jul 11 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.