Machine Learning Lecture 32 "Boosting" -Cornell CS4780 SP17

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

welcome please everybody close your laptops everybody close your laptops thank you all right good stuff so um quick fever blister just takes the support vector machine project is done turns out the competition was a little hard not too much variance on that one however the CAG competition turns out a little bit too easy some people are doing really really well already that's really really cool I won't please remember you have to submit a write-up describing your solution and code at the end okay good so we are talking about boosting so last lecture we still talked about bagging bagging if you remember we basically take classifiers that have a variance problem so if you have a variance problem you know you can identify this your training error is much much lower than your test error typically and what you do in bagging you basically to sample a whole bunch of train dealer sets with replacement and you train a classifier on each one of them on your average and we could see very nicely that reduces the variance does not increase bias so therefore your test error goes down bearing is a very very good solution if you have a variance problem so today if you're doing boosting and boosting is the setting where you have a bias problem and it started out with a question I mentioned this last time whether weak learners this was in the 80s the late 80s people considered the question be differentiated between weak learners and strong learners weak learners are algorithms that cannot learn you know bring your training error down to zero and strong learners can't so that was important distinction back then and the question was can we take weak learners we have many weak learners each one of them cannot get your training error down to zero can you combine them in a way to get zero training and can you do this efficiently and so this was a question post about Michael currents and very famously rupture Peary came up with the answer he proved first just proved theoretically yes that always possible and then he came out with an algorithm that was essentially part of his proof and that actually showed how to do it and that became a huge instant hit so what I wanted to give you today is a little first I want to start with a little intuition how that works and so once again our classifier is the following we have H of X is an ensemble of many different classifiers for each one of these little T's a weak learner so that's not algorithm that's not very good and the way we define weak learner is in the classification setting it does better than a random point Hoss okay so that's that's all that's a pretty blow bar right that's better than guessing and okay so now I'm going to give you some intuition so please pay attention because if you reprehend around this and everything becomes a lot easier so imagine we can write you know and then we should use the following coordinate system that we've seen before in the context of Gaussian processes where I give you an n-dimensional space we have M training points and I can now define an n-dimensional space where each axis corresponds to one training point again baby says the prediction or the label off that training point nice we have enough here x1 to explain what these are this is important these are not the feature values these are the labels right and so in this space let's say we have a three dimensional space let's say I have a data set I know just make this up x1 has the label 1 x2 as the label minus 1 x3 has the label 1 again all right then I can actually these are my wives right this is my y1 y2 y3 so I can make a vector y equals y1 y2 y3 then that is a point in this three dimensional space all right so in this case Y y1 is 1 so I'm here y2 is negative that's actually down here and by 3 is positive that's yeah all right so I basically you know I can go here and then I go down and this here is my vector Y okay ray Z and you still with me ok awesome ok and now I can define my loss function I can just map this basically all in this space okay I can busy say well if I now run a classifier capital H on my data what I get is predictions for every single training point so what my ad h gives me it gives me a vector H of X 1 H of X 2 H of X 3 all right so that's basically what my function does right so I piss thicken it's basically you know I stick in my mind set X 1 X 2 X 3 and outcomes a vector like this we can call this H ok this is my prediction vector never have a strong learner for example in RBF SVM or have a decision tree that I grow all the way to the bottom like without stopping this will be exactly the same as Y right because these algorithms can drive the training error down to zero okay so you know you train the you know you train the algorithm then on the training data set but you will basically get this exactly the same point but a weak learner when it's not very good for example a decision tree that is where you're only allowed to spit twice right so the limited depth will not be able to get every single training point right so what it will give you this is my little age now is again a vector edge age of one H of X 2 H of X 3 but it won't match this alright it's pretty crappy so let's say you know it's here all right let's take in my training data and this here's my prediction right C equals H of X 1 H of X 2 X 3 ok and so the question is if I have an algorithm that gives me these can allow Z predictions right can I somehow combine many of those to actually get to the you know to the true answer all right that's that's the question we have that's the question of boosting and the idea is quite intuitive actually it's quite simple how this works so think of it in terms of a regression problem ok so let's say you want to do regression on these labels so what we do is we start out and we train our classifier H on our training data and because our classifier H is not very good right whatever give us is this point here right and we know this is far off because we have the labels might be computer oh my gosh all right this is not great right so we wanted this this vector here of what we got what's this vector here ok so here's what we do it's a little bit like like you know it's like when you when you would you're drunk right you can do this experiment tonight it's Friday right so go out go to college town get you know not too drunk has you know that's dangerous but you know drunk enough that you can't walk properly anymore and then try to walk home and so what happens is now you know overall you're probably you know gonna be pretty bad at you know pinpointing which direction your home is right but you still gonna make it home right and why because well your average are you know going home having a good time right on average if you sum up all the stats but each one of them is pretty bad right eventually you get there so there is this what we're doing you have this drunk little drunk age guy right it's pretty bad but it's not horrible it's not horrible and sense that it needs to know which direction has to go right it knows you know well roughly South right so we trust it a little bit so instead of going all the way you just go a little step right you know you just take a little step in that direction and we stop here and now we reevaluate again and in some sense basically now we have exactly the same problem we had before could now imagine a new coordinate system that starts here's the origin here's our wise ry vectors slightly different right and we solve the problem again and you get some other H you know as I get again this drunk you know algorithm gives us another H right let's say this kind of points you know whatever points here right so you think it's some tiny step in this direction right need to stop here and then you solve again right and it's essentially what you're doing is very very similar to gradient descent that's what grading descent basically you always get a direction I kind of doesn't really point you to the minimum but it points you in the direction of them anymore and you just take many many such steps and add them up and that's exactly what boosting is boosting is grading descent in function space so you have a function and you just basically performing grading descent on these on these functions and the condition that you have on the weak learner is really really mild so what you want to do is every single step you want to get somehow closer to your goal and that's the case if you know if I take this week learner let's see they say this is my weak learner I'm pointing in this direction right I'm making progress as long as in a product of that weak learner and my label is possible so basically as long as I'm not going completely in the wrong direction right so if I based a point in this direction that means the goal must be somewhere forward of me but it can be over there it can be over there that's okay so this can't have more than a 90 degree angle between the direction that I'm that my migraineurs pointing me and the true label now turns out that's very easy to find I'd imagine you have a weak learner that's really really terrible right and in fact here's the to label instead what it tells you it has to go here that's not a problem why is that not a problem any idea yeah you can just flip it but he's right you can write you can detect this so if it's negative if you just flip it all right so then you're actually pointing in the right direction so the only case where really doesn't work at all is if you're completely orthogonal right then there's nothing you can do anymore all right yes kind of now that's when you add your limit that's actually when the algorithm stops in some sense all right and that's exactly what boosting does so basically whose thing you basically iteratively you know find a new new weak learner you add a tune Sowell that means now you move and now you find a new big learner and you know you add to example etc etc and the smaller steps you know the longer takes you but you know after the you know the more accurate you can add you are at the end any questions about this intuition yeah no no they're just weak learners have a see a little better than toy costing a coin tossing sorry so in terms of classification you get better than se of a balanced data set binary data said you get better than 50% accuracy and in regression essentially you have an inner product divided in a minute like what exactly happened you're not just orthogonal to that so you go yeah that's exactly right so each of this is basically all my little steps right so I basically take alpha is a small value and this here easily gives me each of these weekly owners trying to predict the label all the way right that gives me a vector that points that it thinks points to the label and what I'm saying I multiply this by a small constant just add them up yeah how do you know sorry okay good question so he's raising how do we know that averaging all these classifiers will give us the correct answer right and the answer is very simple the answer is the following if you like I'm here right this here is the Y vector that I'm want to get to I have a big learner it's not very good right let's see points and whatever points in this direction right right because if you take the you know this defines a hyperplane so because the label is on that side of the hyperplane if I take a step in this direction I must get closer to Y and Y is this is Pythagoras again all right so if you take the distance to Y right or the distance to my after a small step this here must be close and the reason is because the distance to Y is a squared plus B squared a squared stays constant and B squared just got smaller okay so provided why there's actually lie on that side of the hyperplane now the positive side of the hyperplane and your step size is small enough if you overshoot it then it doesn't work anymore except size is small enough you must get closer to Y so every single step you are getting closer to Y right and so eventually basically you will get very close to it so that's basically that's kind of how the convergence works so the ya time point-of-view weak learners so weak that basically it can't point it basically the only thing you can do is point orthogonal to Y right like if you're pointing here and the only thing I can do is return back just like this then then I'm done then my me glue and I it's just not good enough anymore yeah are we introducing so the question is are we increasing variants that's the question it's actually this is a funny question so the adaboost seems to surprisingly resistant and increasing variants of your classifier so it reduces bias and it's actually amazing at not increasingly marriages and I will get able to let me get to this a little bit at the end again okay last question maybe yeah yeah you will see that exactly in a minute that's right so we basically have to make a different call to the weak learner right because Snoopy is now our after step that Y has changed right that the angle that we want to bias that some sense we have a new coordinate system and now this years our Y right so that's a very good question that will exactly really exactly address this now okay and this is actually where I stopped last time so if you just if you formalized this we have our H let me just write it down one more times if you have it on this court is the sum Chi equals 1 to capital T alpha HT of X and so for now let's just keep alpha constant later on we will actually optimize alpha as well and this is when you get to add a boost to adaboost stands for adaptive boosting where alpha is adaptive that that's the second then then we get an amazing classifier for now alpha is just a constant and so imagine we Paisley run for a few steps it could be you know could be zero right so initially we just say you know T is zero then we just have the old zero prediction we at the origin and the question is which next classifier should we add and so to formalize this we basically saying you would like to find H which minimizes yeah pothinus classes the set of all the weak learners that we could possibly get we'd find the age that minimizes the loss of h plus alpha times H okay so far so good raise your hand of you with me okay good so the last function space your last function that is defined over my ages for example one example of loss function could be L of H equals the sum I equals 1 to n H of X I minus y I squared but it really doesn't have to be the square roots right it can be any kind of loss can mean the exponential loss or something right so it just has to be differentiable it has to be convex so ok so we take this loss and now we want to add one where we take one more step we want to find which H which direction is that is that step so we do exactly the same thing that we did with gradient descent does anyone remember how do we solve this for gradient descent we did an approximation it begins with J and ends with lil Taylor that's right how did you know so this year is approximately L of h plus the inner product of the loss function with respect to H and little H ok good so we can plug this in now I know some people are scared right now there's an inner product over two functions right you may have not seen this before it just don't worry it won't be as bad as you think so it's now that the argument over this these two terms this here is a constant does it's not affected by H so don't worry about it just drop it and what we get here is the inner product of dld H cos alpha so okay good so we know we want to minimize this now what's the inner product between a function and a derivative of a loss function with respect to another function right well we did this in middle school no I'm kidding actually it's not very complicated and so you hey that's you know you can't define this in multiple ways but one way you can just do this is basically say we just evaluate both of these guys at you know many different points and multiply them so if you basically had and because we're in a coordinate system is just endpoints it's simply the following it's simply just be some I equals 1 to N and it's DL the H of X I times alpha H of X so if you why why does why is this correct or why does this make sense think of your function because we're only interested in these n training points essentially I'll be expressing everything in terms of these a and training points we basically say a function is represented by the values it gives for these n training points so the function in some sense is a point in this in this pace right the two functions could be very different outside and these are endpoints they're the same then you kind of consider them the same for now and then this is just the inner product between two vectors and so that's exactly the definition ok any questions about this okay good so I wrote a little something about the nose but it's really you know the idea at the end of the day is that we just think of these as vectors right so each basically for each one of these these coordinates X 1 X 2 X 3 V base you get a value and that gives us a point in this n-dimensional space and we take the inner product between these two points so that's that's totally doable okay awesome so now we are here ok so let's take a step back we're trying to reduce thing intuitionists basil that we take we add up many of these little weak learners that are not very good to get a strong learner and what we need to do is the following at any given time point if you already have a bunch of them found you want to find the next one and all we need to do for that it needs an algorithm that says if I have the derivative of the last function expect to every single point obviously no training point then I want to find the little H that give me the minimum in a product with this vector ok so I get to a few examples in a minute let me first this years may still seem scary so let me just unscary this in some sense well why don't you just do this I'll give you a minute you and your neighbor figure out what this term is for the square loss it's up here ok so that's how you want to want to minimize the square loss what is this term give one minute please feel free to discuss with your neighbor [Music] [Music] alright who's - who wants another minute who thinks he or she has the answer okay good yeah that's right exactly and so actually a cheese little I made up one half in front of it to get rid of two but then it's actually just H of X I - why all right so that's easy to compute right so you have a function H E evaluated in each one of your training points and it subtract from it the label so in the figure that we just had the figure that we just had here's our shoe vector Y and let's say we are he right after our first you know after bunch of iterations we kind of ended up here this is our H all right what is this this is basically this vector here right and so what you want to do is you want to find an H that is the minimum inner product with this vector what's minimum inner product it's exactly the opposite right that's pointing in this direction that's pointing exactly - why okay does that make sense so if we want to find this vector here the derivative of L respect to H is the vector from by 2h we want to find the function capital H our current current classifier we've won to find the little of H it has the minimum in the product minimum when a product means a very negative value so if the arrow goes this way we go exactly the opposite and we're pointing to Y and that's exactly scare like a compact right it's always pointing towards Y right the labels of our training data points and that's exactly what we want to add right and if the wiggler and I could actually do it you would just have to add one more of you doc but the weak learners pretty lousy right that's the problem with you know drunk classifiers right you can have take a small step you know go like this that's like this but it will always point to Y right no matter where where you are so you will eventually get there so just that you believe me I made a little demo and so what I thought is what the crappiest classifier on the planet but I really had to think about like what's the stupidest thing I could possibly do and I came up with something and that was like the stupidest algorithm in the world takes a dataset of feature vectors x and y totally ignores the axis and returns random labels all right that's pretty stupid now you have to be a little better than that because has to be a weak Liana so here's what we do we just you know actually what I'm predicting is just labels plus one or minus one and I try to deliver different versions either flip them or I don't and as see which one gives me lower loss and that's it alright so busy do this like a point in a random direction and say if that's actually oh that's actually backwards oh oh never mind go in this direction right that's all that's all I'm doing clicking a rat blunder drunk guy giving you directions well you like you know how do you get the end of this race I don't think they can't be right well then it must be this way so all right good and so let me explain to you the image so this here's now exactly what I do so this here is my origin the red point is where I am right now that's the old zero vector so initially I'm predicting the old zero vector there's just a three dimensional case so three training points this is the label of the first data point second data point and third data point so here the label is minus 1 1 and minus 1 ok I'm starting here and I'm running my first classifier and of course a spy gives me random garbage right but the random garbage is such that it kind of always points in in the Paisley it's positively aligned with the Trulie like that's the only thing in guarantees otherwise I would have flipped it and now I keep adding them up and so here's what I'm what's happening so what you see here is basically the red points basically moving and down here you see the errors at the beginning every single you know this is the squared error to the Y vector the root mean squared errors at the beginning it's really high and now I keep moving and adding like these are just the dumbest classifiers on the planet right you know anybody kind of looks like a drunk guy doesn't it like you know it can be good seems like oh that was a good night out right yeah you see it slowly getting it right it's sneaking up there I didn't hit it boom right and now at some point you have basically and this is not the stage where you get home and you can't fit the keyhole you like so we'll never be that exact right that's where I get stuck there's actually how you can win character competitions right so if you carry competitions and the prize money is actually given out on the training data right and what you do is just always upload random noise right and if the error is above chance then you keep it otherwise you just flip the the random bits all right and then you just keep adding them up and you will win the competition a bachelor right that's why people don't that some people have secret test sets otherwise it could has always do this cheating and this is yeah there's also related to why in science many results are not reproducible because we make small decisions that basically lead us to to the outcome that we want okay any questions about this demo no okay good awesome so all right so the the resulting algorithm is called any boost and by the way that was invented ten years after the first boosting algorithm so it's amazing in hindsight but it took ten years for people to realize that boosting was actually gradient descent it's exactly gradient descent in function space just taking the gradient we're adding it it wasn't clear at the time and in part because the notation was very different people thought of it very very differently so and now it's it's it's you know it's very obvious obvious obvious so if you look at you note the any boost algorithm basically summarizes this you compute the gradient for every single data point that's I call this part here call this alright and then you call you algorithm that basically gives you a minimum of the alignment and one thing I have to say have to stress is that this doesn't have to be very good right it doesn't have to be exact minimum it just has to be something that's you know it's pointing somewhere in the right direction even works in this absolutely worst case but you just have this random classifier and then you know you just add them up and that's that's it okay good any questions for any boost then let me move to so there's two very famous cases of those that have special names and I would like to go over them the first one is gradient boosted trees and the second one is adaboost gradient boosted trees has become extremely popular because that's essentially what search engines use so if you use a Google or Yahoo or Microsoft Bing or whatever this is the the way they predict what you're looking at is the output of a grating boosted tree and there's there's very good reasons for this and there's actually something I used to work on when I was at Yahoo research and in some sense they are busy just very specific choices about this this boosting algorithm and then Grady Booch the trees the first choice is that it can be regression or classification so we're not setting you know why eyes can be anything let's say the my eyes are regression and emote I class they move the dimensional regression and the HS the weak learners are trees of limited depth so H is decision tree or card tree and she's sorry every question trees it hard cheese off limited depth typically maybe depth equals four or five any four to six but not more than that these are highly biased classifiers right there I'm only allowed to split four times right four five times so each one of them is pretty lousy they're very nice because they're very fast it can be evaluated very very fast and they don't make any assumptions on the scaling of the data set etcetera alright and then the last assumption making is alpha is a constant small constant okay good so so far there's nothing special about why don't we pick you know age to be no y is just a regression problem that's fine we didn't mean you know but but we just but I just we just talked about you know we actually assumed advise you know just a point in this n-dimensional space that's there's not no problem with this alpha a small constant that's what we assumed anyways that's fine this year it gets tricky H is a card tree right but so card trees we found that it can either minimize you know entropy loss genus John Gini impurity or the key minimizes square loss but we what we want to do here is this alignment between two vectors so we have this vector with all these gradients we would like to find a tree that actually aligns with it alright that's that doesn't really we don't really know how to use car trees that and that's the one thing we have to quickly fix any questions okay good so one more time what we want to find is a card tree that solves this equation here let me just write it down our twin h element of H I think these are now the cards depth for whatever card trees such that the Alliance or I of H of X i where are I here is this the gradient of the gradient of the loss with respect to H of X I and you just derived correctly that for the square loss this is just the residual it's just the direction from my current predictor for my for my label to my current village so I want to minimize this function and I want to find a tree that minimizes this function the question is do I have to derive a totally new card algorithm to minimize such a function right so far we haven't had this this is not a convex law strong so this is like a weird thing right turns out actually we can shoo one that's very very easily into our existing regression tree framework so the first thing you gotta notice let me just do one thing let me just save all this here's the lady let's just define the negative gradient is minus TI so the negative gradient this is my figure this is my Y this is my current H now the negative gradient points exactly to Y so if I want to do some other things more intuitive that I want to go to what's Y okay so if I plug in my TI and I don't want to because this is negative gradient I don't want to minimize anymore I would have to maximize maximize I'm maximizing these producers so what we do instead is we just minimize the negative right as we say argument negative I RI sorry TI h of x i okay so what I did is it's very very simple i justified TI is the negative RI and so I just have to put a negative here I also dropped the Alpha because of the constant it doesn't make any difference raise the hand if you're still with me okay any questions so one more time you're trying to minimize this in a product this here's the function we are learning this is a tree there's a you know a limited depth tree that's not very good but you know it's a function these are given to us those are the gradients of the last function with respect to every single point what I now did I just said okay we have the negative gradients and then I get a minus here and because minimizing this is only half as good as minimizing it twice it just put a factor two in front of it why not right I can do it all right doesn't make any difference alright raise your hand if you still hit me okay good so far so good so now here comes he comes a little bit step you have to believe me the first one is that I say sum over all t I is a constant that's independent of my H that's true all right so the T eyes are given to me sum of all the squared T eyes has nothing do with H so can just add it right at t I squared all right raise your hand if you still with me okay good now comes the the last bit I'm telling that some of our I H of X I squared is also constant now this is not true but it's very convenient so here's why we should make it true because if he want to minimize this you could just take any tree and just multiply it by a large number and you get a smaller value here okay because this is linear there's an inner product so essentially what we're saying is I want to find a tree that points to what's Y right but if I just get a really long era right that actually that seems better than than a shorter right well that that doesn't matter because we're just just interested in the direction okay so what we're saying that H squares should be a constant or be essentially saying each square should be lying on a circle of radius C whatever C is doesn't matter so this way we are just decoupling the the length of the vector with this direction so we're just enforcing that we're just saying H of X I although the the the sum of all the squares must be a constant and we can easily achieve this because whatever tree we have we can just evaluate it on all the training points and then just rescale all the predictions such it's a con if you just know we could just make this one and I make this one it doesn't matter okay so we just normalize the outputs so we changed the card tree algorithm t need a little bit if that's the case then I can also add that to it because now it's a constant right this is I'm just adding this constant so if I do this I get arc min here's my constant I'm just adding a constant okay H of X I squared minus 2t I H of X I plus TI squared constant constant the thing we want to optimize ah what's going on who can see it this is let's get off H of X I minus T I squared which is exactly what the regression tree minimizes all right so we're done we don't have to do anything right so all we need to do is just here's our our label a vector bye this the factory here's what we initially have that's a base the one value one label for every single training point we stick that into our tree it sends us here actually it sends us here but we just take a small step and now what we do we just take the remainder right it stick that into our tree tonight okay well that was a good start one more time all right yeah is that okay you know stick in this vector now we end up here right we stick in this vector right let's say we go here and so on I mean we keep moving around but every time is target you just take the residual which is exactly the the thing that's left the direction that's left between our current prediction and the lake and that's all there is to educating both trees is a beautiful algorithm let me write it down in pseudocode how much time - okay good here's the here's the see I just need a small blackboard here the entire algorithm will fit on this blackboard and now you can go out and build your own search engine so here's the idea just say initially H equals zero all right so initially we don't you know by the way that algorithm here i yeah that was you know please ignore that I rewrote that on the on the notes that online so here's what we do we take T steps and every time it's the following we say T I equals y I minus H of X I so we take our vector we take our labels and B subtract from them what we already have that's that we already have in the bank right that's how close we already are we subtracted from the label and let me say okay the next next H you're trying to find it's just the country that minimizes and the square loss with what's left my TI and then I say my H becomes H plus alpha times H and that's it then you say n you repeat this so it's like it's four lines of code as a loop it's one of the most powerful algorithms and the only hyper parameter you really have it's too it's like one is how deep your trees are and the second one is your step size and that's something you have to tweak a level so usually basically deeper trees give you more except steps more except exact steps and you also you have more variants and if you take smaller trees well you know you have to find something intimate usually four to six is good and the step size here if you make alpha smaller you have to boost for you have to move for more iterations well that's really all there is to any questions about Grady boost the trees yeah so is it with certainly with Grady boosting Bo yes but it's four lines of code what are you talking about I can put this inside of here then it's me yeah okay so questions like why don't we just so here's actually that wasn't um it was actually a big competition yahoo actually fell apart which was a couple years ago like you know actually right when I left the pattern part they had a big competition where they BZ took all their search engine data and made it public they anonymize data made it public so all the features and they try to made a big a big competition you know they offered several thousand dollars of price money for the person can build a best search engine and actually the winner what they did is they boosted for a thousand iterations and they did that several hundred times and then they backed those and the result was like you know I think it was actually it was even more that was like you know the result was I think 10 million trees that you had to evaluate for each search query for each webpage so it takes like you know minutes and minutes for each search query to reach Cannell get any results but it was the most accurate by far so you can you know you're right right so in some sense basically they do serve ions per variance problem and then they reduced it again with bagging and so on typically however the variance doesn't go up too quickly and so if you if you keep your alpha and your T in check right you don't make alpha too small and you don't boost for too long you actually eat should be fine yeah I mean the difference of the questions we are not having are completely combining different weak classifiers but these are very different trees though right every single tree would be very very different it's just not yeah we're not combining trees and SVM's or something right yeah yeah that's right but you've really have one type of algorithm that generates you we classifiers and you basically want to make a strong classify out of it all right thank you everybody see you next week

Info

Channel: Kilian Weinberger

Views: 17,818

Rating: 5 out of 5

Keywords: artificial intelligence, machine learning, cornell, cs4780, kilian weinberger, course, boosting, gradientBoosting, xgboost

Id: dosOtgSdbnY

Channel Id: undefined

Length: 48min 27sec (2907 seconds)

Published: Wed Jul 11 2018