6.034 Recitation 10: Boosting (Adaboost)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so this is the last technical topic in 603 for you're almost there we're going to talk about boosting and specifically add a boost which is a type of boosting which we saw in lecture this morning the general idea is we're combining together all the different machine learning methods that we saw earlier in the semester to be able to make a strong classifier by using all the information so what are some machine learning methods we saw what's one machine learning method we've looked at nearest neighbors okay so we can make a list of machine learning methods we have different flavors of nearest neighbors like we have one nearest neighbor we could do three nearest neighbors and five nearest neighbors we could use different distance metrics what else have we looked at SVM's okay what else neural nets I think we have two more yeah I D trees and today in lecture we use just ID tree stumps the first level of an ID tree which will also do today what's the what did we see last week was there any machine learning last week to long ago what was that we looked at naive Bayes classification which is yet another type of machine learning we're going to say that all of these are weak classifiers and our class some examples they generally classify our training data perfectly because we only have like five to ten data points but in a real data set where you have dozens of features and hundreds of training points it's going to be difficult to classify everything correctly we'll say that these things are weak or imperfect classifiers and we'll call them little H sub I now what we're going to do is we'll combine them all together or combine some subset of them to form some strong overall classifier big H of X so some this is sometimes called an ensemble classifier or aggregate classifier because it's just combining together all of the weak classifiers this is just like any other machine learning rule where it takes in a vector of features and it returns a classification in this case it's a boolean classification it'll be plus one or minus one so our rule will be that's equal to the sign of a vote of all of the weak classifiers so we'll give each weak classifier some voting power that we'll call alpha these alphas are completely different from alphas and support vector machines which are supportiveness these ones are voting power we're just reusing the same letter so we might have some voting power times the first week classifier plus some voting power alpha 2 for the second week classifier and we'll keep doing that until we've added enough week classifiers that we can get a good classification out of our overall Plaza fire so we'll go up to alpha n HN so now the question is just how do we build this classifier how do we choose the classifiers and determine our voting powers so in the big picture we're going to have an error rate for each of these weak classifiers we can say H weak classifier little H will have some error rate that will call Epsilon that error rate is going to go between zero and one where here's the error rate zero means that it classifies everything perfectly correctly and one means it classifies everything completely wrong so if it classifies more things correctly would we want it to have a higher or lower voting power higher make sense so we'll give it more voting power in this direction but if our classifier classifies everything wrong all the time and there are only two classes is there some way we could make use of that yeah do the opposite there we go so we can also say that there's they can have more voting power as they become more wrong we'll just give them more negative voting power so it will take their boat into account and then just do the opposite of ever they say we have more negative voting power there that means over here we have more positive voting power if they're more correct and in the middle will have less voting power the middle being an error rate of one half okay so each classifier has an error rate and now the way we're going to assemble this overall classifier is in a series of round so our overall classifier Big H of X it is assembled in a series of rounds and in each round we're going to do a few things we're going to pick the best weak classifier to add to our overall classifier I'm going to put best in quotes because we haven't defined that yet pick the best weak classifier little H and then we'll assign that classifier a voting power [Applause] that will call alpha for that week classifier a voting power and finally will just append that on to our overall classifier so we'll append the term alpha H of X to our overall classifier big H of X so now that this question of what does it mean to be the best so we're going to say best means a couple things it's going to mean the classifier that makes the fewest errors or in this case the most errors before now we'll keep it simple and just say the fewest errors but we're also going to weight that so that we emphasize the points that have been misclassified in previous rounds and that's why and that's where the actual power of boosting comes in because if the weak classifiers we've chosen so far make certain mistakes we want to correct for those in the next classifiers we choose so this will be adjusted to emphasize points that were previously misclassified [Applause] so that's the general idea of boosting and now we're going to figure out how this actually works mathematically using some of the formulas we derived in lecture so we're going to start building a flow chart and we're going to do an example problem so it can start with the example problem over here we're going to have just five data points this is a classroom example we'll have two plus four plus points and one minus point we'll label them a B C D and E they're x-coordinates are 1 3 & 5 and their y-coordinates are also 1 3 & 5 this is X and that's why and then in this particular problem which is from 2012 quiz 4 we're going to be allowed to use six classifiers so we're just going to use decision tree stumps and in this case they'll all be vertical lines so our classifiers will be X less than 2x less than 4 X less than 6 and then the opposite of those X greater than 2x greater than 4 and X greater than 6 and so what these mean for example X less than 2 is going to mean we'll draw the line x equals 2 and everything that's on the left side of the line everything is less than X will be classified as a plus everything else will be classified as a - so which points does X less than 2 misclassify if it says everything to the left is a plus B and E yeah because it says BC and E are minuses and B and E are wrong so we're going to make a list of misclassified points and here we'll have B and E what about X less than four that will draw the next line over which will be there so that says that a C and D are pluses B and E or minus is what's wrong yeah C B and E and then X less than six just says that everything is a plus so that's going to misclassify Point C what about X greater than two that says everything to the right of the x equals two line is a plus ya a C and D so we can notice that's actually the opposite of what next less than two misclassified which makes sense because we just invert we were inverted the classification so you can also do X greater than four will be the opposite of X less than four and we'll have a and D X greater than six is the opposite of X less than six and we'll misclassify a B D and E okay so now we can get to the first thing in our flowchart which will be assigning weights to each of these training points so we're going to initialize the weights to all be equal because there's no reason right now to emphasize one training point or another so we'll initialize weights and we'll set the weight of each training point to be one over the total number of training points so that the weights will add to one so in other words we're assigning each training point an equal weight so it should be the weight of training point a how many training points do we have five so what should be the weight of each point yeah one over five so we're going to turn this into a table where we'll have the weight of each training point weight of a weight of B C D and E and right now they'll all be one over five so I said boosting is going to happen in rounds so this is round one okay we have a weight for each training point now what we're going to do is compute the error rate for each of our weak classifiers so our next step is calculate error rate for each weak classifier little H the error rate will just be the sum of the weights of all of the points that that that classifier misclassifies so the sum of weights of everything that it gets wrong so it would be the error rate for X less than 2 which misclassifies B and E 2/5 because we're just adding the weight of B plus the weight of B and we get 2/5 so it's a little messy so we if we put the error rate for each of these here the error rate of X less than 2 we just said will be 2/5 and so that's in this case that in the first round the error rate will just be the fraction of points that are misclassified because all the points have equal weight so it would be the error rate for X less than for 3/5 because it misclassifies 3 points and then we'll get 1 1/5 for X less than 6 which misclassifies C will have 3/5 for X greater than 2 2/5 and 4/5 one thing we can notice is the error rate of the greater than classifier is just 1 minus the error rate of the less than classifier which makes sense because they just do the opposite thing okay so now we have error rates now we need to choose which of these classifiers is the best so for now we're going to ignore the fact that we can use classifiers that are mostly wrong and we'll just look at the one that makes that has the smallest error rate makes the least errors so pick the weak classifier that has the lowest error rate or the smallest error rate pick the best weak classifier that has the smallest error rate so which one is the best right now by that definition ya X less than 6 because it has an error rate of 1/5 which is less than any of the others so we'll pick that one we'll extend our table a little more and keep track of the best weak classifier in each round it's error rate and it's voting power so our best weak classifier is X less than 6 we just said it's error rate is 1/5 now we just need a formula for the voting power so this comes out of the formula as we saw in lecture we won't actually derive it right now but you can derive it at home if you want so we'll calculate voting power for our best week classifier and the voting power will be defined as 1/2 the natural log of 1 minus the error rate over the error rate so if we if our error rate is 1/5 then this will be 1/2 the natural log of 1 minus 1/5 over 1/5 which is equal to 1/2 the natural log of 4/5 over 1/5 which is 1/2 the natural log of 4 and for now we don't actually care what the natural log of 4 is but this is going to be some number which is our voting power so here we'll put 1/2 the natural log of 4 and then we'll just append that to our overall PLASA fire so if we build an overall PLASA fire over here we'll have our big H of X will be the sign of a bunch of terms what's the first voting power yeah the thing we just calculated 1/2 the natural log of 4 and what's the wheat classifier X less than 6 so it just has a shorthand I'm going to put X less than 6 here really we need this to be a function that returns plus 1 or minus 1 depending on the features you give it but this will just be shorthand for whatever decision is we turned by that classifier so this term would be plus or minus 1 so now we need to know whether we're done so we're going to ask are we finished there are a few ways that we can be finished the first one is if our overall PLASA fire big H is good enough so good enough is depends on your goal in general that'll be given it could be something like for a large data set I want 90% classification accuracy in this case because we only have 5 points let's say good enough is it's only good enough if it perfectly classifies all of our training data so is our big H good enough no because it misclassified Point C because it's only using X less than 6 currently so it's not good enough there are two more criteria though one is if we've done enough rounds this is another criteria that has to be given to you in this case let's say we'll do this for at most four rounds and if we haven't reached one of the other termination criteria by then we'll stop so have we done enough rounds no because we did 1 and then the last criteria is if there are no good classifiers left this one is a little more well defined we're going to define this as if our best classifier if our best week classifier has an error rate of 1/2 we're going to say there are no good classifiers left so if our best week classifier H as error rate of 1/2 so that one we don't actually know until we look at the error rates for our weak classifiers so in our first round we still had good classifiers left and when we get to the when we recalculate our error rates then we'll have to ask question again so for now it looks like we're not done but if we were done then we would just return our overall classifier if we're not done we're going to update the weights of the points and go on to the next round so update the weight for each training point and specifically we'll do that to emphasize the points that we're misclassified in the previous round so we'll have a wait update formula for this and it'll depend on whether the point was classified correctly or incorrectly so if the point was classified correctly the new weight will be 1/2 times 1 over 1 minus the error rate times old wait that's if it was classified correctly or classified right otherwise it'll be 1/2 times 1 over the error rate times the old wait that's if it was classified wrong if a point was classified wrong do we want its weight to go up or down up because we want to emphasize that point and so that means that because we want all of our weights to always add to 1 if the points were classified right their weights will need to go down so we could add that to our diagram over here the fact that our weights if the weight was our new weight if the point was correct will increase and our new weight if the point was classified incorrectly or wrong sight to that backward that will decrease and if it was wrong then the weight will increase to emphasize that point okay so let's try recalculating our weights for round two so if we recalculate weight a will have 1/2 times 1 over 1 - what's our error rate which error rate are we using yeah 1/5 so that's going to be the error rate of our wheat classifier that we just tried and what's the old weight of point a yeah 1/5 because all of our old weights were 1/5 times 1/5 so that's going to be equal to 1/2 times 5 over 4 times 1 over 5 those cancel and we get 1 over 8 what will be the new weight for point B yeah it's the same because the 1/2 doesn't change the error rate is the same and it had the same old weight as point a so it also have a new weight of 1/8 and then similarly D and E were classified correctly so they'll have an error rate of 1/8 but point C was classified wrong in the previous round mark that so if we calculate the new weight for Point C we'll have 1/2 times 1 over 1/5 times 1/5 these 1/5 will cancel and we get just 1/2 there's another way we could have figured that out though we learned in lecture today that all of the weights should add up to 1 in every round so because he was the only point left we could have just looked at a B D and E saw that they added to 1/2 and that would give us weight C must be 1/2 so that's actually a general fact that the weights of all of the points that were classified correctly in the last round will always add up to 1/2 and the point and accordingly the weights of the point that were classified wrong in the last round will also add up to 1/2 so some of the weights they were classified right is equal to the sum of the weights that were classified wrong which is equal to one-half and that also leads to the fact that the sum of all of the weights is one so that means that we could actually use this to update our weights intuitively instead of using these weight update formulas we could just look at the weights and see a B D and E were in a ratio of one to one to one to one and we know that their new weights will be in the same ratio because of the formula because the old weight is the only thing that changes and we know that their new weights have to add up to one-half so from that we could deduce that they must all be 1/8 so we'll try that in the next round and see how it works without calculating weights using formulas okay so now we've updated our weights now we're going to go back to some step which step should we go back to what do we need to do next yeah we need to calculate our new error rates so there isn't much space for this but this is going to go all the way back to there to calculate new error rates so what's the new error rate for X less than 2 which misclassifies B and E how do we calculate that you know just add up some weights of points so it misclassifies B and E which each have an error rate of 1/8 so we'll get 2 over 8 I'm going to rewrite C as 4 over 8 to make it easier to add up these fractions then we can say X less than 4 misclassifies BC and E so it's error rate will now be 1/8 plus 4 eighths plus 1/8 is 6 over 8 X less than 6 just misclassifies point C so that be an error rate of 4 over 8 or 1/2 that's a useful fact but whenever whichever classifier you just used will always have an error rate of 1/2 in the next round the reason why is the error rate is just the sum of all of the points that were that that classifier misclassifies and that's the classifier we just use so we just said that the points that were misclassified by it in the previous round should now have weights that add up to 1/2 so by definition that new weight the new error rate is going to be 1/2 does that make sense ok ok what about X greater than 2 we can use the fact that X less than 2 and X greater than two are opposites so we just get 1 minus 2 8 is 6 eighths here we'll have two eighths and then 4 eighths or 1/2 so which classifier is the best yeah we have a tie so we'll need a tie breaking rule let's say that our tie breaking rule is will choose classifiers that appear earlier in the list first so which one should we choose well yeah the first one because I Louis error rate is to 8 so we have a tie so we'll choose the first one so we'll get X less than 2 with an error rate of 1/4 then if we calculate this voting power we now have an error rate of 1/4 so this would be 1 minus 1/4 over 1/4 this becomes that's messy this becomes 1/2 the natural log of 3/4 over 1/4 which is 1/2 the natural log of 3 so there's actually a pattern here where when our error rate is 1 over some number n our voting power is just 1/2 the natural log of n minus 1 so we can use that if we keep getting error rates of 1 over something so we don't have to keep calculating it ok so we'll add that onto our overall classifier we have now 1/2 the natural log of 3 times the decision made by X less than 2 are we done this classifier good enough so what happens if we have exactly two weak classifiers that are voting and one of them has a higher voting power if they disagree on something who will win that vote yeah the one higher voting power so this means that whenever they disagree like they're going to disagree on Point C the first one is going to win because log 4 is bigger than log 5 so this is basically same as the classifier we had before because that first one always wins so is H good enough no it's not good enough yet because it's still going to misclassify Point C okay have we done enough rounds no we said we want to do four rounds and it looks like we did have a good classifier because our best classifier had an error rate that wasn't one half so we'll keep going and we'll update the wait so which points were misclassified in round two so this is a little tricky because our overall classifier yeah it only misclassified see but when we're updating the weights the points that we're considering are the ones that were misclassified by the most recent weak classifier so we're going to say that X less than two misclassified B and E so those are the points that whose weights were going to increase in the next round and the reason why we're not just looking at C is because even though C still isn't classified correctly by our overall classifier the classifier we just chose did classify it correctly and we've already taken into account our first classifier not being very good for C so we're not going to double count that so given that B and C B and E were misclassified in the previous round their weights are currently in a ratio of 1 to 1 and their new weights have to add up to 1/2 what will be their new weights 1/4 yeah so 1/4 and 1/4 AC and D will be a little trickier we see that they're in a ratio of 1 to 4 to 1 and there are new weights have to add up to 1/2 so what could one of those weights be almost so one way to think about this so for some people it might be easier to just use the weight update formulas you could use the weight update formula to get one of them and then scale it to get the others but if you want to do this just into it the way I like to think about it is if we have a ratio of one to four to one those add up to six so our denominator has to be some multiple of six and we want the sum of those three to be one-half so we'll do one six times one-half gives us a denominator of 12 so if we do 1/12 for 12 and 112 those add up to six twelfths which is one-half so we're good there okay and then we can write rewrite all of these guys in 12 so these are three twelfths and then it'll be easier to add up so now what's the error rate for what's the error rate of a classifier we just used X less than two 1/4 where does that come from oh so that's the error rate that it had but what's it what will it new what will its new error rate be so as a new error rate is going to be the sum of the points that of the weights of the points that it misclassifies and we know that the weights of B and E we said had to add up to 1/2 so what's this error rate 1/2 so let's again this property that the classifier you just used will always have an error rate of 1/2 in the next round and then for X less than 4 we'll have to add up B C and E so it's 3 12 plus 4 12 splits 3/12 gives us 10 over 12 point C is just four over twelve so this error rate stopped being 1/2 and then X greater than 2 well we can do 1 minus what we got for the less than classifier so we have 1/2 to 12 and 8 twelfths so what's the best classifier it's greater than 4 yeah so 212 is the lowest error rate which means that X greater than 4 is our best classifier so that has an error rate of 1 over 6 so if we follow this pattern our voting power will be 1/2 the natural log of 6 minus 1 is 5 and you could also calculate that and you get the same value so we'll add that to our overall PLASA fire 1/2 the natural log of 5 times the decision made by X greater than 4 now is our overall Plaza fire are good enough yeah it is and here's how we can find out if we just look at which points each of these misclassifies we said that X less than 6 misclassifies see X less than two misclassifies what is that B and E and X greater than four misclassifies a and D so that means that for each of the training points only one classifier is wrong about it so that means that as long as the two that are right can add up their voting powers and overpower the one that was wrong then they'll classify everything correctly so that just means that we need these three voting powers to satisfy the triangle inequality where any two of them added together is greater than the third so the smallest ones are 1/2 log 3 and 1/2 log 4 is that greater than 1/2 log 5 how do you add logs yeah multiply the inside so if we want to do 1/2 the natural log of 3 plus 1/2 the natural log of 4 that will just be 1/2 the natural log of 3 times 4 so as the natural log of 12 greater than the natural log of 5 yeah because the log of a bigger number will be a larger number so is H perfect now who thinks we're done who thinks we're not done you think we're done okay no one else wanted to vote so yeah we're done this is human boosting so the reason why we're done is because each point is misclassified by only one classifier and the other two can always overpower the one that's wrong so we got H is good enough we don't need to keep going because we only need one of these criteria to be true so now let's see what would have happened if we had allowed classifiers that made lots of mistakes and given them negative voting power so we could amend this thing that says a smallest error rate and we could say well what's even better than using smallest error rate is using the error rate that's furthest from 1/2 because if it's almost always wrong and that's also useful so even better is farthest from 1/2 which mathematically is whichever classifier has the largest value the largest absolute value of the error rate minus 1/2 so that's just whichever one has error rate furthest from 1/2 so if we use that criteria which classifier would be the best in our first round is it still X less than 6 is there anything that's better or just as good the last one yeah so 4/5 is just as far from 1/2 as 1/5 it's just in the opposite direction so in this case our tiebreaking rule would pick the 1/5 over the forfeits but they would be equally good and then what about in the next round is there anything better than or just as good as two eighths yeah six eighths is just as good so in this case we would actually have four classifiers that are tied for being equally good because they're equally far from 1/2 but our tiebreaking rule would again pick the first one what about in the third round yeah so 10 12 and 2/12 are equally far from 1/2 but X less than 4 comes first so we would pick X less than 4 with 10 12 so this would become X less than 4 we would have an error rate of 5 over 6 and if we do the math we would end up with a voting power of 1/2 the natural log of 1/5 or what's another way we could write that yes that's the same this comes out to the same as the well it's negative 1/2 the natural log of 5 and so that makes sense because we chose instead of X greater than 4 we chose the opposite classifier X less than 4 so it makes sense that we would give it the same voting power just negate it so over here we would have + or instead of a minus 1/2 the natural log of 5 times the decision made by X less than 4 and so in this case this would make the same overall decision as if we use the positive voting power with X greater than 4 and as for which criteria to use this one is better but on a quiz you would be told whether you should use for this from 1/2 or just absolute smallest error rate so a couple other things we can see with regard to this is well what would happen if we had an error rate of 1/2 what would be our voting power we can calculate that if we have an error rate of 1/2 then this is the natural log of 1 minus 1/2 which is 1/2 over 1/2 what's the natural log of 1 0 so this isn't just less voting power this is if we Cho if we did choose a classifier with an error rate of 1/2 our voting power would be 0 so that's one reason it's silly to choose a classifier with an error rate of 1/2 but then another question is well maybe we would then update the weights and it would get better so let's see what would happen if we update the weights with an error rate of 1/2 we would have here 1/2 times 1 over 1/2 we wouldn't but this is if if we were to keep going to show why you wouldn't want to keep going so if we had 1/2 times 1 over 1/2 those would cancel out and you just get the new weight is equal to the old weight and you would get the same thing here so not only would you end up adding a classifier with 0 voting power which doesn't do anything but you would also the weights to be exactly the same as the old weight so you would just be stuck in an infinite loop of adding useless classifiers so that that's why we're going to stop if our best classifier has an error rate of 1/2 so we can say here in the middle the new weight would be equal to the old way even though we'll never actually choose that case and then as we go this way we said if the point was classified correctly its weight will its new weight will decrease it was classified wrong its new it will increase what about as we get closer to negative if we have more negative voting power and we have an error rate closer to 1 will we get the same behavior yeah it's going to be exactly opposite and the reason why is X less than 4 it says that it missed classifies BCE but because we're negating its decision it's effectively misclassifying a and D so in this case it would say oh but BC and E are classified wrong but effectively they're classified right so we want there are new weights to decrease ok let's look at a couple other neat facts we look to boosting other any questions first ok let's see I guess we can just fit these in here so one thing is the weights are always going to be greater than 0 the weight of each training point and less than or equal to 1/2 and that's because in the first round we start off with equal weights for all the points and then when we update them each point was either classified right or wrong and each of those sums is one-half so no one weight can ever be greater than 1/2 the only exception to these rules is our weird trivial pathological cases where you have exactly one training point or you have a perfect weak classifier in which case there's no reason to be doing boosting but in general these will hold a consequence of these up/down arrows is a new weight the new weight of a point will never get equal to its most recent weight so the weight of a point will always change from one round to another because the only way for it to not change is to have an error rate of 1/2 which will never choose another consequence of this is if we think about what happens to an outlier or nose noise point most of the classifiers will classify it wrong so its weight is going to continue increasing and increasing until eventually it gets classified correctly or we stop before then so one way to think about that is this boosting system is becoming more and more anxious about this outlier point that never gets classified correctly until eventually it does get classified correctly like in terms of overfitting well so we saw that in in lecture today in one of the examples where there was some noise point and there were I think was the rings of Saturn then there was some noise point and eventually it just ended up with this very small area dedicated to it so you could say that that's overfitting but it over fits a lot less than most classifiers because it's going to ignore that point for a long time until that's the only thing left that it can fix so in general boosting does help a lot with overfitting and you could also do something like choosing a good enough threshold to be I want to classify 95% of my training data and then you would ignore that outlier two other neat things one is if you have three or more disjoint classifiers that is classifiers that make non-overlapping errors which we had in this case then you're guaranteed to be able to make a perfect classifier out of them so for example if we had three classifiers H 1 H 2 H 3 and H 1 misclassified ABC H 2 misclassified D and E and H 3 m is classified F and G then it's definitely possible to just assign each of these equal voting power and you're guaranteed to get a perfect classifier but one thing to note is that it's not guaranteed that boosting will choose those three classifiers in the first three rounds just that as a human it would be possible to construct the classifier another thing to note there is the converse is not true so just because classifiers make overlapping errors like if you have some other classifier H for that misclassifies C and D and if well that's not a good example but because classifiers make overlapping errors doesn't necessarily mean that you can't make a perfect classifier out of them and then one other thing is if a classifier makes a super set of errors that other classifiers make so here we said X less than for misclassifies BCE whereas X less than two misclassifies B and E then it's guaranteed that that classifier will never have the smallest error rate and the reason why is the error rate of X less than two is the weight of B plus the weight of e X less than four is those weights plus the weight of C and each the weight will always be greater than zero so that means that if it classifies if it miss classifies all the points that X less than two misclassifies and then sum it always have a bigger error rate yeah yeah well so it will never have the smallest error rate but you might still use it if you're using that yeah I think that's everything let's see yeah so that's basically everything you need to know about boosting and so that's basically everything you need to know in six oh three four have a great Thanksgiving

Info

Channel: Jessica Noss

Views: 26,251

Rating: 4.9410524 out of 5

Keywords: MIT, Massachusetts Institute of Technology, AI, Artificial Intelligence, CSAIL, 6.034, Patrick Winston, Patrick Henry Winston, phw, jmn, Machine Learning, Supervised Machine Learning, ML, Supervised ML, boosting, Adaboost, weak classifiers, strong classifier, overall classifier, aggregate classifier, amalgam classifier, ensemble classifier, voting power

Id: gmok1h8wG-Q

Channel Id: undefined

Length: 51min 56sec (3116 seconds)

Published: Fri Dec 02 2016