Bagging and Boosting

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Good morning, today we will have the second part of the lecture on Ensemble Learning. In the last class, we gave a big introduction to the different ensemble learning methods. Today, I will talk about two specific methods, bagging and boosting which are methods for ensemble learning. Bagging is a method for ensemble learning bagging stands for bootstrap aggregation. So, we saw that in order to have in order to use an ensemble of learners and be able to combine the learners and get better accuracy, what is needed is a set of learners who which make independent errors you want to set of learners which make independent errors. And we discussed, in the last class, in order to force the learners to make independent errors, you can make them use different samples - data samples, you can make them use different parameters of the algorithm, or you can make them use different algorithms and the output can be combined by some form of voting. Now, in bootstrap aggregation, what is done is that; first of all the samples are generated such that the samples are different from each other. However, suppose, you have a training set S or let us call it D, and if you are going to have n learners or let us say k learners, you could split the training set into k partitions and use one for each learner. However, often it is the case that the size of the training set is not large, so if you make it smaller by splitting it into k different sets, the individual training sets will be small. And as we have observed while talking about the foundations of machine learning is that if you use a learner on a small training example, the learner can over fit, it can have high variance and it will not generalize well. So, we want to be able to use you know, we do not we will not be able to afford to have fully independent data, but we can sample randomly from the data, so that you know there may be some overlaps. And we do that by a method of sampling instead of dividing it into k disjoint groups. We treat this as the pool; and from this pool, we sample D 1, D 2, D 3, D k, so these are the different data subset which is sampled from the original data. They may have some in common, but they also will have different values. Now, as we have discussed that what we do is that to get D i draw random samples with replacement. For example, if we desire that each of these D i’s, we will have m examples we randomly draw m samples from D with replacement that is we can have the same instance repeated several times and some instances may not appear in a particular D i, and so these D i's can be different. We have discussed in the last class that in order that ensembles will work we desire that these learners may have high variance. And by combining these k learners, we can reduce the variance you know in certain cases by 1 by k. So, we can start with learners with high variance. And we also discussed that in order that the different learners produce different hypothesis, we want a learners to be unstable, stable learners you know the respective of what data there you get, the output will be similar that is they have low variance. Learners with high variance are unstable. Among the algorithms that we have studied decision tree, then neural network they are unstable algorithms; decision tree neural network are unstable algorithms given different data, they may give different models. Whereas algorithms like k nearest neighbor is more stable. So, when we want to use ensembles or specifically bagging, we wish to use unstable learner. And we use bootstrapping to generate the training sets; we generate these k training sets. And we train. So, once we get these training sets so suppose in this D, we have the instances 1, 2, 3, 4, 5, 6, 7, 8. And we sample and we get D 1 and D 1 may have 1, 5, 1, 2, 3 and D 2 may have 6, 1, 6, 5 4; D 3 may have 4, 1, 5, 1, 1. So, we get these different learners different data sets. And we train a learner L 1, L 2, l K or these different data sets and we get a model. And these models have to be combined by voting. So, we use bootstrapping to generate these k learners, and train a base learner with each and combine the output by voting. So, in bagging, what we do is that we do sampling in order to get these individual data sets, we do sampling with replacement. And from each sample, we build a bootstrap sample. And if you look at each instance in the original data set, what is the probability that it is included in the sample. Suppose, there is a sample of size m sample of size n right and suppose this data set is also of size n. So, if you take any instance what is the probability that when you draw randomly from this data set from the sample, what is the probability that a particular instance will get included in D i? It is 1 minus 1 by n. And if you do n such draws, so if this is n size n and each of the subs samples are also of size n, then this is the probability that a particular instance will get included in the sample. And so this particular sample will get selected, and we can see that you know this can be approximated by a e inverse and we can see that if we use n random samples about 63 percent of the instances about 63 percent of the instances will occur in each of these samples; 63 percent of the unique instances. And other instances will be repeated this you can find out. Now, So, this is about bagging. Next we come to a second ensemble algorithm which we call boosting. Now, in bagging we are generating the data samples D 1, D 2 from the original data set. And they are all getting the same probability of getting included in any of these subsets. Now in boosting, what we do is that again we have a sample. Now when we try to get D i, we have to sample from D. But now what we do is that we assign a probability with each instance, so every instance is assigned a probability. Suppose, to start with there are n instances and initially we may have we may give equal probability to all the instances. So, each of none of them has a probability of 1 by n. In the first iteration, you choose D 1 based on this uniform probability distribution, and you get D 1 by sampling. Now on D 1 you train a learner L 1; and this learner, you apply to all these instances. Now this learner may have 100 percent accuracy on these instances you know if that is the case then it is a it is a very good learner, you no need to really proceed, so, it will have you know all of them it will do it correctly. But in most of what will happen is that this learner will label sum of these instances in D correctly and some wrongly. Now, suppose, D 1 labels this correctly, labels this wrongly, labels this correctly, labels this correctly, labels this wrongly, labels this wrongly, labels this correctly, and labels this correctly. Now what we want is we want the next learner to have independent errors, and we want the next learner to do well on those instances where the current learner has not done well. So, now what we will do is we will change the probability distribution. And more specifically, we identify those instances which were wrongly labeled by learner one, and we will increase the probability of these instances. And those which were correctly labeled we will proportionately reduce the probability. So, what we are going to do is that we want to hide the probability of those instances which we were not correctly classified by learner L 1. And So, we will get we will have the same dataset, but now the probability distribution will change. Now we will now sample D 2 according to this probability distribution and we will get a sub sample and this sub sample is more likely to contain 2 5 6 then 1 3 4 7 8, so we get D 2. And on D 2 we apply learner we apply the learning algorithm and get the learner L 2. Now, again we apply L 2 on these training instances, and find out which ones D 2 does error on. Suppose, D 2 wrongly classifies 2 7 and three, and the rest it does correctly. So, what we are going to do now is to increase the probability of 2 3 and 7 beyond what was the current probability, and reduce the probability of the rest to keep the sum of the probabilities equal to 1. So, based on that we are going to find D 3, on which we are going to train learner L 3 like this we will go on, we will get the different learners a different data and the different learners. And then we will combine these different outputs of the different learners by voting, and while voting we will assign a weight to each learner and the weight will be somewhat related to how good the learner was. So, this is the basic boosting algorithm. So, in boosting is an iterative procedure, we start with a uniform probability distribution on the given training instances. And we adaptively change the distribution of the training data. So, initially all the training instances have equal weight after each round of boosting the weights get changed. So, we learn a hypothesis or a learner and we assign a strength to each learner we assign a strength to each learner. And this strength is used to decide the weight of the voting, and the final classifier is a linear combination of this different learning hypothesis you know weighted by these weights right. So, what we require is that the individual what we have said earlier also the individual learners may be weak learners they must be unstable they must be weak learners. In fact, it is better if they are weak learners, but they must have more than 50 percent accuracy for a 2-class problem. Now, there are several boosting algorithms. One most common algorithm for boosting is Adaboost which I will talk about in today's class. So, we have seen that boosting can turn a weak algorithm into a strong learner. So, Adaboost takes a sample as input, and let us say we call this sample S. And it contains x 1, y 1, x 2, y 2 x m y n. This is the training example x is a vector of the input attributes and y is the output attribute. And we have a weight for each training example initially D 0 i is the weight of ith training example; initially D 0 i equal to 1 by m for all i between 1 to m. Initially at time 0, all the learners have equal weight which is equal to 1 by m and we use a algorithm. Now, we do capital T number of iterations. In the first iteration we construct D 1 on x 1 x 2 x m, we construct D 1; and by sampling from this S, using the distribution D 0. So, let us write it as script D, so this is D 0. So, we get D 1 or let us me call this S 1. Then what we do is that we update D 1. So, for those i, for which from S 1 we get a particular learner L 1. So, we find out which training examples L 1 classifies correctly, for those training examples that L 1 classifies wrongly, we increase D 1 i, D 1 i is increased for those examples on which L 1 wrongly labels. So, we update D 1 i; and from D 1 i we sample S 2. And from S 2 by running our algorithm we produce the learner L 2. And let us say when we apply the learner L 1 on S the error is epsilon 1. When I learn L 2 on initial S error is epsilon 2; similarly S 3 produces l 3 with error epsilon 3. So, epsilon t is the error over the current sample; this is the error not over the original data, but over the current sample epsilon 3 is the error over S 3. Now, with this, let us just look at please look at the slide to the boosting algorithm. So, we are given S equal to x 1, y 1, x 2, y 2, x m y n, where x is the set of input attributes, y is the output attribute, and initialize the distribution to 1 by m. We initialize the distribution to 1 by m. And then we run this iteration for t equal to one dot dot T. So, we trained the weak learner using D t, and get a weak classifier, let us call it h t. Then we choose alpha t as the weight associated with h t, we will see later how alpha t is chosen. And we update the distribution, so we get D t plus 1 i as the updated distribution which is obtained from D t i into exponential of minus alpha t y i h t x i. So, y i is the target output of the particular training example i, and h t x i is what is output by the learner h t. So, if they agree this is 1; if they do not agree, it is minus 1 So, divided by a normalization factor this normalization factor is chosen such that the sum of the probabilities sum of this overall i is equal to 1. So, Z t is a normalization factor. Now, we do this capital t number of times and final classifier that we output is given by h x equal to the sign of sigma alpha t h t x. So, h 1, h 2 h t are the different classifiers that are output and h t are the different classifiers they are weighted by alpha t. And the total is computed if the total is positive then the class is positive if the total is negative the class is negative. So, this is the basic Adaboost algorithm. You can see the slide here, where the algorithm is given. So, this is our sample; y i is either minus 1 or plus 1, this is the initialization of the probability distribution. And we run this algorithm for capital t number of iterations. In each iteration, we use the current probability distribution to generate the data; and on that data we train a classifier h t, we find alpha t i will show you how alpha t is computed. And based on that the new probability distribution is computed which is D t plus 1 i. And z t is the normalization factor so to make sure that the sum of this is equal to 1, and the final classifier is a taken as a vote of the different h t’s weighted by alpha t. Now, let us see how alpha t is computed. First, we need to compute epsilon t which is the error of the hypothesis h t with the current sample. So, epsilon t is summation over the training examples D t i is the probability of the ith training example in the sample. And delta h t i not equal to y i is you know if i is misclassified this is 1, otherwise this is 0. So, what we are doing is that we are taking the number of misclassifications each you know with a taking the weighted sum of the number of misclassifications so weighted by the probability so that is epsilon t. And alpha t is computed as half log of 1 minus epsilon by epsilon t, so this epsilon is a measure of the error and alpha is a measure of the accuracy of the particular hypothesis h t. And in the ensemble when we are combining the votes of the different hypothesis we are combining based on alpha t. So, this is the mechanism which is followed by the Adaboost algorithm which is a very popular ensemble learning algorithm. Now we will talk about a little bit more about why this ensembling works. So, if you are given a number of weak classifiers, they can be combined to get a strong classifier. These weak classifiers must have error better than random that is epsilon t must be less than 0.5 for a 2-class problem it can be shown that this algorithm adaboost will achieve zero training error exponentially fast. Under the assumption that the errors are independent errors of the different hypothesis are independent. It is shown that 1 by m, this is the average error delta h x i not equal to y i i equal to 1 to m this is the average error, it can be shown the average error of this ensemble produce by a Adaboost is less than equal to the product of over all the members of the ensemble. This is the product of all the learners that we are using in the ensemble times Z t, which is less than equal to exponential of minus 2 small t equal to 1 to capital T, this is the total number of iterations in the Adaboost half minus epsilon t whole square. So, epsilon t, as we said has to be less than 0.5, but even if it is a small number you know it may not need not be very much smaller, but whatever it is so this is the half minus epsilon t is a is a fraction less than 1. And as you take this product, this you know this will become as t increases this will drop to zero exponentially fast, and therefore, Adaboost can this algorithm if the learners are independent it can become very good as we increase the number of iterations. So, we are not going into the theory, but this is the result which has been proved. Now, we will show please look at this slide we will illustrate AdaBoost, how it works So, suppose this is the initial data. We have 10 data points 5 of them are positive which is given by green and given by blue are positive. And the 5 green ones are negative, and all of them each one of them has probability 0.1. So, sum of the probabilities is 1. Now when for boosting you select the first set of training example and the ones within squares get selected. So, these are the data points that you use for training. And based on this you get a classifier and this classifier, let us say predicts these 3 as positive and these 7 as negative. So, we see that these three are correctly labeled as positive, but here these two are wrongly labeled as negative. So, now this probability distribution is updated. So, these two where AdaBoost makes an error their probability is boosted using the formula of boosting and they get boosted to 0.4623 and to balance the probability of these 8 examples are reduced, and they become 0.0094. Based on these values, the value of epsilon and alpha can be calculated and alpha happens to be 1.9459. Now, what happens is the next round takes place. In the next round, we have this new distribution from which the data points are sampled. And let us say now these 6 points which are in square a sampled, and suppose the learner that to come up with this B 2, now B 2 again makes some error and based on that error you know B 2 makes error on these three examples. And based on this the probabilities of these examples are increased, and the others are reduced, and alpha in this case is computed to be 2.9323. Then the third round takes place on this new distribution. And let us say third round chooses some samples and B 3 is this classifier. And there is an error on these values this middle values and so the error on this values based on that the probabilities are recomputed, and alpha is now computed to be 3.8744. Finally, these 3 classifiers are combined with alpha 1, alpha 2, alpha 3 and the overall classifier now is able to label these as positive, these as negative, these as positive. So, this is in this particular case, boosting is able to combine these three classifiers into a classifier which is fully accurate it may not always happen that you get it something fully accurate. But what is the interesting you see that these classifiers were linear classifiers they just made a division of this interval; whereas the ensemble classifier does not belong to the same type of classifier, it is able to identify that there are positive examples here and negative examples here. So, this is an illustration of the AdaBoost algorithm. With this, I stop today's lecture on boosting. Thank you.
Info
Channel: Machine Learning- Sudeshna Sarkar
Views: 61,208
Rating: 4.8264151 out of 5
Keywords:
Id: MRD67WgWonA
Channel Id: undefined
Length: 31min 45sec (1905 seconds)
Published: Thu Sep 01 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.