Boosting - EXPLAINED!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone my name is AJ how thora and in this video we're gonna talk about boosting so you're a programmer working a dead-end job not making enough money and you decide that well you need side hobby that's somewhat more stable and you decide to go into gambling on horse races but clearly you're no expert and you barely have an idea of what to look into while placing your bets realizing your ineptitude you consult an expert gambler and you ask him what are the most important factors in determining the winners of a horse race and he says well it's it's complicated there's no grand set of rules you see to which you say well ok then but what if we take a look at just these ten races only and he says well there are some rules of thumb that we can establish for example for these ten races it looks like the horse with the most previous wins won that race so just bet on the horse with the most previous wins if we were to present another collection of ten different races then the expert may say well for these ten races it looks like the latus horses performed better so you should bet on the latter horses now these rules are very general and while these rules of thumb are still better than a random guess they are still quite inaccurate the lighter horses may have had an edge in these last ten races but we can come up with probably 100 other cases where the heavier horses may have won and so we give our expert a different set of races and he would give us back some rules of thumb but here's what we need to do what is the best way to group these horse races in two sets in order to get the best rules of thumb from our expert and even if we do do that and we have these rules of thumb ready how do we actually make predictions on or how do we make decisions on which horse to bet on in a race now boosting solves both of these questions it's a technique where we combine multiple rules of thumb to make an accurate and informed decision but how is this possible I want to take a step back and segue back into this real Sandow neural networks logistic regression support vector machines all of these models answer the question how do we learn to solve this problem but a question that we should actually ask before this is is this problem learn about or is this problem solvable to answer this question a machine learning framework called PAC learning model was introduced back in 1984 this framework quantitatively defined how do you determine if a problem is learn about so what is the PAC model exactly well it's a probably approximately correct model okay there's a lot of words here let's break it down we want a correct model a model that always makes correct predictions on an unseen data set and you can only get this by seeing every possible input combination and but if you do that at that point it's just memorization and not generalization which is the point of machine learning so there's no learning okay so if we can't get a correct model can we get an approximately correct model well we can gather some training data set and potentially train a model that has a very small nonzero error however we cannot always guarantee that the error is something tolerable what if we accidentally sample some bad training data then we're going to get a bad model that has a pretty high error so we have to make sure that the probability of getting this bad model as a result of bad training data is low lower than some threshold so we cannot be approximately correct always but we can probably be approximately correct and hence the name P AC PAC putting this all together a problem is packed learn about if a learning algorithm can find a solution that has an error less than some threshold with a probability that is greater than some other threshold this is what I meant by quantitatively defining what is learn about we have actual physical numbers here now this defines pac learnability but if we look at this same definition in terms of the learning algorithm we can say that a learning algorithm that satisfies these threshold conditions is said to be a strong learner okay that's cool but we're talking about boosting here so where does that fit in keep listening consider the simple problem of classifying a flower into its correct category based on some foreign America features this you may know is the basis of the iris dataset and the simple problem can be solved with reasonably low error using something like logistic regression so for this particular problem logistic regression is a strong learner it fits our definition with the thresholds that is that it is able to learn to model this iris dataset with an error less than some epsilon with a probability greater than than some other threshold like 1 minus Delta but for more complex problems a strong learner would need to be more complex to make sure that it satisfies the conditions for the thresholds but for these problems we require a lot more learning parameters a lot more samples for training and we may also have like a very high hardware requirement solving these complex problems with stronger algorithms is actually the direction in which we are moving in the field of deep learning research specifically but for general use if we want to solve complex problems like these and do have the most sophisticated hardware what do we do or what if we don't have enough training examples can we not just solve these complex problems at all this is where weak learners come in in 1988 paper by researchers Michael Kern's and Leslie valiant defined weak learners as algorithms that performed just slightly better than random guessing shortly after in 1990 another researcher Robert Shapiro illustrated the powers of these weak learners in his paper the strength of weak learnability the main takeaway of this paper was that if a problem can be solved by a strong learner then a weak learner should be able to do it too in some way they showed this by introducing a technique called hypothesis boosting mechanism so let's talk about that a hypothesis represents what a model has learned after training in the supervised learning case think of it as the final equations with the weights instead of constructing one hypothesis though why not construct three hypotheses have them all make predictions and then we go by majority vote we can get different hypotheses from the same algorithm just by training it on different training data so to construct the first hypothesis feeded all the training data we have for the second hypothesis construct the data set where half the points were classified correctly by the first hypothesis and the other half was incorrectly classified this way the second hypothesis can somewhat compensate for the shortcomings of the first now the third hypothesis is basically a tiebreaker so we use the points of the first two didn't agree on and construct training samples to train this third hypothesis so the third hypothesis overcomes the shortcomings of the first in the second when we want to make some predictions on unseen data we basically feed it to all these three hypotheses determine their outputs a majority vote wins with this version of boosting algorithms actually performed better than without boosting now think about this for a second what if the algorithm in question was a weak learner then that means that we can construct three weak hypotheses by boosting and by combining these three weak hypotheses we may be able to make strong predictions and solve more complex problems that is weak learners can be used to make strong predictions which was the big takeaway of the paper but what if the problems were even more complex three weak hypotheses may still not be enough to get adequate performance and so we can use multiple such three hypotheses units to get this hierarchical structure where the final output would be like a majority of majorities if you will but this does not scale well in 1995 researchers from AT&T Labs introduced a modification to this algorithm collapse the entire hierarchical structure and take a single majority from multiple hypotheses this slightly improved performance for more complex problems all we need to do is just add more hypotheses until our training error goes down so it's quite scalable here's another thought in this case every hypothesis has equal say in the final decision but what if they were weighted instead what if after adding a hypothesis there importances would adapt or change depending on the errors of the new hypotheses added this is the idea behind adaptive boosting or adaboost one of the well known variants of boosting we see today the main difference between adaboost and the previous variants is to fold every weak hypothesis has a weight while constructing the final decision and every sample has a weight associated with it while constructing a weak hypothesis this is as opposed to the filtering of data samples that we discussed before in the hypothesis boosting mechanism here's the algorithm step 1 initialize the weights of each sample to be the same step 2 construct the first hypothesis by training on these data points step 3 determine the training error of this hypothesis and then make the weight updates now there are two sets of weights that needed to be updated the first is a sample weights for the next hypothesis so we need to increase the weights of misclassified samples and decrease the weights of the correctly classified samples this is to ensure that the next hypothesis focuses more on what the current one missed all so we update the weights of the hypotheses themselves in the case of the first iteration the weight of the first hypothesis will be 1 because there's only one hypothesis and as we add more and more hypotheses over the iterations you will see that these weights adapt or change the final classifier can thus be used to make stronger decisions now here's a case study on overfitting and underfitting and how adaboost deals with it if we don't add enough hypotheses our partially boosted model will still not be sophisticated enough to capture patterns in training data so boosting weak learning algorithms may be prone to underfitting on the other hand boosting of weak algorithms is relatively robust to overfitting but only under the assumption that we're using a weak learning algorithm as the learner like a single stumped decision tree if the hypotheses are weak it's going to take a considerable number of iterations to add enough complexity to the final boosting model now another even more popular boosting technique than adaboost is gradient boosting it's easier understood by drawing parallels with adaboost so boosting techniques like adaboost and gradient boost they are additive in nature after every iteration we add a new hypothesis that strives to overcome the shortcomings of the previous hypotheses adaboost constructs a new hypothesis by increasing the weights of the misclassified data samples so the next hypothesis gets it right gradient boosting on the other hand uses gradients for everything to find the data point to focus on it will compute the gradients of the overall hypotheses with a training data sample this is done because gradients also measure how poorly this particular data point is classified worse the data point classified that means higher is its gradient and the next hypothesis is going to focus more on that sample to make sure it gets it right for adaboost the shortcomings are shown by high weighted data points while in gradient boosting the shortcomings are identified by high gradient data points in adaboost the weighting of data points is done with an exponential curve this just means that if a sample is classified incorrectly by the overall hypothesis then we need to weight it really high so that the next hypothesis really focuses on getting it right but instead of an exponential weighting scheme what if we let the model determine the weights in its own fluid weighting scheme after all in gradient boosting everything seems to be done with gradients so I not use that to learn the weighting scheme it on its own this allows for more flexibility and a Debus is often seen as a specific case of gradient boosting with an exponential weighting scheme gradient boosting seems to be a clear winner but as we use machine learning in real life we have complex problems that require millions if not billions of training examples and well some caveats of gradient boosting or that computing the gradients can become quite slow and because of so many moving parts and learn about parameters they can also be over fitted both of these are however taken care of in a programmatic implementation of XG boost XG boost is gradient boosting but with some performance enhancements usually a dataset is in table format where the rows are data samples and the columns are features this takes too much space so XG booth stores data in a compressed sparse column format instead or a CSC format if we have a lot of data which is usually the case we divide the data into blocks these blocks can run on separate machines for parallel compute actually boost operates quick on sorted columns with this compressed sparse column format we can sort every column in every block by value to optimize even further decision tree nodes are constructed in a breadth-first fashion also in parallel all of these optimizations together make extra boost up to 10x faster than normal gradient boosting and it's because of this you see extra boost everywhere from product recommender systems to house price prediction systems to fraud prediction you name it a fun fact that I read in the XG boost paper was that 17 of the 29 winning solutions in the 2015 Cargill competitions uses either some form of gradient boosting or gradient boosting alone itself I encourage you to check out the coggle solutions in their blog post they are very well-written and explain how and why these winners chose some variant of gradient boosting for their solution and now that you have some context boosting and its evolution you can better appreciate these programming implementations and even for your future projects you can know whether to apply these techniques or similar techniques in the description and lay out all the research resources that I have referenced in this video as a timeline of events from the inception of pack learning that started the discussion to defining what problems are learn about - well gradient boosting that we see in production systems today for like high scale and very complex problem solving hopefully now you're better equipped to take on your newfound passion of horse race gambling to make ends meet in your plebeian household bye bye
Info
Channel: CodeEmporium
Views: 48,829
Rating: undefined out of 5
Keywords: Machine Learning, Deep Learning, Data Science, Artificial Intelligence, Neural Network, boosting, xgboost, gradient boost, adaboost, random forest, best machine learning algorithm
Id: MIPkK5ZAsms
Channel Id: undefined
Length: 17min 31sec (1051 seconds)
Published: Thu Dec 19 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.