XGBoost Part 1 (of 4): Regression

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Josh Starmer is the best thing that's happened to communicating statistics in an understandable way.

πŸ‘οΈŽ︎ 15 πŸ‘€οΈŽ︎ u/[deleted] πŸ“…οΈŽ︎ Dec 16 2019 πŸ—«︎ replies

BAAAAAAAM !!!

πŸ‘οΈŽ︎ 9 πŸ‘€οΈŽ︎ u/[deleted] πŸ“…οΈŽ︎ Dec 17 2019 πŸ—«︎ replies

The way Josh Starmer brings statistic terms to us is BAMMM !!!

πŸ‘οΈŽ︎ 6 πŸ‘€οΈŽ︎ u/hoangHEDSPi πŸ“…οΈŽ︎ Dec 17 2019 πŸ—«︎ replies

StatQuest rocks!

πŸ‘οΈŽ︎ 1 πŸ‘€οΈŽ︎ u/ppasanen πŸ“…οΈŽ︎ Dec 17 2019 πŸ—«︎ replies
Captions
XG boost its extreme and its gradient boost stat quest hello I'm Josh stormer and welcome to stat quest today we're gonna talk about XG boost part 1 we're gonna talk about XG boost trees and how they're used for regression note this stat quest assumes that you are already familiar with at least the main ideas of how gradient boost does regression and you should be familiar with at least the main ideas behind regularization if not check out the quests the links are in the description below XG boost is extreme and that means it's a big machine learning algorithm with lots of parts the good news is that each part is pretty simple and easy to understand and we'll go through them one step at a time actually I'm assuming that you are already familiar with gradient boost and regularization so we'll start by learning about XG boosts unique regression trees because this is a big topic will spend three whole stack quests on it in this stack quest part 1 will build our intuition about how XG boost does regression with its unique trees in part two we'll build our intuition about how XG boost does classification and in part three we'll dive into the mathematical details and show you how our aggression and classification are related and why creating unique trees makes so much sense note 'xg boost was designed to be used with large complicated datasets however to keep the examples from getting out of hand we'll use this super simple training data on the x-axis we have different drug dosages and on the y-axis we've measured drug effectiveness these two observations have relatively large positive values for drug effectiveness and that means that the drug was helpful these two observations have relatively large negative values for drug effectiveness and that means that the drug did more harm than good the very first step in fitting XG boost to the training data is to make an initial prediction this prediction can be anything but by default it is 0.5 regardless of whether you're using XG boost for regression or classification the prediction 0.5 corresponds to this thick black horizontal line and the residuals the differences between the observed and predicted values show us how good the initial prediction is now just like unex treem gradient boost XG boosts fits a regression tree to the residuals however unlike unex treem gradient boost which typically uses regular off-the-shelf regression trees XG boost uses a unique regression tree that I call an XG boost tree so let's talk about how to build an XG boost tree for regression note there are many ways to build XG boost trees this video focuses on the most common way to build them for regression each tree starts out as a single leaf and all of the residuals go to the leaf now we calculate a quality score or similarity score for the residuals similarity score equals the sum of the residuals squared over the number of residuals plus lambda note lambda is a regularization parameter and we'll talk more about that later for now let lambda equals zero now we plug the for residuals into the numerator and since there are four residuals in the leaf we put a four in the denominator note because we do not square the residuals before we add them together in the numerator 7.5 and negative 7.5 cancel each other out in other words when we add this residual to this residual they cancel each other out likewise 6.5 cancels out most of negative 10.5 leaving negative 4 squared in the numerator thus the similarity score for the residuals in the root equals 4 so let's put similarity equals 4 up here so we can keep track of it now the question is whether or not we can do a better job clustering similar residuals if we split them into two groups to answer this we first focus on the two observations with the lowest dosages average dosage is 15 and that corresponds to this dotted red line so we split the observations into two groups based on whether or not the dosage is less than 15 the observation on the far left is the only one with dosage less than 15 so it's residual goes to the leaf on the left all of the other residuals go to the leaf on the right now we calculate the similarity score for the leaf on the left by plugging the 1 residual into the numerator and since only one residual went to the leaf on the Left the number of residuals equals 1 like before we set lambda equal to zero and the similarity score for the leaf on the Left equals 110 0.25 so let's put similarity equals 110 0.25 under the leaf so we can keep track of it and calculate the similarity score for the residuals that go to the leaf on the right we plug in the sum of residuals squared into the numerator and since there are three residuals in the leaf on the right we plug 3 into the denominator like before let's let lambda equal zero note like we saw earlier because we do not square the residuals before we add them together 7.5 and negative 7.5 cancel each other out leaving only one residual 6.5 in the numerator thus the similarity score for the residuals in the leaf on the right equals 14 point zero eight so let's put similarity equals 14 point zero eight under the leaf so we can keep track of it now that we have calculated similarity scores for each node we see that when the residuals in a node are very different they cancel each other out and the similarity score is relatively small in contrast when the residuals are similar or there is just one of them they do not cancel out and the similarity score is relatively large now we need to quantify how much better the leaves cluster similar residuals than the root we do this by calculating the gain of splitting the residuals into two groups gain is equal to the similarity score for the leaf on the Left plus the similarity score for the leaf on the right minus the similarity score for the root plugging in the numbers eep-eep eep-eep eep-eep eep-eep eep-eep gives us 120 point three three small BAM now that we have calculated the gain for the threshold of dosage less than 15 we can compare it to the gain calculated for other thresholds so we shift the threshold over so that it is the average of the next two observations and build a simple tree that divides the observations using the new threshold dosage less than twenty two point five now we calculate the similarity scores for the leaves and calculate the game pppp pppp pppp de boop boop the game for dosage less than twenty two point five is four since the gain for dosage less than twenty two point five is less than the game for dosage less than fifteen dosage less than fifteen is better at splitting the residuals into clusters of similar values now we shift the threshold over so that it is the average of the last two observations and build a simple tree that divides the observations using the new threshold dosage less than 30 then we calculate the similarity scores for the leaves and the gain doot-doot doot-doot doot-doot doot-doot doot-doot doot-doot doot-doot doot-doot the game for dosage less than thirty equals fifty six point three three again since the gain for dosage less than 30 is less than the game for dosage less than 15 dosage less than 15 is better at splitting the observations and since we can't shift the threshold over any further to the right we are done comparing different thresholds and we will use the threshold that gave us the largest gain dosage less than 15 for the first branch in the tree BAM now since there is only one residual in the leaf on the Left we can't split it any further however we can split the three residuals in the leaf on the right so we start with these two observations and their average dosage is twenty-two point five which corresponds to this dotted green line so the first threshold that we try is dosage less than twenty two point five now just like before we calculate the similarity scores for the leaves note we calculated the similarity score for this node when we figured out how to split the root so now we calculate the game dududududu dudududududududu and we get gain equals twenty eight point one seven four when the threshold is dosage less than twenty two point five now we shift the threshold over so that it is the average of the last two observations calculate the similarity scores for the leaves and the gain doo-doo-doo-doo - and we get gain equals one hundred forty point one seven which is much larger than twenty eight point one seven when the threshold was dosage less than twenty two point five so we will use dosage less than 30 as the threshold for this branch note to keep this example from getting out of hand I've limited the tree depth to two levels and this means we will not split this leaf any further and we are done building this tree however the default is to allow up to six levels small BAM now we need to talk about how to prune this tree we prune an XG boost tree based on its gain values we start by picking a number for example 130 oh no it's the dreaded terminology alert XG Boost calls this number gamma we then calculate the difference between the gain associated with the lowest branch in the tree and the value for gamma difference between the gain and gamma is negative we will remove the branch and if the difference between the gain and gamma is positive we will not remove the branch in this case when we plug in the game in the value for gamma 130 we get a positive number so we will not remove this branch and we are done pruning note the gain for the route 120 point three is less than 130 the value for gamma so the difference will be negative however because we did not remove the first branch we will not remove the route in contrast if we set gamma equal to 150 then we would remove this branch because 140 point 1 7 minus 150 equals a negative number so let's remove this branch now we subtract gamma from the gain for the route since 120 point 3 3 minus 150 equals a negative number we will remove the root and all we would be left with is the original prediction which is pretty extreme pruning so while this wasn't the most nuanced example of how an X G boost tree is pruned I hope you get the idea now let's go back to the original residuals and build a tree just like before only this time when we calculate similarity scores we will set lambda equal to one remember lambda is a regularization parameter which means that it is intended to reduce the prediction sensitivity to individual observations now the similarity score for the root is 3.2 which is 8/10 of what we got when lambda equals 0 when we calculate the similarity score for the leaf on the Left we get fifty five point one two which is half of what we got when lambda equals zero and when we calculate the similarity score for the leaf on the right we get ten point five six which is three-quarters of what we got when lambda equals zero so one thing we see is that when lambda is greater than zero the similarity scores are smaller and the amount of decrease is inversely proportional to the number of residuals in the node in other words the leaf on the Left had only one residual and it had the largest decrease in similarity score 50% in contrast the root had all four residuals in the smallest decrease 20% now when we calculate the gain we get 66 which is a lot less than 120 point 3 3 the value weak out when lambda equals 0 similarly when lambda equals one the game for the next branch is smaller than before now just for comparison these were the gain values when lambda equals zero when we first talked about pruning trees we set gamma equal to 130 and because for the lowest branch in the first tree gain minus gamma equaled a positive number so we did not prune at all now with lambda equals 1 the values for gain are both less than 130 so we would prune the whole tree away so when lambda is greater than zero it is easier to prune leaves because the values for gain are smaller note before we move on I want to illustrate one last feature of lambda for this example imagine we split this node into two leaves now let's calculate the similarity scores with lambda equal to one for the branch we get sixty five point three for the left leaf we get twenty one point one two and for the right leaf we get 28.1 to that means the gain is negative sixteen point zero six now when we decide if we should prune this branch we plug in the game and we plug in a value for gamma note if we set gamma equal to zero then we will get a negative number and we will prune this branch even though gamma equals zero in other words setting gamma equal to zero does not turn off pruning - on the other hand by setting lambda equal to 1 lambda did what he was supposed to do it prevented overfitting the training data awesome for now regardless of lambda and gamma let's assume that this is the tree we are working with and determine the output values for the leaves the output value equals the sum of the residuals divided by the number of residuals plus lambda note the output value equation is like the similarity score except we do not square the sum of the residuals so for this leaf we plug in the residual negative 10.5 the number of residuals in the leaf 1 and the value for the regularization parameter lambda if lambda equals zero then there is no regularization in the output value equals negative 10.5 on the other hand if lambda equals one the output value equals negative five point two five in other words when lambda is greater than zero then it will reduce the amount that this individual observation adds to the overall prediction thus lambda the regularization parameter will reduce the prediction sensitivity to this individual observation for now we'll keep things simple and let lambda equal zero because this is the default value and put negative 10.5 under the leaf so we will remember it now let's calculate the output value for this leaf when lambda equals zero the output value is seven in other words when lambda equals zero the output value for a leaf is simply the average of the residuals in that leaf so we'll put the output value under the leaf so we will remember it lastly when lambda equals zero the output value for this leaf is negative seven point five now at long last the first tree is complete double BAM since we have built our first tree we can make new predictions and just like on extreme gradient boost XG boost makes new predictions by starting with the initial prediction and adding the output of the tree scaled by a learning rate oh no it's another dreaded terminology alert XG boost calls the learning rate Etta and the default value is 0.3 so that's what we'll use thus the new predicted value for this observation with dosage equal to 10 is the original prediction 0.5 plus the learning rate Etta 0.3 times the output value negative 10.5 and that gives us negative 2.6 5 so if the original prediction was 0.5 then this was the original residual the new prediction is negative two point six five and we see that the new residual is smaller than before so we've taken a small step in the right direction similarly the new prediction for this observation with dosage equal twenty is beep boop boop boop beep boop beep boop 2.6 and the new residual is smaller than before so we've taken another small step in the right direction likewise the new predictions for the remaining observations have smaller residuals than before suggesting each small step was in the right direction BAM now we build another tree based on the new residuals and make new predictions that give us even smaller residuals and then build another tree based on the newest residuals and we keep building trees until the residuals are super small or we have reached the maximum number triple bam in summary when building XG boost trees for regression we calculate similarity scores gain to determine how to split the data prune the tree by calculating the differences between gain values and a user defined a tree complexity parameter gamma if the difference is positive then we do not prune if it's negative then we prune for example if we subtract gam off from this game and get a negative value we will prune otherwise were done if we prune then we will subtract gamma from the next game value and work our way up the tree then we calculate the output values for the remaining leaves and lastly lambda is a regularization parameter and when lambda is greater than zero it results in more pruning by shrinking the similarity scores and it results in smaller output values for the leaves BAM tune in next time for XG boost part 2 when we give an overview of how XG boost trees are built for classification it's going to be totally awesome hooray we've made it to the end of another exciting stat quest if you liked this stack quest and want to see more please subscribe and if you want to support stack quest consider contributing to my patreon campaign becoming a channel member buying one or two of my original songs or a t-shirt or a hoodie or just donate the links are in the description below alright until next time quest on
Info
Channel: StatQuest with Josh Starmer
Views: 230,434
Rating: 4.9311585 out of 5
Keywords: StatQuest, Josh Starmer, XGBoost, Regression, Machine Learning, Statistics, Data Science, Regression Tree
Id: OtD8wVaFm6E
Channel Id: undefined
Length: 25min 46sec (1546 seconds)
Published: Mon Dec 16 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.