XGBoost Part 3 (of 4): Mathematical Details

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
XG boost math details there's a lot of them watch out staffed quest hello I'm Josh stormer and welcome to stat quest today we're gonna talk about XG boost part 3 mathematical details this stat quest assumes that you already have a general idea of how XG boost builds trees if not check out the quests the links are in the description below this stat quest also assumes that you know the details of how gradient boost works if not check out the quests lastly it assumes that you are familiar with Ridge regression if not the link is in the description below in XG boost part 1 we saw how XG boost builds XG boost trees for regression and in XG boost part 2 we saw ha XG boost builds XG boost trees for classification in both cases we build the trees using similarity scores and then calculated the output values for the leaves now we will derive the equations for the similarity scores in the output values and show you how the only difference between regression and classification is the loss function to keep the examples manageable we'll start with a simple training data set for regression and this simple training data set for classification for regression we are using drug dosage on the x-axis to predict drug effectiveness on the y-axis for classification we are using drug dosage on the x-axis to predict the probability the drug will be effective for both regression and classification we already know that XG boost starts with an initial prediction that is usually 0.5 and in both cases we can represent this prediction with a thick black line at 0.5 and the residuals the differences between the observed and predicted values show us how good the initial prediction is just like in regular unex treem gradient boost we can quantify how good the prediction is with a loss function in gradient boost part two regression details we learned how to use this loss function 1/2 times the squared residual for regression to review y sub I stands for the Y access value from one of the observed values y sub 1 Y sub 2 and Y sub 3 and P sub I stands for a prediction P sub 1 P sub 2 in P sub 3 that corresponds to one of the observations y sub 1 Y sub 2 and Y sub 3 for example if we applied the loss function to the initial prediction then we would add one term for each observation so N equals three the first term in the summation corresponds to the first observation y sub one second term corresponds to the second observation y sub two in the third term corresponds to the third observation y sub three so we just plug in the numbers BP boopy boopy boopy boopy boopy boopy boopy and do the math and get 100 4.4 later we can apply the loss function to new predictions and compare the results to this one to determine if our predictions are improving or not note if we had in observations then we would add up in terms no big deal in gradient boost part for classification details we learned how to use this loss function the negative log likelihood for classification and just like before whysa by refers to a y-axis value for one of the observed values which is either 0 or 1 and P sub I refers to a predicted value between and including 0 & 1 note if you want more details and examples using this loss function check out the stat quest gradient boost part for classification details now that we have one loss function for regression and another loss function for classification XG boost uses those loss functions to build trees by minimizing this equation note the equation in the original manuscript for XG boost contains an extra term that I'm omitting this term gamma times T where T is the number of terminal nodes or leaves in a tree and gamma is a user definable penalty is meant to encourage pruning I say that it encourages pruning because as we saw an XG boost part 1 XG boost can prune even when gamma equals 0 I'm omitting this term because as we saw in parts 1 & 2 pruning takes place after the full tree is built and it plays no role in deriving the optimal output values or similarity scores so let's talk about this equation the first part is the loss function which we just talked about the second part consists of a regularization term the goal is to find an output value for the leaf that minimizes the whole equation and in a way that is very similar to Ridge regression we square the output value from the Nutri and scale it with lambda later on I will show you that just like Ridge regression if lambda is greater than 0 then we will shrink the output value the 1/2 just makes the math easier note because we are optimizing the output value from the first tree we can replace the prediction P sub I with the initial prediction P of 0 plus the output value from the new tree now that we understand all of the terms in this equation let's use it to build the first tree we start by putting all of the residuals into a single leaf and now we need to find an output value for this leaf that will minimize this equation but before we dive into the math let's simplify things by setting lambda equal to zero and that means removing the regularization term okay now let's plug in different output values for the leaf and see what happens for example if we set the output value equal to zero then we are left with the loss function for the initial prediction zero point five note we already calculated the loss function for the initial prediction when we demonstrated the regression loss function and we got 100 four point four so let's move the equation up here and plot the result on this graph X access on this graph represents different values for the output value and the y axis represents the sum of the loss functions for each observed value now let's see what happens if we set the output value equal to negative one when the output value for the leaf is negative 1 then the new prediction is 0.5 plus negative 1 which equals negative 0.5 and negative 0.5 corresponds to this new thick black line and that shrinks the residual for y sub 1 but it makes the residuals for both y sub 2 and y sub 3 larger when we do the math we get 109 point 4 now we plot the point on the graph and we see that negative one is a worse choice for the output value than zero because it has a larger total loss in contrast if we set the output value to positive one then the new prediction is 0.5 plus 1 which equals 1.5 and that makes the residual for y sub 1 larger but the residuals for y sub 2 and y sub 3 are smaller now when we do the math we get 100 2.4 and we see that we have the lowest totals so far and thus positive 1 is the best output value we have picked so far note we can keep plugging in numbers for the output value to see what we get or we can just plot the function as a curve curve shows us that when lambda is zero then the optimal output value is at the bottom of the parabola where the derivative is zero now let's see what happens when we increase lambda to 4 when lambda equals 4 the lowest point in the parabola shifts closer to 0 and if we increase lambda to 40 then the lowest point in the parabola shifts even closer to 0 in other words the more emphasis we give the regularization penalty by increasing lambda the optimal output value gets closer to zero and this is exactly what regularization is supposed to do so that's super cool BAM now one last thing before we solve for the optimal output value you may remember that when regular on extreme gradient boost found the optimal output value for a leaf it solved an equation very similar to the first part without regularization on extreme gradient boost used two techniques to solve this equation one for regression because the math was easy in a different one for classification because the math was not easy specifically for classification on extreme gradient boost use a second-order Taylor approximation to simplify the math when solving for the optimal output value in contrast XG boost uses the second-order Taylor approximation for both regression and classification unfortunately explaining Taylor series approximation x' is out of the scope of this stat quest so you'll just have to take my word for it that the loss function that includes the output value can be approximated by this mess of sums and derivatives the genius of a Taylor approximation is that it's made of relatively simple parts this part is just the loss function for the previous prediction this is the first derivative of that loss function and this is the second derivative of that loss function note since the derivative of a function is related to something called a gradient XG boost uses G to represent the derivative of the loss function and since the second derivative of a function is related to something called a hessian XG boost uses h to represent the second derivative of the loss function now let's expand the summation bit bit bit bit the regularization term and plug in the second-order Taylor approximation for each loss function bit bit bit bit bit before we move on let's remember that we want to find an output value that minimizes the loss function plus the regularization and that all we have done so far is approximate the equation we want to minimize with a second-order Taylor polynomial now it's worth noting that these terms do not contain the output value and that means they have no effect on the optimal output value so we can omit them from the optimization now all that remains are terms associated with the output value so let's combine all of the unsquare output value terms into a single term in combine all of the squared output value terms into a single term and move the formula to give us some space to work now let's do what we usually do when we want a value that minimizes a function one take the derivative with respect to the output value to set the derivative equal to zero and three solve for the output value of the first term with respect to the output value is just the sum of the G's when we take the derivative of the second term with respect to the output value the exponent two comes down and cancels out the one-half leaving us with the sum of the H's and lambda times the output value now we set the derivative equal to zero and solve for the output value so we subtract the sum of the g's from both sides and divide both sides by the sum of the HS and lambda hooray we have finally solved for the optimal output value for the leaf now we need to plug in the gradients the G's and the Hessians the H's for the loss function so let's move the equation out of the way so we have some room if we're using XG boost for regression then this is the most commonly used loss function shameless self-promotion just to remind you this is the exact same loss function that we described in detail in gradient boost part two regression details so if the next few steps move too quickly just check out the quest the link is in the description below the first derivative aka the gradient G sub I with respect to the predicted value P sub I is negative one times the difference between the observed value y sub I and the predicted value P sub I in other words G sub I is the negative residual that means we plug in the negative residual for each G sub I in the numerator note this negative sign cancels out all of these negative signs so the whole numerator is just the sum of the residuals now we need to figure out what the ages are in the denominator the second derivative aka the Hessian H sub I with respect to the predicted value P sub I is the number one so that means we replace all in H's in the denominator with the number one in other words the denominator is the number of residuals plus lambda so when we are using XG boosts for regression this is the specific formula for the output value for a leaf to summarize what we've done so far we started out with this data then we made an initial prediction 0.5 then we put all of the residuals in this leaf then we asked ourselves what the output value of this leaf should be given a loss function and regularization we then plotted the equation as a function of the output value and solve for the lowest point where the derivative is zero and this is what we got the equation for the output value gives us the X access coordinate for the lowest point in the parabola BAM now if we are using XG boost for classification then this the negative log likelihood is the most commonly used loss function shameless self-promotion this is the exact same loss function that we worked with in gradient boost part for classification details in that stat quest we spent a long time deriving the first and second derivative of this equation calculating the derivatives took a long time because the output values are in terms of the log odds so we converted the probabilities to log-odds one step at a time rather than skipping the fun parts like we are now then we took the derivatives without skipping the fun parts like we're doing here and lastly we converted the log-odds back to probabilities without skipping any steps in the math note if you feel like you just missed out on a lot of fun stuff fear not the link to gradient boost part for classification details is in the description below now that we have the first derivative of the loss function aka the gradient G sub I and the second derivative of the loss function aka the Hessian H sub I we can plug them into the equation for the optimal output value just like for regression G sub I is the negative residual and we know that this negative sign will cancel out this negative sign in the numerator for the output value so we can replace the numerator for the output value with the sum of the residuals in the denominator we can just replace all of the H sub i's with the sum of P sub I times 1 minus P sub I note in the denominator we're using previous probability to specify the previously predicted probability rather than the previously predicted log odds so when we are using XG boost for classification this is the specific formula for the output value for a leaf BAM now regardless of whether we are using XG boosts for regression or classification we can calculate the optimal output value for this leaf and we do it by plugging derivatives of the loss functions into the equation for the output value BAM now we need to derive the equations for the similarity score so we can grow the tree however before we do that remember that we derived the equation for the output value by minimizing the sum of the loss functions plus the regularization and let's also remember that depending on the loss function optimizing this part can be hard so we approximated it with a second-order Taylor polynomial so we expanded the summation bit Allah debt debt debt but at debt added the regularization term and swapped in the second-order Taylor approximation of the loss function then removed the constant terms and lastly did a little algebra to simplify everything now because we removed constants went arriving this equation it's not equal to what we started with however if we plotted both equations on a graph we'd see that the same X access coordinate represented by the optimal value tells us the location of the lowest points in both parabolas I mention this because XG Boost uses the simplified equation to determine the similarity score so the first thing XG boost does is multiply everything by negative one and that makes each term negative and it flips the parabola over the horizontal line y equals zero now the optimal output value represents the X access coordinate for the highest point on the parabola and this Y access coordinate for the highest point on the parabola is the similarity score at least it's the similarity score described in the original XG boost manuscript however the similarity score used in the implementations is actually two times that number the reason for this difference will become clear once we do the algebra so let's do the algebra to convert this into the similarity scores we saw an XG boost parts 1 & 2 first let's plug in the solution for the output value now multiply together the sums of the gradients G's on the Left note these negative signs cancel out and we get the square of the sum now we square the term on the right this sum cancel out this square now we add these two terms together and we end up with this fraction this is the equation for the similarity score as described in the original XG boost manuscript however in the XG boost implementations this 1/2 is omitted because the similarity score is only a relative measure and as long as every similarity score is scaled the same amount the results of the comparisons will be the same this is an example of how extreme extreme gradient boost is it will do anything to reduce the amount of computation now if we're using XG boost for regression and we're using this loss function we plug the first derivative G sub I into the numerator and since G sub I is the negative residual the numerator is simply the sum of the residuals squared now we plug the second derivative H sub I into the denominator and since H sub I equals 1 the denominator is just the number of residuals plus lambda thus this equation is the similarity score that we use for regression double bam now we need to derive the similarity score for classification and that just means plugging in the first derivative for the loss function G sub I and the second derivative H sub I again since G sub I is the negative residual the numerator is simply the sum of the residuals squared and since H sub I is the previously predicted probability times 1 minus the previously predicted probability then the denominator is just the sum of the H sub i's plus lambda thus this equation is the similarity score that we use for classification triple bam now for one little annoying detail in part two of this series we talked about cover cover is related to the minimum number of residuals in a leaf and we said that cover was the denominator of the similarity score minus lambda in other words cover is the sum of the Hessians the h sub i's for regression the hessian AKA the second derivative of the loss function is 1 and since there is one hessian / residual in a leaf cover for regression is simply the number of residuals in a leaf for classification the hessian is P times 1 minus P so cover is equal to the sum of the previously predicted probability times 1 minus the previously predicted probability small bam in summary boost builds trees by finding the output value that minimizes this equation the equation consists of a loss function and a regularization term that is just like Ridge regression we then solve for the optimal output value and once we have the output value we plug it into the simplified equation to get the similarity score we then configure the output value and similarity score equations for regression or classification by plugging in the first derivative G sub I and the second derivative H sub I of the loss function BAM [Music] hooray we've made it to the end of another exciting stat quest if you like this stat quest and want to see more please subscribe and if you want to support stat quest consider contributing to my patreon campaign becoming a channel member buying one or two of my original songs or a t-shirt or a hoodie or just donate the links are in the description below alright until next time quest on
Info
Channel: StatQuest with Josh Starmer
Views: 48,034
Rating: 4.962461 out of 5
Keywords: Josh Starmer, StatQuest, Machine Learning, Statistics, Data Science, XGBoost, Regression, Classification
Id: ZVFeW798-2I
Channel Id: undefined
Length: 27min 24sec (1644 seconds)
Published: Sun Feb 09 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.