Gradient Boost Part 3 (of 4): Classification

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

last night ahead a dream about gradient boost and it was crazy I was using it to classify things and my memory is clear and not hazy stant quest hello I'm Josh Dahmer and welcome to stat quest today we're gonna do part 3 in our series on gradient boost this time we'll focus on the main ideas of how gradient boost can be used for classification note this stat quest assumes you have already watched gradient boost part 1 regression main ideas if not check out the quest in addition when gradient boost is used for classification it has a lot in common of logistic regression so if you're not already familiar with logistic regression check out the quests in this stat quest we will use this training data where we have collected popcorn preference from six people their age their favorite color and whether or not they loved the movie troll two and walk through step by step the most common way that gradient boost fits a model to this training data just like in part one of this series we start with a leaf that represents an initial prediction for every individual when we use gradient boost for classification the initial prediction for every individual is the log of the odds I like to think of the log of the odds as the logistic regression equivalent of the average so let's calculate the overall log of the odds that someone loves troll two since four people in the training dataset love troll two and two people do not then the log of the odds that someone loves troll two is the log of 4 divided by two which equals zero point seven which we will put into our initial leaf so this is the initial prediction how do we use it for classification like with logistic regression the easiest way to use the log of the odds for classification is to convert it to a probability and we do that with the logistic function probability of loving troll 2 equals e to the log odds divided by 1 plus e to the log odds so we plug in the log of the odds into the logistic function do the math and we get 0.7 as the probability of loving troll 2 and let's save that up here for now note these two numbers the log of 4 divided by 2 and the probability are the same only because I'm rounding if I allowed four digits past the decimal place then the log of 4 divided by 2 would equal 0.6931 and the probability would equal zero point 6 6 6 7 since the probability of loving troll 2 is greater than 0.5 we can classify everyone in the training dataset as someone who loves troll 2 note while 0.5 is a very common threshold for making classification decisions based on probability we could have just as easily used a different value for more details check out the stat quest ROC and AUC clearly explained now classifying everyone in the training dataset as someone who loves troll - is pretty lame because two of the people do not love the movie we can measure how bad the initial prediction is by calculating pseudo residuals the difference between the observed and the predicted values although the math is easy I think it's easier to grasp what's going on if we draw the residuals on a graph the y-axis is the probability of loving troll 2 the predicted probability of loving troll 2 is 0.7 the red dots with the probability of loving troll to equal to zero represent the two people that do not love troll two and the blue dots with the probability of loving troll two equal to one represent the four people that love troll two in other words the red and blue dots are the observed values and the dotted line is the predicted value so for this sample we plug in one for the observed value and 0.7 for the predicted value and we get 0.3 and we save the residual in a new column then we calculate the rest of the residuals but ah buh buh duh buh buh buh duh buh buh duh buh buh duh buh buh duh buh buh duh buh buh buh hooray we've calculated the residuals for the Leafs initial prediction now we build a tree using likes popcorn age and favorite color to predict the residuals and here's the tree note just like when we used gradient boost for regression we are limiting the number of leaves that we will allow in the tree in this simple example we are limiting the number of leaves to three in practice people often set the maximum number of leaves to be between 8 and 32 now let's calculate the output values for the leaves note these three rows of data go to the same leaf these two rows of data go to the same leaf lastly this row of data goes to its own leaf when we used gradient boost for regression a leaf with a single residual had an output value equal to that residual in contrast when we use gradient boost for classification the situation is a little more complex this is because the predictions are in terms of the log of the odds and this leaf is derived from a probability so we can't just add them together and get a new log of the odds prediction without some sort of transformation when we use gradient boosts for classification the most common transformation is the following formula the numerator is the sum of all the residuals in the leaf and the denominator is the sum of the previously predicted probabilities for each residual times one minus the same predicted probability note the derivation of this formula is quite technical so I'm saving it for part four of this series when we get into the nitty-gritty details of gradient boost for classification so for now let's just use the formula to calculate the output value for this leaf since there is only one residual in this leaf we can ignore these summation signs for now so we plug in the residual from the leaf and since we are building the first tree the previous probability refers to the probability from the initial leaf so we plug that in do the math and we end up with negative three point three as the new output value for this leaf now we need to calculate the output value for this leaf since we have to residuals in the leaf we'll add them together in the numerator in the denominator we just add up the previous probability times 1 minus the previous probability for each residual so we plug in the previous probability for each residual note for now the previous probabilities are the same for all of the residuals but this will change when we build the next tree now do the math and the output value for this leaf is negative one now let's determine the output value for this leaf we plug the residuals into the formula and the previous probabilities and do the math and the output value for this leaf is 1.4 hooray we've calculated the output values for all three leaves in the tree now we are ready to update our predictions by combining the initial leaf with the new tree note just like before the new tree is scaled by a learning rate this example uses a relatively large learning rate for illustrative purposes however 0.1 is more common now let's calculate the log of the odds prediction for this person the log of the odds prediction is the previous prediction 0.7 plus the output value from the tree scaled by the learning rate 0.8 times 1.4 and the new log of the odds prediction equals 1.8 now we can convert the new log-odds prediction into a probability and the new predicted probability equals 0.9 so we are taking a small step in the right direction since this person loves troll 2 we save the new predicted probability here we calculate the new log of the odds prediction for the second person the log of the odds prediction is the previous prediction 0.7 plus the output value from the tree scaled by the learning rate 0.8 times negative one which gives us negative 0.1 for the new prediction we can convert the log of the odds prediction into a probability and save the new predicted probability 0.5 here note this new predicted probability is worse than before and this is one reason why we built a lot of trees and not just one then we calculate the predicted probabilities for the remaining people and now just like before we calculate the new residuals and just like before residuals are the difference between the observed and predicted probabilities and just like before we can plot the observed probabilities on a graph however now everyone has a different predicted probability so to calculate the residual for the first person we plot the predicted probability and the residual is the difference between the observed and predicted probabilities and we save that value here now we calculate the residual for the second person we plot the predicted probability and the residual is the difference save that value here we just do the same thing for all the remaining people BAM now that we have the residuals we can build a new tree and then we need to calculate the output values for each leaf let's start with this leaf note only the second person goes to this leaf so we plug in the residual into the formula for the output values then we plug in the last predicted probability do the math and the output value for this leaf is 2 now let's calculate the output value for this leaf note only the third person goes to this leaf so we plug the residual into the formula for the output values then we plug in the last predicted probability do the math and the output value for this leaf is negative 2 lastly let's calculate the output value for this leaf note a bunch of people go to this leaf so we plugged the residuals into the formula for the output values and we plug in the predicted probability for each individual in the leaf now do the math and the output value for this leaf is 0.6 BAM now that we've calculated all of the output values for this tree we can combine it with everything else we've done so far we started with just a leaf which made one prediction for every individual then we built a tree based on the residuals the difference between the observed values and the single value predicted by the leaf then we calculated the output values for each leaf and we scaled it with a learning rate then we built another tree based on the new residuals the difference between the observed values and the values predicted by the leaf and the first tree then we calculated the output values for each leaf and we scale this new tree with the learning rate as well this process repeats until we have made the maximum number of trees specified or the residuals get super small BAM now for the sake of keeping the example relatively simple imagine that we configured gradient boost to just make these two trees and we needed to classify a new person as someone who loves to roll - or does not love troll - prediction starts with the leaf then we run the data down the first tree be poo poo poo poo and we add the scaled output value we run the data down the second tree BP poopy poop and then we add the scaled output value now we just do the math and get 2.3 as the log of the odds prediction that this person loves troll 2 now we need to convert this log of the odds into a probability so we plug the log of the odds into the logistic function do the math probability that this individual will love troll 2 is 0.9 since we are using 0.5 as our threshold for deciding how to classify people and 0.9 is greater than 0.5 we will classify this person as someone who loves troll 2 triple bam note before we go I want to remind you that gradient boost usually uses trees with between 8 and 32 leaves we use small trees in this stat quest because our training data set was super small also be sure to watch part four of this exciting series on gradient boost next time we'll dive deep into the math of how gradient boost is used for classification we'll derive the equation used to update the leaves and that will make you feel totally awesome megha bail hooray we've made it to the end of another exciting stack quest if you liked this stack quest and want to see more please subscribe and if you want to support stack quest well consider buying one or two of my original songs or buying a t-shirt or a hoodie the links to do this are in the description below alright until next time quest on you

Info

Channel: StatQuest with Josh Starmer

Views: 212,206

Rating: undefined out of 5

Keywords: StatQuest, Josh Starmer, Gradient Boost, Machine Learning, Statistics, Classification, Binomial, Data Science

Id: jxuNLH5dXCs

Channel Id: undefined

Length: 17min 2sec (1022 seconds)

Published: Mon Apr 08 2019