Gradient Boosting : Data Science's Silver Bullet

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] oh god hey everyone welcome back so today we're going to be talking about gradient boosting i know i'm excited about every single video i say i'm excited about every video but i'm very very excited about this one and the reason is because gradient boosting is this extremely flexible method in more ways than one we'll talk about all those ways but the other big thing is that gradient boosting is something that you often just use out of the box on the job so many of these other techniques you've learned like linear regression or logistic regression of course they're important to learn their core they're fundamental to data science going forward they form the foundation of many things but you're probably not going to be using a linear regression on the job by itself you're probably going to be using it in conjunction with other things but gradient boosting in my experience at least is one of these models or rather it's a family of models as we'll see which is something that there are industry level applications for right away so that's why it's pretty important to learn this once you've got this down i think you become a lot stronger of a statistician or a data scientist and so let's go ahead and get started we're going to start with a real-world situation as we always do let's say you have the good fortune of owning an ice cream shop and you're trying to build a model which is going to predict the number of ice cream cones you're going to sell on any given day and that there's some features out there let's say you're going to use temperature the day of the week there's some list of features you can use in your model and so in a little picture you're trying to take those features x and learn some kind of prediction function f of x which is hopefully going to tell you how many ice cream cones you sell on any given day let me pause here and talk about the first huge pro of gradient boosting which is that although we're clearly trying to solve a regression problem today it's some kind of continuous variable we're trying to predict gradient boosting can solve yes regression problems also classification problems also more exotic things like ranking problems and other things as well so it's this kind of general purpose framework that is not just good for one type of problem it can really solve many many types of problems so that's one of the big flexibilities it has going for it and now let's talk about the first word in gradient boosting which is boosting if you haven't learned about boosting before it can be somewhat of an interesting weird concept so let me try to break it down before moving forward boosting says that when we learn this f of x so we're going to have some kind of final prediction function f of x model at the end of the day we're saying that boosting is saying that we're going to learn f of x as the sum of m weak learners so mathematically it looks like this whatever function we learn at the end of the day is going to be able to be expressed as the sum of i equals 1 to m of these f i of x's each of which is what we call a weak learner now a weak learner for example this is a regression problem you can think of a weak learners for example a underpowered linear regression or underpowered support vector regression or any kind of regression model you've learned so far the only condition is that it should be underpowered it should not be too complicated and this this phrase is kind of more of an art than a science but basically means that you're not trying to kind of capture all the dynamics in your data and just one of these weak learners you're trying to capture some of them but the idea of boosting is that the next weak learner that you train is going to learn from the mistakes of all the weak learners that came before it so you train some really bad weak learner first the next one learns from the mistakes of that one the one after that learns from the mistakes of the first two and so on and so on so that at the end of the day although each of them is weak by itself when you combine them together you end up with something that's actually extremely powerful okay so that is the idea of boosting now let's go into the step-by-step process of gradient boosting and i'll make sure to explain the intuition and the interpretation along the way so let's start with step zero it's called step zero because it's not really part of the process but it is crucial something you have to decide on before starting which is define your loss function so you need to decide some kind of loss function which inputs two things the true label in this case the true number of ice cream cones you sell on any given day and the other thing being the prediction y hat which is the output of your gradient boosted model so far and of course the higher the loss function is the worse job you're doing the higher loss you have and the lower the loss function is the better job you're doing now this loss function needs to satisfy one other condition which is very important which is that it needs to be differentiable and and hopefully differentiable in some kind of efficient quick way this goes hand in hand with the fact this is called gradient boosting if we can't easily take gradients of our loss function this is not as easy as if we can so we need to pick some kind of differentiable loss function here and now we can start with the process step one step one of the process is that we're going to start with some extremely weak learner f1 of x and for example we can just take the mean of all of the number of ice cream cones we've sold in our training data and that can just be our first model clearly a terrible idea clearly a bad idea to just predict the mean for every single day but it's starting somewhere and now to kind of throw a picture in here if we plot the loss function so for example if you pick one observation in your training data i we can make a plot of the predicted value of this guy against the loss function which is the true value and predicted value being put in there and give us the loss function on the y axis and for example let's say that the current prediction we have which is the average of all the y's is here so clearly we're not doing as good of a job as we could for this observation there's clearly many other values of this predicted value that could give us a lower loss and so that's kind of just saying we're not done yet but this picture is going to help us kind of understand what comes next speaking of what comes next the next thing we do and this is the key step okay step two is the key step so it's going to be helpful to understand i'll spend some time on it we are going to compute these quantities r1i this one is basically just saying that we're currently looking at the first week learner this one's going to get updated as we move on this i is just the data point we're currently interested in so i is going to go from 1 to n if we have n data points so for each data point we're going to compute this derivative or this gradient this gradient is the derivative of the loss function which is this guy with respect to the prediction that we currently have right now and that's what this guy is saying here we're saying that if we plug in the current prediction which is big f1 of x which is exactly what you see here which is currently the mean of all the y's then this is going to give us the gradient now very mathematical so far let's look back at this picture to make sure we understand what this means graphically if i take the gradient of the loss function for this current observation i at the place that the current prediction is then i'm basically getting this gradient right here that's more curvy than i would have liked but that's supposed to be a straight line gradient right there and another key thing is that we do the exact same thing for all data points i equals 1 to big n so this is just the picture for one possible data point we have similar pictures of loss functions for all of our data points and we compute all n of these gradients and those get stored in these r variables now why did we just do this we did this because we would like the next week learner that we learn to hopefully move this prediction in the direction of the decreasing loss so for example if i move this a little bit this way and i know to move it that way because the gradient i just computed is negative if my prediction was over here the gradient would have been positive and i know to move in the other direction so the gradient is informing which way am i supposed to go to minimize the loss on each of these different examples so if i take a little step in this direction then i'm going to be on my way to decreasing the loss on this particular example and also every example if i kind of generalize this process so that's the idea so now step three is that we fit that next week learner so here's where that next week learner comes in learning from the mistakes of the old week learner so the way we fit this new week learner is we learn some kind of model a weak model where the target variable is r1 hat which are all these gradients and the inputs are the same features we've always been using so the one thing that stays constant in this entire process is that each weak learner is trained on the original set of features we were looking at the whole time and for example let's say our weak learner is a linear regression then this f2 of x is going to be that weak learner linear regression and now what we do is figure out how much in quotation marks how much of f2 to add to the current model which is just f1 of x and we do that by solving this problem it looks complex but let's just kind of break it down what's going on this inner part is summing over i equals one to n which is every single data point we're just summing the loss for every single data point the loss from putting in the true value y i and putting in the proposed new prediction which is going to be the old model f1 of x i the old weak learner plus i just snuck a little variable in there you probably noticed but plus gamma so that thin thing right there is the same variable gamma you're seeing here plus gamma times the new week learner so you can think of this as just some kind of recipe it's saying that i know that i'm going to add some quantity of this new week learner to my my model so far this is just saying how much do i want to add because i don't want to add too much i'm going to overshoot i don't want to add too little i'm going to undershoot so this is officially called a line search which looks for the correct gamma the correct amount of the new weak learner to add to the existing model so far which is going to minimize the sum across all loss functions for all n observations and the answer to that is going to be gamma 2 hat hey sorry to interrupt you pass rigvik just realized while editing this video all of the math is correct here it might have been a better idea to put a negative sign in front of the gradient to make it match more like gradient descent and that way our gammas are always going to be positive the amount in which we travel in that direction and that's it we've done one iteration we've worked through all the math all the intuition we need and now we can officially say that the new model so we've only done one step so far but now our model is big f2 of x which is equal to the old weak learner which is this guy plus that amount of the new week learner and this is going to do better this is going to do better than the old weak learner alone because we have hopefully moved all of these predictions for all n observations in the direction for which their loss functions are a little bit lower than they were before so we're getting a better performance but we're probably not done yet because this was still just a weak learner and so the last step which is just return to step two and do this process all over again so let's talk through the high level steps just so make sure you understand so next thing we're going to do is basically get these gradients again but now our prediction is going to be this the sum of two week learners we're going to go ahead and fit a third week learner on those gradients using our features we're going to figure out how much of that weak learner to add to the existing model and then we're going to say our updated model is equal to this guy plus gamma 3 hat times f3 of x and we just go on for as many iterations as we want just adding these weak learners step by step by step and ladies and gentlemen that is gradient boosting now if you're like me you hopefully understand the process here but you might not be satisfied with why it works seems like everything we did was reasonable but why does this framework overall actually going to work in the end of the day and so i think i do owe you that explanation and that's let's say the second to last thing we talk about today so why does this work well think about being at step m plus one in the process so you've trained m week learners so far so your current model looks like f m of x which is the sum of the m week learners so far and now you've proposed some kind of new weak learner so you're going to add some quantity of this new weak learner little f m plus 1 of x why is this doing a better job than f m of x by itself well let's trace through everything that we know this form is approximately equal to the old weak learner plus this amount plus r hat underscore m why because this new weak learner fm plus one x was specifically trained to do a good job of predicting this r m hat so we can just put that right there putting this as an approximately equal to because it's trying to do as good of a job as possible being a weak learner but what is our m hat if we look back at this form here our m hat is going to be this derivative of the loss function with respect to the current prediction which is fm of x and so in words what you're looking at is that the new model which is f m plus one of x is going to be the previous model which is that guy plus a small step which is this small step here in the direction of the decreasing loss in the direction of the decreasing loss this is literally just gradient descent folks like this is just saying that we're going to take the previous model and walk a little bit in the direction where the loss is minimized for all these different data points and that hopefully is an intuitive idea about why this works and why it's so powerful and now i just want to recap how amazing this is i mean i know this is a little bit of a longer video than usual but i think it deserves to be look at all the flexibility we have in this process the problem you can solve regression classification ranking whatever else you want one point of flexibility the loss function totally up to you as long as it's easily differentiable the weak learners that we learn totally up to you you're doing classification problem use logistic regression use decision trees use whatever you want there's so many points of flexibility such a catch-all framework that does really well in the real world this is something that you have to add to your tool belt in my opinion and now one note i'll say is that you'll typically hear about the weak learners being decision trees gradient boosted decision trees are kind of just a ubiquitous term in the industry and academia now because we've seen in many many data sets many many different examples that if your weak learners are decision trees whether or not this is a classification regression ranking whatever problem this does really really really well and now the final thing i'll say before i let you all go is we've talked about all the amazing things but i have to talk about the drawbacks we have to talk about both sides the main two drawbacks here are first interpretability obviously we understand hopefully how it works but when you have this model at the end of the day which is the sum of all these weak learners it can be pretty tough to understand or interpret how it actually works at the end of the day you can use these generic things like partial dependence plots and shapley values but by itself understanding this intuitively is not the easiest thing in the world so we sacrifice some of that interpretability but get a lot of performance gains the other consideration is that it could be pretty computationally inefficient there's a lot of moving parts here so you might need kind of high computing power especially if you have big data in this situation oh yeah maybe one last thing is that we want to be careful that it doesn't over fit so for example if we're doing gradient boosted decision trees you can take various kinds of steps to make sure it's not overfitting you can keep the size of each decision tree weak learner small maybe just like a couple of layers deep at max you can limit the number of weak learners that you have for example you can keep this m to something that's not massive so it's not going to overfit on the training data various steps like that it is definitely something to be concerned about but it's not even close to disqualifying this this framework is really cool so hopefully i convinced you of the power of gradient boosting or at least showed you how it works here if you like this video please like and subscribe for more videos just like this and i'll catch you next time

Info

Channel: ritvikmath

Views: 3,885

Rating: undefined out of 5

Keywords:

Id: en2bmeB4QUo

Channel Id: undefined

Length: 15min 47sec (947 seconds)

Published: Wed Sep 29 2021