23. Gradient Boosting

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so today we will continue our discussion of ensemble methods so last week we spoke about trees and then we moved on to ensembles of trees parallel ensembles of trees we talked about random forests and bagging and in these scenarios we were building lots of prediction functions lots of trees in this case to solve the same problem and then at the end we would combine them in a rather simple way just average them or order some simple kind of majority voting if they were hard classifiers today we're going to talk about a different type of ensemble approach called sequential ensembles and in a sequential ensemble the idea is that you build a model you build a friction function to fit your data and then in some sense you see where your friction function did poorly on your training data and you fit a new model to correct the mistakes of the first one and then the final prediction would be some combination of those two typically a linear combination of those two prediction functions but then it could go another step and see whether the combination does poorly on the training data and fix those errors with the third prediction function and you combine them and this is a sequential ensemble because you sequentially add one more prediction function to the collection at a time each one depending on what happened before because it tries to make things better in each step so it's not really parallelizable in the same way that parallel ensembles are parallelizable so the traditional starting point to talk about sequential ensembles is a method called adaboost but we're not going to talk about adaboost directly we're gonna consider it briefly as a special case of the methods that we're talking about today forward stage wise additive modeling and it's more broadly applicable gradient boosting and gradient boosting is I'm excited that we're getting to cover this technique because it's it's one of the state-of-the-art techniques used today on in practice and in in the competitive machine learning world of Kegel and that sort of thing it's basically gradient boosting and neural networks generally lead the lead or the leading methods and we're gonna get a good understanding of that today so start with a an example I suppose from the nonlinear regression setting this is an illustration of some training data we might have the input space is just one dimensional X the output space is this response quantitative variable Y and you can see the data it's not gonna be fit well with a line all right so what are our options we've already at this point we have several that we can go to if you're faced with such a data what might you what about you try to okay polynomial regression so this amounts to creating some new features like squares and then fitting a polynomial function essentially using our linear methods you just add a square term as a new feature and so yeah so one version to generalize would be you make a lot of nonlinear transformations of X those are your new features and then you fit with a linear method alright yeah yeah sure great we could use a random forest like we discussed last week absolutely there's one other thing that's kind of related that we discussed great we could do SVM with a square loss to make it regression and with some kind of a kernel like RBF kernel or more generally we could use any kernel method yeah okay yeah I guess when I said SVM with the square loss that doesn't quite make sense SVM is the hinge loss so yeah some kernel method great we could use a regression tree directly right a regression tree we spoke about last week all right so those are some basic methods event you will talk about neural networks and that could fit this directly as well okay so let's focus in on the creating new functions creating new basis functions approach so we could imagine creating some collection of new functions of X for inches for instance the square function and coming up with a linear combination of these transformations of X so it's our basis function what we've written here and what we're calling here basis functions is is exactly a feature function or a feature that we've been discussing kind of this entire course but when you take linear combinations of these things and they look like functions it somehow feels sometimes more natural to call them basis functions but in any case you'll see different terms in different places and different treatments all right so once we've chosen these feature functions the goal is to find the best linear combination of them and for that we can use any of our linear approaches we can do regularization like lasso regression Ridge regression Square loss we could do have one loss but the key point is that this collection of my apophysis bit this collection of hi pot of Pritchard functions once we fixed the features the basis functions is just a linear set the linear hypothesis space so if we're restricting hypothesis space to these linear combinations of fixed feature functions fixed basis functions then in sum the prediction functions we get our continuous valued real valued and for regression that's fine we're all set but if we want to solve a problem that's not regression so like classification or something all right well we've spoken about that as well we can transform our real valued score function into a class by threshold a that's one thing we can do also we spoke a few weeks ago about probabilistic modeling right where we had a real valued score function being transformed into the parameter for distribution so for by known for Bernoulli for like a classification the parameter is just the probability of heads probability of one and the transformation we talked we talked about like the logistic transformation and the kind of inverse CDF of a normal distribution transformation different ways to map a score function that could take any real value the output of this to some number between 0 and 1 so we could also use it for ranking there's being other examples so if you have a lot of results a lot of possible results for search query you could evaluate the score of each one and then sort the result possible result candidates by their score and that could be a ranking of for first search so the point is that even though it seems like this score function would only be useful for regression you can via various transformations use it in classification or probabilistic modeling multi-class classification with that multinomial logistic regression different things so this is a pretty widely used so I positive space even though it's just real valued so far we've been speaking about manually coming up with these feature functions GM our basis functions the way we're gonna frame the topics of today is that we're going to find these feature functions automatically as part of the learning process and sometimes this is called adaptive basis function modeling in the sense that the basis functions the feature functions are chosen adaptively to the data to the problem rather than fixing them in advance manually so in this setting we introduce what we'll call a base hypothesis space H and this is going to consist of our potential feature functions so it could be a huge class of functions infinite uncountably infinite class of functions and from there we will choose a finite relatively small subset of these functions to be our basis all right so in that case we will end up with still something of this form a leaner combination of feature functions hm but now both VM the coefficients and hm the feature functions the basis functions will be chosen adaptively using the training data so we can now think about hypothesis space as this collection of linear combinations of feature functions where each of the feature functions is chosen itself from the base hypothesis space so we can call the set of functions that we eventually are predicting with a combined hypothesis space or we're just the hypothesis space and it's a collection of linear combinations of some finite choice of functions from our base hypothesis space clear all right so given some data learning is to choose both the coefficients and the functions so how can we do that let's start as we do by thinking about empirical risk minimization so we have some loss function and we look at the average loss and our training data for a function f chosen now from our combined hypothesis space and we can explicitly call out the parameters of this objective it's going to be the coefficients and the functions right and we're we would love to optimize or both over both of these optimizing over functions HM just raw is sounds difficult let's let's assume it can parametrize this set of base hypothesis space functions the set of feature functions so we'll take theta to be some vector space of vector space RB that parameterize that index is the set of all possible infinitely many feature functions in our base hypothesis space so now we're looking to optimize the theta each theta I corresponds to the choice of one of our ice feature function our isis function yeah okay all right so now this is something a little bit more comfortable because all of our parameters are either scalars or vectors and we could imagine potentially solving this using our traditional techniques great into cent or something like that potentially so we can in fact do it for certain classes of models one example we haven't talked about yet but we will soon is neural networks now it fit into this into this framework it's it's important to note that this is more general than anything we've talked about so far because we're all we have this linear combination of functions that we are all the functions themselves are being optimized over so it's a little bit different but what if we wanted our base hypothesis space to be something like trees something more complicated something that's not obviously something we can differentiate so that's the key piece this we can do our usual techniques like SGD or gritty descent if we can differentiate the subjective function with respect to theta this is easily differentiable with respect to the V that's not the issue can we differentiate this thing with respect to theta if we can we're good to go gradient descent what we're talking about today gradient boosting gives us a technique to use when we cannot differentiate the subjective function with respect to the Thetas that's kind of the point of that's the problem that gradient boosting solves for us ok so a specific case of trees let's just think about it for a moment we have can we parameterize the space of trees in some way with some finite dimensional vector RB what would be the issues we encounter if we try to kind of parameterize the space of trees with a finite dimensional vector RB well what should be be for one thing be it the dimensionless vector okay so maybe depth of tree size a tree so maybe we can think for instance alright we have with a fixed number of splits right okay so maybe that's a way to fix things and maybe we we could have a different dimension for the split point of every speech about B splits but then we also have to select what variable we're splitting on right that's a little awkward well maybe we could get around that somehow or maybe let's simplify it set of trees let's let's try to get some traction here what if we chose fixed and advanced the variables are splitting on at every node right and then we just have to record where we're splitting but even then we have an issue even if we could parameterize a set of trees we're still not going to be able to kind of do a gradient method because the output of a tree is not does not change continuously with and this with any of these parameters as you change the split point the prediction will not be affected until it is and when it is affected it has a jump change right because trees are piecewise constant anyway so this is kind of a long-winded argument to say why you're not going to be able to find it it's not easy to find a parametrization of trees of fixed dimension first of all even if you could on some restrictive set of trees it still is not something you can differentiate with respect to because the predictions are not continuous much less differentiable with respect to parameter okay so the gradient boosting technique we'll talk about today will apply to this setting whenever our loss function is differentiable with respect to the predictions okay and whenever we have a base hypothesis space the set of potential feature functions that for which we can do regression so regression trees would be a fine example of a base type battlespace you can do a regression for that even though the predictions are not continuous or even for difference with respect to any parameter that would work just fine for grading boosting alright so a little overview of what's coming up we're gonna first talk about forward stage wise additive modeling which when it works is very nice but enforce a choice out of additive modeling there's a optimization problem that we only sometimes know how to solve alright and when we can't we're stuck gradient boosting gets us past that that sticking point by taking an iterative approach to solving that optimization problem and then we'll discuss various variations on gradient boosting so first let's start with forward stage wise additive modeling the idea here is you you kind of take the the description of sequential ensemble methods I described fairly literally so we start with friction function we're only doing with continuous valued prediction functions so we start with something that's predicts identically 0 and then in every round we have a kind of inductive description in every round we have a prediction function that is some linear combination of M minus 1 previously chosen basis functions and then in the M thrown the the objective is to choose your next function H sub M and the corresponding coefficient on that function that will add to this ensemble alright so we can write down an objective function for that some terminology that we're going to use now to initially just to kind of be suggestive of where we're going is that we're going to call this function H M we're going to call that a step direction and the coefficient we're gonna call step size so you could think about what we're do a little bit as an iterative optimization in function space remember we had Grady descent we picked the direction the negative gradient and then we took a step in that direction and we added that product to the the kind of thing that we've gotten so far so we in each round or picking a function which is we'll think of as a step direction and a step size multiplying that step direction which we add to our function that we had up to that point all right so here we can write an objective function very naturally we're looking for the minimizer over step size in step direction of the previous prediction function plus the new piece right and then we look at the average loss on the full training set for that potential new function and then we can try to optimize that over new and H all right so the only question is can we solve this optimization problem for every round and if we can this is this solves the problem so it turns out we can for certain loss functions at certain function classes so let's take one particular example which is very intuitive l/2 boosting this is a regression problem setting and we'll use the square loss and so we just plug in the square loss to that objective function and we have this objective and let's we can simplify a bit if we make a simple assumption about our space of our based hypothesis space H so remember when we're optimizing this over H H is coming from some base hypothesis space right a set of possible step directions and if we assume that that hypothesis spaces kind of has all possible if you if a particular function is in that box of space and we rescale it that function is still on the hypothesis space so close under rescaling you'd say that if that's the case then you don't actually need this this new aspect because you could just gets absorbed into the age so let's drop that for simplicity and we get this so we've taken two steps here I haven't like so we've we've rearranged a bit so that we group why I and fm- one together and then we've dropped the the new here and we get this now study this for a second and and you see how you might interpret that objective function well first how would you triplet y i- f sub n minus 1 x i but first what's F sub n minus 1 X I let's start there yeah for the current prediction what's the for the current prediction function f sub I minus 1 what's the prediction on X I all right and then why I minus that it's the error or residual on the ice point so given the friction function we have so far this term is the residual on the ice data point great alright so then how might you interpret this full objective function yeah you're fitting the residuals you're finding an H that fits the residuals as best we can with square loss great so that is l2 boosting l2 boosting is we find the function and I pass the space that fits best in terms of l2 loss we look at the residual and we fit the residual with the same hypothesis space and we repeat so your concern this is heavily this potentially is heavily overfitting noise okay so yes so there's a potential for heavy overfitting we couldn't list let's come back to that that's right okay so one thing what if we sneak ahead one thing we can do is we can go back to this framework and choose the age that fits it best but then don't take the full step take new to be like point one instead of one and so we don't go all the way to fitting the residuals perfectly we just take a little step in that direction and that's the type of regularization that we'll end up using mostly good point let's look at a quick example so let's consider the base hypothesis space of regression stumps so trees with only a single split that's another way we regulate we just use a very simple hypothesis space yeah so here's input space is one-dimensional alright and we have a single split so in this case this represents a tree the prediction function corresponding to a tree with a single split the split is that five to the left we predict negative two to the right we predict seven okay this is just an example of we would call regression stump a stump for a tree with one split so let's consider for word stage wise additive modeling with an l2 loss over the base hypothesis space of regression stumps on some data see what happens so here's some data and let's let's run this for a few rounds so the first round we start with constant prediction function at zero right the first round we fit a stump we get that and then let's look at a few subsequent rounds just after two rounds each each picture contains compares the previous round to the next round so this is two and three here's three and four four and five 49 and 50 now we're overfitting certainly yeah okay so along the lines of capacity control to prevent overfitting we would we could for instance after each stage assess performance on a validation set something like that or out of bag talk about that later any questions yeah yeah so we start out with so off with the friction function constant at zero okay so first of all the residual when I pictured function is zero the residual is the original data yeah so we're so we take our original data and we are in this setting so this residual this is just the original Y as rolla zero and now we have this objective function and we basically say this amounts to taking our data and feeding it to the regression Stumpf algorithm and it gives us back a regression stump and we claim is it looks like this so what's the yeah yeah that's right so we're choosing from a nonlinear hypothesis space of which our main winner is trees so that's right any other questions so we're like shopping for stuff sort of like like how tall the vertical pieces we're that that's right well that's it yeah there's exactly three parameters to a regression stump the when you only have one input dimension where you split and the values to the left and the values to the right of the split that's right there's only three things yeah so obviously for a linear hypothesis base this doesn't work at all but like how would you categorize the set of spaces for which you you'll get kind of the same behavior okay so the first comment let's let's say that for a moment the first comment is obviously for a linear hypothesis space this won't work it off so why won't this work it off for a linear hypothesis space when you add them up you're just getting another linear model assume I'll just get the linear on in the first place okay so one is that when you add them up you still get a linear model but more to the point is that you've already fit this as well as you can't the linear models so the residuals should have zero correlation with the input in some sense okay so and then the question is it doesn't work for linear models is there some way to characterize when this will work we typically use kind of non parametric well I think why don't we just say is simply if the the final combined hypothesis space kind of is expressive enough to fit your data then you're in good shape in a linear case you get no additional experts ibbity but in the combined hypothesis space so I think that would just be a good way to think about it can you with linear combinations of your base prediction functions come up with a function that will fit your data yeah what questions okay alright this is l2 boosting worked out very nicely there's another example where forward stage-wise additive modeling works out nicely turns out it will give us adaboost which is a the classic sequential ensemble method so let's this is addressing the classification problem so outcome space negative 1 1 action space is reals our score function and we're in the margin settings so just very quickly recall that the margin when you make a picture on X Y the margin is your spritzen score times the label y and y is 1 or negative 1 and so if f of X and y have the same sign that's good and in that case the margin would there be a negative times a negative which is positive or positive times positive which is positive so if you're doing a correct prediction your margin is positive and we want large margin in in these models all right and so we described some loss functions margin based loss functions we had the hinge loss and the logistic loss and the zero one loss all based on the margin and to get out of boost we introduce a new loss function called the exponential loss all right so see it has the same the right characteristic of small small loss here where we're doing things well we're classifying correctly positive margin and aggressive loss over here but you see any potential issues with using exponential loss yeah it goes it increases really quickly as the margin goes negative all right so I mean that's the main visually that's the main distinction at least at this scale between these loss functions all right well we revisit in a minute so exponential loss and it turns out you gotta do some algebra it work it out which we want to but it sketched out in in a note on line if we take a base hypothesis space of classifiers but they're outputting the class labels as numerix negative 1 in 1 and because of that we can treat it also as a regression function it's just that it's values are only ever negative 1 and 1 and so we could plug it into a square loss so forward stage wise added a modeling reduces to a version of adaboost in this case if we use this exponential loss function and you could derive that fairly straightforwardly so back to the loss function so what's happening over here with our large negative margin these are the examples that are predicted incorrectly right because were less than 0 but more over kind of confidently incorrect so big margin means a large score that's large and absolute value is a confident prediction it should be but we're both confident and wrong when we're when we have a very small margin and it puts a really heavy loss on those examples so when might this be an issue yeah missile training data yeah so if you mislabeled training data that's that's a great example if you have mislabeled training data you know you might be an example that's easy and so the classifier correctly gives a big score big positive score safe but the actual label whose mislabeled is negative and so the margin it's negative and it was confident so it's gonna have a really big loss so your objective function is going to be heavily weighted by this error because it's a big margin it's penalized exponentially so what happens when you have the exponential loss is that it spends all this effort trying to correct these really bad errors which might be maybe they were mislabeled or maybe you know sometimes there's just noise in the world right for two different X's 70% of time it's 1 and 30% of time it's negative 1 and there's nothing you can do to resolve that and you just do the best you can you predict the one that happens more frequently but in that case you don't really want to spend a whole lot of effort trying to fix the incorrect examples so this is the this is why something like exponential loss can cause some issues so and in fact it's observed in practice that adaboost doesn't have great performance and scenarios where there's called high Bayes error rate this is where you know there's just situations where even the best possible prediction function can't figure out what it is with perfect accuracy or label noise maybe the other scenario where adaboost is kind of empirically found the not do great alright so each of these was an instance of forward stage wise added a modeling which we wrote down very naturally just from the intuition of adding one new function at a time and always trying to minimize the same empirical risk you average loss so FSA M works great for certain loss functions when we can solve that minimization problem that optimization problem each step but we happen to find two examples where that works but more generally it's not easy to solve that optimization problem of how to take your next step and so for instance logistic loss or cross entropy may not be clear if you write that down if you can solve that over the space of trees or something like that so the next topic is gonna be a resolution of this problem for us oh sorry I didn't mention cross-entropy last yeah it's it's a bit of a generalization of just the logistic loss I just briefly mentioned it so you know what what's there so in logistic loss remember we're evaluating a probabilistic prediction so I think the probability of class 1 is going to be 0.8 and the actual label is 1 or at 0 and that corresponds are certain log likelihood loss and that's what we're averaging cross-entropy allows us to have a ground truth that's itself a probability so maybe instead of having a hard class labels we actually have soft class labels it's a classification problem but maybe the the imagine that we had a crowdsourced set of labels right and it's a difficult classification problem even the do by hand and we had five label or each example and they don't agree so we could take a majority vote or something like that and that could be a hard label that we trained to or you could say you know what let's try to fit the uncertainty and you take the five label or and that gives you a distribution it gives you a maybe 0.8 probability of this example being a class one and then you can try to fit that probability you could say the target is to predict point eight every time we see that given input X so cross-entropy allows you to have a loss between a predicted probability distribution and an actual probability distribution it's not used often people call across entropy but they usually just mean something like logistic lost or multinomial loss but anyway neural networks they like to speak about cross entropy even though they almost always have hard labels as opposed distribution labels alright so now we're gonna get into greedy boosting and my own okay good good right so now we're getting now we're gonna get into gradient boosting any boost which is going to help us in this FS am objective function in the case that we don't know how to solve that inner optimization problem so this is our FS am step and the issue is that sometimes we do not know how to solve the optimization problems so what to do what if we try to solve this optimization problem locally kind of in the same way that we solve all the optimization problems we have so far we start at a point and instead of trying to in one step figure out where the minimum is we what we've done so far is take a gradient and kind of look locally what's the good direction to step in and we go a little bit in that direction and we incrementally optimize our function so that's kind of the idea of how we're gonna approach this minimization problem though this minimization problem is it's not local it says find the best H and the best real valued function that minimizes this objective function so this is this is not a local objective optimization problem okay so the hard part it's not it's not the step size it's the step direction so let's let's think about this local approach a little bit so here we have an objective function we've written it back in terms of just the prediction function f so this is the the total loss on the training data and if you wanted to kind of do a local method it's almost like we want to do gradient descent on this function f like in the function space which is feels complicated but it's kind of the idea what we want to do but what saves us from getting really deep in a functional analysis is the fact that you can notice that even though F is a function we only ever use the value of f at n specific points right so we only ever examine the function value at the training points so what we'll do let's let's reformulate this a bit for a moment and let's write a vector F which pulls out the evaluation of the function f at each of the N training points all right so now we have a column vector that has the relevant evaluations of the function f that we're going to need then we can write the objective function ok almost the same but now we're indexing into a vector instead of evaluating a function alright and now this is something we can do gradient descent on because F is the vector F well it's just a vector and we can potentially if L is differentiable we can differentiate this objective function with respect to the vector F we can find the gradient of F exactly what that means but we can do that we can we can optimize over F here let's push this a few steps so we'll write down the negative gradient direction is the negative gradient with respect to this vector F right in terms of the partial derivatives and we can calculate that and just a little if you're a little confused you're right to be confused because eventually what we're gonna need to do eventually we want we don't just want kind of an optimal friction function just as a vector on the training points we're gonna we need a friction function that generalizes to dew points not just the train data points but let's we'll come to that so for now this picture kind of captures what we're good I pulled it from another source which use a slightly different notation but I should be able to explain it just fine so the the domain iya now we're optimizing over are the predictions of our function on two training points so f of x1 this dimension is the prediction we're gonna produce on training point x1 and this dimension is the prediction we're going to produce on training point x2 so the vector F kind of splits into is a point in this in this bottom plane particular our objective function kind of our performance on these two training points is going to be a value above that plane so as a function of the predictions we're making at those two points we're going to have the objective function now yeah this is not even clear that I mean this is one surface no no no so the so for any for any function f we get a we get two predictions we get a prick Chinon x1 and a friction on x2 our two training examples we have two training examples in this image we're given F we get a particular objective function value okay and that corresponds to one point on this graph and now the set of all possible for any possible what this graph shows is the objective function value for all possible FS you tell me I want to predict n on x1 and negative 7 on x2 okay that corresponds to a point down here and then the graph tells us the value of the objective function our total loss or average loss on those training points with those predictions that clear that's correct so in this image the the dimension is the training set size that's right we are we're optimizing over a space of dimension training set size that's correct we did that in one other situation earlier remember and Colonel method exactly via the representor theorem we could express the optimum as some linear combination of things in terms of the individual training points that's right okay but for now we are evaluating were we're optima we're potentially optimizing directly over the prick Shinzon the end training points so our the space were optimizing over is our end where our training set size is n so is the picture clear that for any set of n predictions characterized by a vector F in RN we get some objective value and then this is the graph of that objective value as a function of our predictions all right well given that we can choose some initial set of predictions on the training points and we can do gradient descent assuming that this surface is differentiable assuming the the objective function is differentiable with respect to the predictions so this is our objective function on the n dimensional friction vector if that's differentiable with respect to F then we can do this type of gradient descent all right so here we were in negative G for the step direction which is the negative gradient and these notice that there's a kind of a partial derivative for every training item training example and the negative derivative for reasons that will be clear soon as is called the pseudo residual in this context okay and one thing that you can note is that if the loss is is a square loss than this derivative if you work it out is exactly the the residual so if you if this let's just work it out really quickly so if the true label is y1 and our prediction is f1 and the loss is the square loss and then the recipe is to differentiate with respect to the friction F 1 then we get ok an X that's the derivative but we want the negative so that's the right step direction I've thank you so there's a chain rule so there's a negative on the minus F so it's minus minus F so so up to scale factor this is exactly the residual about prediction okay but that's just for square loss we can apply this for any loss all right so here's here's the idea here's the key the key idea so this is all well and good to minimize over the space of functions that we're predicting for predictions just on the training points but we need we want a function that is defined everywhere all right so the strategy is is quite quite cute all right so every time we take a step we get a new set of frictions on just the training points and what we do is we find the function in our in our base hypothesis space that approximates those predictions as closely as possible all right you can think of that as a projection step you're projecting this set of predictions into the set of predictions on the training set that you can actually attain with your hypothesis space but I mean how would we approximate predictions as closely as possible with a set of hypotheses this is a lot like a regression problem so this is the this is the next step find the hypothesis and that pops of space that's closest in the l2 sense let's say square loss to the step direction that you want to go in alright so if we could if we could change the function values directly at each of the training points the optimal local changes would be in the Direction minus G all right that's how we would change all the frictions at all the training points to optimally improve the objective but when we go in that direction we're gonna add out end up at a set of predictions which may not be attainable by a function in our high based hypothesis space right our base if there may be no stump that achieves exactly those exactly that step Direction G so let's find the prediction function in the base hypothesis space that comes as close as we can to that optimal step Direction minus GI why is it so okay someone proposes an F a function f all right so let's suppose someone proposes x squared all right now our training points are 0 1 and 1 2 these are our training points so our training X's are zero in 1 right okay let's define for convenience F 1 is equal to f of 0 is this training point 1 and its training point two what is f of 0 yeah F of 1 1 great ok so J of F is equal to the vac the performance of this same function on the 2 training points so that's you know the loss of 1 comma F of 0 plus loss of 2 F of 1 right 1 is the that you're supposed to predict on example 0 & 2 is the thing we're supposed to pricked on example 2 so loss of f of 0 is 0 F of 1 is 1 we're doing square loss so that's 1 plus 1 which is 2 all right so that's J of F is that clear so that's J of a particular function but we could also think of this as a as an as a function on the vector f which are the evaluations of F at two at the two points x1 and x2 that help anything so by is the output of some given function I do quotient F on trading point X out that's correct so it's always the same the subscripts just are indexes into the output outputs of attempts on all the treatment but for a given function f will create a vector F where the subscripts are its evaluations at each of the training points that's right but then when we want to optimize over the function f that's equivalent to optimizing over the vector F because in this objective function and we only ever look at the function at the training points so there's that equivalence yeah we are absolutely forgetting the iPods space that's let me respond to that first you're absolutely right so when we're optimizing here we are there's no mention of a hypothesis space we are just pretending that our hypothesis space is the set of all vectors our n where the vector characterizes the set of predictions we make on the training data so there's no there's no reference to the original base hypothesis space H at this point good obvious what actually meet which is just a necklace values yes so for certain loss functions that so the point is that shouldn't just be obvious what to plug in for fi why don't you need gradients I mean for certain loss functions that just Square loss you just take fi to byi and you're done okay no I'm not it's not the case for all loss functions though so some loss functions like remember the logistic loss this is a margin based loss and it never gets to zero right so if your class is one the best loss you can do is arbitrarily close to zero but your score would have to be infinity so if you were to do that method for like logistic loss or something then the optimal would have F to be plus or minus infinity which is not so great so what about this method is so far there's no constraint on F but what's going to happen is that we're after every step we project it back into the set of into a function that can actually that's part of our path to space all right let's go let's repeat that piece and then see if there's more questions so so in every round we start with this we take a gradient step in RN so we adjust our function predictions at each of the training points kind of locally optimally using a negative gradient direction and then we say all right if we actually we actually need a function to that's defined everywhere which would be coming from our base hypothesis space so let's find the function in that base hypothesis space that approximates the locally optimal step Direction negative G as closely as possible in this sense so we're searching for a function H and hypothesis space which gives us the legit function to find everywhere in X and we ask that it try to approximate negative G I at each training point I as closely as possible in this square sense and that will be the step we actually take so we add H or some multiple of H to the function that we're building up alright let's go a few slides more and we'll ponder some more this is the this is it for a gradient boosting there's variants but this is the key idea so we can stay on this as long as as long as we need so here's a little additional picture so again we're in this space of frictions on the on the training points and there may be an optimal step direction but this takes us to a set of frictions on the training points that's not achievable necessarily by adding just adding something from our base up out of the space see what we have so far so we project that step we find the closest approximation to that step in our base hypothesis space and that may be ending us up there that's the step direction we would actually add let's look at the full algorithm well the last piece is how big a step do you take and so one version is you once you've chosen your step Direction hm you just search for the step Direction that minimizes your empirical risk but more commonly we think of the step size as the shrinkage parameter is a regularization parameter and if you took V to just be one that would be kind of the full the full gradient step but often we'll tune that step size something like point one or it's a tuning parameter and as we'll see in a little while kind of a smaller you make that the better things tend to fit just takes more steps or oh that's a good point the better it tends to generalize rather than the better it tends to fit thank you thank you so at each step we find a function we add a function to our collection at each step we add a function to our yeah that for example approximately the function that we add at each step is the best l2 approximation the best approximation is square error to the unconstrained gradient step on the values of the predictions that our function makes on the training points all right we can go over some examples by hand in our in our lab session or how about right now all right so but just a quick recap of what the ingredients are for a gradient boosting we have a loss function we need to be able to differentiate it with respect to the predictions usually we're differentiating with respect to the parameters of our hypothesis space but all we need is to differentiate it with respect to the predictions we need a hypothesis space on which we can do regression because that's how we fit the gradient as best as we can we choose a number of steps or some stopping criterion and we need a way to choose the step size so usually just kind of some fraction of the function gets added in each step and then we're good to go alright so let's do an example binomial boost say - someone invented this at some point but it's really just an application of gradient boosting with a particular loss function so let's do classification with logistic loss here's a loss function it's a margin loss you can see the margin is sitting there in the exponent y times f of X all right it's a margin loss and so what do we need to do to do gradient boosting on this what's mechanically what's the step you need to differentiate with respect to the prediction exactly so great so for any training point for any training point xry I we need the partial derivative of this loss with respect to the prediction itself right so if I were I would write it D I could have written that F sub I like we're doing before all right so let's write that up you guys can help me with my differentiation log of 1 plus e to the negative Y f of X I Y I f of X I so I have that right okay so let's differentiate which is 1 over that times the derivative of that okay so 1 plus e to the negative I'll try mi from margin we know that's the margin right and then the numerator the derivative one goes away get the e again so e to the negative M I times the derivative of that - and what's the Y I that's right because we're differentiating with respect to the prediction f of X I so we get the Y that look right ok great now does this we could simplify this a bit just make it a little it's it's equivalent but it's a little less - right if we multiply the numerator and denominator by e to the mi then this piece goes to 1 and then this piece goes to 1 and this piece goes to e to the mi so that's negative y I eat to the M sub I plus 1 all right okay the partial with respect to FXI we want that so we do that for each of the eyes and then we get the gradient G and we want to go to the negative gradient direction okay so the negative G negative gradient is equal to the negative of that so it's like why I over e exponential of why I f of X I plus one and the vector is eyes one to n right it's my shorthand for the gradient that's our step direction now what that's that's the unconstraint step direction so now what a new ingredient enters the picture hypothesis space enters the picture let's just write it down H and then what do we need to do with that hypothesis space okay fantastic select that will call H select an H and H that this G okay so let's minimize the square difference between the predictions of H and the Nick okay let me let me give it a try so how about argument over H is in the base about this space of we need a summation over I minus G sub I I'll be the - tree - H of what yeah H of X I right and then square great if we put a 1 over n there that's like every regression problem we've done over the space H so if we know how to do regression on the space H this is what we do ok so this gives us our step direction and then we can take some fixed multiple of that a tuning parameter multiple of that for we go next I've got all right every example goes this way yeah you do that we will find if whatever continue you will find answer optimize that letter that almost is then just once is that better or is that force than this okay what I think you're proposing is I think you're proposing the following so rather than just take so one step negative G you're saying maybe why don't we stay in this unconstrained space longer and get a really good value not just a one step update but like a really good update that gets us much closer to a minimizer of this unconstrained version and then you're suggesting and then do the projection yeah great question I've also thought about this question so one one version is where do you stop so for logistic loss like this what we said was that the optimizer is going to have the scores go to plus and minus infinity so it's not entirely clear what the right target is I think they each get a point when full of this function to a different direction if you don't know one data point but you have been didn't play well it pulls so you don't want each they don't want F to be huge on the positives and hugely positive on the positive examples and hugely negative on the negative examples and then I mean this agree the only makes sense if the bit appointee just this little one data point the functioning agreement flexing I understand so because he had good each data point if essentially we recorded function to me some some direction which is in favor of that and a point I mean so let's hold on let's suppose we have a function that looks like this okay and then there's some training points all right so this is X 1 X 2 X 3 X 4 all right so if the true label of this is positive and this is negative and this is positive and this is positive then the G the gradient is going to ask that this guy go up this guy go down this guy go up and this guy go up so it doesn't care what happens to the function overall it wants the function value at this point to go up but this example x2 wants this the value of the function at this point to go down for logistic loss where optimally it would be plus or minus infinity at each of these training points is this picture clear or clarifying yeah fine we're lost army of the square loss we won't be able to we need to find component of the like the basis function which is closest to the step so that we can if you won't be able to do that side every step time when Velociraptor unique we at every step we need to do a projection to stay in the space but the question is the proposal was basically a projected gradient step type thing so we take a small step and then we project it back the idea is why not take a big step where the big step takes you to a point that's very good in the unconstrained space and then you project that I think the issue is that we don't really know the properties I mean if we're projecting a small distance so there's some I don't know manifold some surface of predictions that are achievable by our hypothesis space and if we deviate by that a little bit because we've taken a small step and then we project back where we have some the projection is kind of makes sense like it's taking us to a point that we think will still be pretty good whereas if we kind of go really far away it's hard to know whether when we project back that is going to make any sense okay so almost so almost all successful instances I know of gradient boosting are actually doing gradient tree boosting where your base hypothesis space is a set of trees constrained in some way or another by depth or by number of leaf nodes or something like that so one common form is the number of leaf nodes terminal nodes so stumps would be two terminal nodes so things have changed a little bit over the years so the hasty tip sharni Freedman book recommends you know let the number of terminal nodes because train to be kind of between four and eight I remember earlier than that there's people who are often saying things like oh you know just boost with stumps that's good enough that's all you ever need boost with stumps that's definitely not the case you know several years later they're recommending between four and eight so trees that have some that are kind of less trivial than just stumps but these days with like XG boost and these sorts of things people are May trees that are even much bigger than this so 30 40 leaf nodes or depths even depending on the shape of the tree software packages GBM is a very popular one that's in our but SK learn has their version gradient boosting classifier gradient boosting regressor kind of the state of the art software packages these days are extra boost and like GBM like GBM it's from Microsoft both open source so let's just look at another example to talk about the issues of how big steps the step sizes that we should take so here's some data is basically the same example the same model that we had before where we were adding regression stumps together so here's some data and various rounds of the fit up to a hundred steps and in this case we were taking a step size of one right so just whatever we got back from the regression that was the step that we took is it clear the connection now between step and these regression functions so in every round we pull one of those decision stumps out of the base hypothesis space age and add it to this so this is this is what I would like to take a look at what we're looking at is for this for this problem we've now this was run for step size one now we've run it for four different step sizes one half point one and point oh five and on the left we have a training set performance so as you might expect the bigger step size fits more quickly so this is air so lower air is better so that is what we see and that's what we expect but on the right here is out-of-sample performance validation set and we do see significant difference here on the curves and we see for step size one is that we go down not even as far as some of the others and then we start to overfit performance gets worse right and as we use smaller step size things get better and there's kind of two things to note here one is that the the value of the minimum for slower step size of course we have to run more steps but eventually we get to better minima for smaller step size that's demonstrated here but maybe more importantly in practice is that it stays that the minimum longer it's easier to find so here kind of hits the minimum and bounces back up but here it stays there a little bit longer and so the rule of thumb is that what people found is that smaller step size is better but you have to take more steps and so you know make the step size as long as small as you have the patience to wait for it to converge it's kind of the rule of thumb okay questions so in each case it was bouncing back up the header on the desk set intuitively is that just because we're adding more of the function and like result that's causing it to bounce back up what's the bounce back up to yeah so generically speaking if I were to show you this graph of performance as we did something for more iterations if it bounces back up you would what would you suspect it's happening so let me back up one step so in every step we're adding a new function to our linear combination so would you think it's fair to say that the kind of function we're getting is getting more complex it's fitting the data better so I mean the quick thing to say is that we are overfitting and just as we saw at the very beginning with the decision stumps if you run a long enough it'll fit the data in that example anyway it'll fit the data perfectly so you just need to know the right point to cut it so they can use out-of-sample performance for that in this example how many trading points if you have no not sure I guess you could count them I think there's four hundred not innumerable this is a training set yes I'm sure there's a relationship between training set size and overfitting certainly yeah alright let's let me mention a few variations of gradient boosting and then I think I'll switch topics so stochastic gradient boosting this is actually a very useful thing in it fits in very well with other things that we've discussed so in stochastic gradient boosting the idea is that in every round instead of using your full training set you use a subset a random subset of your training set so typical if you look at kind of the original paper on stochastic gradient boosting by Friedman he says I use half your data the term is bag fraction and so the question is why might you want to do that so kind of original explanations were things like to prevent overfitting so by adding some by by using less data it's harder to fit the data because you're not looking at it in some of your steps and so that would kind of regularize by making it more difficult to fit the data ok another would would just say what's faster you have less data to deal with you're using half the data you should be able to go you know for a linear algorithm twice as fast maybe faster than linear depending on the algorithm you're using so there's speed reasons so the way I like to think about it in our context is this is basically a mini batch method so you are the true step direction the true G would be if used all your training data and you're estimating that by using a subset of your training data ok that's that's how we would view it with our machinery although if that's how we're viewing it these guidelines of you know 50% of the data set size no longer seems so sensible like why would the way we were arguing about mini batch sizes is that the size of the mini batch should have nothing to do with how much training data you have the mini batch size should have to do with yeah maybe the complexity of how difficult is it to estimate the right step Direction basically so I kind of opened that as a a direction for investigation if you have very large if you have very large training set sizes do you really want to use 50 percent of your training data to choose your to choose your step and I know I guess I'd argue that it's it should have at the end of the day shouldn't have anything to do with the size of your training size of your training set it should have to do with your base hypothesis space maybe you need a certain number of training points to fit a tree of a certain size and should have to do with the function you're trying to fit and I know that's my that's my hypothesis could be fun to the check out so that was randomly selecting a subset of training data at every step then there's just as we did for random forests people have introduced the idea of randomly sampling features or columns and this wasn't part of the original proposal but why not mix this idea and the extra boost supports this feature for instance and they claim that a lot of people find it more useful even than stochastic least sampling the the rows the examples I know a Carol competitor I asked him what he uses and he says that yeah sometimes helps sometimes not like 20 percent to 100 percent feature sub sample rate would be what he finds most useful so you know it's an extra tuning parameter that you can play with any questions alright let's take another let's take a short break I want to change topics difficult question Mick step out to the spaces whatever mind okay so the question is can you make a base hypothesis space that has various types of things like trees I don't know your other examples were but other sets of friction functions and in every round you'd find among each set the best fit and take the best of the best I've never seen that yeah it's interesting idea so I guess you might be concerned that it's a pain for a tree based method to fit essentially a linear trend and so you might want to get rid of that first go through that a pre-processing step so I haven't seen this kind of heterogeneous base hypothesis space I've more seen kind of people fit models with different basic path spaces and then I'll sambal them together with some combination that's an interesting idea ray you're asking why not jumped why not aim for them the minimizer of the unconstraint problem right so I don't anyone does that but one thing people do is take a Newton step instead of a gradient step a Newton step direction so Newton step Direction is like it's like you find the best second-order fit to your function the kind of quadratic fit tent at your of your function at your point and you jump straight to the minimizer of the quadratic or the direction towards the minimizer of your quadratic is your is your step direction so that's the step direction used by XG boost and used by a few other gradient boosting algorithms just one difference from what we've presented and then gradient boosting also introduces some different regularization on the size of the tree and they build our tree you know it's more integrated with the objective function one thing to note about gradient boosting as we presented it is that the regression step the base hypothesis space piece choosing the H that fits G is as well as possible that's like a black box regression out step so you can take any regression algorithm you want and do gradient boosting on it extra boost opens up that black box and they kind of have more the way they build the tree depends more directly on the objective function alright and I should be able to read the slides and get the hang of it
Info
Channel: Inside Bloomberg
Views: 16,931
Rating: 4.7869821 out of 5
Keywords: machine learning, machine learning jobs, applications of machine learning, David Rosenberg, machine learning engineering jobs, bloomberg careers, mathematics, python, Gradient Boosting
Id: fz1H03ZKvLM
Channel Id: undefined
Length: 84min 35sec (5075 seconds)
Published: Wed Jul 11 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.