Gradient Boosting In Depth Intuition- Part 1 Machine Learning

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] hello my name is Krishna and welcome to my youtube channel so guys today in this particular video we'll be discussing about gradient boosting now this particular algorithm is one of the most requested algorithm by all my subscribers so definitely I will be explaining this and this video will be divided into four parts because we need to understand this by equations by mathematical formulas and all and again understand in the first part we'll just understand how gradient boosting works in the second part we'll be discussing about the pseudo algorithm about the working of this gradient boosting because it will be definitely very very helpful when you are actually going for the interviews right in the third part and again why first and second part it is with respect to regression problem statement I will be discussing and then I will be going ahead with classification problem statement and with respect to these guys you need to understand this very properly because this is one of the you know when I talk about all the assembly techniques like random forest rata boost gradient boosting extreme gradient boosting these are one of the best algorithms that I use to solve a lot of problems statements so it is better that you understand this and this is also one of the most favorite interview topic for the interviewers itself so let us go ahead and try to understand how does gradient boosting actually work now with respect to gradient boosting guys this is a boosting technique and if you have seen my complete machine learning playlist I have discussed about both bagging and boosting technique in bagging technique I have discussed about random forests and in boosting I have also discussed about adaboost now before going hard with respect to this particular video you need to have a proper understanding of decision trees how does decision tree actually work okay so if you if you don't know about these guys it will be very difficult for you to understand this now let us go ahead and try to understand how does a gradient boosting work okay now I have a data set in my independent feature I have experience in degree my dependent feature is having salary now this is my input features like experiences two years degrees B then this is my salary and similarly I have four other records like this right now what is the first step in decision tree so let me just write it down my first step is to compute the base model which will give me one output which will give me one output now what will be this output this output will be the average of all these particular values all this salary so if I try to compute the average fifty plus seventy plus 80 plus 100 divided by four right if I try to find out the I prize the approximate value that I'm going to take a seventy five guys okay now don't tell me Chris you have not written the right value you're not computed it rightly don't do that I just taken an example okay 75 T okay so this is my first base model and remember whenever I give my training dataset the output will be only 75 pretty much clear right so let me write it down over here okay this is my y hat Y hat is basically my predicted value so I will be having 75 75 75 75 okay pretty much simple then first step okay now in the second step in the second step we will be computing compute residuals let students basically means errors okay errors or I can also call this as pseudo residual I tell you why pseudo residual guys because understand in order to compute errors we basically use a loss function you know in regression you have different kind of loss function then mean squared error root mean squared error we have loss functions in classification like log loss we have hinge loss and many more losses right so based on that we'll try to compute the risperdal errors now in this case we'll use a simple you know we'll be using a simple loss function just consider this here my loss will actually be what I am trying to do in order to compute the residuals I will just subtract my actual value with the prediction value just for an example guys here you don't tell me that why I am not taking square just to show you an example because in the next video where I will be discussing about the pseudo algorithm there I will be telling you about different kind of loss functions and there I'll be using the exact formula in this case what I am doing I am just trying to subtract the actual value with the prediction values okay now this initially when I'm subtracting this I'll write a new column that is my R 1 so this is basically my residual 1 now with respect to this guys if I subtract 50 - 75 it is nothing but minus 25 this is minus 5 plus 5 and 25 I understand this is my R 1 my residuals my errors okay in the first way now third thing guys third very important thing now after this base model I will add one decision tree sequentially so I will construct a decision tree now in this decision tree in this decision tree my input will be my experience and degree okay this is my X I that is same input like experience and degree but my output value my dependent feature will not be salary it will be this residual error r1 okay in this it'll be a little error r1 so what I've done this is my base model after this I have added a decision tree have trained with the data where I have independent features and my output is basically my resilient so I have trained this decision tree right now one important thing okay when I have trained this date and when I have trained my decision tree with this data obviously my decision tree will predict the output of the residents only right so if I pass my independent feature once again to this decision tree okay once again to this decision tree I will be having my art to feature suppose my r2 residual to that is coming out after passing this independent feature is somewhere like this suppose I get minus twenty three minus three okay um suppose I get three and this is supposed 20 suppose I get this kind of residence right now this is that recital that I am actually getting the output right output from this decision tree now still I I don't know how to compute this salary so let us see how to compute this salary now suppose I pass this particular value to this two models one is the base model and one is to the decision tree one okay which is attached sequential II then what is the output elegant so whenever I pass through the base model for the first record I will definitely get 75 right allocating 75 because always remember a base model whatever records I pass it will give me 75 so after the base model I've got the value is 75 now with respect to decision tree 1 I will be getting some resident value right some residual values now suppose in this case I will be getting a residual value of minus 23 for this record for the first record right for the first record and now if I try to subtract 75 minus 23 what it will be it will be somewhere around 52 right now this 52 and shall value 50 is very very much near so do we think that this model is performing well it is this doing a wonderful job the solute the answer is guys no understand this is overfitting problem here within my decision tree one I am able to get a value which is very very near to the salary right this is an overfitting problem all together right suppose I fired my residual - as 20 at that and 75 - sorry if I had like - 25 iodide get got 75 minus 25 is equal to 50 right which is pretty much bad right because I understand with respect to decision trees with respect to a model that we create with respect to any models that we create we should have a generalized model which has low variance and low bias okay now in this case I'm having low bias but high variance because then I get my new test data definitely this value will be little bit bigger you know for my new test data so for that to prevent this what we'll do is that when we are adding this right we will be adding a learning rate and then we will be adding your r2 value in the residual value so if I write and this alpha or learning rate ranges between 02.01.2012 3 so I have 75 minus 0.23 sorry it would be 2.3 okay 2.3 and Here I am getting the value is seventy three point seven okay so here I've got my value 73 around there but still this value is having a huge difference right when compared to the output salary so what we'll do after this we'll add one more decision tree now this decision tree will be now created based on the output based on the output of this r2 value that is my residual - value and my independent feature will be same so this decision tree will be computing my next residuals okay one after the other so what I can do is that I can write a generate formulas saying that f of X is equal to my base model I will write it as H 0 of X right now if I want to add one more decision tree this will become alpha 1 H 1 of X H 1 of X is nothing but I understand the output that is given by this right similarly if I want to add one more decision tree then it will become H 2 of similarly like this I will be having alpha an action effects and finally I can write it as Sigma I is equal to 1 to N H of I sorry alpha Phi H of Phi X right so this is my final output and again I'll be discussing about this in more depth in my pseudo algorithm that is my next video but just understand what we are doing we are adding after one base model via atom you're creating this sequential decision tree based on the stress needles that we are getting and by that you will be seeing that after some time this red-suited will also be decreasing this residual will also be decreasing this value can see that - 25 200 - 23 then minus 20 minus 2 it 19 minus 18 so like that minus 15 minus 10 up till some decision tree will be going and this our main aim is to basically to reduce this little error and after that suppose any new test data that comes we have to pass from this base model to that many decision trees and we need to try to add this in this way so suppose one more decision tree is there I will be dry adding with our alpha 2 with my next H of X then alpha 3 with manage the next edge of X like that I will be continuously adding you know it may be a minus value it may be a plus value and this alpha value you usually decide with the help of hyper parameter tuning right it will be against between 0 to 1 now I have got I hope you have got the idea guys how a gradient boosting works why we call it as a sequential tree why we call it as a boosting 3 because the understand after the Beast base model we are adding decision tree sequentially we are adding sequentially right we are boosting this base model as we go ahead with the help of this residual values that are given computed that are getting computed so I hope you understood this particular video now guys in my next video I will be discussing about this pseudo algorithm and if you want to get a head start I will be given the length of the Wikipedia in my description of this particular video just go over there have a look this is how the whole gradient boosting actually works and that I will try to discuss mathematically like how do we come up with this particular base model how do we calculate the average value how do we compute the residual values everything so this we'll be our repeated steps unless and until we get some for some n number of trees so I hope you like this particular video please do subscribe the channel if you have not already subscribe and I'll see you all in the next video have a great day thank you one and all bye-bye

Info

Channel: Krish Naik

Views: 82,050

Rating: 4.9286776 out of 5

Keywords: data science tutorial javatpoint, data science tutorial python, data science tutorial online free, python data science tutorial pdf, python data science tutorial point pdf, what is data science, data science tutorial tutorials point, data science course, ensemble technique, gradient boosting, extreme gradient boosting

Id: Nol1hVtLOSg

Channel Id: undefined

Length: 11min 20sec (680 seconds)

Published: Mon May 11 2020