Gradient Boosting Complete Maths Indepth Intuiton Explained| Machine Learning- Part2

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] hello on my name is Krishna welcome to my youtube channel so guys we will continue our discussion with respect to gradient boosting algorithm and now we will try to understand the whole maths behind it how actually it works and this is the whole pseudo algorithm that I have actually written guys I know this particular algorithm looks too much complicated what I will do is that I try to break each step by step and I will try to show you what exactly this steps mean you know and if you want to get the whole pseudo algorithm I have given the link in the of the Wikipedia page in the description of this particular video you can actually check it out over there so let's begin okay so with respect to my previous video on gradient boosting we had actually discussed how gradient boosting works okay and now I will try to follow this mathematical things so let us go ahead and see first of all I have a data set over here experience degree and salary based on that I have three records my independent features is experience and degree my output feature is salary okay so this we are considering now what is the first input that is actually required for the gradient boosting so let us go ahead and say try to see that what are the things initially required for this particular machine learning algorithm now the first thing that we need to provide is basically the inputs inputs we need to provide our independent and dependent feature you can see over here experience and degree is my independent feature salary is my dependent feature itself right now I am going to provide my independent and dependent feature the second thing that I have to provide in the input is basically my loss function now loss function can be different for regression problem for classification problem it may be different for regression problem I can have mean squared error I can have root mean squared error unless and until and one condition should be that all the loss function should be differentiable you know you should be able to find out the derivative of that okay in so in classification problem we have different kind of loss function the first loss function that I'd like to indicate is log loss they also call a loss function like hinge loss and all so you can provide any other loss function with respect to the different kind of problem statement okay the third thing is that how many number of trees you want in this gradient boosting algorithm so this three parameter is actually provided as a input now let us go ahead and try to see the sudo algorithm the first step is basically initialize model with constant values now we need to create a base model the first model right and I also told you in my previous explanation about the gradient boosting we need to find out the average of all this particular algorithm all these particular salary values and then that will basically be the output of the base model but over here we need to apply this particular formula now what this equation actually says okay we will try to understand that okay how we are going to initialize this model okay so remember guys this particular of mathematical concept that is written over here it is nothing but our mean of this gamma value where the summation of I is equal to 1 to n loss of Y comma gamma so over here gamma is nothing but the predicted value remember this guy's gamma is nothing but the predicted value so we usually say predicted value as Y hat okay so let me just define the loss function over here so my loss function let me consider that I am using a regression loss function and this particular loss function is nothing but 1 by 2 y minus y hat whole square I hope everybody's familiar with this loss function okay so what we need to do in the first step is that we need to find out a Y hat value in such a way that this summation should this loss value should go down okay it should be minimal okay it should be very very minimal now what does this y minus y hat actually indicate it is basically the residual right it is basically the residual or the error so we should be getting a minimal error so we need to find out a y hat value such that this whole loss should reduce it should be minimized now how can we actually find it out now over here remember that when I am writing this kind right if I try to expand this okay if I try to expand this I need to find out this Y hat value okay okay understand this so I can write let me consider that this 3 records are there okay so I even write 1 by 2 my Y value this is my Y value right this is my actual value for the first record it is 50 minus y hat whole square plus 1 by 2 the second record is nothing but 70 - y hat whole square the third value is nothing but 16 minus y hat whole square so these are my three values I've expanded this summation values okay now if I want to find out the predicted value that is the Y hat value okay such that I minimize the loss I need to do a first order derivative first order derivative understand this thing guys now what this first order derivative mean suppose I have a value X of n if I want to find out the derivative of X in this case okay if I take this particular value like this the output will be nothing but and multiplied by X and minus one I hope you have learned this in college guys right we have learned this in college days if you want to find out a derivative of X of n with respect to X it is nothing but n multiplied by X of n minus 1 so similarly we will be doing the first order derivative of this with respect to Y hat okay with respect to Y hat Y we are doing because we need to find out the Y hat value or the gamma value to minimize this loss and when we minimize the loss you know that we write we try to find out the gradient descent we try to find out the slope which trying to find out the derivative of that particular loss okay so how I will find out the derivative of this particular value with y hat I will just multiply this 2 over here 50 minus y hat and then if I try to find out the derivative of 15 it is nothing but constant and minus y hat is nothing but minus 1 okay I hope everybody's clear with this then similarly 2 by 2 70 minus y hat x minus 1 plus 2 by 2 60 minus y hat x minus 1 ok so I hope everybody's clear with this so I will just try to deduct this this all will become 1 now I'll try to sub multiply this one over here okay I try to multiply this minus 1 over here so this is nothing but minus 50 plus y hat then I have minus 70 plus y hat then I have - 60 plus y hat okay okay I hope everybody's clear then we'll try to add up this 50 70 and 60 it is nothing but minus 70 plus 50 is 121 minus 180 plus 3y hat okay I can also write this as 3 Y hat is equal to 180 and finally my Y hat will be 180 by 3 so this is nothing but 60 now understand over you guys we need to just find out initialize the model with the constant value so the constant value is nothing but this Y hat which is nothing but 60 okay so we found out a very easy way we're considering this loss function since we need to minimize this loss function by selecting some Y value has Y hat or this gamma what I am doing I'm just doing a first order derivative with respect to this Y hat and finally I get output of 16 so this 60 is nothing but the Y hat for the base model for the base model okay for the base model you can see it very clearly initialize the model with a constant value so we have done the step one okay we have done the step one so let's go ahead towards the step 2 that is i trait M is equal to 1 to M till then I'll drop this all guys if you have not understood just revise it just go back and try to see all these things so okay guys we have initialized the model with a constant value and we saw that the output of this particular model was 60 so I have written the Y hat as 60 over here now let's go to the second step I trade M is equal to 1 to M what is this M is nothing but number of trees okay so I'm not rating this from 1 to N and remember this pseudocode is basically written from the same Wikipedia page guys I have just written the same thing ok you can refer over there definitely now from 1 to M I am going to compute the pseudo residual okay pseudo residual basically means the pseudo error okay we will try to compute now what is this particular formula okay and what is this particular equation again you may find it much more complicated guys but this is also very very simple let me just tell you what this is this is nothing but derivative of laws of y comma f of X I write AB now remember that during this particular model my f of X is nothing but f 0 of X right so this we are trying to find out this derivative with respect to this particular value itself now understand one thing initially this is my loss function if I try to find out the derivative of this particular loss what will happen it is nothing but 2 by 2 y minus y hat x minus 1 right this will become 1 then I have y plus y hat let me take - as my common symbol okay - as my common symbol and remember guys this is nothing but derivative of loss with respect to derivative of Y hat ok let me take minus symbol as common then I can write it as Y minus y hat and let me move this minus symbol this side it will be nothing but derivative of loss derivative of Y hat is equal to Y minus y y y hat right now what is this derivative of loss with respect to derivative of Y hat it is nothing but it is the same value you can see minus symbol over here right we are trying to find out the derivative of loss with y comma f of X I which is the previous model here also what we are doing we are calculating the derivative of loss with y comma y hat and that is nothing but Y minus y hat right so what we are doing over here we are just calculating the error by subtracting this value with this value okay we are subtracting this value with this value this value with this value that is what we are doing so you guys I've tried to derive this this was my loss function I did a first order derivative which I had to do derivative and loss with derivative of f of X I f of X is nothing but this previous function right this is nothing but Y hat this is from where I am getting this Y hat right so if I do this and you can see a way I move this particular minus symbol over here now this is nothing but this right I have this minus symbol over here and this is nothing but Y minus y hat that basically means you're subtracting salary with Y hat and then I am getting the resident right so this are what I of em are I of M basically means for the model 1 I can say for the first record for the model 1 is nothing but R 1 1 in this case R 1 1 will be 50 minus 60 which is nothing but minus 10 R 1 to 1 R 2 1 I'll write it out - 1 which is nothing but 70 - 60 it'll be 10 okay so this will be 10 are of 3 1 this will be nothing but 60 - 60 which is nothing but zero so this is my residual this is my dresser it'll always remember that okay so I hope you have understood this now once we get the residual once we get the resident we go to the next step and then we create a decision tree wherein understand we're in my now dependent feature will be the sinusoidal value and my independent feature will be my independent features experience and degree and now I will train a decision tree regressor with this as my dependent feature and this - as my independent feature that is what I am going to do in the next step okay now once we do this this is what the second step is actually saying you have to fit the base learner hm of X where in your input is X of I and your output or your dependent features are i of m RI of m is nothing but these three values with respect to these three records pretty much simple pretty much easy guys right so we have done till step two now let's go and understand this is from the soft step of - okay and understand this guys we have a main step - inside this we have various sub steps okay so we have completed till here we have initialized this base model we have found out the residuals and then we have fit the base learner okay now we will try to understand what this mathematical equation has actually says again I am going to rub this so make sure that if you have not understood to revise it again let us go ahead with respect to the third step again this looks complicated but understand just compare this third step with this particular step see the equation is almost same right but here instead of gamma we have F of M minus 1 of X I plus gamma okay this F M minus 1 of X I is the previous model output okay remember that thing now suppose after my base model I've created a decision tree with the residuals okay what does this particular function say is that so we have to find out the gamma which is y hat which is minimize this loss it will minimize this particular loss now what is this particular loss understand this thing raise now I can write now this particular loss as one by two okay why your eye is like that only okay why your file and remember I have summation of I is equal to 1 to N okay I is equal to 1 to n y of I minus minus now if I consider this F of M minus 1 X of I okay if I consider this particular value guys understand what is my previous model suppose I have started from base model okay I am just that in the first tree decision tree right have constructed my first decision tree now my previous model is having this particular value this particular value is nothing but 60 so I can write it as 60 over here right then I will basically write this as plus gamma okay or this gamma I can also say it as it is nothing but Y hat right so this becomes my equation I hope to everybody's understood this this becomes my equation sorry I forgot to write square over here so this becomes my equation right almost in this particular way now I need to minimize this I need to minimize this again I have to follow the first step how I have actually done I write like this right again it will follow the first step where again I'll find out a derivative and based on this derivative I'll try to find out the Y hat value okay just understand guys whatever I have done in the first step same step will get repeated and again then you will be understanding that this this gamma value will be wearing with best based on records it will not be the same okay now considering that will be getting the Y hat value again the same step whatever I have done over here so whatever I have done over here it will be repeated in this way only only the thing that is getting added is the previous model value now suppose for the second model it the base model also get added the next model output will also get added over here okay so with this is how we have got this particular equations over here with respect to this particular fine okay now last is the updating the model now updating the model guys it will be pretty much simple because just understand this particular way okay now this is initially my base model okay now my based on my base model residues and the input value I created a decision tree I created a decision tree Oh so if we pass the first record to the base model we know the output will be 16 so here I will be getting the 16 then plus I'm adding a model decision tree and this is actually trained on the independent features and the residual errors so this will also give me an output now in this particular case if I pass this first to record my output is actually minus 10 okay but remember this but same formula we are doing over here but there is also a value called as alpha okay now this alpha is nothing but it is learning rate okay it is learning rate and it is usually selected between zero to one so here in this particular case suppose if I select point one then this will be nothing but 60 minus 0.1 okay which is not nothing but fifty nine point nine I hope I have done it correctly - okay not 0.1 it will be minus one point zero okay so this will be nothing but fifty nine point zero right so when I give my first record to my base model and the first additive decision tree I am getting this particular value but still there is a huge difference there is a difference of somewhere around nine so we will try to add one more additive model later on as we are repeating with multiple decision trees you can see this the citation will be going on but remember this is the final step they are this alpha is nothing but the learning rate the learning is already selected between zero to one and here we have selected two as point one when I do it it is nothing but one point zero and I'm actually getting fifty nine point zero v with respect to the first record with respect to the second record you can see over here my base model is again giving me sixty right and my editing model is actually giving me ten here I'll be adding a learning rate and this is nothing but 61 right so for my second record my output is 61 similarly for the third record this value is zero so my output will be same okay so this was all about the sudo algorithm guys I know it looks lot of it is it looks a lot complex but just to understand if you have followed my previous lecture step by step I think you'll be able to follow everything so yes this was all about this particular video I hope you liked it please do subscribe the channel if you are not reduce the squares here actually I have a great day thank you one and all bye-bye

Info

Channel: Krish Naik

Views: 52,766

Rating: undefined out of 5

Keywords: data science tutorial javatpoint, data science tutorial python, data science tutorial online free, python data science tutorial pdf, python data science tutorial point pdf, what is data science, data science tutorial tutorials point, data science course, nlp tutorial python, natural language processing python

Id: Oo9q6YtGzvc

Channel Id: undefined

Length: 17min 47sec (1067 seconds)

Published: Wed May 13 2020