Why multicollinearity is a problem | Why is multicollinearity bad | What is multicollinearity

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome to unfold data science friends my name is aman and i am a data scientist so one of my subscribers mr sanjeev is asking me he has heard that before fitting a machine learning model highly correlated feature should be removed can i explain on that thanks mr sanji for asking this question and you know what today we will cover in detail about the answer of this question this question is a favorite interview question for many interviewers so when you go and explain your data science project right so this is how people will explain i took the data blah blah blah then we did some feature engineering blah blah we removed some correlated feature and that is where interviewer might stop you and say hey can you tell me why multi-collinearity is a problem we will understand in detail what is multicollinearity why that is a problem and what are the ways to remove it let us start one by one okay so first of all what is multi-collinearity so guys multicollinearity is a scenario in which two of your independent variables are highly correlated now what is correlation guys correlation is if two variables are strongly related to each other an example will be let's say you capture employee data of an organization in one column you put age of the employee in other column you put numbers of years of experience okay so it is highly possible that you know as age increases numbers of years of experience also increases these two variables are said to be highly positively correlated there can be negative correlation also for example if you say age of a person and years left to retire let us put the imaginary variable here number of years to retire okay so if age increases the number of years left to retire will decrease right this is a negative correlation in your data if two variables are having any kind of this correlation negative or positive that is called a multi collinearity problem okay why multicoordinarity is a problem before that we will understand why there is multi quality in the data let me write a simple equation here guys okay let me omit this let me omit this so let me write here sales is equal to so i am assuming let us say you sell this marker okay sales is equal to let us say i am putting 10 plus 10 plus 0.8 into advertisement budget add budget okay okay plus 0.3 into production quantity or broad quantity just a simple model simple linear regression equation let's say how many independent variables we have two budget and production how many target variables one sales for this marker now let us imagine a scenario where while capturing the data we capture add budget and we also capture tv add budget okay so one more variable here plus 0.1 into tb add budget fine now tv add budget is a component of total ad budget right hence these two variables are highly correlated this scenario in a regression model or any model let's say is called a multi-collinearity scenario but this is a problem in regression only i will tell you how and why in some time but please understand why it occurs so this is one way in which it can occur in data capture you have captured a duplicate variable or let us say duplicate information this is one way this is called data related multicoordinating another kind of multi quality can be structure related for example let us say i am doing some feature engineering on this data okay and i add another variable i call that production square okay new variable broad square now this new variable guys is coming from the broad variable okay so these two variables are highly correlated again how it has come it has come from the structure of the data so it is quite possible that while doing feature engineering while doing some data preprocessing we might create some of these these these are the ways in which you will have a multi-coordinated scenario in your data why it is a problem okay let us go to the basic of what is the purpose of a regression model which a logistic regression linear regression and integration we want to understand how each of these variables each of these variables are impacting the target variable individually this is very important guys i will repeat it again when you write y is equal to m 1 x 1 plus m 2 x 2 plus c what you say here is for every unit increase in x1 there is y will shift by m1 okay keeping all other variables constant this is the explanation of this m1 if i ask you what is the explanation for m2 you will say me for every unit shift in x2 y shifts by m2 keeping other variable constant fine now here see these coefficients guys these coefficients have come from the linear regression model okay now practically practically from the common sense suppose i want to know how my ad budget is impacting sales and there is tv as ad budget also in the data do you think that if i change tv add budget add budget will remain constant not possible right hence these coefficients value determination will get impacted badly coefficients values will not be same i want you to do a small exercise here guys okay do one thing create a data okay x1 and x2 in x1 you put the value 1 2 3 in x2 you put the value 2 4 and 6 put something in y okay any value that you want 10 12 14 let's see create this simple data in python run a regression model with both these variables see what is your coefficient and what is your p value fine and second run you drop one variable let's say x1 you drop and run the model with only x2 y and x2 see how your p value is looking like and how your coefficient is looking like you will get the answer of how multicollinearity screws your coefficients if by any reason or by any means our coefficients are screwed in a regression model then that model is of no use guys you and me know from our knowledge that the reason we go for regression model is we want to understand the impact of all these variables on the target individually if we are not able to understand that impact from the coefficients then the you know meaning of the regression model is not fulfilled and that is the reason in regression model you should not have multi collinearity in the data fine this applies for both positive and negative correlation now what we do if we have this type of data then what we do there are two three ways one way is guys if you have let's say limited number of features in your data right let's say 20 30 features what you can do is you can create a correlation matrix like this correlation matrix will give you all your variables correlation value with other variable okay so for example x1 x2 x3 here you can have x1 x2 x3 correct it will say you 0.8 0.9 and then you can put a threshold saying anything above than 0.9 remove one of the variable for example in this case i can remove this add budget variable completely from my analysis because my correlation value will be high that is one way when you have limited number of variables in your data one question i have from you here that you have to answer me in the comment okay let us say there are two variables x1 and x2 in the data okay both are highly correlated now we have discussed one of these should be taken out which one is my question to you okay which one you have to answer me in the comment moving ahead this is one way second way is something known as little advanced regression techniques known as lasso and ridge regression in this regression what will happen is model will penalize you for the duplicate information okay and it will shrink the coefficients i can explain you in detail how these regression work let me know in comment if you want me to do that okay so i will just reiterate what all we learned in this video what is multi coordinating a phenomena where two variables are highly correlated in the data independent variables uh what are different ways in which multi-quantity multi-quantity can occur data related structure related okay how to tackle if limited feature put a threshold remove one variable which one you have to answer mean comment otherwise go for some advanced regression techniques like lasso and rich degradation and and the idea of regression model is to get these coefficients these are the most important feature of a regression model so we cannot let these coefficients get spoiled by anything so i hope you understood what is multicollinearity why it is a problem how to overcome it and you know you i gave you one small assignment to do a slide it will take just five minutes do a hands-on and see how your coefficients are changing okay that will give you an idea about what happens so let me know what doubts you have guys what comments you have i'll see you all in the next video till then wherever you are stay safe and take care
Info
Channel: Unfold Data Science
Views: 34,534
Rating: undefined out of 5
Keywords: Why multicollinearity is a problem, Why is multicollinearity bad, What is multicollinearity, Problems with multicollinearity, Multicollinearity in regression model, multicollinearity in multiple regression, multicollinearity in spss, multicollinearity in excel, regression multicollinearity, regression multicollinearity problem, regression multicollinearity vif, unfold data science, problem with multicollinearity, multicollinearity
Id: ekuD8JUdL6M
Channel Id: undefined
Length: 10min 45sec (645 seconds)
Published: Wed Mar 10 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.