Decision Tree Algorithm in Machine Learning Python – Predicting Churn Example

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

what's up udeda friends is janice here and welcome to my channel this video is video part one of the machine learning decision trees series of videos where we're gonna show how to run decision trees random forest and extreme grady and booze machine learning models in python and just to show you a quick overview of what we are going to cover first we are going to go through the libraries we need then we are going to go through the machine learning process then i'm going to show you how to load data from sql or from excel csv into python and then we are going to go through the data pre-processing phase after this one i'm gonna show you how to run decision trees and the different ways of evaluating decision trees then we're going to go through how to run random forest models and how to tune their hybrid parameters in order to improve our results which is what we do over here and then we're going to do the same for the extreme gradient boosting models so how to run it and how to tune its hybrid parameters after we finish with our models we're going to pretend that it's a business scenario where we get some new raw data and seeing raw data and we want to make predictions on that raw data which is what we are going to do we are going to predict whether it's a customer churn or not and what's the probability of that customer becoming a chair or not and at the end i'm going to show you how to export our model predictions and deploy them in a power bi dashboard that looks like this and i'm going to show you how to build this from scratch where the business can see how many new customers they have from that new unseen raw data what are the potential chance so the number of people who we are estimated are going to churn the variables affecting the people who are going to churn and then a customer details table where the business can see all the people and their details that are going to churn so we can see this person over here for example we are predicting it that is going to churn with a probability of 74 so they can come in this dashboard they can filter all the people that we predict that are going to churn and then work on these people so we can retain them and not lose them and before we start this tutorial let me just say that if you're new to my channel and you're passionate about data science then please consider liking this video subscribing to my channel and enabling notifications from a future videos right the first thing we need to do is to load the packages we need so for this tutorial i'm using operating system numpy pandas matplotlib and seaborn which is our main ones and then a few libraries from sklem so make sure you have all these libraries you can actually copy them and paste them from my github repository and if you're missing any of these libraries then all you have to do is just install the library so for example i can copy and paste these and run this just to show you and this is telling me that the requirement is already satisfied because i have already installed this library moving on we have the machine learning process which is the process we are going to follow for this tutorial so we're going to start with the problem formulation we're going to load our raw data then we're going to do some data pre-processing split our data run our model then evaluate our model then apply some hybrid parameter tuning and then repeat this process until we are happy with our final model right going to problem formulation now so what are we trying to solve imagine that we work in a bank and our manager has asked us data scientists to predict if a customer is going to churn or not with the overall aim of identifying those customers early and trying to stop them from churning so retaining as much customers as possible and as much business as possible right next we need to load our raw data and i have two ways of loading the same raw data the first one is from sql server so if you have an sql server you can input the data there and then load it into python or you can just download the csv file i'm going to have in my github repository and then you can load it by calling the pd.read csv and then into pd.read.csv all you have to do is paste the path of the file so let's say my path is this one over here i can just click on it i can copy it i can come over here i can paste it at the end i need to specify the name of the file and then dot csv and then use this encoding as ladding dash one then i just print the shape of my raw data which is 10 000 rows as you can see so observations and 14 columns and then i just display the first five rows so in our data as you can see we have the row number the customer id the surname then the credit score geography gender age tenure balance number of products has credit card is active member the estimated salary and then whether is a churn or not right moving on we have the data pre-processing phase and what i'm doing in this code is that i investigate all the elements within each feature and as a feature i mean column and the way to do this is that i say for column in row data so this is going to create a loop for each column i create this unique values list which is going to have the unique elements within the column so for example it's going to start with row number and this is going to have because we have ten thousand rows it's gonna have ten thousand unique values and then i calculate the length of those unique values so because we know it's ten thousand rows for row number this is actually gonna be ten thousand two so both of them are going to be ten thousand and then what this is saying is that if the nr underscore value so if this number is less than 36 then it's going to print the number of values for feature and then the feature is going to be this column over here so row number so this is the result the number of values for feature row number then it prints the number of values so 10 000 as we've said comma the unique values over here but because the nr underscore values is more than 36 it's going to print the second bit so it's only going to print the column and then the nr underscore values so for example if we go down to geography that the unique values is only going to be three unique values which is france germany and spain is actually going to print those unique values so this is basically a very quick code to see the number of features within each variable and also the distinct values of those features so for example in geography we only have three distinct values which is france germany and spain in gender we only have two distinct values which is male and female in 10 year we have 11 and we can actually see them then number of products we have four has credit card is two etc etc so if i change this 36 now to 10 for example you will see that 10 year because it's 11 and is more than 10 is not going to print the unique values so let's test it so if i do turn and then run it you can see in 10 years now we don't have the distinct values so i'm going to leave it as 12 for now so we can see 10 year 2. right next i want to check if you have any null values so i call my data frame dot is null dot sum to sum all the null values per column however as you can see no column has any null values so i can move on into visualizations so the first thing i always do is to run a pair plot from sns which is seabourn library so what i do is that i limit the raw data to only include the numeric columns so from here as you can see we are missing geography and gender because they are not numeric and i also exclude the row number and the customer id because they are basically we don't we don't need them right now in our visualizations and in order to create a pair plot all you have to do is say sns dot pair plot then you fit your data frame which is this data frame over here which only has numeric values i set my hue equals exited so i want the distinction to be where there is a churn or not because this is what we are going to predict later on and then this diag underscore kws is just to limit my pair plots so i don't get any errors and just to investigate our purpose now to see if there is any relationship that is going to have later on when we are predicting we can see that credit score with h where we can actually see it over here there is a very good uh relationship as in people who are less than 20 or more than 60 are more likely to not churn and then people in between are the people who churn just because orange is the churn and then blue is the people who are not churning so again if we check other graphs we can see there are good relationships over here which is tenure and age there is good relationships over here again which is age and number of products and then for everything else i think there is not much of a relationship right moving on i want to investigate my non-pneumatic features because we've done numeric upon here so now i'm going to create bar plots for geographic gender and i have also included age 10 a number of products has clinical and is active member just because they have a lot less unique values so i can visualize them in a bad plot and what i do over here i say for feature in features over here then i want to plot a figure and this figure is going to be my sns.count plot x equals f which is my feature then my data is going to be our data frame so the data frame will create it above and then my hue is going to be whether it's a churn or not and then i set my palette into set one and now just to investigate it from geography we don't get much over here from gender again i cannot get much info from it then from age now this is where i get some information is that we can see for people who churn have more of a normal distribution and then people who don't churn so the red is more of a right skew distribution so that's good to know then for ten year again not much maybe i need to turn them into percentages to understand this better same for know for number of products actually we get a very good information from this graph because we can see that people with three or four number of products are almost always churns so it's always blue apart from this small red into three and then from has credit card again i need to turn them into percentages but there is not much i think and same for is active member again there is not much information from this graph right the next thing we want to do now is to make our categorical variables into numeric representation and the reason we want to do this is because the model so the mathematical equation later on is not going to understand a character so germany or france for example we need to give it a number so it's going to understand it so we need to turn germany or france into a number so zero or one and the way to do this is the following we call our new data frame now neural data and then we want to say pd dot get dummies we want to feed it the previous data frame so the one we actually have so if i visualize this song for you quickly so you see it and then we need to select the columns from this data frame we want to turn into numeric representation so i've selected geography which is this one over here gender which is this one over here then has credit card which is this one over here as you can see these two are actually numbers but because the difference now is implying something 0 1 2 3 it's not actually the way we want to represent it we actually want to separate it in two different columns so the difference doesn't imply anything so for example with number of products because the difference matters if you have one two or three or four so if it increases it matters then we need to leave it as as it is but for has credit card and is active member it is a yes or no answer so we need to turn it into a categorical representation so as you can see down here after i run it for the geography now this column is gone from here and it created three extra columns so geography france geography germany and geography spain and because the first row now is actually geography france under france we have one and under germany and spain we have zero and it's the same case for has credit card so if i go under here has credit card you can see zero and one and now the one has moved under the one in the column heading in the same case for each active member right moving on the next step we have is to scale our data and we want to do this for two things the first one is because it makes our machine learning models run faster and the second one is that it can improve our model's accuracy because we normalize the distribution of our inputs sometimes though it's not always the case right the first thing i do is that i select the variables i want to scale and this is going to be the credit score the estimated salary the balance and the age and then i call my scalar which is the mean max scalar and then i fit my scalar into those columns so i say neural data and then scale variables so these variables equals scalar.feed transformation and the transformation is going to be applied in those variables and the transformation is going to be the mean max scalar so what this means how how it works is that for balance for example is going to take the minimum value and then the maximum value let's uh let's calculate it quickly so if we select uh the column balance which is this one and then we do dot min you can see that the minimum value is zero we can also check the maximum value which is this uh 250 000 is going to select zero to 250 000 as our zero in hour one and it's going to force all the numbers in between into a number between 0 and 1. so as you can see for the first language is 0 it stays with 0 but for the second one which is 83 000 it converted this number into 0.33 and the same thing applies for all columns that we have scaled by selecting the minimum value the maximum value and then converting those into numbers between 0 and 1. right moving on we have to split our raw data into x and y so we can fit it into our model so what i'm doing here is that i'm saying x equals neural data which is this data frame above here but i actually want to draw my y variable which is the exit which is this one over here and then i want to select the values and then my y is only going to be exited and then i want to say dot values then i print my shapes just to make sure they are correct so as we can see we have 10 000 observations and 15 variables to fit in our model as x and then as y we only have 10 000 observations and then one variable which is zero or one then i also split my raw data into x train x test y train and y test because i'm going to use the hold out validation technique so i'm gonna use ninety percent of my data to train the model and then ten percent of my data to test the model just because i want to test our model on unseen data so data that the model has never seen before and then because i'm going to know the answers which is my why test i can compare them and select the best model right the next step now is to run our decision tree model and some notes before we do that is that decision trees are supervised machine learning methods that are used for both classification and regression so you can use them to predict both classes and continuous numbers and the way it works is that you really have to understand how information gain or entropy is calculated and how it works so i suggest watching these two videos first just to understand information gain and entropy before you move on with this tutorial but i'm going to try my best to explain it in a very simple way so the algorithm begins with the original set s as the root node so what this means is that it goes through every single column or attribute or variable that we have and it calculates the information gain or the entropy and it chooses the one with the highest information gain or the lowest entropy so age is the variable with the highest information gain and then on each iteration of the algorithm it iterates through every unused attribute so it goes through every unused variable and it calculates again the entropy or information gain of that attribute then it selects the attribute which has the smallest entropy or largest information gain so it repeats the same process now for the variables left so our next variable is the number of products and the is active member as the ones with the highest information gains and then after we select these two we produce more subsets of data which is this point over here and then the algorithm continues to recurse on each subset considering only attributes that never selected before so it's going to go through every single attribute and it's going to create these three this decision tree based on the entropy or information gain and based on how much depth you want your decision tree to go because here i have stopped it only on two iterations because if you run on all iterations then you're going to have other feeding issues that we are going to discuss later on anyway this is a very simple way of explaining how it works and just to translate it into numbers what this is saying is that if the age of the customer is less than this 0.318 by the way this is scale data so just to understand it better i think we can run it on not scale data so if i go up here run this one and remove the age from the scaling so we're going to make it more understandable now age is the actual age and then we rerun our decision tree we can see that what this is saying is that if the age of the customer from our data is less than 41.8 then it's going to move down to these three over here otherwise it's going to move down to these three actually there are branches i think this branch over here and then we're going to continue with other variables now with if and else until we reach out whether it's going to be a customer chair or whether it's going to be a retained customer but because i have only set the max step to 2 now what this is going to do is going to start with the first observation so the first customer has age of 42 is going to come over here and it's going to say if the customer is less than 41.8 our customer is not then we're going to go over here if then is active member is less than 0.5 so if i go to is active member 1 which is the uh where is it is active number one is actually one so it's actually more than uh 0.5 we're going to move down to this branch and this is saying that we predict this as a churn let's check if he's a turn or she uh they are actually a churn so we can actually see the one over here so we've done the prediction correctly right and just to explain how to run the code now if i quickly go back and add the h bug in the scaling because we've removed it from here and then we rerun everything else so we can work with the scale data so i say dt equals decision tree classifier so just call it i set the criterium into entropy so if you follow this link here you can read all about this classifier from sklearn and i want my criterion to be entropy because the default one is guinea i also set the max depth equals two so this is the number of iterations you're going to have so i only want to stop in the second one so if you read about max depth is the depth of the tree and if you leave it as default then it's going to run until all variables from your x train are going to be used and the last thing i do is that i set my random state equals one and this is because in case i'm gonna replicate my results it controls the randomness of the estimator then i say dt.feed and i feed it on my x train and y train so if i run this again and the next thing i do before i evaluate my model is just i want to visualize the tree which is running this bit down here and in order to visualize the three we are going to use this library over here and we're going to say 3 dot export the graphic visualization into brackets i'm going to give it my model which is this model over here ignore this then my feature names i want them to come from the data frame before we split it into x and y and i want to drop the accident and i want to use just the columns and then for the class names i want to do the same as above but i only want to use exited and then the unique values as stream values then i want it to be filled i want to be rounded i want to have special characters and in order to visualize it i need to call these dot underscore data into my graphic visualization dot source and then you will be able to visualize the model you've created above here another thing i want to show you before we evaluate our model is how to calculate the feature importance so remember how i told you the decision tree works based on future importance or entropy and it chooses the variables with the highest feature importance we can actually calculate it and what the output of this means is that we can say to the business that the most important variables that are affecting your customer's chain are the following and then you can give them the ones with the highest feature importance in order to calculate it now uh what i've done over here you can actually copy and paste this code i said for i comma column in enumerate and then i set all my variables apart from my y variable so i drop my exit then i want to print the feature importance of the column that i start so i'm going to go through every column and in order to get the feature importance all you have to do is say dt dot feature underscore importances underscore and then i which is going to be my first variable so you can see here it prints the feature importances of all variables and the bit in the middle is that i create a data frame where i say variable and then i use the column and then feature importance i use the feature importance of the first variable and then i save these um the first time this is gonna run this first bit is going to fail because final underscore fi is not existing so this is gonna fail and we're gonna move on into this bit of code down here so it's to pass the first data frame into final df and the second time it's going to run is just going to merge the previous feature importance with the new feature importance so as we go on we create this data frame and we insert new lines which is the new column with the new feature importance in the data frame and the last bit i have over here is that i just ordered the data from my data frame based on feature importance score and then i want them to appear as ascending so this means the highest is going to be the first so this is the output of the query above here so when i call my data frame this is the data frame i get so i can see the first variable is age with the highest feature importance so this is why we start with age above here the second variable is number of products and this is why we have number of products down here the third is is active member underscore one and we have it over here etc etc right the next thing we want to do now is to evaluate our model and to do this i'm gonna call the score on the training data set and the score on the testing dataset so as you can see we predicted 82 correct when we evaluated our model on the same data set that it was trained on and we have also performed 82 accuracy when we evaluate our model on data that has never seen before so this is actually very good to have 82 percent accuracy however accuracy is not the most important or maybe is not the only evaluation metric you should check the next one i'm going to show you is the confusion matrix which is very important to see when you're predicting classes to evacuate so in order to create it first you're going to have to run this function which is the confusion matrix function so it creates the confusion matrix i have actually found this on the internet a couple of years ago so you can just copy and paste this one as it is then to plot it the first thing we have to do is to predict our why so let me just show you what this white predict does so this is going to predict um what did i do so run and then paste it so this is going to make predictions on our x train so if it's a customer chair or not then we need to calculate the confusion matrix numbers in order to fit them into this plot so if i run this quickly to show you what this does this is giving us the confusion matrix numbers so this is how many people we've calculated correctly that they are not chains this is how many people we've calculated correctly that are churns and then these two numbers are the people that we have misclassified this one has churns and this one has no chance the next step i have over here is that we need to normalize the data so we need to turn them into percentages and then to plot the figure we have to call this function above here which is this one we have to fit the normalized data that we have just calculated which is this one over here we need to pass the classes and we can get the classes from the model we have just run dt dot classes underscore and the title is going to be the training confusion matrix so what this is telling us now is that we can predict very well the people who are not churns so a 91 accuracy on the zeros but on the ones we only have 46 percent accuracy which is telling us a very different story from this 82 percent accuracy we have above here because yes we get the zeros accurately but the aim of the business or our aim is to predict the chance as accurately as possible in order to help the business a quick way to try to improve our results now is to increase the max depth of our uh model above here so we can have three four or five iterations so just to show you how this is going to look like if i make these three now and if i run our decision tree visual to view it again you can see now that we have another breakdown after the number of products which is using again number of products and then balance so i think the reason it's using again number of products is because it ran out of variables that have information gained so if i run this again we can see that now we have balance giving some information gain but after balance there is zero added value from the rest of the variables so i think this is why we repeat the same variables into our decision tree model but anyway if we run the accuracy now over here we can see that it improved we have higher accuracy and then if we run the confusion matrix again we can see that it dropped a lot when it comes to one and this is because we are overfeeding the data so you need to be careful with the max depth of your trees because you are overfeeding the data and then you are not gonna have good results right if we go back and change it quickly back to two so we are not overfitting again and we run it so two then we add this one then we run this on then we run our score then our confusion metrics back to the 46 percent some additional metrics now to evaluate how good your model is is the sensitivity or hit rate or recall or through positive all these things are the same under one metric another one is the precision or positive predicted value these two have different names but again they are the same thing the next one is the false positive rate or false alarm rate the next one is the false negative rate or misrate and the last one is the classification error if you want a deep explanation of how these work and how they are calculated i mean i have the code over here but i have explained them very well in this video over here when we are running logistic regression you can go and watch it just to understand how they are calculated and when to use which metric but i'm not gonna go into it right now the next thing i would normally do after running decision trees is to try to tune their higher parameters in order to improve our results but i'm not going to do this now because i'm going to show you how to do this when we run the random forestries down here however because this tutorial is long as it is i'm going to stop this tutorial here and finish the rest of it in the next video right so i hope you've enjoyed this tutorial and you've gained enough value out of this tutorial and if you feel like you did please click that like button subscribe to my channel and enable notifications for my future videos if you have any questions please let me know in the comments below otherwise thank you very much for watching this video and i'm gonna see you in the next video

Info

Channel: Data 360 YP

Views: 5,081

Rating: undefined out of 5

Keywords: Decision Tree Algorithm in Machine Learning Python, decision tree, decision tree machine learning, decision tree algorithm in machine learning python, decision tree classifier machine learning, decision tree example, decision tree algorithm, how to run decision tree in python, decision tree sklearn python, how to visualize decision tree python, decision tree examples with solutions, decision tree tutorial python sklearn, decision tree analysis, feature importance, python decision

Id: sFVxFCYiIQI

Channel Id: undefined

Length: 33min 38sec (2018 seconds)

Published: Wed Jan 13 2021