StatQuest: Decision Trees

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
that place is growing up that quest is growing down that quest is Gruen hello I'm Josh Dahmer and welcome to stat quest today we're gonna be talking about decision trees here's a simple decision tree if a person loves stat quest theme songs then that person is awesome and if a person does not love stat quest theme songs then that person is slightly less than awesome in general a decision tree asks a question and then classifies the person based on the answer it's no big deal this decision tree is based on a yes/no question but it is just as easy to build a tree from numeric data if a person has a really high resting heart rate then that person had better see a doctor and if a person does not have a super high resting heart rate then that person is doing okay here's one more simple decision tree this decision tree is based on ranked data where one is super hungry and two is moderately hungry if a person is super hungry they need to eat and if a person is moderately hungry they just need a snack and if they are not hungry at all then there's no need to eat note the classification can be categories or numeric in this case we're using Mouse weight to predict Mouse size here's a more complicated decision tree it combines numeric data with yes/no data notice that the cutoff for resting heart rate isn't always the same in this case is a hundred BPM on the left side and a hundred and twenty BPM on the right side and the order of questions on the left side first about resting heart rate then about eating donuts doesn't have to be the same on the right side on the right side the question about donuts is asked first lastly the final classifications can be repeated for the most part decision trees are pretty intuitive to work with you start at the top and work your way down and down till you get to a point where you can't go any further and that's how you classify a sample oh no jargon alert the very top of the tree is called the root node or just the root these are called internal nodes or just nodes internal nodes have arrows pointing to them and they have arrows pointing away from them lastly these are called leaf nodes or just leaves leaf nodes have arrows pointing to them but there are no arrows pointing away from them now we are ready to talk about how to go from a raw table of data to a decision tree in this example we want to create a tree that uses chest pain good blood circulation and blocked artery status to predict whether or not a patient has heart disease the first thing we want to know is whether chest pain good blood circulation or blocked arteries should be at the very top of our tree we start by looking at how well chest pain alone predicts heart disease here's a little tree that only takes chest pain into account the first patient does not have chest pain and does not have heart disease and we keep track of that here the second patient has chest pain and heart disease and we keep track of that here the third patient has chest pain but does not have heart disease the fourth patient has chest pain and heart disease ultimately we look at chest pain and heart disease for all 303 patients in this study now we do the exact same thing for good blood circulation lastly we look at how blocked arteries separates the patients with and without heart disease since we don't know if this patient had blocked arteries or not we'll skip it however there are alternatives that I'll discuss in a follow-up video remember the goal is to decide whether chest pain good blood circulation or blocked arteries should be the first thing in our decision tree aka the root node so we looked at how well chespin separated patients with and without heart disease it did okay but wasn't perfect most of the patients with heart disease ended up in this leaf node and most of the patients without heart disease ended up in this leaf node then we looked at how well good blood circulation separated patients with and without heart disease it wasn't perfect either lastly we looked at how well blocked arteries separated patients with and without heart disease note the total number of patients with heart disease is different for chest pain good blood circulation and blocked arteries because some patients had measurements for chest pain but not for blocked arteries etc oh no it's another one of those ghastly jargon alerts because none of the leaf nodes are 100% yes heart disease or 100% no heart disease they are all considered impure to determine which separation is best we need a way to measure and compare impurity there are a bunch of ways to measure impurity but I'm just going to focus on a very popular one called genie to be honest I don't know why it's called genie I looked around on the internet and couldn't find anything however if you know please put in the comments below I would love to know regardless the good news is calculating Gini impurity is easy let's start by calculating Gini impurity for chest pain for this leaf the Gini impurity equals one minus the probability of yes squared minus the probability of no squared now let's plug in some numbers the probability of yes equals 105 divided by the total number of people in this leaf node and the probability of no equals 39 divided by the total number of people in this leaf node after we've done the math we get zero point three nine five that is to say the Gini impurity for the leaf node on the Left equals zero point three nine five now let's calculate the Gini impurity for this leaf node the one on the right just like before it equals one minus the probability of yes squared minus the probability of no squared and the probability of yes is 34 divided by the total number of people in this leaf node and the probability of no equals 125 divided by the total number of people in this node and if we do the math we get 0.33 6 now that we have measured the Gini impurity for both leaf nodes we can calculate the total Gini impurity for using chest pain to separate patients with and without heart disease because this leaf node represents 144 patients and this leaf node represents 159 patients the leaf nodes do not represent the same number of patients thus the total Gini impurity for using chest pain to separate patients with and without heart disease is the weighted average of the leaf node impurities so to calculate the weighted average we take the total number of people in the left leaf node and divided by the total number of people in both leaf nodes we then multiply that fraction by the Gini impurity for the left leaf node then we take the total number of people in the right leaf node divided by the total number of people in both leaf nodes and then multiply that fraction by the Gini impurity for the right leaf node after we do the math we get 0.364 thus the total Gini impurity for chest pain equals zero point three six four and since I'm such a nice guy I'm going to cut to the chase and tell you that the Gini impurity for good blood circulation equals zero point three six zero in the Gini impurity for blocked arteries equals zero point three eight one good blood circulation has the lowest impurity it separates patients with and without heart disease the best so we will use it at the root of the tree note when we divided all of the patients using good blood circulation we ended up with impure leaf nodes that is to say each leaf contain a mixture of patients with and without heart disease that means the 164 patients with and without heart disease that ended up in this leaf node are now in this node in the tree and the 133 patients with and without heart disease that ended up in this leaf node are now in this node in the tree now we need to figure out how well chest pain and blocked arteries separate these 164 patients 37 with heart disease in 127 without heart disease just like we did before we separate these patients based on chest pain and then calculate the Gini impurity value in this case it's zero point three and then we do the exact same thing for blocked arteries since blocked arteries has the lowest Gini impurity we will use it at this node to separate patients here's the tree that we worked out so far we started at the top by separating patients with good circulation then we used blocked arteries to separate patients on the left side of the tree all we have left is chest pain so first we'll see how well it separates these 49 patients 24 with heart disease and 25 without heart disease nice chest pain does a good job separating the patients so these are the final leaf nodes on this branch of the tree now let's see what happens when we use chest pain to divide these 115 patients 13 with heart disease in 102 without note the vast majority of the patients in this node 89% don't have heart disease here's how chest pain divides these patients do these new leaves separate patients better than what we had before well let's calculate the Gini impurity for it in this case it's zero point two nine the Gini impurity for this node before using chest pain to separate patients is 0.2 the impurity is lower if we don't separate patients using chest pain so we will make it a leaf node okay at this point we've worked out the entire left side of the tree now we need to work out the right side of the tree the good news is that we follow the exact same steps as we did on the left side first we calculate all the Gini impurity scores second if the note itself has the lowest score then there is no point in separating the patients anymore and it becomes a leaf node third if separating the data results in an improvement then pick the separation with the lowest impurity value hooray we made a decision tree so far we've seen how to build a tree with yes/no questions at each step but what if we have numeric data like patient wait imagine if this were our data how do we determine what's the best way to use to divide the patients step 1 sort the patients by weight lowest to highest step to calculate the average weight for all adjacent patients step 3 calculate the impurity values for each average weight for example we can calculate the impurity value for weights less than 160 7.5 dooba dooba dooba dooba dooba dooba do dooba dooba dooba boo ultimately we get 0.3 as the impurity value for this weight and then we calculate the impurity values for the other weights as well the lowest impurity occurs when we separate using weight less than 205 so this is the cutoff in the impurity value we will use when we compare weight to chest pain or blocked arteries now we've seen how to build a tree with yes/no questions at each step and numeric data like patient wait now let's talk about ranked data like rank my joke's on a scale of 1 to 4 and multiple-choice data like which color do you like red blue or green ranked data is similar to numeric data except now we calculate impurity scores for all of the possible ranks so if people could rank my jokes from one to four for being the funniest we could calculate the following impurity scores joke rank less than or equal to one joke rank less than or equal to two in joke rank less than or equal to three note we don't have to calculate an impurity score for Jill crank less than or equal to 4 because that would include everyone when there are multiple choices like color choice can be blue green or red you calculate an impurity score for each one as well as each possible combination for this example with three colors blue green and red we get the following options color choices blue color choices green color choices red color choices blue or green color choices blue or red and lastly color choices green or red note we don't have to calculate an impurity score for the color choice blue green or red since that includes everyone BAM now we know how to make and use decision trees tune in next time for random forests that's when the fun really begins hooray we've made it to the end of another exciting stat quest if you liked this stack quest and want to see more please subscribe and if you have any suggestions for future stat quests well put them in the comments below until next time quest on
Info
Channel: StatQuest with Josh Starmer
Views: 697,402
Rating: 4.9249024 out of 5
Keywords: StatQuest, Joshua Starmer, Statistics, Machine Learning, Decision Trees, Random Forests, Data Mining
Id: 7VeUPuFGJHk
Channel Id: undefined
Length: 17min 22sec (1042 seconds)
Published: Mon Jan 22 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.