Decision Trees, SVMs and Random Forest | Practical Machine Learning with Scikit-Learn #2

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
alright guys so we did some regression problems now it's time to work with classification and in case you forgot classification is when we don't have any objects any values on a sliding scale we either have something that is something or it isn't something so in this example we are going to be taking a look at a data set of breast cancer tumors and determining whether or not they are malignant or benign so you can imagine this problem is hugely important because it can save lives and help doctors save time as well so the first thing that we're going to do is we are going to connect to our colab runtime and once it's connected we're going to press upload and we are going to upload our cancer data set which is called cancer CSV all right now that we have that let's take a look at our data set so if we open it up here you see we have the ID which is basically you don't have to worry about that it doesn't have anything to do with the machine learning that we're going to be doing and then we have the diagnosis which is M for malignant or B for benign and then we have a radius mean texture mean and all these other data points about all these other features of a given tumor and then from that we are going to try to predict if that tumor is malignant or if it is benign so let's import our data into coop first we'll say import pandas as PD and then we will set our data set equal to PD dot read underscore CSV and we will say cancer dot CSV because that's the name of our file and then also remember we have to split our data into the X and the y now as you can see there are a lot of features and I don't feel like counting them all so I'm just going to print the length of the data set dot columns and that will show us that we have 32 columns so 32 columns that means 31 indexes and we don't need the ID and we don't need the diagnosis because that's going to be our Y variable so that means that X is going to equal two data set dot I lock and we're going to have just a colon and then we're going to go from the second value 0 1 2 which is radius mean all the way up to the 29th value and of course we're going to put dot values at the end of this as well and then let's also set up our Y variable our Y variable is equal to data set dot I lock and it is just going to be the first column which is to say the second column because we have 0 and then 1 now put dot values there I'll run that ok all is guchi so next it's time to do train test split and it's really important to note that we need a training set and a testing set even more so for classification problems because something like misclassifying to tumor can be a matter of life and death so being able to make sure that our model isn't just overfitting on our data and just understands our data really well but doesn't understand new data is really really important so we're going to say from SK learn doll underscore selection we are going to import our train test split and then we're going to say the X train the X test the white rain and the Y test are equal to Train underscore test underscore split our x and y and we are also going to set our test underscore size equal to 0.2 or 20% of our data which for 569 entries means that we'll have a little more than 100 entries saved for testing our model so run that looks like all is good next what we're going to be doing is we are going to scale our data so as you can some of these numbers are all over the place for instance you have numbers in the hundreds and the numbers in the point O ones and if you think of in a graph if you have one variable say like the x2 variable that's super super big and the other variable isn't then you're going to it's going to be harder to see a correlation what you want is you want all the variables to have a similar scale so that it's easier to find that correlation that line of best fit you basically want to exacerbate the features of the model and how they are related to each other and so in order to that we do scaling so we're going to say from SK learn dot three processing we're going to import our standard scaler and then we're gonna set scaler equal to standard scaler and then we are also going to say our X train is going to equal two scalar dot fit underscore transform X train and we'll do the same thing with our X test and this will basically just ensure that our model can understand the data and doesn't get too confused with the numbers being all over the place if we put all the numbers on a set scale it's going to be easier for the models to draw conclusions so we'll say X test is equal to scaler dot fit transform X test great so we'll run that so now that our data is scale it's time to do logistic regression and you might say hold up regression was the last section of this course shouldn't we be doing something with classification in the name and I understand that that's confusing but logistic regression is actually a classification algorithm and the reason for that is that if you look at the image on the screen what we're trying to do is we're trying to predict if a tumor is malignant or benign so if it's zero or if it's one and as you can see here we can actually use linear regression to solve this problem however it doesn't work very well because linear regression is supposed to work on sliding values and obviously a tumor is either malignant or benign it's never going to be in between and so because of that we do logistic regression which does the same thing as linear regression but it takes into account the fact that those values in the middle right there we don't have to account for them we don't need a straight line we need a curvy line to show the differences between 0 & 1 and so with that being said it's time to import our logistic regression models so we'll say from SK learn dot underscore model we're going to import logistic regression and then now we are going to set our model equal to logistic underscore our classifier and we'll say logistic classifier is equal to logistic regression empty parentheses and then we'll just say logistic classifier dot fit and we will fit our X train to our Y train all right so I'll run that and something that's important to note here is that if you notice we've been leaving our parentheses blank when we make the model and because that's because we haven't been putting in any hyper parameters if you see here these are the default parameters that are assumed when we put nothing in the parentheses here however if you want to make your model more and more accurate it's nice to sort of play around with some of these parameters and see if you can make your model more accurate now luckily the scikit-learn documentation is really good at explaining what each of these different parameters does so I'm going to leave some links down below so that you can sort of work that out and figure out what each of these parameters mean if your model just isn't as accurate as you want it to be with that being said it's now trying to see how accurate our model actually is so what we're going to do is we are going to set Y Preds and we're going to set our predictions equal to logistic classifier dot predict and we are going to predict the X values and of course we have the white s values as well so we just show the two we just compare the two and we'll be able to see how accurate our model is and so in order to do that we're just gonna say from SK learn we are going to import the confusion matrix and the confusion matrix is hugely important because what it's going to show is it's not just going to show how accurate our model is but it's going to show how many examples our model misclassifies and what types of miss classification our model is performing so just to shore to show you what I mean we're going to print out our confusion matrix and the confusion matrix parameters are going to be the Y tests and of course our Y Pretz so let's print out the confusion matrix and you can see right here you get two columns and two rows and basically these two data points the sixty nine and the forty two represent the number of credit correct predictions and these represent the number of false predictions so what a confusion matrix does is it gives you a little bit more info on what exactly your model is misclassifying and what your model is getting right so up here this represents the number of positive values and that were predicted to be positive this represents the number of negative values that were predicted to be negative so since these numbers are both very high it that means that our model is pretty accurate in that sense now over here this one means that the model predicted that a value was positive but in actuality it was negative and this right here means that the model predicted that a value was negative but it was actually positive so in something like cancer classification this is really really important because if your model is predicting that a bunch of tumors are benign when they are actually malignant that's much worse than if your model is predicting that a bunch of tumors are malignant when actually they are benign because at that point you know the worst that could happen is you maybe get an extra checkup versus having a two or grow into a more cancerous tumor and then eventually kill you so that's why it's printing your confusion matrix is important because there are lives at stake people okay so going away from that we are now going to talk about support vector machines and support vector machines are hugely popular in classification problems you can also use them for regression problems in fact all the algorithms that we're talking about from this point on you can use for regression as well but we're just going to use them for classification because that's generally what they are used for so in support vector machines what you're trying to do is the model is trying to group individual data points such that the distance between them is as far as possible and this is a really cool concept and math behind this is pretty advanced I'm not going to explain it now as always I'll have resources for you guys to look and learn on your own if you wish but if not you can just implement the algorithm in like three lines of Python which is pretty awesome so let's import say from SK learn is the M we're going to import SVC and then we're going to set our SVM for a support vector machine equal to SVC and then inside of here we actually are not going to leave this empty we are going to add a kernel and that kernel is going to be equal to RBF and this sort of goes in with the kernel trick which of course I'll also leave resources for but this is just a hyper parameter that we're going to be playing around with of course you can do whatever you like you can get rid of the kernel whatever you want but just for this example we're going to be using a kernel just to show you that it works and now if you wanted to make a support vector machine for something other than a classification problem like a regression problem instead of importing SVC for support vector classification you would import SVR so for support vector regression that's important to note next what we're going to do is we are going to call SVM . fit and we are going to fit our x train to our y train all right so run that bad boy and you can see right here under Colonel it says RBF because we passed in Colonel has been RBF sweet so now it's time to print the confusion matrix what we're going to do here is we're going to copy these values and we're going to paste them here and instead of doing our logistic classifier dot predict we're going to say SVM not predict and that should be it so we'll run that and you can see right here we get the exact same result now it's important to note that because our test data is pretty small it's only about a 100 entries you're not going to see too much difference between these algorithms but as the datasets that you're working with become more complex have more features more complex patterns between each other and more importantly are larger the types of algorithms that you're using are going to generate vastly different results but overall support vector machines do work better than logistic regression now obviously if the problem is really simple it'd be easier just to use the logistic regression classifier because it's just you know very simple not gonna take a lot of processing power you don't want to reinvent the wheel and just make a super hard elegant solution to a really simple problem and with that being said it's on to decision trees and now decision trees are something that you've probably heard of or used or thought about outside of machine learning and basically the idea is that you want to break up actions and features and then at the bottom you'll have different classifications so an example that's popping up on your screen is let's say you want to make a decision tree for deciding if a person is fit or unfit so you would say okay does that person eat a lot of fast food then if it's yes then they're unfit if it's no then they are fit same thing with if they exercise regularly if they exercise regularly then they're fit if not they're unfit so that's a very basic example but a lot of companies you decision trees to map out more complex features and map out more complex trees of what a dataset is trying to show when they're trying to classify your data points and so that was a little bit of a liberty explanation but hopefully you guys sort of get the general idea of what a decision tree is doing and basically in order to implement our decision tree all we're gonna say is from SK learn tree we're going to import our decision tree classifier and as you've noticed of course we can also do decision tree regression which isn't very common but it does actually work surprisingly well I found generally at the bottom you know you're gonna want to have your classification like it's this class is this class is this class at the bottom of your decision tree which is why we're going to be using decision trees for a classification example rather than a regression example okay so now let's make our model we'll call it tree and tree will equal decision tree classifier and our criterion will equal to entropy so that means that our criteria will be focused on the entropy of the model that's just a hyper parameter that we're going to be adjusting finally we're going to call tree dot fit and we're gonna fit the X train to the y train all right so we've run that you can see right here that our criterion is equal to entropy and now it's time to print out the confusion matrix so I'm gonna copy and paste the same code from above so I copied too much and instead of why Preds doing this VM not predict it's going to be doing tree dot predict so we'll run that and you see here this one decision tree is actually less accurate than our support vector machine and less accurate than our logistic regression example and the reason for that is that one tree alone is not very strong so if you think to those competitions that they have will have a certain amount of items in a jar and you have to predict how many items are in that jar something that's really interesting about human intuition is that each prediction will be pretty far off but if you add up all those predictions and then divide them by the number of people who made those predictions you'll find that you get a number that's very close to how many items are in that jar and so that's something that's really cool about human intuition that also applies to machine learning one tree is not going to be very strong but if you have a forest of trees then it's going to be a much stronger model and that is the idea from random forest classification and this is getting into what we would call ensemble learning the idea behind ensemble learning is that you create a bunch of weak models and then if you combine them together you can make a really really strong model and that's what we're going to get to a little bit in the next video when we talk about optimization but for now we're just going to give you a taste of what's to come by creating a random forest and basically a random forest is just a bunch of decision trees you add up their predictions divide them by the number of trees and then you're going to get a more accurate prediction so we'll say from SK learn dot ensemble we are going to import our random forest classifier and remember you can do a random forest regressor too if your problem is a regression problem and we'll say forest equals random forest classifier and our number of estimators or a number of trees is going to be equal to let's just start out with 100 trees and our criterion will be equal to entropy and it's important to note decision trees are extremely powerful but unlike some of the other algorithms classification algorithms that we've worked with they are prone to overfitting which is why it's important that we have a test set so that we can make sure that our algorithm doesn't really understand our data set well but not understand new data points so we'll set criterion equal to entry and I'll just say forest dot fit and we'll fit the x train to the y train we'll run that okay everything looks coochy and now it's time to see if our random forest is any more accurate than our singular decision tree so I'll run that instead of treetop predict will say forest dot predict will run that and you can see now instead of 5 incorrect predictions we only have 4 incorrect predictions so you can see that between each of these examples the actual gained accuracy is rather marginal and that has more to do with the data set and the size of the data set and the testing set than the actual algorithm but you can see overall our support vector machine perform the best as well as our logistic regression classifier however our random forests did not perform as well for this particular problem some problems you'll find that the random forest works better generally you're going to want to use a support vector machine there are some problems with support vector machines but overall support vector machine is probably the best classification algorithm that we've looked at up until when we get to the next video which is optimization and with optimization we're really going to dive in to ensemble learning and we're going to see how powerful it can be and also how quickly you can build a model with ensemble learning versus just doing something like deep learning so yeah guys that's pretty much it for this video and I will see you in the next one sweet thank you guys so much for making it to the end of this video as always if you enjoyed it make sure to hit that like button also if you have any questions comments or concerns about what we did in this video please leave them down below now if I could have just one more minute of your time I'd like to tell you about a service that I've been using for over a year now called script now just as a side note script did not sponsor me to make this video I just wanted to tell you about it put simply script is a lot like audible except for instead of being 15 dollars a month it's only nine and instead of only having two audiobooks per month you get an unlimited access to a plethora of audiobooks ebooks documents and even sheet music and magazines so for me this was obviously a no-brainer and right now if you use the link in the description you get 60 days free of script and I get one month if you sign up using my link so that's why script didn't officially sponsor this video I'm just telling you about it so that I can get some free months and I can continue learning and you can also continue learning with your 60 day free trial so thank you guys so much for making it to the end of this video and I'll see you in the next one peace [Music]
Info
Channel: Khanrad
Views: 681
Rating: undefined out of 5
Keywords: khanrad, adam eubanks, online course, learn to program, logistic regression, suppor vector machines, kernel svm, classification algorithms, machine learning, machine learning scikit learn, decision trees, random forest, scikit-learn tutorial
Id: skDzwssOJSM
Channel Id: undefined
Length: 23min 20sec (1400 seconds)
Published: Wed Jun 10 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.