Machine Learning in Python: Building a Classification Model

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome back to the data professor YouTube channel if you new here my name is Tim n-not at cinema and I'm an associate professor of bioinformatics on this YouTube channel we cover about data science concepts and practical tutorials so if you're into this kind of content please consider subscribing so I have been mentioning for quite some time about making Python tutorial videos and so today is gonna be the first episode and we're gonna look at how you can build a simple classification model using random forest algorithm on the iris dataset so without further ado let's get started so why don't you go ahead and open up the github page of the data professor and click on the code repositories scroll down and then find the python iris so because the Python repository contains only one subdirectory it will show it as a Python slash iris but in the future when there is more video in the Python subdirectory the name iris will disappear and it will only appear as Python so you can just click on it okay and then you are inside the iris subfolder of the Python repository under the code repository so in here there is one file called the IP Y&B so the IP y and b is the file extension for the ipython notebook which is the original name before it became too bitter notebook and so what you want to do is click on this file so the advantage of having Python codes in the Jupiter notebook is that the input code and the output code can be shown on the same page without any compilation or running the code on a local computer so you can even look at this on your tablet on your mobile phone when you're on the go so you will comprehend what input will lead to what outputs okay so what you want to do now is right-click on the raw link and then sibling ass or save target as you find a suitable location on your computer and then save it in there so I have already saved that so I will just open up the file then I'm going to fire up the command prompt and then I'm going to go into the directory where the files are located and then I'm going to activate my Conda environments and then I'm going to run D to promote okay and then I will see the Python workbook file here click on it and then it will load that okay so you can see that this tabular notebook looks like a document so it is a blend of Python code along with documentation and the documentation as you see here could be written in markdown language or using HTML so the first cell provides the title of this tubular notebook file and then the second cell we see here is the header and i number it for ec reading so the first section here is called import libraries and we will see that there are a total of 10 steps that you could do to build your first classification model in python using random forest algorithm on the iris data set so the first step is pretty easy you're gonna import the libraries and the libraries that we're going to use today is based solely on the scikit-learn package and cyclin package is a popular machine learning algorithm package for the Python environment so we're going to use scikit-learn to import the data set iris which will be taken directly from the data set sub model of the socket learn and then we're gonna use the Train test split function and as well as fear and enforce classifier and the make classification okay and then the second step is to load the iris dataset and we're gonna do that by assigning the iris variable with the datasets function and in load iris function we actually it is the data set sub-module and within it we're gonna use the low kairos function so this will put the iris dataset into this iris variable so in order to run the cell you want to click on the cell and on the keyboard press and hold the shift key along with the enter key and then you will see that briefly the number will to be ambassador risk which means that it is currently running and when it is completed it will become a number the number will be done sequentially so this is the second time the cell has been run so if I run it again it will become 3 ok so now we're gonna go to our data set and you see that the number become 4 which continues to fund the previous cell which is 3 and so let's enter iris and let's see what happens and then we will see that underneath there is the data which comprises of 150 flowers the target which is the class labels and the target names so 0 1 2 corresponds to C Tosa versicolor and virginica and then this is the description of the data set and then the feature names we saw the sepal width hit a link put a width and also the sequel name so what we can do is we could call it by specifying iris dot target which we would do below okay so let's go ahead and specify that as we see that there is a feature name here so we're gonna print iris dot feature names which will allow us to see the input features which will comprise of the four characteristics of the iris flower which are T super lang super with pedal Lane petal width so when we shift enter we're gonna get the same or salt acid what's not previously and then to see the output feature it is in the iris start target names and actually if you want to add a cell week you can click on the plus button here and it will add a cell so with the type in iris dot target chef's enter and then we're gonna see the serial 1 2 so it's gonna be the same thing but this will be numerical or integer but this will be the strings version okay and of course that we're gonna take a glimpse at the data so iris data and then we're gonna see the data here essence array and then ok so this we have done previously above ok so go ahead and assign the iris data into the X variable and assign the iris target which is the class label into the Y variable so the FS will contain 4 input features and the Y will contain one feature so it will add up to a total of 5 features and 150 data samples representing 150 iris flowers where it belongs to either one of three classes iris setosa iris versicolor and iris virginica so if we look at the data dimension we will see that there are 150 rolls and 4 columns so corresponding 250 flowers and 4 input features and then the Y shape will become 150 so nothing is showing afterward because it is one column ok now comes the fun part we're going to define COF which would stand for the abbreviated form of classifier and we're gonna use the random forest classifier function and then we're going to call it the random forest classifier function by typing in COF and then dots fits and then as input argument we're going to use x and y so what this essentially does is it's going to call the classifier which is the random forest classifier and then it's going to use the v function which will create a classification model and then it will take as input two variables the first one is the X variable which are the input features and the second argument will be the Y variable which is the class label so it's gonna take an S and create a model but using Y as the class label so the model will be performed in a supervised manner by learning from the class label and finally outputting a classification using the random forest at worden so we're gonna see that under the hood it's going to use these default values which you can modify later on to your own personal linking so we're gonna cover in more depth on rainforest algorithm later on in a future video so let's go ahead and look at the feature importance so these are the important features so let's have a look so we have four input features the sepal length with the petal length width and each of them have corresponding importance to a different degree to the classification model so the values are in order for the respective input features that we have here okay so the importance of each of these variable will be shown by the number and it's right here so we can see that the most important feature was the third variable and then followed by the fourth variable and then the first variable so the third variable is the pit of length is the most important I suggested by the model followed by the pit of width followed by the signal length and the last important feature is super with it just but it contributes to a lesser degree to the classification model and so so let's make a prediction and we're gonna feed in the first data sample which is the first flower as we recall there are a hundred and fifty flowers and the input data is assigned to the next variable let's have a look so you can see that we can move up and down the cells if so let's have a look X are the input features and there are 150 flowers and so here we're going to use the first flower so we can just call it by using the bracket and then 0 it's the first position and we're gonna get that if you change it to the second position you will get a different set of values so Python counts from zero that's the first iteration then we can cut it out we're gonna print the results using the print function so the print function is just essentially printing out whatever we put two nasty arguments so let me say 505 if we put in a string and output a string in the world okay and so here we're gonna print and then taking in as argument the prediction function of the classifier and as input of this we're gonna put in the the set of values for the first flower and we're gonna see the prediction coming out and it prints out zero and zero means the first class right here is zero so either stage the Tulsa right if it's one it becomes the versicolor nephrogenic guy okay and we're gonna use this it's essentially the same thing you could put in the set of values directly or we could to slice it directly from the X variable so this will also give us a tradition of cereal and what about if we want to look at the probability of the prediction we want to see before it arrives at its conclusion that it will predict it to be the first class label and let's say that it predicts the data sample to be class 1 let's say 80% and the probability of being class to 20% it's not the case for this one so let's check it out and when we run it using the predict proper function we see that the probability to each of the three classes are shown below one civil civil means that a hundred percent of the probability goes to predicting the first class okay so it is a hundred percent confident that class one or DC Tulsa is the correct class for this input feature showing here in the first flower so it is confident that the first flower with the set of values one here is corresponding to the first class label which is sister Tulsa so based on the input values of Siebel Linkwood pedaling with 5.12 1.4 0.2 it predicts it to be iris setosa okay but there's a bit difficult to look at the class label by seeing 2 0 1 2 so when we run this function and then we're gonna see the names of the class label okay there you go so instead of 0 we're gonna see the actual class label name by using this function here okay so in the previous example we use the full dataset to build the classification model but let's say that we want to use a subset of the data set let's say we want to do a 80/20 splits where 80% will go to the training set and 20% will go to the testing stand so how do we exactly split the data so we're gonna use the Train test split function in the secular n-- and we're gonna assign it to 4 variables and so they're gonna be called extreme X test why train and why test so it will take the data directly from this function where it will specify the input argument as follows which will take the X variable the Y variable and then the input argument of the test ratio of the train and test set splitting ratio of 0.2 meaning that 0.2 will be for the test and 0.8 will be for the train and so training set will get 80% of the data samples and 20% of the data samples will go to the test set okay so let's write that so let's have a look at the Xtreme and white-winged dot shape so we're gonna see the dimension of the variables and we see that there are 120 flowers and 4 features for the X train and for the white rain there is 120 flowers and there is one column or the class label so that line is about the training set and then the testing set is what follows now so X test and Y test dot shape will give us 30 rolls and 4 columns corresponding to 30 flowers and for input feature and in the second one for the white test dot shape will be 30 flowers and one column which is the class level and so we're gonna rebuild the classification model on this new train test split data so we're gonna use the CLF dot fit function where we define earlier that field F will refer to the random forest classifier and we're using default values for that and then the input argument to this function will be extreme and light rain so it's gonna specify the pairs of x and y for the training set and then it's gonna make a prediction afterward so this run the classification model and so now the model is tween on the 80% of the data samples and then we're gonna use similar to the previous step we're gonna print the results of the prediction where we're gonna use the first flower as the input and then predict it out and then the probability of the prediction okay and so we get the same prediction that the first flower will be a iris little sign and so this is making a single prediction meaning that we're feeding in the input values of one data sample so let's say that we want to predict a set of samples containing more than one so you can do that by instead of using one set of features with four input values we're gonna use the whole data frame of the X test which will comprise of 30 flowers so simply we're going to use the print function and then we're gonna print the results of the CL f dot predict and it's going to use the input argument of X test and so that will give us the prediction so the predictions are shown here and in order to see the test label we're gonna use the above lines of code here we're gonna copy that hmm add a cell put it below here okay and we're gonna run it you can't and we're gonna perform the prediction again and now we're gonna see the prediction as the class labels so let's put the model again perform the prediction on the X test again and it will give us a series of values here and so let's compare it with the actual values so this is the predicted values for the label and this is the actual class label so we're gonna see that in the first data sample it is predicted to be two but it is also a 2 right in reality is 0 here it's predicted to be Estoril if it's a - it's pretty good to be a 2 - meaning the values here right 0 1 2 corresponding to see Tulsa versicolor and virginica so you can see it in one goal the predicted values and the actual values and then the last step in this jupiter notebook file is to print the prediction score which is the accuracy of the model and it's take input as TX test and lie test and then it will up with the score of 0.9 6 so 96% accurate it's here so congratulation you have built your first classification model on the iris dataset using the random forest algorithm and so in the meantime before we release the next video if you want to test out this with a new data set please feel free to do so so let me tell you how you can do that okay so you're gonna do it right here instead of assigning EPS to be iris data and Y to be I restored yet you're going to use a different data set okay so let's see which data set are available for you to use so scikit-learn comes with a couple of toy data set so the first one that we used was the iris tree and so the other dataset are shown here we have the Boston house price data set the diabetes data set and several others so please check it out and in the next tutorial we're probably going to use some of these assam samples and if you would like to suggest some data set as examples to use in future videos please do so by commenting down below and please don't forget to to try it out with a different data set and save your progress by uploading it to your github profile and as before the best way to learn data science is to do data science so until next time I'll see you in the next video thank you for watching please like subscribe and share and I'll see you in the next one but in the meantime please check out these videos
Info
Channel: Data Professor
Views: 100,389
Rating: undefined out of 5
Keywords: Data Professor, DataProfessor, data, bigdata, big data, data science, data mining, data science tutorial, data mining tutorial, data science project, AI, python, python programming, learn python, classification, iris, machine learning, machine learning model, classification model, random forest, decision tree, data set, dataset, iris data, scikit, scikit-learn, scikitlearn, python data science, data science python, python data science project
Id: XmSlFPDjKdc
Channel Id: undefined
Length: 19min 58sec (1198 seconds)
Published: Wed Mar 11 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.