Decision Trees and Hyperparameters | Solving a real-world problem from Kaggle

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hello and welcome to machine learning with biotin zero to GB. This is an online certification course being organized by Chovan today. We're in lesson three decision trees and hyper-parameters so let's get started. The first thing we do is go to the core speech zero to GBM start com so just point your browser to zero to GBM StartCom and that will bring you to the course page where you can find some information about the. and you can find all the lessons and assignments that you need to complete to get a certificate for the schools. So far, we've looked at linear regression, logistic regression, and you will also be working on assignment one where he will train your first machine learning. Now before we open lesson three, I just want to tell you about our discord server. At the top of the page, you will find a link to our discord server. In the discord server, you can interact with thousands of members of the Jovian community who are taking this course with you right now. so the first thing you should do once you enter the discord server is introduce yourself in the introduce yourself. Then check out the agenda channel for some recent discussions, you should also carefully review the community guidelines just to keep the Jovian community, a friendly environment. There's a list of small do's and don'ts that you need to follow. And then you have several channels which you can use to learn and get help while you're working on the project. The ask questions general is to ask questions about any concepts or assignment. errors that you're facing the assignment help channel is to get help with the assignment. If you're stuck at some point and need help, or you're facing an edit that you cannot resolve post in the assignment, help channel the study. Our channel is to give you information about study hours that we conduct from time to time using voice channels. So we're conducting a study out every Wednesday and you can find the information in the study hours channel the share your work channel is a place for you to share any interesting data science projects that you have worked on Jovian or elsewhere. And finally, the shared resources channel is a place for you to find interesting machine learning resources and data science resources. You can also share any interesting blogs and articles that you find here. So that's a discard channel do make use of it being part of a community. Being part of an online classroom is the best way to take an online course. And we've seen in the past that being active in the community forums makes you four to five times as likely to complete. Even if you check out the discord channel daily, that's good enough that keeps you motivated. Would that let's scroll down to lesson three decision trees and hyper-parameters. on the lesson page, you will be able to find a recording of the lesson . And here is some description of are the topics that we're covering today. The court we will be executing today is present in this Jupiter notebook, which you will find in the lesson notebooks tab. We have just one Jupiter notebook for today, and you can read through this Jupiter notebook and you can view the code, but if you want to run the code, you need to click run. So I'm just going to click run on CoLab here. And this is going to take the Jupiter notebook that you will viewing on the lesson page and put this into your Google drive account. And in your Google drive account, this Jupiter notebook will then be opened up using Google collapsed. So you will be asked to connect your Google drive account. If you haven't turned that already. And once it disconnected that will bring you to this speech. Now make sure to run the first cell of code on Google CoLab. This is very important because this cell connects your Jupiter notebook from Jovian to Google collapse so that when you want to save a snapshot of this Jupiter notebook, it will get saved to your Jovian profile. So just run the same. And then you're all set to go. When you run the first cell Google CoLab, we'll set up a server on the cloud for you and all the code that you write from this point on will be executed on that cloud server. So the topic for today is decision trees and random forest. This is something different from what we've covered so far, which is the linear models like linear and logistic regression and random forest, especially are a very powerful and widely used machine learning. It's most likely that in your professional work, you will be building decision trees and I know forests most of the time. And one of the primary reasons for that is interpretability of these models. So we will also talk about why these models learn the things that they do and why they give the results are predictions that, so here's what we're going to cover today. We will download a real world dataset just as we've been doing for the previous lessons. We will prepare a dataset for training a machine learning model. Then we will train and interpret some decision trees. And then we will move on to training and interpreting training random forest. We will talk about overfitting hyper parameter tuning and regularization. These are some of the central problems in machine learning, and this is where you will spend a lot of your time when you are improving your models. And finally, we will talk about how to make predictions on single inputs as well. Now I'm running this notebook on Colab, but you can also run it locally, but thanks may be a little bit slower depending on what your configuration is. If you have a good CPU and a high Ram, you should be able to run this locally as well. Okay. Just as we've been doing in the previous lessons, we will take a practical and coding focused approach. We will learn how to use decision trees and random forest to solve a real world problem from Kaggle. And we're going to use the same dataset that we used the last time. And this will also give you a chance to contrast decision trees versus linear regression logistic regression models. So we will use the rain in Australia dataset, which contains about 10 years of daily weather observations from numerous Australian weather. And here is a small sample from the dataset. So on several dates, you have information captured from several locations and this information includes minimum temperature, maximum temperature, rainfall, evaporation, etc. And the last two columns are the more, most interesting one is whether it rained on that day. And the second is whether it rain on the next day. Now, of course we have rain tomorrow because we are looking at historical data. And as a data scientist at the bureau of meteorology, you are tasked with creating a fully automated system that can use today's weather data for a given location to predict whether it will rain at a location tomorrow. So you want to create an automated system, which can essentially predict the likelihood of rainfall all over Australia. So let's see how far we can get there. Before we begin. We'll just install an import, some of the required libraries that we've been using throughout open datasets for downloading a dataset pandas for loading data sets and working with data frames, num PI for some mathematical work, mathematical computing SK learn contains all the machine learning models that we train and Jovian for saving snapshots of your notebook. So let's import all the libraries as well. Open data sets as Matplotlib dot pyp plot as PLT seaborne as SNS pandas, as PD numpy as NP Matt block lab Jovian. And we will also use the OS module a bit. So these are standard conventions that you should follow in all your jupyter notebooks. And if you don't follow these, you will find that people get confused. Of course you can call pandas anything you want, but prefer calling it PD because that's how you will see it all over the internet. Finally, we're also setting some options for display here so that the graphs are a little bit bigger and. See more information within our pandas frames. All right. So the first step is to download the dataset as we are done the last time we will download this dataset using the open datasets library directly from Kaggle within jupyter. So we just run audi.download. And when we run, audit or download, we will be prompted for an API key Kaggle username and an API key. Here's what that looks like. Okay, let me just run this. So we will be prompted for a Kaggle. We'll be prompted for a Kaggle username and an API key. Here's what that looks like. Now. This is one way to provide the information, your Kaggle username and your . But one other thing you can do is just click on this file Explorer and find the upload button and upload your Kaggle dot Jason file. And if you place your Kaggle or Jason file next to your notebook, then open datasets will automatically find the credentials and download the dataset. As you can see here, this was a Tree MB dataset that was downloaded automatically. And of course, if you don't have your calculator, Jason file go to kaggle.com, which is where we downloading the data set from click on account, scroll down to create new API token, and that will download the Kaggle dot Jason file for you. Okay? So you can either provide your Kaggle username and ID and key directly, or you can upload the file to Google CoLab, or just place it next to the notebook. If you're running it locally So the dataset is now downloaded and extracted into this folder weather dataset rattle package. And we can check that using OS dot list. Now I'm just going to click, edit clear all outputs here so that we can see, we can run all the code fresh and we do not have any stale outputs in our notebook. All right. So the file, whether Aus dot CSV contains the data, you can see here, whether Aus starts CSV. So let's load it into a Panda's data frame. I'm going to run PD, dark, read CSV, and that loads the data frame up. And here's the data frame. So here's the data. We looked at it the last time as well. We have date location, minimum temperature, a bunch of weather parameters. And finally, we have rain today and rain tomorrow. Our objective is to take all of this information, maybe not the date, because everything is on a different date, but everything except the date, and use that to predict whether it will rain on the next day. And hopefully we can then use it on some future data as well. So let's check the column types of the dataset. If we just do raw df.info, it tells us that there are a total of 145,000 entries, and you can see the types of each column. So you have object, which is mostly strings or categorical data, and then you have float 64 and then you have, okay, 9 64, but floats and insert numeric data. The others are mostly categorical data in some, and sometimes these can be string data as well. And you will notice that some of these columns have null values too. So we need to deal with them as well. Now, one of the things I'm going to do is remove any rows where the value of the target column is not. Because we want to train a model that can predict whether or not it will rain tomorrow. So to give the model any data where we don't know information about whether it rains tomorrow will not be useful to train the model, right? So we will remove any rows where the target column is empty. So I'm just going to remove the subset rain tomorrow. And here's an exercise for you to try and perform some exploratory data analysis on this dataset. If you haven't already and studied the relationships of the other columns with the rain tomorrow column. See if you can figure out before we build this model, which columns are the most important in determining whether it will rain tomorrow. and I'm just going to save my notebook as well. So I'm running, Jovian not commit here. and I am asked to enter my API key, which I can find from my jovian profile by going to jovian, not AI, just click on this, that will copy the API key and I can paste it here. And this will save a snapshot of the notebook to my profile so that I can come back and continue where I've left off in my next session. The CoLab notebook of course will shut down after some time. All right, so now we've done most of this before. So let's go with this quickly. We will perform some steps to prepare the dataset for training. The first step is to create a training test and validation split. Remember it's common practice to set aside only about 60% of the data for training the model. And then we use temper 20% of the data for validation, which is to evaluate different versions of the model. As we try out different parameters to train the model and finally to report the final accuracy we use the tests. Now it's common practice to do a random split, but in this case, because the data is ordered by date. And because the model that we are going to create using this data is going to be used in the future. We can simulate that, which is using the model train on the past to the, to predict values in the future, by picking the last couple of years for the test set. And we can maybe pick one year before that for the validation set, and then all of this data can be used for the training set. So this is the distribution of the number of rows per year . And we've done that using a simple count plot using seaborne. So here's what we'll do. We will create a train data frame, which is a fraction of this rows of the raw data frame, which we just loaded up where the year is less than 2015. And here is how we've converted. The year we have taken the date column Row df.date, and we have parsed it as a date, time field, each value in the column. And from that daytime field, we have extracted the year. So this is basic pandas data operations that you should check out if you're not familiar with this already. So before the year 2015, which is up to 2014, we used the data for training. And then the validation data is that year 2015. And then the test data frame is the year 2016 and 2017. Again, this, we are doing this only because we have this chronologically ordered data, and this is how we, our model will be used in the real world. If you do not have chronologically ordered data, then you use a random split and there is a method in scikit-learn called train test split, which you can use to do that. Yeah. So now we have about 98,000 measurements, 98,000 observations or 98,007. For training, we have about 17,000 samples for validation. So as we try different kinds of models and we'll try quite a few today, we can use the validation data frame to evaluate how well those models are performing. And finally, we have a test data frame. This is where we will report the final accuracy of our model. Now here's an exercise for you. If you want to build on top of this, you can try and scrape the climate data for the recent years from this website. This is the official website of the bureau of meteorology in Australia. So you can try and scrape the data from 2017 to 2021 and try training a model with the enlarged dataset. In fact, this is how this data set was created in the first place by scraping data. So web scraping is a great way to create new datasets for machine learning. All right, so we have created the training and validation dataset split, and then the next step is to identify the input and target columns because not all the columns will be useful for machine learning. And it's also very important to separate out the input and the target data. One common mistake people make initially is to accidentally use the target column to predict the target column, in which case your machine learning model, isn't really doing anything. It is taking the value of the target column and simply returning it. So, always make sure to carefully check the columns of your data frame and separate out the input and output columns. So if I check the raw data frame, or maybe if I just check the train data frame, which is just a subset of the rules, you can see that we don't want to use the first row date. And we don't want to use the last row tomorrow as an input. Why not date? Because we are going to use a model in the future. So a date will not be a useful thing and. Rain tomorrow is not useful because this is the value that we want to predict. So the input to the model should be the rest of the columns and a prediction of the model should be compared with the target column, which is rain tomorrow. So here's always setting that up. We are set up, reindeer have got columns. We've converted that into a list and we're excluding the first and the last value. And we're excluding the first and the last value from that list. and now we can take just the input columns from the training data frame and create training inputs. So I'm just creating a copy here because we are going to make some modifications in the in the next few steps. And we can also separate out target columns. We can also separate out the target column. So now the target column is just a single column. So when we select train DF target call, that is going to return a Panda series, not a data frame. So just keep that in mind. So here's what that looks like. Train inputs. So this contains location to rain today and train targets. This contains just the value of rain tomorrow. Okay. Always a good idea to just check out what information you have within your data frames before you move forward. Similarly, here we are creating the validation inputs, validation targets, and we're creating the test inputs and test targets. next up, let's also identify the numeric and categorical columns within the data because we will need to deal with them separately. So here's one tray. Well, one simple thing you could do is you could just do train DF and then you could manually look through and make a list. Okay. Main temp is numeric. Max temp is numeric. Rainfall is numeric etc, etc. But what you should ideally be doing is detecting these automatically. So here's how you can do that. If you just do this train inputs or any data frame.select D types. So only select the columns which have these matching D types and for the matching D types, if you just provide NP dot number, which encompasses float and int and all the numeric data types. So now you will get just a data frame containing all the numeric columns, and then you can simply access the columns here and that gives you a list of all the numeric columns. And finally, we can convert that into a list using tool list. So here, now we get back a list of all the numeric columns now to get the list of categorical columns, all you need to do is change this to categorical. Oh, sorry. Not categorical to get the list of categorical columns. All you need to do is change this to object so when you change this to object, you get back the list of categorical columns. Now, how did I find this out? Well, I simply looked it up online. How do you find new American categorical columns in a data frame? And once I found it, I just have it written in my notebook so that I can use it anytime. So these are the numeric and categorical columns. now, one thing you might want to do at this point is decide if you really need all the columns, because every column introduces new data, which may potentially slow down your training. In this case, we have a small enough data set. So we may not, we do not need to worry about it, but you can do some analysis and you can. How closely the columns are correlated with the target and maybe just select a subset of the columns instead of all the columns and observe how it affects the results. Does it lead to a large decrease in the output or is this a very insignificant decrease? And if it is then it's probably okay to drop a few columns and just use the ones that are more, most important. Okay. So try it out, observe it and try to get a feel for when it makes sense to drop some columns for now, we are going to move ahead with all the columns and the next important step is to impute missing numeric values, which means we want to take all the missing values because machine learning algorithms can't work with missing values. They will throw arrows at you, and we want to replace them with some other values. So how do you check the missing values? Well, you can go train inputs and then from train input, it inputs pick the numeric columns. So just the data from the numeric columns. This is what that looks like. And here we can check is any, which is going to replace each value with a true or false, depending on whether it is Nan. So this will become a true in this will be a false, and then I'm going to do a dot sum. So chaining, pandas commands is a useful skill to learn. So you always think about what you want to get to and what is the incremental process that will take you there. And maybe I might even do this. I may also do a sort of values here in the CDs and maybe set ascending equals false. All right. So it seems like sunshine has the highest number of missing values followed by evaporation, followed by cloud 3:00 PM, cloud nine impression I name and so on. So all these numeric columns have some missing values and we are going to replace them using a simple strategy, which is. Which is basically replacing them with the average of that column. So for this, we can import simple computer from scikit-learn and we create an imputed object and we specified the strategy that we want to use strategies mean. And after creating the imperator object, we can also call.fit and give it the data, the numeric column data, which is all the data from all the numeric columns in our data frame, and imputed is going to then figure out what is the average for each of those columns. Now, once you've fitted the computer, which means once it has found the averages or the statistic that we want to use for each to fill each column, we can actually fill the columns by calling imputed or transform. So we call imputed or transform on train inputs, numeric columns that is going to fill information into all the empty data in the numeric columns of the train ports and the tone, a new nonbinary. We can take that new result and put it back into the original data frame, train inputs, and replace the original numeric columns. Okay. So the net effect of all of this is that you have no missing data in any of the numeric columns. We filled it with the mean value now mean is not the only imputation strategy. There are several other imputation strategies that are available in scikit-learn. So an exercise for you would be to try a different imputation strategy and see how it affects the final result. And this is all what practical machine learning is. You try different things and maybe sometimes you try different strategies for different columns by doing some exploratory analysis and figure out the strategies that work best for the problem. Okay, next up, we are going to scale the numeric features. Scaling simply means we want to take the ranges of each of the new medic features, which is the min and the max. And we want to bring them down into a zero to one range. As you can see here in the validation training or test dataset, each numeric feature has a different range. Main temp is minus eight to plus 31. And wind speed is seven to 1 35. Whereas certain values like pressure can be like 9 88 to 1039. So because there are a lot of numerical competitions that happen inside the machine learning algorithm and ultimately a single loss value is optimized. We don't want the data to have any, we don't want any specific feature to dominate the training process. We want to give every feature, a level playing field to participate in the training of the model. And that is why we scale all of these feature values to the range zero to one. And we do that using min-max scaler. So here we are creating a min-max scaler and then we call on min-max scaler fit and we give it the new medical columns, the data from all the new medical columns. So it is going to figure out for each column, what is the minimum and the maximum value. And then we can call scaler or transform, give it all the data from the numeric columns, and it is going to scale them into the zero to one range. And then we can take that output and put it back into our training validation and test data frames. So the net result of all of this is that the inputs are going to change from I'm going to change from a variety of different ranges to the zero to one range. No. The zero to one range is not the only scaling strategy. There are several other scaling strategies as well. So you should try out a different scaling strategy. Specifically standard scaler is something worth checking out and observe how that affects the results. next next, we're going to encode the categorical data machine learning algorithms can only work with numbers and in our data frames, we have some categorical data. If I just check brain DF, you can see here, you have location, which is categorical. Then you have wind gust direction, which is also categorical. And then you have a bunch of other categorical data as well. Things like rain tomorrow. In fact, that's what we've listed in categorical calls, location, wind, , direction, wind direction. We're in 3:00 PM and rain today. So what we're going to do is perform one hot encoding for the categorical columns. okay. And for the categorical columns, we do need to first fix the NaNs. So I'm just going to fill NaNs wherever we have NaNs in the categorical columns. So I'm just going to do drain DF categorical calls, not fill any, and I'm going to fill all entities with the value unknown. I'm going to do that for the validation, and I'm going to do that for the test data frame as well. So we did fill out missing values in the numerical columns, but we did not do that for the categorical columns. And you can see here if I just pick the categorical columns, you can see that some of them have some NaN values. Most of them actually. So we're just going to fill wherever we have Nan values. We are going to fill it with the string unknown, just so that one heart encoder doesn't complete. And let's just do that in place. Let's try that again. All right. Let me just fix this. I believe this is an issue because of the version of scikit-learn. So this was something that worked on my computer, but did not work is not working on Google CoLab. And whenever you face such issues where something works in a certain place, but it does not work in another place. That is probably because of that is probably because of scikit-learn because of abortion differences. Okay. Let's do this one last time and it should fix it. . So watch out for Washington differences between libraries. And if you ever want to check the version of a particular library, the way to do it is just run PIP list. And that will show you a list of all the libraries that are. And you can check their version so you can check the version on your computer. You can check the version on CoLab or wherever you're running and identify the discrepancy. And the way to install a specific version is to go PIP install scikit-learn for example, and then specify the version you want to install, but after an equal to equal to 0.1 0.3. Okay. So with that out of the way, we can now one Hot in code our columns we can no one heart encode our columns. So by one heart encoding, what we want to do is we want to take all the categorical columns and pick the values in those categorical columns and create up, create a separate column for each category. And those category columns will simply contain ones or zeros, depending on whether a particular row belongs to that category. Again, something that we've discussed in detail in the previous session. So I will just run this code here, which is to first create a 1 hot coder and then fit it to the inputs that we have. Then create a list of new feature names or new category names. And you can see what these category names are, create the list of new category names. So for each categorical column and for each category combined, we have one new category name, and then we can transform the data from the categorical columns into one heart vectors and put them back into our data frame. So the net effect of this is that for every categorical column, for example, location here, we have a bunch of separate columns like location rain today, a location, Adelaide location, Alberni, location, Albury, etc, where we have zeros. And we have one for the specific location that this represents, for example, one for Albury, because this location is Albury and zero elsewhere. Now categorical one heart coding is not the only encoding strategy. There are some other incurring strategies as well. So I encourage you to try them out and observe how they affect the results. And as a final step, let us drop the textual categorical columns from our inputs. So I'm just creating these new extreme X, Y, and X test wearables, which contains simply the numeric columns, which have been imputed and which have been scaled to the zero to one range and the encoded categorical columns. So we are removing the actual string categorical columns, and just keeping the encoded ones here. And these, this is the input that we will use to train on evaluate our model. Of course, we have the targets as well. We have train targets while targets and test targets as well. so here's what the input to our model looks like. Okay. So let's say what work before continuing, or this is something that we did the last time as well. So it's, so this, all, this is something that we've done the last time as well. So this should be fairly calm, fairly standard should start to feel fairly standard and boring by now because these are the steps that you will take for pretty much every machine learning problem Let's talk about training and visualizing decision trees. Our decision tree in general, parlance represents a hierarchical set of binary decisions. For example, here is a kind of decision tree that you may set up to decide whether or not you accept a job offer. Maybe if the salaries between 50,000 to $80,000, then you, then you consider the offer. If it is not between 50 to $80,000, maybe you would decline the offer. If it is between 50 to $80,000, then maybe you check. If the office is close to your home, if it is not, then you declined the offer. Otherwise you check if. Office. If the company provides a cab facility, and if it, if it doesn't, then maybe you're declined the offer. Otherwise you accept the offer. So this kind of a strategy is how we make a lot of decisions in the real world. In fact, this is how a lot of processes are set up. And if you think carefully, this is how programs are also set up or where, where we write a lot of if, else statements to come to a certain decision. Now our decision tree machine learning works exactly the same way, except that we let the computer figure out the optimal structure and hierarchy of decisions instead of coming up with the criteria manually. So applying it to our problem about whether or not it will rain tomorrow, first we'll set, we let the computer figure out what is the most important criteria to decide whether or not it will rain tomorrow and maybe after checking the value of that criteria, let's say, well, maybe it's whether it rain today or not. There is a different tree on weather based on whether it rained today and a different tree or based on whether it did not rain today. So if it did, if it did rain today, then maybe we simply look at the pressure. And if it did not rain today, maybe we look at the wind speed at 3:00 PM and so on, right? So you can have multiple trees on either side and we will see how these trees come up. But the important point is we are not creating those trees. We are letting the machine learning model figure out what the right criteria and the right decision points are going to be to best fit the model. Okay. And to train a decision tree, we simply the decision tree classifier model from scikit-learn. To train a decision tree. Now why decision tree classifier? Because this is a classification problem. Remember there are two types of problems, classification and regression in regression. You're trying to predict the continuous value, which is for example, the medical charges for an insurance applicant, Martin classification. You're trying to classify the input into one of two categories. For instance, here, we're trying to classify the measurements given today based on to whether or not it will rain tomorrow. So yes or no. So that's why we using a decision tree classifier. If it was a regression problem, we can use a decision tree regressor. So from we import decisiontree classifier, and then we create the decision to remodel. So we created a decision tree model by simply creating an object of the class decision tree classifier. And there is some randomness involved in how decision trees work. So if you want to get the same output, each time you run the score, just provide a value for random state. So random state 42. So this is initializing the random number generator inside the decision tree. So each time you run the score, you will get the same kind of randomization and help that has the same kind of outputs. Now, if you do not want to have the same kind of output, each time you run the score, then you can remove this random State, but it is generally recommended to have a random state and you can pick this to be any number you want, but it is generally recommended to have a random state for your decision tree classifier so that you can replicate your results. Otherwise your results will not be replicable. All right, so now we've created the model and the next step is to fit the model. So we give the model, the training data, which is all the new medical columns, which have been imputed and which have been scaled to the range zero to one. And we give it the targets, which is simply the yes. While you fall, whether it will rain tomorrow for each of the input columns. And we run that and it takes maybe a second, maybe a couple of seconds. So it took 2.8 seconds and our decision tree classified has been trained. Okay. So what just happened? Let's try and use this classifier and let's see how it works and then we'll try and visualize it as well. So the first thing after training any model is to try and make some predictions using the model and evaluate how well it is doing. So here's how you can make predictions using the model. If we call model dot predict, and we give it a set of inputs to make predictions on, it will give us a predictions that we can look at. So I'm going to call model or predict on extreme. And this is what extreme looks like big. So we are giving the model, all this data, all of these are numbers. And all of these have been missing values have been filled in categorical columns have been converted to one heart and the model gives me some predictions. What do those predictions look like? Well, the predictions are either nor yes. How does the model know that it needs to predict nor yes, because we call it model.fit with our targets and our targets also have these. Yes, no value. Yes. So when the model was training, when it was learning from the data, it identified that it needs to predict a yes or no value. Now, internally of course the model represents this. Yes, no target value as a zero or one, but to show us the output, it is showing us it is going to show it is going to return strings. Yes or no. So now we have some predictions from our model. We call it modeled or predict on our input data, our training data itself. And we got some predictions and these are the predictions. And we, it seems like there are a lot of here, but just to make sure that we also have some yeses, I'm going to run PD dot value counts, and pd dot value counts simply takes a list. And it's going to tell you the counts of the unique values. So it seems like they're at about 76,000 nos and about 22,000 yeses in the prediction. So the model is based on some, whatever logic it has learned, the model is actually predicting different things. It's not just predicting no every time. So the model has, seems to have learned something now, how well has it learned something? Well, that is somethingthat we can evaluate by computing and accuracy score. So we have training predictions, we have training targets, which is the actual values and the simple thing we can do is compare each value. So we compare the first value and they match. We compare the second value and imagine we compare the third value in the match and we count the percentage of values that match. So I'm just going to run accuracy score and accuracy score is imported from sklearn dot matrix. And that is simply going to count the number of matches. I'm going to run accuracy, score on train breads and train targets. So let's see how well the model has done on the training site. Okay. So it seems like the accuracy of the model on the training site, on which it has been trained is 99.99%. So practically a hundred percent, this score just be a floating point. So the accuracy is is close to a hundred percent and the decision Tree also returns probabilities for each production for each prediction. So we can also check the probabilities. So let's take the probabilities and to get probabilities, you can simply call modeled or predict proper P R O B E and give it the same input and let's check the probabilities. And it looks like the model is very confident about its predictions as well. So we have an accuracy of 99% and we have a probability of one for all, for most of the predictions. And you can verify if this is actually true throughout or not. So it seems like we've learned everything there is to learn from this data, or is that, so the training said accuracy is close to a hundred percent, but we can't rely solely on the trainings or accuracy because your model will not be used in the real world on the same training data in the real world, your model will see data that it has not seen before. And so far, it hasn't seen the validation set. So we must evaluate the model on the validation set. So now we need to make predictions on the validation set by calling model dot predict. And then we can compare the validation set predictions, which are obtained from the validation inputs with the validation targets, using the score accuracy score function. But because this is such a common operation scikit-learn models already have a dot score method. So in the case of decision trees, if you call modeled or score, give it the input. So in this case, the validation inputs and give it the targets, then it will make a prediction on the well it'll make predictions on the validation inputs, and it will then compare those predictions with the targets. And it will give you the accuracy here. And it turns out that the accuracy on the validation set is just 79.2%. So you can see the accuracy on the training set was a hundred. As we saw here, 99.9, 9% and the accuracy and the validation set is just about 79%. And in fact, 79% is only marginally only marginally better than always predicting. Now, for example, if you look at the validation data and we see the percentage of values that are no, which is by getting the value count and then dividing them by the length of the validation dataset. So it turns out that 78.8% of the data is, has the target no. And 21% of the data has the target. Yes. Which means if we had a model that simply predicted no all the time, that would be 78.8% accurate. And our fancy decision tree that we've trained, which is a hundred percent accurate on the training set is only marginally better, just less than 1% better, just only half a percent better than our dump model, which always predict snow. So what's going on here? What's going on? How is the model a hundred percent accurate about the training data, but completely missing or learning anything but completely failing to learn anything important, anything useful about the validation data? So here's what has happened. It appears that the model has learned the training examples perfectly, which means it has basically memorized all the training examples. It's like if you memorize the answers to all the questions in your textbook for an exam, and then you go to the examine, none of the questions come up with exactly the same values. So you are likely to score a very low mark in the exam in the same way. The model has learned all the training examples, but it does not generalize very well to previously unseen examples. This phenomenon is called over-fitting and reducing overfitting is one of the most important parts of any machine learning project, especially when you're dealing with Tree based models, like decision. So we see how to improve VCR to reduce overfitting. And the first step in understanding what's going on is to visualize the decision tree that has been learned from the training data. Now, I told you in the beginning, that decision tree is a hierarchical tree of binary decisions, something like this, something like this. So our model actually builds a decision tree, which is pretty close to what we saw above. And we can visualize the street using the plot tree function from sklearn dot tree. So I'm just going to import plot Tree from sklearn dot Tree and plot Tree users matplotlib under the hood. So I'm just setting, I'm just increasing the figure size here so that this is a big image that we can look at and we call it plot tree with the model and plot. We can also take the name of the features or the name of the columns, so we can provide to plot tree the names of these columns so that it can actually tell us which columns the model is looking at. And then we provide a maximum debt because this tree is a very deep tree. It's going to have a depth of what 40 or 50 of which cannot be printed very easily. So we're just going to look at two levels of the tree. And we're just this is just some, some information about color. So we're just filling up. We'll just feel some nodes of the tree with some backgrounds. So let's run this and let's see what this looks like. Here's what our models, predictions look like. The model first checks the humidity at 3:00 PM and it checks if the humidity at 3:00 PM is less than 0.715, then it goes into this direction. And then it checks if the rainfall at AFT, after checking where if the humidity at 3:00 PM is less than 0.715, it checks whether the rainfall is less than 0.00. If that is so indent checks of sunshine is less than 0.2, 5.5 to five, and then it has multiple checks and so on. So this is all the moderate proceds. Each time it makes a decision based on this value based on checking the humidity that either goes left or right now, if it has gone right then here, there is another check on humidity. And once again, based on that, it goes left or right. And then based on this condition, it checks left or right. And then it keeps going. Now we've only plotted till the depth of two. Where are you going to plot till any depth here? So you can see here, here, you can plot to any, any depth. Now there seems to be a problem here. Typically this image, you will see that it is connected. So in this image, you will see that there are lines connecting these, but you can see the Tree that's building up here. So this is the first decision. And based on this decision, this may be the second decision based on this. This may be the third decision and so on, and that keeps going till it finds a final leaf node where there are no more decisions to be mean. And at that leaf node, it contains information about which class should be returned as the output. Okay. So I hope you can see how a model classifies a given input as a series of decisions. Now, of course, the trees truncated here, but following any part from the root node to a leaf will result in a yes or a no. And I hope now you can also start to see how a decision tree differs from a logistic regression model. Now, one important difference that I can immediately tell you is that instead of having a standard weightage for every insert of having a fixed weight for every column, as you go left and right, the kind of conditions and the kind of weights can change. For example, based on weather, the humidity is less than 0.7 or more than point. The conditions that are applied to wind gust speed may change. And that makes sense, like if it has rained today, maybe the factors I should look at are different compared to whether it has not rained today. And that non-linear relationship can be captured better in a decision tree. And it's a bit harder to capture in a linear model. So whenever you have these non-linear relationships, then it's always better to try out a decision tree and see if that performs better than a logistic regression model. Okay. No, you may wonder how this decision tree is created. How exactly does a model figure out what is the first decision to be made? And what's the second decision and so on. And this is where you should pay attention to this Gini value. So in each box you will see this Gini score. Now, every machine learning model has something called a loss function or a cost function. And the objective of the model is to minimize the cost. So the way the model does this is the decision tree does this is by using a genie score. The Gini score represents how good a certain split is. So a Ginni score, a lower genie score means a lower cost, which means a good split. So a perfect split, let's say by just looking at humidity 3:00 PM, you could perfectly classify things as Ville, not rain tomorrow versus rain tomorrow. In that case, the Ginni score will be zero. So a perfect split has the Ginni score zero and as, as a split gets worse and worse. So if, if your split is completely useless, which means that even after splitting, there are 50% yeses and 50% no's on the side and 50% yeses and 50% no's on the site. Then you will have a hygini score, maybe close to a 0.5, or I think somewhere around the range of 0.5. All right. So a low Gini score means a good split. A hyginis score means about split. So what does, what does our decision tree model. Conceptually speaking while training the model, well, evaluates all possible splits across all possible columns. So right now we are looking at this one, split humidity, 3:00 PM. But conceptually speaking, what the model has done is it has looked at all the different columns. And then for each column, it has looked at all the possible split points. So it has basically sorted all the values in those columns in increasing order. And then it has taken each value as a split point. And then for each split point, it has performed a split. And based on the split, it has calculated the Gini score. Now gold splits will have a low Gini score and bad splits. We'll have a hygini score out of all the, all the columns, all the splits. It has selected the best column with the best split, which leads to a lowest possible Gini score. Now, of course, with just one split, you cannot really get to a genie score of zero because you can't just look at one feature in one split and perfectly make predictions about whether or not it will rain tomorrow, but among all the possible splits. It turns out that humidity at 3:00 PM, whether that's less than 0.7 or more than 0.7 is the most important factor. It reduced. It leads to the lowest Ginni score. Okay? So that's how the decision tree figures out. What should be the top level root level node. Now, once it is figured out what the root level node is, which is the best split among all the columns and all the possible splits, it performs a split using that data. So certain data points fall into this region, all the training data, which has humidity less than 0.7 falls into this region, all the data, which has humidity greater than points of and falls into this region. And this is where now the process is repeated now for this entire data, which has humidity less than point. It price all the columns, all the possible splits and figure it out and figures out the best split. Now turns out that if humidity is already less than 0.7 rainfall, less than 0.04 is the best split. And if humidity is greater than 0.7 and humidity, whether that's humidity less than equal 0.8 to five is the best split. Okay. So that's what is happening here. So the iterative approach of machine learning in the case of a decision tree involves growing a tree layer by layer. So what we do first is input all the training data, and we essentially look at all possible splits and we take all those possible splits. And then we compute the Ginni score for all the possible splits across all the possible columns. Based on the genie score, we pick the best possible split. Then we split the data into business split that was decided. And then we repeat the process recursively for each split for the left split and for the right split, right? So we are recursively growing the tree. We have the level one decision, and then we make level two decisions with the split data. Then we make level Tree decisions with the split data. Then we make level four decisions with the split data and so on. And for how long does this go on? Well, this goes on forever until the point that you end up with just a single value now, right now you can. And in fact, that is the number that you can see here at the very top. You have 98,988 rows of data. And this split sent 76,000 rows into the left and 22,000 rows into the right now, similarly, this particular split has 82,000 rows of data. And this split sense, 70,000 this way and 11,000 this way. Right? So that's roughly how it works. It, it does. Divide the data into multiple parts and it keeps dividing till it gets to single leaf nodes. So where you just have a single row of data. And then for that row of data, since you already have the target, so that the target for that rule is used as the value of that leaf, right? So every leaf ultimately contains just one sample and that sample already has a target of yes or no. So essentially what we're saying is we want to follow this decision tree down to a specific example from our training set, and just look at the label of that training example and return the same label. Okay. So I hope now you understand why the training accuracy is a hundred percent because the decision tree has literally learned or literally memorize the entire training set in the form of this Tree based structure. Okay. And you can verify how deep this tree is by checking the maximum depth of the tree. So you just call modeled or treated max stepped. And it turns out that this tree is 48 layers deep. So that it's possible that within 48 decisions, you will get to a leaf node. And on that leaf node, you will have a label corresponding to that specific training example, which lies in that leaf. Okay. So this is one way to visualize a decision tree, what the model loans. And as I said, you will see a rosier. Normally I don't, I'm not sure why they don't show. They're not showing up here, but normally you do see one oh way to display a decision tree, which can be easier to is as text. So you can call export texts and you can pass in the model. You can again, specify a maximum depth up to which you want to show things. Because again, this can get pretty large as well, and you provide a list of two names here too. So here's what the textual representation looks like. So here we are checking if the humidity is less. Oh 0.72. And if the humidity is less than 0.2, we go down this part. Otherwise we go down the other part, which is, well, I think we've not shown it here. We've just shown a few lines because the street self, even with 10 layers of depth is very high. But yeah, if you're first to check humidity, then you check rainfall, then you check sunshine. Then you check pressure. Then you check wind gust speed. Then you check humidity again. Then you check wind direction. Then you check the location. Now, if you're looking at Watsonia, then we check the cloud cover at 9:00 AM and then we check the wind speed at 3:00 PM. And then we check the pressure. So if all of these checks succeed, then we return. Yes. That, yes, it will rain tomorrow. If the pressure check fields, if the pressure is not less than equal to 0.47, then we return. No. And if the wind speed check feels, if the wind speed is not less than equal to 0.07, if it is greater than 0.7, then we check the minimum temperature. And then there's another branch of decisions that is to me meet. And similarly you have this. If the cloud cover is greater than 0.83, then we check the cloud cover at 3:00 PM and then we return. Yes. Otherwise we return true. Otherwise we are done otherwise. Well, we check the temperature and then we return. Yes. Otherwise there is another decision tree here. Okay. So the idea is this is the same as the decision tree that we saw above. And the idea is that we make these hierarchical decisions and the model has learned which decision to make first by analyzing all possible decisions. Now, one small note I want to give you. This is how you should think about it. Conceptually, the model actually has not really analyzed all possible decisions because that is going to be very inefficient. So there are certain techniques or certain what are called heuristics that are applied. So, which is basically strategies to figure out good decisions, good enough decisions, if not the best decisions. And there is some randomization and mold in there as well. Right? So just as an optimization, there is some randomization and some strategies to pick if not the best, but at least a good enough decision. All right. So that's the internals of decision trees, which we don't really need to worry about. So based on this discussion now, can you explain why the training accuracy is a hundred percent, whereas the validation accuracy is lower and you can think about it it's because the model has literally learned every training example. And when it sees an example, which does not fit exactly what training it tries to categorize it into one of the existing training examples by following one part of the decision tree, and that may or may not end up really well because it's going to ultimately boil down to a specific training example. And this is what it's called or fitting where your model has learned or memorized specific training is and does, and does not generalize well to examples that it has not seen before. Okay. . okay. Let's keep going. Now. Based on the Gini index computation, a decision tree and science and importance value to each feature. Again, there is a certain calculation involved here on figuring out how the importance is assigned, but these values can be used to interpret the results given by a decision tree. So if you just check inside any decision to remodel dot feature underscore importances, that will give you a list of numbers. So here's what that list looks like. And this is the importance for every feature. Now, remember the input to our model extreme. This had about 119 columns. So you will see 100 and 119 values here. In fact, if I just check the columns and in fact, if I just check the columns, extreme dot columns, you can see here that there are 119 columns. So this is the importance for minimum temperature. And this is the importance for maximum temperature. And this is the importance for rainfall, and this is the importance for evaporation and so on. So let's create a data frame out of it. So I'm just creating a Panda's data frame, where we have one column called feature. The name of the column in the original data frame extreme, and one column called importance, which is the importance of that feature. And then we are going to sort those values by importance in the descending order. And let's look at that. Let's look at the 10 most important values, the 10 most important columns. So we have humidity at 3:00 PM, which seems to be the most important Columbia 0.26. Then we have pressured at 3:00 PM. It seems to be the next most important. Then we have rainfall, which seems to be the next most important and so on. And you will find that these importances line up with the decision tree itself, you can see here, you have humidity and then you have rainfall and then you have a wind gust. Speed of pressure does show ups and shine. Doesn't show up, but pressure doesn't show up yet, but if you've maybe went one level deeper, you would also see pressure So these are the importances humidity pressure rainfall when windGusspeed, sunshine, etc. And we can also plot these as a bar plot. So I'm just using sns.bar plot to create a horizontal bar plot. And we are looking at the 10 most important features here. So it turns out that humidity at 3:00 PM has a feature importance of higher than 0.2, six are higher than 0.25. Whereas the next most important feature seems to be pressured at 3:00 PM, followed by rainfall, wind gusspeed, etc. And these values should be interpreted in relative order. So mostly you just want to use this to figure out which one is more important than other columns or than, than other features So that's how you interpret a decision tree. You can see the actual decision making process of a decision tree. And given an example, you can actually just draw the tree and walk through it and see why a decision tree arrived at a certain answer. And you can also see the importance of the different factors. And this is where now you can check if humidity has a lot of missing values and maybe we failed a lot of missing values into humidity. Maybe we are misleading the model by filling all those missing values. Maybe we should remove the humidity column, or maybe we should try and fill those missing values and so on. Right? So you need to go back and forth. You need to go back and check if your data makes sense, given that this is the feature importance that you're working with. Yeah. So that's how you train a decision tree. You import, this is the decision tree classifier from a sklearnlearned dot Tree, and then you train the Tree you're trained, you fit it to the input data, and then you can analyze it. You can evaluate it using the validation dataset. And we saw that the decision tree classifier, that we trained, memorized all the training examples leading to a hundred percent training accuracy while the validation accuracy was only marginally better than a dumb baseline model. So at this point, our decision trees basically useless because it has just memorized all the training examples and this phenomenon is called overfitting. And in the section, we will look at some strategies for reducing overfitting. So you will hear a lot of these terms. Now there'll be four or five terms that you'll hear right now. And often in machine learning, over-fitting simply means that you are doing very well on the training data, but you're doing weighty poorly on the validation data. And we'll define it a bit more rigidly, a bit more concretely it in a short while. And then the process of reducing overfitting is known as regularization. So whenever you see regularization or regular regularization techniques, regularization co-efficient regularization component, etc, etc, all of that is concerned with reducing overfitting, which means drying. Increase the validation accuracy, or get it closer to the training accuracy. And sometimes we may be okay to give up some training accuracy, to get a better validation accuracy, because the validation accuracy is what we ultimately care about. now, how do we reduce overfitting in a decision tree classifier. Now their decision tree classifier. When we created it, we gave it a couple of arguments. We set some random state. We give it just one argument, which was the random spit a state. And apart from that, it also accepts several other arguments which can be used to reduce overfitting. So if you just check the help, which is by typing a question, mark, for decision tree classifier, you will see that you can specify a criterion, which can be Ginni or entropy. And this is simply the loss function. So there are two loss functions. One is Ginni, and one is entropy. You can specify a splitter. So this is the strategy that is used to split at each node. And by default, it picks the best strategy, which just picks the best possible split, of course, with some randomization, or you can also specify a completely random split with, without actually looking at the, without actually evaluating different splits. But here's something interesting. So you have a max depth parameter and you can specify the maximum depth of the tree. So let's see that there is a max depth parameter, and typically these arguments in the context of machine learning that you set, right, when you are creating your machine learning model are called hyper parameters because the term parameter is generally reserved for the parameters or the numbers inside the machine learning model. Logistic regression. The weights of the different features are known as parameters in our decision tree, which column is the root node and what point we are splitting at. And then what do the splits look like? Those are known as parameters. Anything that the model learns or figures out on its own is called a parameter. So just two separate things that model figures out was this things that we have to set up front are we call some of these things hyper parameters. So we call max depth, which is something that we specify at the very beginning. When we are creating the classifier, we call it a hyper parameter because it's not something the model figures out. It's something that we are specifying. So what does maximum depth? Well, if you saw the tree, the tree went down to about 1442 levels deep, and we could check that using. So the previous model that we had, the model that we trained earlier, if we just check the Tree incited so we can call modeled or Tree underscore, and then we can check max stepped. So the model that we had trained earlier, the decision tree went 48 levels deep. And that was one of the reasons for overfitting because it was learning every training example. So what if we did not go 48 level, Steve? What if we only went Tree levels deep, let's try and see what that will give us. So if you only go Tree levels deep. So now we have put in a restriction that we do not want a decision tree to go more than Tree levels deep. And then we call model.fit with the same training and port data and the same training targets. It's. Now the model has been trained again, it just takes a second or two. And then we tried to compute the accuracy of the model on the training and validation datasets. So we call model dot score on extreme and we call model dot score on X well and while targets. And now it turns out that the model is only 82% or 83% accurate on the training. And this makes sense because the model can no longer learn every training example. It can only go Tree layers deep. So it just has to make the best it can out of Tree layers. But this has the unintended consequence that the model is no longer. Overfitting the model now performs better on the validations said than it did before. So this may seem counter intuitive that a Tree level deep, Tree performs better in the real world on real world data compared to a 48 level deep Tree. And that's because the 48 level deep freeze learning specific training examples. Whereas the Tree level deep Tree is picking up general trends and in machine learning, you want models to pick up general trends and not memorize training examples. Okay. So that's the model. The model's accuracy has gone up to 83% from 79%. That's a good improvement. And even though this has gone down, ultimately what we care about is the validation accuracy. And let's visualize the model now, so we can visualize the model using plot Tree once again. So here's what our entire decision tree model looks like. First, we check humidity at 3:00 PM. If the humidity at 3:00 PM is less than 0.715, then we go left. Now, here we check the rainfall. If the rainfall is less than 0.0, zero four, then we go left. Then we check sunshine at the sunshine is less than 0.0 5, 2 5. We go left. And finally, if, if we reach this point, we return the class classroom. So whenever you reach a leaf node, you return the class of that leaf node. Similarly, now you check humidity, rainfall, humidity, gene, and you get to this point you return. No. So it seems like a lot of these have no. So in a lot of cases, as you go along this decision tree, you will end up. But there are certain cases where you will end up at yes. So you go to humidity or humidity 3:00 PM rainfall. So if the info, if the humidity at 3:00 PM is less than 0.8 to five, and then the wind speed wind gust speed is less than 0.279. Then the Ginni score is 0.471. Alright, so it's, your humidity is greater than point. The humidity is greater than 0.7, but less than 0.8. And the wind speed is less than 0.27. So here the classes. Yes. So here is what here you return that there will be rain tomorrow. And if the humidity is greater than 0.7 and greater than 0.8, well, it turns out that in all these cases you end up at, yes. Right? And of course these trees got truncated. These trees could not be built beyond Tree layers deep. So that's why you see a bunch of yeses here. It's possible that if you are allowed more layers than maybe some of these yeses would then split once again into news, but because we are ending the tree at Tree layers, it's going to return knowing all these conditions and it's going to return. Yes. And all these conditions. Okay. So this is what you want to study carefully because at this point we already know that we can predict with 83% accuracy simply by looking at humidity, rainfall, sunshine, wind gust speed. And that's it. Right? So just four out of the 23 plus columns can be used to predict to get a prediction of 83% accuracy. And once again, we can also look at it as a textual tree. So here you can see the same thing, humidity less than 0.7, two, and humidity greater than 0.72. And then you check the rainfall and based on the rainfall value, you either check sunshine or you check humidity once again. And it turns yes and no. Okay. So one thing you may want wonder is what is the right maximum depth to use? Should we use a maximum depth of zero? Obviously not because if you use the maximum depth of zero, then your model would not learn anything. That means it would always just predict no. And while that would be 79% accurate. And while that would be regularized, that would not be very useful because you have not given enough power to your model. But on the other hand, if you allow your model to go 40 layers, deep or 50 layers deep, then your model can memorize every single training example. And since it is trying to optimize for the lowest Gini score, it is basically memorizing all the training data and that's bad because then your model will not generalize. So the best value for the maximum depth of the tree is going to be somewhere between zero and 40. So let's try and explore that. Let's try and figure out what the best value for maximum depth is going to be. So here's what I'm defining. I'm defining a function called max depth error, which takes a sample max step to value, which we can give as an input. And then we create a decision tree classifier for that particular max depth, with the random state 42. And we get this model. Then we fit this model to the training data for that. Given max step, we create the model faded to the training data. Then we calculate the accuracy on the training set and we calculate the accuracy on the validation set, and we define let's call this error. And we define the training error as one minus the training accuracy and let's call. And we define the validation error as one minus the validation accuracy. So if accuracy is home, what percentage had got right error is the percentage that had got wrong. And then we simply return this dictionary and you'll see in a moment why I'm doing this. Now we take this max dept error. Which takes a max step 10 and figures out for that Mac step. What is the training error and the validation error. And we run that through this list. It's comprehension. So we try all the max. We try all possible values of max depth from one to 21. And that's where it's taking awhile because we are building a decision tree for every max dip value from one to 21. And we're computing the training error and the validation error for each of these max depth value models. And then we're putting all those results into a data frame. So let's yeah. So let's give that a minute. And then here you go. So this is what we get when you have a max depth of one, the training error is 0.18, which means just by which means just by selecting a max lift of one, just by making one decision, you get to accuracy of about 81% and validation accuracy of about 82%. That's what it looks like. And, but as you, as you increase the max depth, that error goes down, which means the accuracy improves. You can see here, the training error keeps increasing 18 seven. The training error keeps decreasing. The accuracy keeps improving so 18, 17, 16, 15, 14, 13. And it goes on all the way up to 0.0903. So at this point we are at 99 point. So at this point we are at 97% training accuracy at a maximum of 20. And of course, if we increase the max up further, the model will be so to speak, learn or memorize more training examples, and it will get better the training data, but notice what's happening with the validation errors. So the validation error is at 0.17 and it goes down and it goes down and it goes down to 0.15, and then it starts to increase again. So you can see here from 0.15 starts to go to 0.1, 6.17 point. 8.19. And if you plot this, so here's a, here's what it will look like if I plot it. So you up simply plotted the training versus validation error. The blue line is the training error, which is one minus accuracy. And the orange line is the validation error. So what's happening here, here, where you see both the training and the validation loss decreasing. So what's happening here is you're making your model progressively powerful, which means you're allowing it to make a one decision right now. And here you're allowing this model, the model at where max stepped to you allow it to make two layers of decisions. And this model with max stepped for you, allow it to make four layers of decisions and so on. So up to a certain point, it helps to add more complexity or help helps to add more power within your model, right? It helps to make your model bigger, but after a certain point, when once your models capacity gets large enough, it starts to just focus on memorizing the training data and it stops generalizing. So at this point, you see it gets better and better at the training data and it gets worse and worse at the validation data. And this is the scenario that is known as overfitting. Okay. So here is the graph that you will see over and over and over again in pretty much every problem as you increase the complexity or the size of your model, as you increase the size of your model or the power of your model or the capacity of your models of many different ways of looking at it. Ultimately, it's a question of how many parameters are there inside the model. So as you increase the complexity or size or power or parameters of your model, you will notice that both the training error and that test or validation error will go down up to a certain point because the model is. The model has more capacities or it can learn more and it can actually capture some information about the inputs and the targets and how they're related. But after a certain point, it will start memorizing training examples. And that is a point where your test or your validation error will start to increase. There is a point where your validation accuracy will start to drop. And this scenario is known as fitting video training editor is going down, but your validation error is going up. If you train your model a little more, have you increase the complexity of your model a little more. If you add one more layer to your decision tree, then the training error goes down, but the validation error actually gets worse. And this is where you should stop training your model. So you want to stop training your model at the point where, or this is where you want to pick the complexity of your model. So you want to pick the complexity of your model at the point where the validation loss is just about to increase. So here by plotting this graph, we have been able to figure out that at, at a max dept of seven, we get as good as this decision tree can get on the validation error for the given dataset. So the max Stept of seven is actually the best depth for a decision tree. So this is all you figured out. This is how you regularize a decision tree. So you regularize a decision tree, which means to reduce overfitting by tuning some hyper parameters. So this is called a hyper parameter max step and just changing its value is called tuning. The hyper parameter and my tuning, this hyper parameter, we have regularized the model a little bit. We have now reduced the amount of overfitting that it has. So you can now see that the validation score. And maybe also, let me also print out the training score here. The training score and validation score or training accuracy, validation, accuracy, both are about 84.5 84.6. So that seems like the best we can do by modifying the max depth of the decision tree. Okay. So we just looked at one hyper parameter, which is max tree depth. And we also looked at how that hyper parameter can be used to, can we use to regularize the model. Let's look at another hyper parameter. This one is called max leaf nodes. So this is another way to control the size of the complexity of a decision. I see another way to control the complexity of a decision tree, and which is to limit the number of leaf nodes. Now, whenever you have a decision tree, there are, as you can see here, there are a certain number of decision nodes. And then there are certain number of leaf nodes. Now, the way we have limited the size of the decision tree or the, and the complexity or the parameters of the decision tree in this case is by specifying how deep it can get. But that may not be the best way. Maybe you want to allow it to go a few layers deep here. Maybe you want to allow it to go five layers deep here, and you want to allow it to just stay two layers deep here. So that's where you can actually specify the maximum number of leaf nodes that your decision you can have. So here's how I'm going to do that. I'm new to specify that format decision tree, that the maximum number of leaf nodes that can have is 128. And roughly speaking, if you have one, one node at the top that splits into two nodes below it, that splits into four notes below it, what we might think is that the decision tree actually goes layer by layer, where it goes, builds layer one and then builds layer two. And then it builds layer Tree. But actually what happens is it always tried to make the best possible split. So if it is created a layer one, maybe let's look at it here. So if a decision tree has created a layer, one, it has created a split here. And then based on the split, you now have two splits left. And right now it looks at both of these and it sees, which is the better to split. If this is going to, if splitting, this is going to result in a lower Gini coefficient, then it splits this into two parts by creating a split condition here. And now it's going to analyze among all of these leaf nodes, which is the best split. So if it determines that this is the best split to make, then it's going to make this split first. And now maybe at this point, it's going to look at all these leaf nodes and determine which is the best split to make. So maybe at this point, the next best split to make is, is this. And maybe after this, the next best split to make is this and so on. So your decision tree doesn't really go layer by layer where it first does this, and then it does this. Rather it goes, it looks at all the leaf nodes and it figures out, which is the best leaf node to split at the moment. And it splits now, how does that tie down tie back to max leaf notes? Now here, what we're saying is we want max leaf nodes to be 128 and 128, I believe is two to the power of seven. So if you had a decision tree, which has seven layers, deep added Lewis' point, it would have 128 nodes. Now let's try and give the max leaf nodes of 128. And let's see if the decision tree actually has a depth of seven. So here we create this decision tree, we call decision tree classifier, and then we call model dot fit. So now we're training and the owner leader section we've specified is that the number of leaf nodes, the number of these nodes, which have even split, should not go higher than 128. And that limits a certain number of to now we fit the data on the training. We've heard the model on the training data, and it turns out that the training accuracy is only 84.8% and not a hundred percent because of the same reason. It cannot go down and memorize every training example. There's only a limited number of nodes it can create and let's check the models, accuracy on the validation dataset. So on the validation, it said this time, the accuracy is 84.4% and let's shake the trees depth as well. So the depth of the tree is 12. So let's compare this 84 with what we had the previous. yeah. With a model or here, we had a model that could go to maximum depth of seven and that had 84.5% validation accuracy. In this case, we have 84.4%. Maybe if you change this a little bit, maybe if you change this to 1, 1 30 or one 40, you may find that it may actually cross. But the important thing is that these two are different. And the reason these two accuracies are different is because the strategy by which we are limiting the size of the tree is different. In one case, we are saying that a max depth can be seven. In the other case, we're saying that the maximum number of leaf nodes can be 1 28. And the Tree actually does go down to a depth of 12 in certain places. So that means certain parts go down to a depth of 12, but certain parts maybe are shorter, certain parts of maybe just Tree or four levels deep. And we can try and verify this. We can convert this model. We didn't get the textual representation of the model. And maybe look at just the first few lines of that textual representation. The entire thing will get pretty long. So I've just printed the first 3000 characters or so, so here, you can see that this is a fairly long pot, but then this part definitely shorter. You can see that this spot is shorter than this spot. This is about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12. So there are 12 checks here. On the other hand here, there may be less than 10 checks and here there's even fewer checks. And maybe once we go for there, maybe if I print more of this, you will see that there are shorter and shorter parts. Yeah. So this is, this is definitely a shorter part. You can see that this is maybe 1, 2, 3, 4, 5 levels deep. On the other hand, this is 6, 7, 8, 9, 9 levels deep. So sometimes you have five levels deep. Sometimes you have nine levels deep. And that depends on the best split that the decision tree was able to. So here's an exercise where you find the combination of max depth and max leaf nodes, that results in the highest validation accuracy. Okay. Then another exercise for you is to experiment with the other arguments of decision tree. Scikit-learn has excellent documentation where he extensive and very helpful, very easy to read as well. So just check out the documentation of decision tree classifier on scikit-learn a skill learned or treat our decision tree classifier, look at all of these and go through all of these. Now in a lot of these cases, it will tell you exactly what each parameter does. Maybe try a different criterion, maybe try the random split. See if that helps, maybe try the random, try changing the max depth. We did do some experiments, but how does max depth matter if you're working with a random split, etc, is worth figuring out look at, there are some other parameters, hyper parameters here that you can look at. So try and experiment with all of these, and you can see that there are detailed explanations for each of these. And in fact, in certain cases you will find that there are links to other resources. And as I said, a lot of these are implementations of some of the best papers in machine learning. So a lot of the best practices are given to us. A lot of the best techniques are given to us. We just have to try them out with scikit-learn. Another exercise for you is to try out a more advanced technique for reducing overfitting. This is called cost complexity pruning. So just as we have limited the number of loads by depth, and we have limited the number of nodes by the number of leaf nodes, there is a way to limit the number of nodes by the kind of split that a node performs. So we perform a split only if it satisfies certain criteria. And this is called cost complexity pruning. So you can learn more about it here. It's not a very commonly used technique because decision trees by themselves and almost never used in isolation. So, I will not cover it here, but it's something that you can check out scikit-learn has good documentation on it. And in fact, it has an example implementation and also the link to the paper. So you can check this out and try and follow the code from this tutorial and try to implement cost complexity, pruning, and see if you can improve the validation accuracy further. Okay. Machine learning is all about trying different hyper parameters, trying different techniques and getting that additional boost in the models performance. So just a quick recap of the topics that we've covered today. We started out by looking at this problem statement, the rain in Australia data set, which contains about 10 years of daily weather observations from numerous Australian weather. Have used all of this information more than 20 observations to predict whether it is going to rain tomorrow at a particular location. And we did this by first downloading the data using the open data sets library from this Kaggle competition. Once the data set was downloaded, we read it in using the pandas library, using PD dot, read CSV to view the data. We also looked at the different columns and the different data types within the dataset and looked at the number of non, non values in each column. We dropped all the rows where the value of the target column rain tomorrow was. Then we prepared the dataset for training. The first step was to create training and validation sets. And we decided that because the state has ordered by ear and because this model is going to be used in the future, we are going to use the data for 2016 and 2017 as the test data set. And then we are going to use the data for 2015 as the validation dataset. And we're going to use the data for 2017 to 2014 as the training data. And we created a test training and validation split. You can see this year, then we identified the important target columns. So the input columns are all those columns, which we will use to make a prediction for the target column, which is rain tomorrow. Now in the input columns, we chose not to include the date because the date we will be working with in the real world will be. A completely different range and we want to use just today's weather data to predict tomorrow. The fact that which date it is today is not very important. But we also looked at the numeric and categorical columns just to separate them out because both have to be pre-processed separately, then booted the missing values in numeric columns. So we just run this simple importer using the mean strategy, which means we fill all the missing values in numeric columns. With the average value of that column. Then we scaled the numeric features to the zero to one. Now the zero to one range helps ensure that all the columns have a similar set of values, a similar range of values, and one column does not dominate the value of the loss or the value or the process of optimization. Then we encoded the categorical data using a one heart encoding technique where we took each value or each category from a categorical column and introduce a new column for that category. And we placed zeros. In that column to indicate whether or not a row belongs to that category. And we created these new and coded columns. So if you had three columns and for each of those three columns, you had four categories each. So you would now end up with 12 encoded columns. So just keep in mind the difference between categorical columns and encoded columns and courted columns are the new one heart encoded columns that we are. And what we want to do with the encoder is to transform the categorical columns into encoded columns. So that's why you can see here that we've inserted into train inputs and coated columns, the transformed values of drain inputs, categorical columns. Now, if you're ever unsure about what a particular line of code does, all you need to do is create a new code cell and just run the code. Step-by-step like, you can check what the value of categorical columns is, and then you should check what the value is. Train in ports, categorical columns is then you should check what the value of end quarter dot transform drain in ports. Categorical columns is. And if this is a non binary, you can then check out what the shape of the non binary is, how many rows it has, how many columns it has, and then you should check what the value of this quantity is. And then you should maybe run the entire state. And see what happens and then check the value of train inputs. So use the Jupiter notebook interface, the interactive nature of the platform to explore each line of code and dig people. Now after categorical column one hot encoding, we just created these. Bring in X, Y, and X test variables containing just the numeric and the encoded data. So we are no longer looking at the actual categories. We just look at the imputed and scaled numeric columns and the encoded columns, which contained the one hot encodings coatings of categorical. Then we decided to train a decision tree and a way to train a decision tree is to use the decision tree classifier, because this is a classification problem, but decision trees can also be used for regression. In which case you would import decision tree regressor, then you create a decision, take classify and model, and then you train it using the training inputs and the training. Once a decision tree has been constructed. You can make predictions using the decision tree by calling model dot predict. So when you call it train, that is when the decision tree is set up. That is when all the parameters in the decision tree are created. And when you call predict then predictions, can we meet on any input data that you've given to the decision tree? In this case, we were looking at extreme. So we got the output that we got. All of it was the predictions for the training set. And then we can compare the predictions from the training set with the predictions, with the targets, using the accuracy score. And we got a 99.9, 9% accuracy on the training set. However, when we tested the same thing on the validation set, we got only a 79% accuracy and we realized that it is very. Mildly better than just always predicting that it will not rain tomorrow, which means the model was heavily overfitted. The model has learned all the training examples, but does not generalize well to data. It has not seen before, such as the validation set. So we then looked into the decision tree to learn more and identify how we can solve for the overfitting that the model currently faces. So first we learned that a decision tree can be visualized though. We use the plot three and export text functions from and a decision tree, simply a series of binary decisions where you make a decision and based on that, then you make another decision based on that. You make another decision till you get to a point where there are no more decisions to be made. You get to a leaf node and dead. It is 10 classified. Rain or noting. And the way this is created is using a Gini index. So the model tries to perform an optimal split at every stage, at the top level, and then the low level below it, and then the level below it. And that's all it comes up with the optimal decision tree. So the iterative process of machine learning for decision trees is constructing the decision tree level. And you can also view the decision tree as a textural tree, which is sometimes easier to navigate, especially with larger and deeper decision trees. And you can see what parameters that decision tree looks at to come to a particular conclusion. We can also check the feature importances, what are the different features in this case? It turns out that. Humidity 3:00 PM pressure 3:00 PM, and rainfall seem to be the most important features for this particular decision tree. And we can also plot that as a bar graph to see the relative importance of different features. Finally, we talked about hyper parameter tuning to avoid overfitting. So the decision tree classified except so various arguments, which can be modified to reduce over. And a couple of things. We looked at our max depth and max leaf nodes by reducing the maximum depth of their decision tree. We can prevent the tree from memorizing, all training examples, which may lead to better generalization. And the way to specify max depth is using the max depth argument of the decision tree classifier class. And that limits the number of layers or the number of, or how deep the decision tree. And once you apply that Mack step, the decision tree and no longer memorize training data. So search score on the training data falls, it's only 82.9% accurate, but it score on the validation data set increases, because now it is generalizing. It is picking up more general trends within the data and not specific training data rows. So that's why the validation accuracy rises to 83.3%. And this was via the decision tree. That is just three layers. So it also gives you a lot of insight into the data that just by looking at humidity, rainfall, and maybe a couple of other parameters you can predict to about 83.5% accuracy, whether it will not rain or not tomorrow. And here is a simplified decision tree where just, which is three layers deep. Now, what you would want to do is experiment with different values of max depth. And this is what the graph looks like when you plot training error versus validation error. When your max tip is too small, then your decision tree is not powerful enough, then it will not have a very high training accuracy. Or, and if you look at the error, which is a hundred percent minus accuracy or one minus accuracy, the training error and the validation error will both be high because your model is not powerful. To pick up important relationships within the training data and the targets. But once you start increasing the size of the decision tree, once you start allowing it to go deeper, maybe two levels, deep, three levels, deep, four levels deep, your training arrow starts to decrease because your model is becoming more powerful and your validation error starts to decrease as well. At a certain point, you will notice that the validation error either becomes flat or starts to increase. This is the point where overfitting starts to. So this, for this particular example, it happens that about a max step, Tufts seven, as soon as you go more than seven layers, deep, the model starts to memorize specific training examples rather than picking up general trends about the weather. So that is why you can see that the training error continues to go down, but the validation error starts to increase which, and at a certain point it becomes much worse than what the validation error initially. So, this is something that you should keep in mind. And this is a general trend that you will notice with all the hyper parameters, which is anything that you have to configure in advance before training the model. Right? So the decision tree that is getting built out all the decisions and all the decision points, those are quite parameters because that's what the model is learning from the data. But the max depth is something that we cover that we provide before. Before we actually train the model before we create the model and these are called hyper parameters. So with hyper parameters, what you will always find is there is this model complexity axis. If you change the value for hyper parameter, the model can go from less complex, new, more complex, or less powerful, to more powerful and more powerful. All isn't always a good thing because if you continue increasing the power or the capacity or the complexity of the. Then at some point it will have enough capacity to memorize the entire training dataset, which is what it is optimized for to reduce the training edit. And then it will not generalize well to real world data, test data or validation data. So what do you want to find is that best fit, which is the point at which the validation loss has gotten as low as it can possibly be. And if you go any further than it is start going to start increasing. So this is the best food that you're looking for before this, you are in a space called under fitting where your model is not powerful enough, or it hasn't learned enough about the data. After this point, you are in a space called overfitting where your model is just memorizing training data, and it is getting worse on real world data. Okay? So every hyper parameter is something that you will have to vary and find the optimal position. Then we also looked at another hyper parameter called max leaf nodes, which is yet another way to control the complexity of a decision tree. And the benefit here is that this allows branches of the tree to have varying depths. So instead of having to just limit the depth to a certain. You can say that the number of leaf north should be a certain number, like 1 28, and then the tree can take different depths along different parts, depending on how many decisions need to be taken for it, a certain pot. So this can be somewhat better than max depth sometimes, but typically you would use a combination of both of these. So here is an example where we use max leaf nodes of 1 28, which is what you would normally get. If you had a seven layer decision tree. Going all the way to every leaf, but it turns out that when you use a max leaf north of 1 28, the three depth goes down to 12, but not every part is 12 steps long. You will find that some parts are quite short, like three or four steps long. Some parts are quite long, like 8, 9, 10, 12 steps long, and then other parts. So max leaf nodes allows the tree to have a different structure in each sub three. Some branches can be short, some Montas can be long. And typically you would have to use a combination of max stepped and max leaf nodes. And just like max depth, as you increase the number of max leaf nodes, your moderator start to overfit at some point. And if you keep your max leaf nodes too low, then your model will not be powerful enough. So there is a value somewhere in between, which gives you the optimal validation loss, which is what you need to. And this is really the entire art of machine learning, where once you have a model, you have to find the right hyper parameters, which minimize the loss for that model, which minimize the validation loss for that model. Okay. Now there are several other Ultimates within decision tree that you should explore and refer to the dark spot. And another more advanced technique for reducing overfitting in decision trees is called cost complexity. So you should check that out as well, and try to implant implement cost complexity pruning for this problem. All right So that was a quick introduction to decision trees. Now, of course, we've skipped over some parts, especially the more mathematical parts about the Gini index and the importance calculation and things like that. But I hope you were able to get a basic intuition of on how these splits are created. We look at all possible features, all possible splits, find the best split and the best evaluation is done using the Gini index. Then we divide the data into these two parts or these two portions. And then we identify, which is the next best split to be made based on the leaf nodes that we have. And then we make that split. And then we keep going till we either hit leaf nodes. In which case we essentially memorize the entire training dataset or till we hit some limits that have been artificially imposed. Why are some hyper parameters for the purpose of regularization? And in general, you don't want to have an unbounded decision tree. You want it to be somewhat generalized. So that's where you want to limit its depth. And you want to maybe limit some of the other hyper parameters as well. Now while tuning the hyper parameters of a single decision tree may lead to some improvements, a much more effective strategy is to combine the results of several decision trees that are trained with slightly different parameters. We will continue using this notebook. We will continue using the same data. And we will see how we go from decision trees to an ensemble model called random forest, and why that is helpful, how that affects the results of our modeling. So what should you do next? Review the lecture videos and execute the Jupiter notebook, complete the lecture exercises and start working on assignment one. If you haven't already a new assignment is coming soon. And discuss on the forum and on the discord server and ask questions. This is a very important part of participating in an online course, and it really helps you stay motivated. It helps you improve your learning. It helps you be part of a community which can open up avenues for future collaboration, where you can continue to learn. Long after the course has ended. And you will find some friends that you may build associations with personally or professionally. So with that, I will see you in the forums. You can follow us on Twitter at and , and you can visit the court's website anytime@zerotogbms.com. Next week, we will look at random forests. This was machine learning with zero two GBMs. Thank you, and have a good day or good night.
Info
Channel: Jovian
Views: 8,022
Rating: undefined out of 5
Keywords: decision tree, data science, machine learning, deep learning, python, python data science, jupyter notebook, machine learning course, gradient boosting, gradient boosting machine learning, certification course, decision tree machine learning, decision tree algorithm, decision tree analysis, decision tree tutorial, decision tree python, decision trees, hyperparameters, hyperparameter, hyperparameter tuning, artificial intelligence, neural network, jovian, overfitting, regularization
Id: d6xH6k7_Zv4
Channel Id: undefined
Length: 108min 20sec (6500 seconds)
Published: Sat Jul 03 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.