Hello and welcome to machine
learning with biotin zero to GB. This is an online certification course
being organized by Chovan today. We're in lesson three decision trees and
hyper-parameters so let's get started. The first thing we do is go to the
core speech zero to GBM start com so just point your browser to zero
to GBM StartCom and that will bring you to the course page where you
can find some information about the. and you can find all the lessons and
assignments that you need to complete to get a certificate for the schools. So far, we've looked at linear regression,
logistic regression, and you will also be working on assignment one where he
will train your first machine learning. Now before we open lesson three, I just
want to tell you about our discord server. At the top of the page, you will
find a link to our discord server. In the discord server, you can
interact with thousands of members of the Jovian community who are taking
this course with you right now. so the first thing you should do once
you enter the discord server is introduce yourself in the introduce yourself. Then check out the agenda channel for
some recent discussions, you should also carefully review the community
guidelines just to keep the Jovian community, a friendly environment. There's a list of small do's and
don'ts that you need to follow. And then you have several channels
which you can use to learn and get help while you're working on the project. The ask questions general
is to ask questions about any concepts or assignment. errors that you're facing the
assignment help channel is to get help with the assignment. If you're stuck at some point and
need help, or you're facing an edit that you cannot resolve post in the
assignment, help channel the study. Our channel is to give you information
about study hours that we conduct from time to time using voice channels. So we're conducting a study out every
Wednesday and you can find the information in the study hours channel the share your
work channel is a place for you to share any interesting data science projects that
you have worked on Jovian or elsewhere. And finally, the shared resources
channel is a place for you to find interesting machine learning
resources and data science resources. You can also share any interesting
blogs and articles that you find here. So that's a discard channel do make
use of it being part of a community. Being part of an online classroom is
the best way to take an online course. And we've seen in the past that being
active in the community forums makes you four to five times as likely to complete. Even if you check out the discord
channel daily, that's good enough that keeps you motivated. Would that let's scroll down to lesson
three decision trees and hyper-parameters. on the lesson page, you will be able
to find a recording of the lesson . And here is some description of are the
topics that we're covering today. The court we will be executing today is
present in this Jupiter notebook, which you will find in the lesson notebooks tab. We have just one Jupiter notebook
for today, and you can read through this Jupiter notebook and you can
view the code, but if you want to run the code, you need to click run. So I'm just going to
click run on CoLab here. And this is going to take the Jupiter
notebook that you will viewing on the lesson page and put this
into your Google drive account. And in your Google drive account,
this Jupiter notebook will then be opened up using Google collapsed. So you will be asked to connect
your Google drive account. If you haven't turned that already. And once it disconnected that
will bring you to this speech. Now make sure to run the first
cell of code on Google CoLab. This is very important because this
cell connects your Jupiter notebook from Jovian to Google collapse so
that when you want to save a snapshot of this Jupiter notebook, it will
get saved to your Jovian profile. So just run the same. And then you're all set to go. When you run the first cell Google
CoLab, we'll set up a server on the cloud for you and all the code that
you write from this point on will be executed on that cloud server. So the topic for today is
decision trees and random forest. This is something different from what
we've covered so far, which is the linear models like linear and logistic regression
and random forest, especially are a very powerful and widely used machine learning. It's most likely that in your professional
work, you will be building decision trees and I know forests most of the time. And one of the primary reasons for that
is interpretability of these models. So we will also talk about why
these models learn the things that they do and why they give
the results are predictions that, so here's what we're going to cover today. We will download a real world
dataset just as we've been doing for the previous lessons. We will prepare a dataset for
training a machine learning model. Then we will train and
interpret some decision trees. And then we will move on to training
and interpreting training random forest. We will talk about overfitting hyper
parameter tuning and regularization. These are some of the central problems
in machine learning, and this is where you will spend a lot of your time
when you are improving your models. And finally, we will talk about how to
make predictions on single inputs as well. Now I'm running this notebook on Colab,
but you can also run it locally, but thanks may be a little bit slower
depending on what your configuration is. If you have a good CPU and a
high Ram, you should be able to run this locally as well. Okay. Just as we've been doing in the
previous lessons, we will take a practical and coding focused approach. We will learn how to use decision
trees and random forest to solve a real world problem from Kaggle. And we're going to use the same
dataset that we used the last time. And this will also give you a chance to
contrast decision trees versus linear regression logistic regression models. So we will use the rain in Australia
dataset, which contains about 10 years of daily weather observations
from numerous Australian weather. And here is a small
sample from the dataset. So on several dates, you have
information captured from several locations and this information
includes minimum temperature, maximum temperature, rainfall, evaporation, etc. And the last two columns are the
more, most interesting one is whether it rained on that day. And the second is whether
it rain on the next day. Now, of course we have rain tomorrow
because we are looking at historical data. And as a data scientist at the bureau
of meteorology, you are tasked with creating a fully automated system
that can use today's weather data for a given location to predict whether
it will rain at a location tomorrow. So you want to create an automated
system, which can essentially predict the likelihood of rainfall all over Australia. So let's see how far we can get there. Before we begin. We'll just install an import, some of the
required libraries that we've been using throughout open datasets for downloading
a dataset pandas for loading data sets and working with data frames, num PI for some
mathematical work, mathematical computing SK learn contains all the machine
learning models that we train and Jovian for saving snapshots of your notebook. So let's import all the libraries as well. Open data sets as Matplotlib dot pyp
plot as PLT seaborne as SNS pandas, as PD numpy as NP Matt block lab Jovian. And we will also use the OS module a bit. So these are standard conventions
that you should follow in all your jupyter notebooks. And if you don't follow these, you
will find that people get confused. Of course you can call pandas
anything you want, but prefer calling it PD because that's how you
will see it all over the internet. Finally, we're also setting some
options for display here so that the graphs are a little bit bigger and. See more information
within our pandas frames. All right. So the first step is to download
the dataset as we are done the last time we will download this dataset
using the open datasets library directly from Kaggle within jupyter. So we just run audi.download. And when we run, audit or download,
we will be prompted for an API key Kaggle username and an API key. Here's what that looks like. Okay, let me just run this. So we will be prompted for a Kaggle. We'll be prompted for a Kaggle
username and an API key. Here's what that looks like. Now. This is one way to provide the
information, your Kaggle username and your . But one other thing you
can do is just click on this file Explorer and find the upload button
and upload your Kaggle dot Jason file. And if you place your Kaggle or Jason
file next to your notebook, then open datasets will automatically find the
credentials and download the dataset. As you can see here, this was a Tree MB
dataset that was downloaded automatically. And of course, if you don't have
your calculator, Jason file go to kaggle.com, which is where we
downloading the data set from click on account, scroll down to create
new API token, and that will download the Kaggle dot Jason file for you. Okay? So you can either provide your Kaggle
username and ID and key directly, or you can upload the file to Google CoLab,
or just place it next to the notebook. If you're running it locally So the dataset is now downloaded
and extracted into this folder weather dataset rattle package. And we can check that using OS dot list. Now I'm just going to click, edit clear
all outputs here so that we can see, we can run all the code fresh and we do not
have any stale outputs in our notebook. All right. So the file, whether Aus dot CSV
contains the data, you can see here, whether Aus starts CSV. So let's load it into
a Panda's data frame. I'm going to run PD, dark, read CSV,
and that loads the data frame up. And here's the data frame. So here's the data. We looked at it the last time as well. We have date location,
minimum temperature, a bunch of weather parameters. And finally, we have rain
today and rain tomorrow. Our objective is to take all of
this information, maybe not the date, because everything is on a
different date, but everything except the date, and use that to predict
whether it will rain on the next day. And hopefully we can then use
it on some future data as well. So let's check the column
types of the dataset. If we just do raw df.info, it tells us
that there are a total of 145,000 entries, and you can see the types of each column. So you have object, which is mostly
strings or categorical data, and then you have float 64 and then you have, okay, 9
64, but floats and insert numeric data. The others are mostly categorical
data in some, and sometimes these can be string data as well. And you will notice that some of
these columns have null values too. So we need to deal with them as well. Now, one of the things I'm going
to do is remove any rows where the value of the target column is not. Because we want to train a
model that can predict whether or not it will rain tomorrow. So to give the model any data where
we don't know information about whether it rains tomorrow will not
be useful to train the model, right? So we will remove any rows where
the target column is empty. So I'm just going to remove
the subset rain tomorrow. And here's an exercise for you to
try and perform some exploratory data analysis on this dataset. If you haven't already and studied
the relationships of the other columns with the rain tomorrow column. See if you can figure out before we
build this model, which columns are the most important in determining
whether it will rain tomorrow. and I'm just going to
save my notebook as well. So I'm running, Jovian not commit here. and I am asked to enter my API key,
which I can find from my jovian profile by going to jovian, not AI,
just click on this, that will copy the API key and I can paste it here. And this will save a snapshot of
the notebook to my profile so that I can come back and continue where
I've left off in my next session. The CoLab notebook of course
will shut down after some time. All right, so now we've
done most of this before. So let's go with this quickly. We will perform some steps to
prepare the dataset for training. The first step is to create a
training test and validation split. Remember it's common practice
to set aside only about 60% of the data for training the model. And then we use temper 20% of the data
for validation, which is to evaluate different versions of the model. As we try out different parameters to
train the model and finally to report the final accuracy we use the tests. Now it's common practice to do a
random split, but in this case, because the data is ordered by date. And because the model that we are
going to create using this data is going to be used in the future. We can simulate that, which is using
the model train on the past to the, to predict values in the future, by picking
the last couple of years for the test set. And we can maybe pick one year
before that for the validation set, and then all of this data
can be used for the training set. So this is the distribution of the number
of rows per year . And we've done that using a simple count plot using seaborne. So here's what we'll do. We will create a train data frame,
which is a fraction of this rows of the raw data frame, which we just loaded
up where the year is less than 2015. And here is how we've converted. The year we have taken the date column Row
df.date, and we have parsed it as a date, time field, each value in the column. And from that daytime field,
we have extracted the year. So this is basic pandas data operations
that you should check out if you're not familiar with this already. So before the year 2015, which is up
to 2014, we used the data for training. And then the validation
data is that year 2015. And then the test data frame
is the year 2016 and 2017. Again, this, we are doing this only
because we have this chronologically ordered data, and this is how we, our
model will be used in the real world. If you do not have chronologically
ordered data, then you use a random split and there is a method in
scikit-learn called train test split, which you can use to do that. Yeah. So now we have about 98,000 measurements,
98,000 observations or 98,007. For training, we have about
17,000 samples for validation. So as we try different kinds of models
and we'll try quite a few today, we can use the validation data frame to evaluate
how well those models are performing. And finally, we have a test data frame. This is where we will report
the final accuracy of our model. Now here's an exercise for you. If you want to build on top of this,
you can try and scrape the climate data for the recent years from this website. This is the official website of the
bureau of meteorology in Australia. So you can try and scrape the data
from 2017 to 2021 and try training a model with the enlarged dataset. In fact, this is how this
data set was created in the first place by scraping data. So web scraping is a great way to create
new datasets for machine learning. All right, so we have created the training
and validation dataset split, and then the next step is to identify the input and
target columns because not all the columns will be useful for machine learning. And it's also very important to separate
out the input and the target data. One common mistake people make initially
is to accidentally use the target column to predict the target column,
in which case your machine learning model, isn't really doing anything. It is taking the value of the target
column and simply returning it. So, always make sure to carefully check
the columns of your data frame and separate out the input and output columns. So if I check the raw data frame,
or maybe if I just check the train data frame, which is just a subset
of the rules, you can see that we don't want to use the first row date. And we don't want to use the
last row tomorrow as an input. Why not date? Because we are going to
use a model in the future. So a date will not be a useful thing and. Rain tomorrow is not useful because this
is the value that we want to predict. So the input to the model should be the
rest of the columns and a prediction of the model should be compared with the
target column, which is rain tomorrow. So here's always setting that up. We are set up, reindeer have got columns. We've converted that into a list and we're
excluding the first and the last value. And we're excluding the first and
the last value from that list. and now we can take just the input
columns from the training data frame and create training inputs. So I'm just creating a copy here because
we are going to make some modifications in the in the next few steps. And we can also separate
out target columns. We can also separate
out the target column. So now the target column
is just a single column. So when we select train DF target
call, that is going to return a Panda series, not a data frame. So just keep that in mind. So here's what that looks like. Train inputs. So this contains location to
rain today and train targets. This contains just the
value of rain tomorrow. Okay. Always a good idea to just check out
what information you have within your data frames before you move forward. Similarly, here we are creating
the validation inputs, validation targets, and we're creating the
test inputs and test targets. next up, let's also identify the
numeric and categorical columns within the data because we will
need to deal with them separately. So here's one tray. Well, one simple thing you could do is you
could just do train DF and then you could manually look through and make a list. Okay. Main temp is numeric. Max temp is numeric. Rainfall is numeric etc, etc. But what you should ideally be doing
is detecting these automatically. So here's how you can do that. If you just do this train inputs
or any data frame.select D types. So only select the columns which have
these matching D types and for the matching D types, if you just provide
NP dot number, which encompasses float and int and all the numeric data types. So now you will get just a
data frame containing all the numeric columns, and then you can
simply access the columns here and that gives you a list
of all the numeric columns. And finally, we can convert that
into a list using tool list. So here, now we get back a list of all
the numeric columns now to get the list of categorical columns, all you need
to do is change this to categorical. Oh, sorry. Not categorical to get the
list of categorical columns. All you need to do is change
this to object so when you change this to object, you get back
the list of categorical columns. Now, how did I find this out? Well, I simply looked it up online. How do you find new American
categorical columns in a data frame? And once I found it, I just
have it written in my notebook so that I can use it anytime. So these are the numeric
and categorical columns. now, one thing you might want to do
at this point is decide if you really need all the columns, because every
column introduces new data, which may potentially slow down your training. In this case, we have a
small enough data set. So we may not, we do not need
to worry about it, but you can do some analysis and you can. How closely the columns are correlated
with the target and maybe just select a subset of the columns
instead of all the columns and observe how it affects the results. Does it lead to a large decrease
in the output or is this a very insignificant decrease? And if it is then it's probably okay
to drop a few columns and just use the ones that are more, most important. Okay. So try it out, observe it and try to
get a feel for when it makes sense to drop some columns for now, we are going
to move ahead with all the columns and the next important step is to impute
missing numeric values, which means we want to take all the missing values
because machine learning algorithms can't work with missing values. They will throw arrows at you, and we want
to replace them with some other values. So how do you check the missing values? Well, you can go train inputs
and then from train input, it inputs pick the numeric columns. So just the data from the numeric columns. This is what that looks like. And here we can check is any, which is
going to replace each value with a true or false, depending on whether it is Nan. So this will become a true in
this will be a false, and then I'm going to do a dot sum. So chaining, pandas commands
is a useful skill to learn. So you always think about what you want
to get to and what is the incremental process that will take you there. And maybe I might even do this. I may also do a sort of values here in the
CDs and maybe set ascending equals false. All right. So it seems like sunshine has the highest
number of missing values followed by evaporation, followed by cloud 3:00 PM,
cloud nine impression I name and so on. So all these numeric columns have some
missing values and we are going to replace them using a simple strategy, which is. Which is basically replacing them
with the average of that column. So for this, we can import simple computer
from scikit-learn and we create an imputed object and we specified the strategy
that we want to use strategies mean. And after creating the imperator object,
we can also call.fit and give it the data, the numeric column data, which
is all the data from all the numeric columns in our data frame, and imputed
is going to then figure out what is the average for each of those columns. Now, once you've fitted the computer,
which means once it has found the averages or the statistic that we want
to use for each to fill each column, we can actually fill the columns
by calling imputed or transform. So we call imputed or transform on train
inputs, numeric columns that is going to fill information into all the empty
data in the numeric columns of the train ports and the tone, a new nonbinary. We can take that new result and
put it back into the original data frame, train inputs, and replace
the original numeric columns. Okay. So the net effect of all of this
is that you have no missing data in any of the numeric columns. We filled it with the mean value now mean is not the only
imputation strategy. There are several other
imputation strategies that are available in scikit-learn. So an exercise for you would be to try
a different imputation strategy and see how it affects the final result. And this is all what
practical machine learning is. You try different things and maybe
sometimes you try different strategies for different columns by doing some
exploratory analysis and figure out the strategies that work best for the problem. Okay, next up, we are going
to scale the numeric features. Scaling simply means we want to take
the ranges of each of the new medic features, which is the min and the max. And we want to bring them
down into a zero to one range. As you can see here in the validation
training or test dataset, each numeric feature has a different range. Main temp is minus eight to plus 31. And wind speed is seven to 1 35. Whereas certain values like
pressure can be like 9 88 to 1039. So because there are a lot of numerical
competitions that happen inside the machine learning algorithm and ultimately
a single loss value is optimized. We don't want the data to have any,
we don't want any specific feature to dominate the training process. We want to give every feature, a
level playing field to participate in the training of the model. And that is why we scale all of these
feature values to the range zero to one. And we do that using min-max scaler. So here we are creating a min-max scaler
and then we call on min-max scaler fit and we give it the new medical columns,
the data from all the new medical columns. So it is going to figure out
for each column, what is the minimum and the maximum value. And then we can call scaler or
transform, give it all the data from the numeric columns, and it is going to
scale them into the zero to one range. And then we can take that output
and put it back into our training validation and test data frames. So the net result of all of this is
that the inputs are going to change from I'm going to change from a variety of
different ranges to the zero to one range. No. The zero to one range is not
the only scaling strategy. There are several other
scaling strategies as well. So you should try out a
different scaling strategy. Specifically standard scaler is
something worth checking out and observe how that affects the results. next next, we're going to encode the
categorical data machine learning algorithms can only work with
numbers and in our data frames, we have some categorical data. If I just check brain DF,
you can see here, you have location, which is categorical. Then you have wind gust direction,
which is also categorical. And then you have a bunch of
other categorical data as well. Things like rain tomorrow. In fact, that's what we've listed
in categorical calls, location, wind, , direction, wind direction. We're in 3:00 PM and rain today. So what we're going to do is perform one
hot encoding for the categorical columns. okay. And for the categorical columns,
we do need to first fix the NaNs. So I'm just going to fill NaNs wherever
we have NaNs in the categorical columns. So I'm just going to do drain DF categorical calls, not
fill any, and I'm going to fill all entities with the value unknown. I'm going to do that for the
validation, and I'm going to do that for the test data frame as well. So we did fill out missing values in
the numerical columns, but we did not do that for the categorical columns. And you can see here if I just pick
the categorical columns, you can see that some of them have some NaN values. Most of them actually. So we're just going to fill
wherever we have Nan values. We are going to fill it with the
string unknown, just so that one heart encoder doesn't complete. And let's just do that in place. Let's try that again. All right. Let me just fix this. I believe this is an issue because
of the version of scikit-learn. So this was something that worked
on my computer, but did not work is not working on Google CoLab. And whenever you face such issues where
something works in a certain place, but it does not work in another place. That is probably because of that
is probably because of scikit-learn because of abortion differences. Okay. Let's do this one last
time and it should fix it. . So watch out for Washington
differences between libraries. And if you ever want to check the
version of a particular library, the way to do it is just run PIP list. And that will show you a list
of all the libraries that are. And you can check their version so you
can check the version on your computer. You can check the version on
CoLab or wherever you're running and identify the discrepancy. And the way to install a specific
version is to go PIP install scikit-learn for example, and then specify the
version you want to install, but after an equal to equal to 0.1 0.3. Okay. So with that out of the way, we
can now one Hot in code our columns we can no one heart encode our columns. So by one heart encoding, what we want to
do is we want to take all the categorical columns and pick the values in those
categorical columns and create up, create a separate column for each category. And those category columns will
simply contain ones or zeros, depending on whether a particular
row belongs to that category. Again, something that we've discussed
in detail in the previous session. So I will just run this code here, which
is to first create a 1 hot coder and then fit it to the inputs that we have. Then create a list of new feature
names or new category names. And you can see what
these category names are, create the list of new category names. So for each categorical column and
for each category combined, we have one new category name, and then we can
transform the data from the categorical columns into one heart vectors and
put them back into our data frame. So the net effect of this is that
for every categorical column, for example, location here, we have
a bunch of separate columns like location rain today, a location,
Adelaide location, Alberni, location, Albury, etc, where we have zeros. And we have one for the specific
location that this represents, for example, one for Albury, because this
location is Albury and zero elsewhere. Now categorical one heart coding
is not the only encoding strategy. There are some other
incurring strategies as well. So I encourage you to try them out and
observe how they affect the results. And as a final step, let us
drop the textual categorical columns from our inputs. So I'm just creating these new extreme
X, Y, and X test wearables, which contains simply the numeric columns,
which have been imputed and which have been scaled to the zero to one range
and the encoded categorical columns. So we are removing the actual
string categorical columns, and just keeping the encoded ones here. And these, this is the input that we
will use to train on evaluate our model. Of course, we have the targets as well. We have train targets while
targets and test targets as well. so here's what the input
to our model looks like. Okay. So let's say what work before
continuing, or this is something that we did the last time as well. So it's, so this, all, this is something
that we've done the last time as well. So this should be fairly calm, fairly
standard should start to feel fairly standard and boring by now because these
are the steps that you will take for pretty much every machine learning problem Let's talk about training and
visualizing decision trees. Our decision tree in general,
parlance represents a hierarchical set of binary decisions. For example, here is a kind of decision
tree that you may set up to decide whether or not you accept a job offer. Maybe if the salaries between
50,000 to $80,000, then you, then you consider the offer. If it is not between 50 to $80,000,
maybe you would decline the offer. If it is between 50 to
$80,000, then maybe you check. If the office is close to your home, if
it is not, then you declined the offer. Otherwise you check if. Office. If the company provides a cab
facility, and if it, if it doesn't, then maybe you're declined the offer. Otherwise you accept the offer. So this kind of a strategy is how we make
a lot of decisions in the real world. In fact, this is how a lot
of processes are set up. And if you think carefully, this is
how programs are also set up or where, where we write a lot of if, else
statements to come to a certain decision. Now our decision tree machine learning
works exactly the same way, except that we let the computer figure out
the optimal structure and hierarchy of decisions instead of coming
up with the criteria manually. So applying it to our problem about
whether or not it will rain tomorrow, first we'll set, we let the computer
figure out what is the most important criteria to decide whether or not it will
rain tomorrow and maybe after checking the value of that criteria, let's say, well,
maybe it's whether it rain today or not. There is a different tree on weather
based on whether it rained today and a different tree or based on
whether it did not rain today. So if it did, if it did rain today, then
maybe we simply look at the pressure. And if it did not rain today,
maybe we look at the wind speed at 3:00 PM and so on, right? So you can have multiple trees
on either side and we will see how these trees come up. But the important point is we
are not creating those trees. We are letting the machine learning
model figure out what the right criteria and the right decision points are
going to be to best fit the model. Okay. And to train a decision tree,
we simply the decision tree classifier model from scikit-learn. To train a decision tree. Now why decision tree classifier? Because this is a classification problem. Remember there are two types
of problems, classification and regression in regression. You're trying to predict the continuous
value, which is for example, the medical charges for an insurance
applicant, Martin classification. You're trying to classify the
input into one of two categories. For instance, here, we're trying
to classify the measurements given today based on to whether
or not it will rain tomorrow. So yes or no. So that's why we using a
decision tree classifier. If it was a regression problem, we
can use a decision tree regressor. So from we import decisiontree
classifier, and then we create the decision to remodel. So we created a decision tree model
by simply creating an object of the class decision tree classifier. And there is some randomness
involved in how decision trees work. So if you want to get the same output,
each time you run the score, just provide a value for random state. So random state 42. So this is initializing the random number
generator inside the decision tree. So each time you run the score, you will
get the same kind of randomization and help that has the same kind of outputs. Now, if you do not want to have the same
kind of output, each time you run the score, then you can remove this random
State, but it is generally recommended to have a random state and you can pick
this to be any number you want, but it is generally recommended to have a random
state for your decision tree classifier so that you can replicate your results. Otherwise your results
will not be replicable. All right, so now we've created the model
and the next step is to fit the model. So we give the model, the training data,
which is all the new medical columns, which have been imputed and which have
been scaled to the range zero to one. And we give it the targets,
which is simply the yes. While you fall, whether it will rain
tomorrow for each of the input columns. And we run that and it takes maybe
a second, maybe a couple of seconds. So it took 2.8 seconds and our decision
tree classified has been trained. Okay. So what just happened? Let's try and use this classifier
and let's see how it works and then we'll try and visualize it as well. So the first thing after training
any model is to try and make some predictions using the model and
evaluate how well it is doing. So here's how you can make
predictions using the model. If we call model dot predict, and
we give it a set of inputs to make predictions on, it will give us a
predictions that we can look at. So I'm going to call model
or predict on extreme. And this is what extreme looks like big. So we are giving the model, all
this data, all of these are numbers. And all of these have been missing values
have been filled in categorical columns have been converted to one heart and
the model gives me some predictions. What do those predictions look like? Well, the predictions are either nor yes. How does the model know that it
needs to predict nor yes, because we call it model.fit with our targets
and our targets also have these. Yes, no value. Yes. So when the model was training,
when it was learning from the data, it identified that it needs
to predict a yes or no value. Now, internally of course
the model represents this. Yes, no target value as a zero or
one, but to show us the output, it is showing us it is going to show
it is going to return strings. Yes or no. So now we have some
predictions from our model. We call it modeled or predict on our
input data, our training data itself. And we got some predictions
and these are the predictions. And we, it seems like there are a lot
of here, but just to make sure that we also have some yeses, I'm going
to run PD dot value counts, and pd dot value counts simply takes a list. And it's going to tell you the
counts of the unique values. So it seems like they're at
about 76,000 nos and about 22,000 yeses in the prediction. So the model is based on some, whatever
logic it has learned, the model is actually predicting different things. It's not just predicting no every time. So the model has, seems to
have learned something now, how well has it learned something? Well, that is somethingthat we can
evaluate by computing and accuracy score. So we have training predictions, we
have training targets, which is the actual values and the simple thing
we can do is compare each value. So we compare the first
value and they match. We compare the second value and
imagine we compare the third value in the match and we count the
percentage of values that match. So I'm just going to run accuracy
score and accuracy score is imported from sklearn dot matrix. And that is simply going to
count the number of matches. I'm going to run accuracy, score
on train breads and train targets. So let's see how well the model
has done on the training site. Okay. So it seems like the accuracy of
the model on the training site, on which it has been trained is 99.99%. So practically a hundred percent,
this score just be a floating point. So the accuracy is is close to a
hundred percent and the decision Tree also returns probabilities for
each production for each prediction. So we can also check the probabilities. So let's take the probabilities and
to get probabilities, you can simply call modeled or predict proper P
R O B E and give it the same input and let's check the probabilities. And it looks like the model is very
confident about its predictions as well. So we have an accuracy of 99% and
we have a probability of one for all, for most of the predictions. And you can verify if this is
actually true throughout or not. So it seems like we've learned everything
there is to learn from this data, or is that, so the training said accuracy is
close to a hundred percent, but we can't rely solely on the trainings or accuracy
because your model will not be used in the real world on the same training
data in the real world, your model will see data that it has not seen before. And so far, it hasn't
seen the validation set. So we must evaluate the
model on the validation set. So now we need to make predictions
on the validation set by calling model dot predict. And then we can compare the validation
set predictions, which are obtained from the validation inputs with
the validation targets, using the score accuracy score function. But because this is such a common
operation scikit-learn models already have a dot score method. So in the case of decision trees, if you
call modeled or score, give it the input. So in this case, the validation inputs
and give it the targets, then it will make a prediction on the well it'll
make predictions on the validation inputs, and it will then compare
those predictions with the targets. And it will give you the accuracy here. And it turns out that the accuracy
on the validation set is just 79.2%. So you can see the accuracy on
the training set was a hundred. As we saw here, 99.9, 9% and the accuracy
and the validation set is just about 79%. And in fact, 79% is only marginally only
marginally better than always predicting. Now, for example, if you look at the
validation data and we see the percentage of values that are no, which is by getting
the value count and then dividing them by the length of the validation dataset. So it turns out that 78.8% of
the data is, has the target no. And 21% of the data has the target. Yes. Which means if we had a model that
simply predicted no all the time, that would be 78.8% accurate. And our fancy decision tree that
we've trained, which is a hundred percent accurate on the training
set is only marginally better, just less than 1% better, just only
half a percent better than our dump model, which always predict snow. So what's going on here? What's going on? How is the model a hundred percent
accurate about the training data, but completely missing or learning
anything but completely failing to learn anything important, anything
useful about the validation data? So here's what has happened. It appears that the model has learned
the training examples perfectly, which means it has basically
memorized all the training examples. It's like if you memorize the answers
to all the questions in your textbook for an exam, and then you go to the
examine, none of the questions come up with exactly the same values. So you are likely to score a very
low mark in the exam in the same way. The model has learned all the training
examples, but it does not generalize very well to previously unseen examples. This phenomenon is called over-fitting and
reducing overfitting is one of the most important parts of any machine learning
project, especially when you're dealing with Tree based models, like decision. So we see how to improve
VCR to reduce overfitting. And the first step in understanding
what's going on is to visualize the decision tree that has been
learned from the training data. Now, I told you in the beginning,
that decision tree is a hierarchical tree of binary decisions, something
like this, something like this. So our model actually builds a
decision tree, which is pretty close to what we saw above. And we can visualize the street using the
plot tree function from sklearn dot tree. So I'm just going to import plot
Tree from sklearn dot Tree and plot Tree users matplotlib under the hood. So I'm just setting, I'm just increasing
the figure size here so that this is a big image that we can look at and we call
it plot tree with the model and plot. We can also take the name of the features
or the name of the columns, so we can provide to plot tree the names of these
columns so that it can actually tell us which columns the model is looking at. And then we provide a maximum debt
because this tree is a very deep tree. It's going to have a depth of what 40 or
50 of which cannot be printed very easily. So we're just going to look
at two levels of the tree. And we're just this is just some,
some information about color. So we're just filling up. We'll just feel some nodes of
the tree with some backgrounds. So let's run this and let's
see what this looks like. Here's what our models,
predictions look like. The model first checks the humidity
at 3:00 PM and it checks if the humidity at 3:00 PM is less than 0.715,
then it goes into this direction. And then it checks if the rainfall at
AFT, after checking where if the humidity at 3:00 PM is less than 0.715, it checks
whether the rainfall is less than 0.00. If that is so indent checks of sunshine
is less than 0.2, 5.5 to five, and then it has multiple checks and so on. So this is all the moderate proceds. Each time it makes a decision based
on this value based on checking the humidity that either goes left or right
now, if it has gone right then here, there is another check on humidity. And once again, based on
that, it goes left or right. And then based on this condition,
it checks left or right. And then it keeps going. Now we've only plotted
till the depth of two. Where are you going to
plot till any depth here? So you can see here, here, you
can plot to any, any depth. Now there seems to be a problem here. Typically this image, you
will see that it is connected. So in this image, you will see that there
are lines connecting these, but you can see the Tree that's building up here. So this is the first decision. And based on this decision, this may
be the second decision based on this. This may be the third decision and
so on, and that keeps going till it finds a final leaf node where there
are no more decisions to be mean. And at that leaf node, it contains
information about which class should be returned as the output. Okay. So I hope you can see how a
model classifies a given input as a series of decisions. Now, of course, the trees truncated here,
but following any part from the root node to a leaf will result in a yes or a no. And I hope now you can also start
to see how a decision tree differs from a logistic regression model. Now, one important difference
that I can immediately tell you is that instead of having a standard
weightage for every insert of having a fixed weight for every column, as you go
left and right, the kind of conditions and the kind of weights can change. For example, based on weather,
the humidity is less than 0.7 or more than point. The conditions that are applied
to wind gust speed may change. And that makes sense, like if it has
rained today, maybe the factors I should look at are different compared
to whether it has not rained today. And that non-linear relationship can
be captured better in a decision tree. And it's a bit harder to
capture in a linear model. So whenever you have these non-linear
relationships, then it's always better to try out a decision tree
and see if that performs better than a logistic regression model. Okay. No, you may wonder how this
decision tree is created. How exactly does a model figure out
what is the first decision to be made? And what's the second decision and so on. And this is where you should pay
attention to this Gini value. So in each box you will
see this Gini score. Now, every machine learning
model has something called a loss function or a cost function. And the objective of the
model is to minimize the cost. So the way the model does this
is the decision tree does this is by using a genie score. The Gini score represents
how good a certain split is. So a Ginni score, a lower
genie score means a lower cost, which means a good split. So a perfect split, let's say by just
looking at humidity 3:00 PM, you could perfectly classify things as Ville,
not rain tomorrow versus rain tomorrow. In that case, the Ginni
score will be zero. So a perfect split has the
Ginni score zero and as, as a split gets worse and worse. So if, if your split is completely
useless, which means that even after splitting, there are 50%
yeses and 50% no's on the side and 50% yeses and 50% no's on the site. Then you will have a hygini score,
maybe close to a 0.5, or I think somewhere around the range of 0.5. All right. So a low Gini score means a good split. A hyginis score means about split. So what does, what does
our decision tree model. Conceptually speaking while training
the model, well, evaluates all possible splits across all possible columns. So right now we are looking at
this one, split humidity, 3:00 PM. But conceptually speaking, what
the model has done is it has looked at all the different columns. And then for each column, it has looked
at all the possible split points. So it has basically sorted all the values
in those columns in increasing order. And then it has taken each
value as a split point. And then for each split point,
it has performed a split. And based on the split, it
has calculated the Gini score. Now gold splits will have a
low Gini score and bad splits. We'll have a hygini score out of all
the, all the columns, all the splits. It has selected the best column
with the best split, which leads to a lowest possible Gini score. Now, of course, with just one split,
you cannot really get to a genie score of zero because you can't just
look at one feature in one split and perfectly make predictions about
whether or not it will rain tomorrow, but among all the possible splits. It turns out that humidity at 3:00 PM,
whether that's less than 0.7 or more than 0.7 is the most important factor. It reduced. It leads to the lowest Ginni score. Okay? So that's how the
decision tree figures out. What should be the top
level root level node. Now, once it is figured out what
the root level node is, which is the best split among all the columns
and all the possible splits, it performs a split using that data. So certain data points fall into
this region, all the training data, which has humidity less than 0.7
falls into this region, all the data, which has humidity greater than
points of and falls into this region. And this is where now the process is
repeated now for this entire data, which has humidity less than point. It price all the columns, all the
possible splits and figure it out and figures out the best split. Now turns out that if humidity is
already less than 0.7 rainfall, less than 0.04 is the best split. And if humidity is greater than 0.7 and
humidity, whether that's humidity less than equal 0.8 to five is the best split. Okay. So that's what is happening here. So the iterative approach of machine
learning in the case of a decision tree involves growing a tree layer by layer. So what we do first is input all the
training data, and we essentially look at all possible splits and
we take all those possible splits. And then we compute the Ginni
score for all the possible splits across all the possible columns. Based on the genie score, we
pick the best possible split. Then we split the data into
business split that was decided. And then we repeat the process
recursively for each split for the left split and for the right split, right? So we are recursively growing the tree. We have the level one decision,
and then we make level two decisions with the split data. Then we make level Tree
decisions with the split data. Then we make level four decisions
with the split data and so on. And for how long does this go on? Well, this goes on forever until the
point that you end up with just a single value now, right now you can. And in fact, that is the number that
you can see here at the very top. You have 98,988 rows of data. And this split sent 76,000 rows into
the left and 22,000 rows into the right now, similarly, this particular
split has 82,000 rows of data. And this split sense, 70,000
this way and 11,000 this way. Right? So that's roughly how it works. It, it does. Divide the data into multiple
parts and it keeps dividing till it gets to single leaf nodes. So where you just have
a single row of data. And then for that row of data, since
you already have the target, so that the target for that rule is used
as the value of that leaf, right? So every leaf ultimately contains
just one sample and that sample already has a target of yes or no. So essentially what we're saying
is we want to follow this decision tree down to a specific example
from our training set, and just look at the label of that training
example and return the same label. Okay. So I hope now you understand why
the training accuracy is a hundred percent because the decision tree
has literally learned or literally memorize the entire training set in
the form of this Tree based structure. Okay. And you can verify how deep this tree is
by checking the maximum depth of the tree. So you just call modeled
or treated max stepped. And it turns out that this
tree is 48 layers deep. So that it's possible that within 48
decisions, you will get to a leaf node. And on that leaf node, you will have
a label corresponding to that specific training example, which lies in that leaf. Okay. So this is one way to visualize a
decision tree, what the model loans. And as I said, you will see a rosier. Normally I don't, I'm not
sure why they don't show. They're not showing up here, but
normally you do see one oh way to display a decision tree, which
can be easier to is as text. So you can call export texts
and you can pass in the model. You can again, specify a maximum depth
up to which you want to show things. Because again, this can get pretty
large as well, and you provide a list of two names here too. So here's what the textual
representation looks like. So here we are checking
if the humidity is less. Oh 0.72. And if the humidity is less
than 0.2, we go down this part. Otherwise we go down the other part, which
is, well, I think we've not shown it here. We've just shown a few lines
because the street self, even with 10 layers of depth is very high. But yeah, if you're first to
check humidity, then you check rainfall, then you check sunshine. Then you check pressure. Then you check wind gust speed. Then you check humidity again. Then you check wind direction. Then you check the location. Now, if you're looking at Watsonia, then
we check the cloud cover at 9:00 AM and then we check the wind speed at 3:00 PM. And then we check the pressure. So if all of these checks
succeed, then we return. Yes. That, yes, it will rain tomorrow. If the pressure check fields,
if the pressure is not less than equal to 0.47, then we return. No. And if the wind speed check feels, if
the wind speed is not less than equal to 0.07, if it is greater than 0.7,
then we check the minimum temperature. And then there's another branch
of decisions that is to me meet. And similarly you have this. If the cloud cover is greater than
0.83, then we check the cloud cover at 3:00 PM and then we return. Yes. Otherwise we return true. Otherwise we are done otherwise. Well, we check the temperature
and then we return. Yes. Otherwise there is another
decision tree here. Okay. So the idea is this is the same as
the decision tree that we saw above. And the idea is that we make these
hierarchical decisions and the model has learned which decision to make first
by analyzing all possible decisions. Now, one small note I want to give you. This is how you should think about it. Conceptually, the model actually
has not really analyzed all possible decisions because that
is going to be very inefficient. So there are certain techniques
or certain what are called heuristics that are applied. So, which is basically strategies to
figure out good decisions, good enough decisions, if not the best decisions. And there is some randomization
and mold in there as well. Right? So just as an optimization, there
is some randomization and some strategies to pick if not the best,
but at least a good enough decision. All right. So that's the internals of decision trees,
which we don't really need to worry about. So based on this discussion now, can you
explain why the training accuracy is a hundred percent, whereas the validation
accuracy is lower and you can think about it it's because the model has
literally learned every training example. And when it sees an example, which
does not fit exactly what training it tries to categorize it into one of the
existing training examples by following one part of the decision tree, and
that may or may not end up really well because it's going to ultimately boil
down to a specific training example. And this is what it's called or
fitting where your model has learned or memorized specific training is and
does, and does not generalize well to examples that it has not seen before. Okay. .
okay. Let's keep going. Now. Based on the Gini index computation,
a decision tree and science and importance value to each feature. Again, there is a certain calculation
involved here on figuring out how the importance is assigned, but these
values can be used to interpret the results given by a decision tree. So if you just check inside any
decision to remodel dot feature underscore importances, that
will give you a list of numbers. So here's what that list looks like. And this is the importance
for every feature. Now, remember the input
to our model extreme. This had about 119 columns. So you will see 100 and 119 values here. In fact, if I just check the columns
and in fact, if I just check the columns, extreme dot columns, you can
see here that there are 119 columns. So this is the importance
for minimum temperature. And this is the importance
for maximum temperature. And this is the importance for
rainfall, and this is the importance for evaporation and so on. So let's create a data frame out of it. So I'm just creating a Panda's data frame,
where we have one column called feature. The name of the column in the
original data frame extreme, and one column called importance, which
is the importance of that feature. And then we are going to sort those values
by importance in the descending order. And let's look at that. Let's look at the 10 most important
values, the 10 most important columns. So we have humidity at 3:00
PM, which seems to be the most important Columbia 0.26. Then we have pressured at 3:00 PM. It seems to be the next most important. Then we have rainfall, which seems to
be the next most important and so on. And you will find that these importances
line up with the decision tree itself, you can see here, you have
humidity and then you have rainfall and then you have a wind gust. Speed of pressure does show ups and shine. Doesn't show up, but pressure doesn't
show up yet, but if you've maybe went one level deeper, you would also see pressure So these are the importances
humidity pressure rainfall when windGusspeed, sunshine, etc. And we can also plot these as a bar plot. So I'm just using sns.bar plot
to create a horizontal bar plot. And we are looking at the 10
most important features here. So it turns out that humidity at 3:00
PM has a feature importance of higher than 0.2, six are higher than 0.25. Whereas the next most important feature
seems to be pressured at 3:00 PM, followed by rainfall, wind gusspeed, etc. And these values should be
interpreted in relative order. So mostly you just want to use
this to figure out which one is more important than other columns
or than, than other features So that's how you
interpret a decision tree. You can see the actual decision
making process of a decision tree. And given an example, you can
actually just draw the tree and walk through it and see why a decision
tree arrived at a certain answer. And you can also see the importance
of the different factors. And this is where now you can check
if humidity has a lot of missing values and maybe we failed a lot
of missing values into humidity. Maybe we are misleading the model
by filling all those missing values. Maybe we should remove the humidity
column, or maybe we should try and fill those missing values and so on. Right? So you need to go back and forth. You need to go back and check if
your data makes sense, given that this is the feature importance
that you're working with. Yeah. So that's how you train a decision tree. You import, this is the decision tree
classifier from a sklearnlearned dot Tree, and then you train the Tree
you're trained, you fit it to the input data, and then you can analyze it. You can evaluate it using
the validation dataset. And we saw that the decision tree
classifier, that we trained, memorized all the training examples leading to a hundred
percent training accuracy while the validation accuracy was only marginally
better than a dumb baseline model. So at this point, our decision trees
basically useless because it has just memorized all the training examples and
this phenomenon is called overfitting. And in the section, we will look at some
strategies for reducing overfitting. So you will hear a lot of these terms. Now there'll be four or five
terms that you'll hear right now. And often in machine learning,
over-fitting simply means that you are doing very well on the training
data, but you're doing weighty poorly on the validation data. And we'll define it a bit more rigidly, a
bit more concretely it in a short while. And then the process of reducing
overfitting is known as regularization. So whenever you see regularization
or regular regularization techniques, regularization co-efficient
regularization component, etc, etc, all of that is concerned with reducing
overfitting, which means drying. Increase the validation accuracy, or
get it closer to the training accuracy. And sometimes we may be okay to
give up some training accuracy, to get a better validation accuracy,
because the validation accuracy is what we ultimately care about. now, how do we reduce overfitting
in a decision tree classifier. Now their decision tree classifier. When we created it, we gave
it a couple of arguments. We set some random state. We give it just one argument,
which was the random spit a state. And apart from that, it also accepts
several other arguments which can be used to reduce overfitting. So if you just check the help, which
is by typing a question, mark, for decision tree classifier, you will
see that you can specify a criterion, which can be Ginni or entropy. And this is simply the loss function. So there are two loss functions. One is Ginni, and one is entropy. You can specify a splitter. So this is the strategy that
is used to split at each node. And by default, it picks the best
strategy, which just picks the best possible split, of course, with some
randomization, or you can also specify a completely random split with, without
actually looking at the, without actually evaluating different splits. But here's something interesting. So you have a max depth parameter and you
can specify the maximum depth of the tree. So let's see that there is a max depth
parameter, and typically these arguments in the context of machine learning
that you set, right, when you are creating your machine learning model
are called hyper parameters because the term parameter is generally reserved
for the parameters or the numbers inside the machine learning model. Logistic regression. The weights of the different features
are known as parameters in our decision tree, which column is the root node
and what point we are splitting at. And then what do the splits look like? Those are known as parameters. Anything that the model learns or figures
out on its own is called a parameter. So just two separate things that model
figures out was this things that we have to set up front are we call some
of these things hyper parameters. So we call max depth, which is something
that we specify at the very beginning. When we are creating the classifier, we
call it a hyper parameter because it's not something the model figures out. It's something that we are specifying. So what does maximum depth? Well, if you saw the tree, the
tree went down to about 1442 levels deep, and we could check that using. So the previous model that we had,
the model that we trained earlier, if we just check the Tree incited so we
can call modeled or Tree underscore, and then we can check max stepped. So the model that we had trained earlier,
the decision tree went 48 levels deep. And that was one of the reasons
for overfitting because it was learning every training example. So what if we did not go 48 level, Steve? What if we only went Tree levels deep,
let's try and see what that will give us. So if you only go Tree levels deep. So now we have put in a restriction
that we do not want a decision tree to go more than Tree levels deep. And then we call model.fit with
the same training and port data and the same training targets. It's. Now the model has been trained
again, it just takes a second or two. And then we tried to compute
the accuracy of the model on the training and validation datasets. So we call model dot score on
extreme and we call model dot score on X well and while targets. And now it turns out that the model is
only 82% or 83% accurate on the training. And this makes sense because the model can
no longer learn every training example. It can only go Tree layers deep. So it just has to make the
best it can out of Tree layers. But this has the unintended consequence
that the model is no longer. Overfitting the model now
performs better on the validations said than it did before. So this may seem counter intuitive
that a Tree level deep, Tree performs better in the real world on real world
data compared to a 48 level deep Tree. And that's because the 48
level deep freeze learning specific training examples. Whereas the Tree level deep Tree
is picking up general trends and in machine learning, you want
models to pick up general trends and not memorize training examples. Okay. So that's the model. The model's accuracy has
gone up to 83% from 79%. That's a good improvement. And even though this has gone
down, ultimately what we care about is the validation accuracy. And let's visualize the model
now, so we can visualize the model using plot Tree once again. So here's what our entire
decision tree model looks like. First, we check humidity at 3:00 PM. If the humidity at 3:00 PM is
less than 0.715, then we go left. Now, here we check the rainfall. If the rainfall is less than
0.0, zero four, then we go left. Then we check sunshine at the
sunshine is less than 0.0 5, 2 5. We go left. And finally, if, if we reach this
point, we return the class classroom. So whenever you reach a leaf node, you
return the class of that leaf node. Similarly, now you check humidity,
rainfall, humidity, gene, and you get to this point you return. No. So it seems like a lot of these have no. So in a lot of cases, as you go along
this decision tree, you will end up. But there are certain cases
where you will end up at yes. So you go to humidity or
humidity 3:00 PM rainfall. So if the info, if the humidity
at 3:00 PM is less than 0.8 to five, and then the wind speed wind
gust speed is less than 0.279. Then the Ginni score is 0.471. Alright, so it's, your
humidity is greater than point. The humidity is greater
than 0.7, but less than 0.8. And the wind speed is less than 0.27. So here the classes. Yes. So here is what here you return
that there will be rain tomorrow. And if the humidity is greater
than 0.7 and greater than 0.8, well, it turns out that in all
these cases you end up at, yes. Right? And of course these trees got truncated. These trees could not be
built beyond Tree layers deep. So that's why you see
a bunch of yeses here. It's possible that if you are allowed
more layers than maybe some of these yeses would then split once again into news,
but because we are ending the tree at Tree layers, it's going to return knowing all
these conditions and it's going to return. Yes. And all these conditions. Okay. So this is what you want to study
carefully because at this point we already know that we can predict with 83%
accuracy simply by looking at humidity, rainfall, sunshine, wind gust speed. And that's it. Right? So just four out of the 23 plus
columns can be used to predict to get a prediction of 83% accuracy. And once again, we can also
look at it as a textual tree. So here you can see the same
thing, humidity less than 0.7, two, and humidity greater than 0.72. And then you check the rainfall and based
on the rainfall value, you either check sunshine or you check humidity once again. And it turns yes and no. Okay. So one thing you may want wonder is
what is the right maximum depth to use? Should we use a maximum depth of zero? Obviously not because if you use
the maximum depth of zero, then your model would not learn anything. That means it would
always just predict no. And while that would be 79% accurate. And while that would be regularized, that
would not be very useful because you have not given enough power to your model. But on the other hand, if you allow
your model to go 40 layers, deep or 50 layers deep, then your model can
memorize every single training example. And since it is trying to optimize
for the lowest Gini score, it is basically memorizing all the training
data and that's bad because then your model will not generalize. So the best value for the maximum
depth of the tree is going to be somewhere between zero and 40. So let's try and explore that. Let's try and figure out what the best
value for maximum depth is going to be. So here's what I'm defining. I'm defining a function called max depth
error, which takes a sample max step to value, which we can give as an input. And then we create a decision tree
classifier for that particular max depth, with the random state 42. And we get this model. Then we fit this model to
the training data for that. Given max step, we create the
model faded to the training data. Then we calculate the accuracy on
the training set and we calculate the accuracy on the validation set,
and we define let's call this error. And we define the training
error as one minus the training accuracy and let's call. And we define the validation error
as one minus the validation accuracy. So if accuracy is home, what
percentage had got right error is the percentage that had got wrong. And then we simply return this
dictionary and you'll see in a moment why I'm doing this. Now we take this max dept error. Which takes a max step 10 and
figures out for that Mac step. What is the training error
and the validation error. And we run that through this list. It's comprehension. So we try all the max. We try all possible values
of max depth from one to 21. And that's where it's taking awhile
because we are building a decision tree for every max dip value from one to 21. And we're computing the training
error and the validation error for each of these max depth value models. And then we're putting all
those results into a data frame. So let's yeah. So let's give that a minute. And then here you go. So this is what we get when you have
a max depth of one, the training error is 0.18, which means just by which
means just by selecting a max lift of one, just by making one decision, you get to accuracy of about 81% and
validation accuracy of about 82%. That's what it looks like. And, but as you, as you increase
the max depth, that error goes down, which means the accuracy improves. You can see here, the training
error keeps increasing 18 seven. The training error keeps decreasing. The accuracy keeps improving
so 18, 17, 16, 15, 14, 13. And it goes on all the way up to 0.0903. So at this point we are at 99 point. So at this point we are at 97%
training accuracy at a maximum of 20. And of course, if we increase the max
up further, the model will be so to speak, learn or memorize more training
examples, and it will get better the training data, but notice what's
happening with the validation errors. So the validation error is at
0.17 and it goes down and it goes down and it goes down to 0.15, and
then it starts to increase again. So you can see here from 0.15
starts to go to 0.1, 6.17 point. 8.19. And if you plot this, so here's a, here's
what it will look like if I plot it. So you up simply plotted the
training versus validation error. The blue line is the training
error, which is one minus accuracy. And the orange line is
the validation error. So what's happening here, here,
where you see both the training and the validation loss decreasing. So what's happening here is you're
making your model progressively powerful, which means you're allowing
it to make a one decision right now. And here you're allowing this model,
the model at where max stepped to you allow it to make two layers of decisions. And this model with max stepped
for you, allow it to make four layers of decisions and so on. So up to a certain point, it helps to
add more complexity or help helps to add more power within your model, right? It helps to make your model bigger, but
after a certain point, when once your models capacity gets large enough, it
starts to just focus on memorizing the training data and it stops generalizing. So at this point, you see it
gets better and better at the training data and it gets worse
and worse at the validation data. And this is the scenario
that is known as overfitting. Okay. So here is the graph that you will see
over and over and over again in pretty much every problem as you increase
the complexity or the size of your model, as you increase the size of
your model or the power of your model or the capacity of your models of
many different ways of looking at it. Ultimately, it's a question of how many
parameters are there inside the model. So as you increase the complexity or
size or power or parameters of your model, you will notice that both
the training error and that test or validation error will go down up to
a certain point because the model is. The model has more capacities or it can
learn more and it can actually capture some information about the inputs and
the targets and how they're related. But after a certain point, it will
start memorizing training examples. And that is a point where
your test or your validation error will start to increase. There is a point where your validation
accuracy will start to drop. And this scenario is known as fitting
video training editor is going down, but your validation error is going up. If you train your model a little
more, have you increase the complexity of your model a little more. If you add one more layer to your
decision tree, then the training error goes down, but the validation
error actually gets worse. And this is where you should
stop training your model. So you want to stop training your model
at the point where, or this is where you want to pick the complexity of your model. So you want to pick the complexity
of your model at the point where the validation loss is just about to increase. So here by plotting this graph, we
have been able to figure out that at, at a max dept of seven, we get as good
as this decision tree can get on the validation error for the given dataset. So the max Stept of seven is actually
the best depth for a decision tree. So this is all you figured out. This is how you regularize
a decision tree. So you regularize a decision tree,
which means to reduce overfitting by tuning some hyper parameters. So this is called a hyper parameter
max step and just changing its value is called tuning. The hyper parameter and my tuning,
this hyper parameter, we have regularized the model a little bit. We have now reduced the amount
of overfitting that it has. So you can now see that
the validation score. And maybe also, let me also print
out the training score here. The training score and validation
score or training accuracy, validation, accuracy, both are about 84.5 84.6. So that seems like the best
we can do by modifying the max depth of the decision tree. Okay. So we just looked at one hyper
parameter, which is max tree depth. And we also looked at how that
hyper parameter can be used to, can we use to regularize the model. Let's look at another hyper parameter. This one is called max leaf nodes. So this is another way to control the
size of the complexity of a decision. I see another way to control the
complexity of a decision tree, and which is to limit the number of leaf nodes. Now, whenever you have a decision tree,
there are, as you can see here, there are a certain number of decision nodes. And then there are certain
number of leaf nodes. Now, the way we have limited the
size of the decision tree or the, and the complexity or the parameters
of the decision tree in this case is by specifying how deep it can get. But that may not be the best way. Maybe you want to allow it
to go a few layers deep here. Maybe you want to allow it to go five
layers deep here, and you want to allow it to just stay two layers deep here. So that's where you can actually
specify the maximum number of leaf nodes that your decision you can have. So here's how I'm going to do that. I'm new to specify that format
decision tree, that the maximum number of leaf nodes that can have is 128. And roughly speaking, if you have one,
one node at the top that splits into two nodes below it, that splits into four
notes below it, what we might think is that the decision tree actually goes
layer by layer, where it goes, builds layer one and then builds layer two. And then it builds layer Tree. But actually what happens is it always
tried to make the best possible split. So if it is created a layer one,
maybe let's look at it here. So if a decision tree has created a
layer, one, it has created a split here. And then based on the split,
you now have two splits left. And right now it looks at both of these
and it sees, which is the better to split. If this is going to, if splitting,
this is going to result in a lower Gini coefficient, then it splits this into two
parts by creating a split condition here. And now it's going to analyze among all of
these leaf nodes, which is the best split. So if it determines that this is
the best split to make, then it's going to make this split first. And now maybe at this point, it's going
to look at all these leaf nodes and determine which is the best split to make. So maybe at this point, the next
best split to make is, is this. And maybe after this, the next best
split to make is this and so on. So your decision tree doesn't really
go layer by layer where it first does this, and then it does this. Rather it goes, it looks at all the leaf
nodes and it figures out, which is the best leaf node to split at the moment. And it splits now, how does that
tie down tie back to max leaf notes? Now here, what we're saying is we want
max leaf nodes to be 128 and 128, I believe is two to the power of seven. So if you had a decision tree, which
has seven layers, deep added Lewis' point, it would have 128 nodes. Now let's try and give
the max leaf nodes of 128. And let's see if the decision tree
actually has a depth of seven. So here we create this decision tree,
we call decision tree classifier, and then we call model dot fit. So now we're training and the owner
leader section we've specified is that the number of leaf nodes, the
number of these nodes, which have even split, should not go higher than 128. And that limits a certain number of to
now we fit the data on the training. We've heard the model on the training
data, and it turns out that the training accuracy is only 84.8% and not a hundred
percent because of the same reason. It cannot go down and memorize
every training example. There's only a limited number of nodes
it can create and let's check the models, accuracy on the validation dataset. So on the validation, it said this
time, the accuracy is 84.4% and let's shake the trees depth as well. So the depth of the tree is 12. So let's compare this 84 with
what we had the previous. yeah. With a model or here, we had a model
that could go to maximum depth of seven and that had 84.5% validation accuracy. In this case, we have 84.4%. Maybe if you change this a little
bit, maybe if you change this to 1, 1 30 or one 40, you may
find that it may actually cross. But the important thing is
that these two are different. And the reason these two accuracies
are different is because the strategy by which we are limiting
the size of the tree is different. In one case, we are saying
that a max depth can be seven. In the other case, we're saying that the
maximum number of leaf nodes can be 1 28. And the Tree actually does go down
to a depth of 12 in certain places. So that means certain parts go down
to a depth of 12, but certain parts maybe are shorter, certain parts of
maybe just Tree or four levels deep. And we can try and verify this. We can convert this model. We didn't get the textual
representation of the model. And maybe look at just the first few
lines of that textual representation. The entire thing will get pretty long. So I've just printed the first 3000
characters or so, so here, you can see that this is a fairly long pot,
but then this part definitely shorter. You can see that this spot
is shorter than this spot. This is about 1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12. So there are 12 checks here. On the other hand here, there
may be less than 10 checks and here there's even fewer checks. And maybe once we go for there, maybe if
I print more of this, you will see that there are shorter and shorter parts. Yeah. So this is, this is
definitely a shorter part. You can see that this is maybe
1, 2, 3, 4, 5 levels deep. On the other hand, this is
6, 7, 8, 9, 9 levels deep. So sometimes you have five levels deep. Sometimes you have nine levels deep. And that depends on the best split
that the decision tree was able to. So here's an exercise where you
find the combination of max depth and max leaf nodes, that results
in the highest validation accuracy. Okay. Then another exercise for you
is to experiment with the other arguments of decision tree. Scikit-learn has excellent documentation
where he extensive and very helpful, very easy to read as well. So just check out the documentation of
decision tree classifier on scikit-learn a skill learned or treat our decision
tree classifier, look at all of these and go through all of these. Now in a lot of these cases, it will tell
you exactly what each parameter does. Maybe try a different criterion,
maybe try the random split. See if that helps, maybe try the
random, try changing the max depth. We did do some experiments, but
how does max depth matter if you're working with a random split, etc,
is worth figuring out look at, there are some other parameters, hyper
parameters here that you can look at. So try and experiment with all of
these, and you can see that there are detailed explanations for each of these. And in fact, in certain cases
you will find that there are links to other resources. And as I said, a lot of these are
implementations of some of the best papers in machine learning. So a lot of the best
practices are given to us. A lot of the best
techniques are given to us. We just have to try them
out with scikit-learn. Another exercise for you is to
try out a more advanced technique for reducing overfitting. This is called cost complexity pruning. So just as we have limited the
number of loads by depth, and we have limited the number of nodes by the
number of leaf nodes, there is a way to limit the number of nodes by the
kind of split that a node performs. So we perform a split only if
it satisfies certain criteria. And this is called cost
complexity pruning. So you can learn more about it here. It's not a very commonly used technique
because decision trees by themselves and almost never used in isolation. So, I will not cover it here, but
it's something that you can check out scikit-learn has good documentation on it. And in fact, it has an
example implementation and also the link to the paper. So you can check this out and try and
follow the code from this tutorial and try to implement cost complexity,
pruning, and see if you can improve the validation accuracy further. Okay. Machine learning is all about
trying different hyper parameters, trying different techniques and
getting that additional boost in the models performance. So just a quick recap of the
topics that we've covered today. We started out by looking at this
problem statement, the rain in Australia data set, which contains about 10
years of daily weather observations from numerous Australian weather. Have used all of this information
more than 20 observations to predict whether it is going to rain
tomorrow at a particular location. And we did this by first downloading
the data using the open data sets library from this Kaggle competition. Once the data set was downloaded, we
read it in using the pandas library, using PD dot, read CSV to view the data. We also looked at the different columns
and the different data types within the dataset and looked at the number
of non, non values in each column. We dropped all the rows where the value
of the target column rain tomorrow was. Then we prepared the dataset for training. The first step was to create
training and validation sets. And we decided that because the state
has ordered by ear and because this model is going to be used in the
future, we are going to use the data for 2016 and 2017 as the test data set. And then we are going to use the data
for 2015 as the validation dataset. And we're going to use the data for
2017 to 2014 as the training data. And we created a test
training and validation split. You can see this year, then we
identified the important target columns. So the input columns are all those
columns, which we will use to make a prediction for the target
column, which is rain tomorrow. Now in the input columns, we chose not to
include the date because the date we will be working with in the real world will be. A completely different range
and we want to use just today's weather data to predict tomorrow. The fact that which date it is
today is not very important. But we also looked at the numeric and
categorical columns just to separate them out because both have to be
pre-processed separately, then booted the missing values in numeric columns. So we just run this simple importer using
the mean strategy, which means we fill all the missing values in numeric columns. With the average value of that column. Then we scaled the numeric
features to the zero to one. Now the zero to one range helps
ensure that all the columns have a similar set of values, a similar range
of values, and one column does not dominate the value of the loss or the
value or the process of optimization. Then we encoded the categorical data
using a one heart encoding technique where we took each value or each
category from a categorical column and introduce a new column for that category. And we placed zeros. In that column to indicate whether or
not a row belongs to that category. And we created these
new and coded columns. So if you had three columns and
for each of those three columns, you had four categories each. So you would now end up
with 12 encoded columns. So just keep in mind the difference
between categorical columns and encoded columns and courted columns are the new
one heart encoded columns that we are. And what we want to do with the
encoder is to transform the categorical columns into encoded columns. So that's why you can see here that
we've inserted into train inputs and coated columns, the transformed values
of drain inputs, categorical columns. Now, if you're ever unsure about
what a particular line of code does, all you need to do is create a new
code cell and just run the code. Step-by-step like, you can check what
the value of categorical columns is, and then you should check what the value is. Train in ports, categorical columns is
then you should check what the value of end quarter dot transform drain in ports. Categorical columns is. And if this is a non binary, you can
then check out what the shape of the non binary is, how many rows it has, how
many columns it has, and then you should check what the value of this quantity is. And then you should maybe
run the entire state. And see what happens and then
check the value of train inputs. So use the Jupiter notebook interface,
the interactive nature of the platform to explore each line of code and dig people. Now after categorical column one
hot encoding, we just created these. Bring in X, Y, and X test
variables containing just the numeric and the encoded data. So we are no longer looking
at the actual categories. We just look at the imputed and
scaled numeric columns and the encoded columns, which contained the one hot
encodings coatings of categorical. Then we decided to train a decision
tree and a way to train a decision tree is to use the decision tree
classifier, because this is a classification problem, but decision
trees can also be used for regression. In which case you would import
decision tree regressor, then you create a decision, take classify and
model, and then you train it using the training inputs and the training. Once a decision tree has been constructed. You can make predictions
using the decision tree by calling model dot predict. So when you call it train, that is
when the decision tree is set up. That is when all the parameters
in the decision tree are created. And when you call predict then
predictions, can we meet on any input data that you've given to the decision tree? In this case, we were looking at extreme. So we got the output that we got. All of it was the predictions
for the training set. And then we can compare the
predictions from the training set with the predictions, with the
targets, using the accuracy score. And we got a 99.9, 9%
accuracy on the training set. However, when we tested the same thing
on the validation set, we got only a 79% accuracy and we realized that it is very. Mildly better than just always predicting
that it will not rain tomorrow, which means the model was heavily overfitted. The model has learned all the
training examples, but does not generalize well to data. It has not seen before,
such as the validation set. So we then looked into the decision
tree to learn more and identify how we can solve for the overfitting
that the model currently faces. So first we learned that a decision
tree can be visualized though. We use the plot three and export
text functions from and a decision tree, simply a series of binary
decisions where you make a decision and based on that, then you make
another decision based on that. You make another decision till
you get to a point where there are no more decisions to be made. You get to a leaf node and dead. It is 10 classified. Rain or noting. And the way this is created
is using a Gini index. So the model tries to perform an
optimal split at every stage, at the top level, and then the low level
below it, and then the level below it. And that's all it comes up
with the optimal decision tree. So the iterative process of machine
learning for decision trees is constructing the decision tree level. And you can also view the decision tree
as a textural tree, which is sometimes easier to navigate, especially with
larger and deeper decision trees. And you can see what parameters
that decision tree looks at to come to a particular conclusion. We can also check the feature
importances, what are the different features in this case? It turns out that. Humidity 3:00 PM pressure 3:00
PM, and rainfall seem to be the most important features for
this particular decision tree. And we can also plot that as a
bar graph to see the relative importance of different features. Finally, we talked about hyper
parameter tuning to avoid overfitting. So the decision tree classified
except so various arguments, which can be modified to reduce over. And a couple of things. We looked at our max depth and max
leaf nodes by reducing the maximum depth of their decision tree. We can prevent the tree from memorizing,
all training examples, which may lead to better generalization. And the way to specify max depth
is using the max depth argument of the decision tree classifier class. And that limits the number of
layers or the number of, or how deep the decision tree. And once you apply that Mack
step, the decision tree and no longer memorize training data. So search score on the training data
falls, it's only 82.9% accurate, but it score on the validation data set
increases, because now it is generalizing. It is picking up more general
trends within the data and not specific training data rows. So that's why the validation
accuracy rises to 83.3%. And this was via the decision tree. That is just three layers. So it also gives you a lot of insight
into the data that just by looking at humidity, rainfall, and maybe a couple
of other parameters you can predict to about 83.5% accuracy, whether
it will not rain or not tomorrow. And here is a simplified decision tree
where just, which is three layers deep. Now, what you would want
to do is experiment with different values of max depth. And this is what the graph looks
like when you plot training error versus validation error. When your max tip is too small,
then your decision tree is not powerful enough, then it will not
have a very high training accuracy. Or, and if you look at the error, which
is a hundred percent minus accuracy or one minus accuracy, the training error
and the validation error will both be high because your model is not powerful. To pick up important relationships
within the training data and the targets. But once you start increasing the size
of the decision tree, once you start allowing it to go deeper, maybe two
levels, deep, three levels, deep, four levels deep, your training arrow starts
to decrease because your model is becoming more powerful and your validation
error starts to decrease as well. At a certain point, you will notice
that the validation error either becomes flat or starts to increase. This is the point where
overfitting starts to. So this, for this particular example,
it happens that about a max step, Tufts seven, as soon as you go more
than seven layers, deep, the model starts to memorize specific training
examples rather than picking up general trends about the weather. So that is why you can see that the
training error continues to go down, but the validation error starts to
increase which, and at a certain point it becomes much worse than
what the validation error initially. So, this is something that
you should keep in mind. And this is a general trend that
you will notice with all the hyper parameters, which is anything
that you have to configure in advance before training the model. Right? So the decision tree that is getting
built out all the decisions and all the decision points, those are quite
parameters because that's what the model is learning from the data. But the max depth is something that
we cover that we provide before. Before we actually train the model
before we create the model and these are called hyper parameters. So with hyper parameters, what
you will always find is there is this model complexity axis. If you change the value for
hyper parameter, the model can go from less complex, new, more
complex, or less powerful, to more powerful and more powerful. All isn't always a good thing because
if you continue increasing the power or the capacity or the complexity of the. Then at some point it will have enough
capacity to memorize the entire training dataset, which is what it is optimized
for to reduce the training edit. And then it will not generalize
well to real world data, test data or validation data. So what do you want to find is
that best fit, which is the point at which the validation loss has
gotten as low as it can possibly be. And if you go any further than it
is start going to start increasing. So this is the best food that you're
looking for before this, you are in a space called under fitting where your
model is not powerful enough, or it hasn't learned enough about the data. After this point, you are in a space
called overfitting where your model is just memorizing training data, and it
is getting worse on real world data. Okay? So every hyper parameter is
something that you will have to vary and find the optimal position. Then we also looked at another hyper
parameter called max leaf nodes, which is yet another way to control
the complexity of a decision tree. And the benefit here is that
this allows branches of the tree to have varying depths. So instead of having to just
limit the depth to a certain. You can say that the number of leaf
north should be a certain number, like 1 28, and then the tree can take
different depths along different parts, depending on how many decisions need
to be taken for it, a certain pot. So this can be somewhat better than max
depth sometimes, but typically you would use a combination of both of these. So here is an example where we
use max leaf nodes of 1 28, which is what you would normally get. If you had a seven layer decision tree. Going all the way to every leaf,
but it turns out that when you use a max leaf north of 1 28, the
three depth goes down to 12, but not every part is 12 steps long. You will find that some parts are quite
short, like three or four steps long. Some parts are quite long, like 8, 9,
10, 12 steps long, and then other parts. So max leaf nodes allows the tree to have
a different structure in each sub three. Some branches can be short,
some Montas can be long. And typically you would have
to use a combination of max stepped and max leaf nodes. And just like max depth, as you increase
the number of max leaf nodes, your moderator start to overfit at some point. And if you keep your max leaf
nodes too low, then your model will not be powerful enough. So there is a value somewhere in between,
which gives you the optimal validation loss, which is what you need to. And this is really the entire art of
machine learning, where once you have a model, you have to find the right
hyper parameters, which minimize the loss for that model, which minimize
the validation loss for that model. Okay. Now there are several other Ultimates
within decision tree that you should explore and refer to the dark spot. And another more advanced technique
for reducing overfitting in decision trees is called cost complexity. So you should check that out as well,
and try to implant implement cost complexity pruning for this problem. All right So that was a quick
introduction to decision trees. Now, of course, we've skipped
over some parts, especially the more mathematical parts about
the Gini index and the importance calculation and things like that. But I hope you were able to
get a basic intuition of on how these splits are created. We look at all possible features,
all possible splits, find the best split and the best evaluation
is done using the Gini index. Then we divide the data into these
two parts or these two portions. And then we identify, which is the
next best split to be made based on the leaf nodes that we have. And then we make that split. And then we keep going till
we either hit leaf nodes. In which case we essentially
memorize the entire training dataset or till we hit some limits that
have been artificially imposed. Why are some hyper parameters for
the purpose of regularization? And in general, you don't want to
have an unbounded decision tree. You want it to be somewhat generalized. So that's where you
want to limit its depth. And you want to maybe limit some of
the other hyper parameters as well. Now while tuning the hyper parameters
of a single decision tree may lead to some improvements, a much more effective
strategy is to combine the results of several decision trees that are trained
with slightly different parameters. We will continue using this notebook. We will continue using the same data. And we will see how we go from decision
trees to an ensemble model called random forest, and why that is helpful, how
that affects the results of our modeling. So what should you do next? Review the lecture videos and
execute the Jupiter notebook, complete the lecture exercises and
start working on assignment one. If you haven't already a new
assignment is coming soon. And discuss on the forum and on the
discord server and ask questions. This is a very important part of
participating in an online course, and it really helps you stay motivated. It helps you improve your learning. It helps you be part of a
community which can open up avenues for future collaboration,
where you can continue to learn. Long after the course has ended. And you will find some friends
that you may build associations with personally or professionally. So with that, I will
see you in the forums. You can follow us on Twitter at
and , and you can visit the court's website anytime@zerotogbms.com. Next week, we will look at random forests. This was machine learning
with zero two GBMs. Thank you, and have a
good day or good night.