Hello, and welcome to this workshop
on how to build a machine learning project from scratch today, we are
going to walk through the process of building a machine learning project,
and we're going to write some code life. We're going to start by downloading
a dataset, then processing it, training a machine learning model. In fact, a bunch of different machine
learning models and evaluating those models to find the best model. We will also do some hyper parameters
and do some feature engineering. Now, before we start, if you're looking
to start a new machine learning project, a good place to find datasets is gaggle. So I just wanted to show you this
before we get into the code for today. So there are a couple of places on
Kaggle, which is an online data science community and a competition platform. So there are a couple of places
where you can find good datasets for machine learning projects. The first is competitions on Kaggle. Now Kaggle is a competition platform that
has been around for close to 10 years. I believe at this point. And they have hosted
hundreds of competitions. So you can go on kaggle.com/competitions,
and then go back all the way. You can go back to competitions
from 2010 or 11, I believe. And use those datasets to work on
your machine learning projects. For example, one of the dataset that we
will be looking at today is called the New York city taxi fare prediction challenge. There's competition was conducted
three years ago by Google cloud. And the objective here was to predict
a writer's taxi fare, given information like the pickup location, application
date, and the number of passengers. And you can learn a little
bit about the competition. You can look at the data for the
competition before downloading it. You can also look at a lot of public
notebooks that have been shared by other participants in the competition. Reading other's notebooks
is a great way to learn. And in fact, a lot of the techniques
that we are covering today will be from public notebooks. And you can also look at the discussions. If you have any questions about how
to go about doing a certain thing. Now, one of the best parts of
Kaggle is that you can actually make submissions to the leaderboard. So you can do this. You can go to my submissions
and click late submission. And although you will not rank on the
leaderboard, your submission will still be scored and you can see where you live. Among the entire set of
participants in this competition. So in this competition, for example, there
were over 1400 teams that landed on the leaderboard and getting to anywhere in
the top 30 to 40% on a Kaggle competition. Even one that has ended already is
a sign that you're probably building really good machine learning models. So that's one place on Kaggle where
you can find data datasets for machine learning, and you have at least a
hundred options here to choose from. So if you're building a first or
second project, I would just go here. But apart from this on
kaggle.com/datasets, you can find hundreds of other datasets. Now, one of the things that I like to
do when I'm searching for datasets on Kaggle, especially for machine learning
or classical machine learning, as we call it to differentiate from deep
learning is go to filters here and select the file size five type CSV, and
set a minimum limit on the file size. So a minimum limit of 50 MB gently
gives you a large enough data set to work with apply those filters. Then I sought by the most votes. And finally I S so that leaves us with
about 10,000 data sets to choose from. And finally, I put in a query or a
keyword to filter datasets by a specific. Domain. So here, for example, are all the
datasets that are related to travel. Now, since these are sorted, by the
most words, you, somebody has already done a lot of exploring for you. You can just look through the
first five or 10 data sets. Not all of them may be suitable
for machine learning, but many are, and you can in fact, open up
datasets and read their descriptions. And in many of these descriptions, several
tasks will be mentioned that can tell you how you can do machine learning. Another thing you can do here is you
can go into the code tab and on the code tab, you can search for machine
learning terms like random forest. And you can see here that people
have used this data set to build machine learning models. So that's another good
place to find datasets. So you have hundreds of real world
data sets to choose from because most of these datasets and most of
the datasets in cattle competitions have come from real companies that
are looking to build machine learning models to solve real business problems. So with that context,
let's get started today. We are going to work on this project
called New York city taxi fare prediction. And this was a cattle competition,
as I mentioned a few years ago. And you can learn all about
this on the competition. What you're looking at right now is the
notebook hosted on the Jovian platform. This is a notebook, a Jupiter
notebook hosted on my profile. And on in this notebook, there are
some explanations and there is some space to write code, and we are
going to start writing the code here. Now, of course, this is a read
only view of the notebook. So to run the notebook, you
click run and select run on Cola. We are going to use Google CoLab
to run this notebook, because this is a fairly large dataset. And we may request some of the additional
resources that Google CoLab provides. Now, when you go to this link
and I'm going to post this link in the chat right now, when you go to this link on this
link, you will be able to click grant run on Cola, and you may be asked to
connect your Google drive so that we can put this notebook into your Google
drive and you can open it on Cola. Okay? But once you're done, once
you're able to run the notebook, you should see this view. This is the co-lab platform,
Colabra research or google.com. It is a cloud-based Jupiter notebook
where you can write code and any code that you execute will be executed
on Google servers in the cloud on some fairly powerful machines. In fact, you can go to. Change the runtime type. And from here, you can select, you
can even enable a GPU and high dam machines, which I encourage doing
if you're using either of these. All right. So whenever you run a notebook hosted
from on Jovian on CoLab, you would see this additional cell of code at the top. This is just some code that you
should always run at the beginning. So whenever you go to a Jovian
notebook, run it on Cola. Definitely make sure to run this
first line of code, because it is going to connect this CoLab
notebook to your Jovian notebook. And any time you want to save this,
a version of your CoLab notebook to your Jovian profile, you will
be able to do that, but you need to run this single line of code. Okay. All right. With that out of the
way let's get started. So we'll train a machine learning model
to predict the fare for a taxi ride in New York city, given information like
pick up date in time, pick up location, drop location, and number of passengers. And this data set is taken from Kaggle. And we'll see that it contains a large
amount of data now, because this is a short workshop we're and we're doing
all this life build attempt to achieve a respectable score in the competition
using just a small fraction of the. Along the way. We will also look at some practical
tips for machine learning, things that you can apply to your projects
to get better results faster. And I should mention that most of
the ideas and techniques covered in this notebook are derived from other
public notebooks and blog posts. So this is not all entirely
original work, nothing ever is now who run this notebook. As I said, just pick, run, and run
uncola and connect to your Google drive. And you can also find a completed
version of this notebook at this link. So I'm going to drop this link in the chat
if you need to refer to the code later. Okay. So here's the first tip I have for you
before we even start writing any code, create an outline for your notebook. Whenever you create a new Jupiter
notebook, especially for machine learning and fill out a bunch of sections and
then try to create an outline for each section before you even start coding. And the benefit of this is that this led
to this lets you structure the project. This lets you organize your thought
process into specific sections. And this lets you focus on
individual sections at a time without having to worry about the. Okay, so you can see here, if you click
on table of contents, I have already created an outline in the interest of time
where there are sections and subsections. So here's the section and here to
have some subsections and so on. And then inside the subsections,
there's also some explanation that mentions what that subsection covers. So here's what the outline
of this project looks like. First, we're going to
download the dataset. Then we went to explore
and analyze the dataset. Then we went to prepare the dataset
for training machine learning models. Then we are going to first train
some hard coded and baseline models. Before we get to the fancy three bays
to our gradient boosting kind of models. And then we'll make predictions
and submit predictions from our baseline models to Kaggle. And we'll talk about why that's important. Then we will perform some feature
engineering and then we will train and evaluate many different
kinds of machine learning models. And then we will tune hyper
parameters for the best models. And finally, we will briefly
touch on how you can train all the GPU with the entire dataset. So we will not be using the entire
data set in this tutorial, but you can repeat the tutorial with the
entire dataset as a little bit. And finally, we are going to talk
a little bit about how to document and publish the project online. So let's dig into it. And if you have questions at any
point, please post them in the Q and a, and we will stop periodically
if possible, to take questions. All right. So, as I said, for each section,
it's always a good idea to write down the steps before we actually try to
write the code and you can, Jupiter is great for this because it has
marked down cells where you can write things and modify them as required. So here are the steps. First we will install
the required libraries. Then we will download
some data from Kaggle. Then we look at the dataset files. We will then load the training
that set with pandas and then load the test set with pandas. So let's get started. I'm going to install the Jovian library. Well, that's already installed,
but I'm going to put it in here. Anyway. Uh, we're going to use a
library called open data sets for downloading the dataset. You're going to use pandas. You're going to use NumPy. We're going to use scikit-learn. We are going to use XG boost
and I believe that should be it. So I'm just going to install
all of these libraries and I've added hyphen hyphen, quiet to. Uh, white, any outputs
from this installation. Now, whenever you're working on
a notebook, it's important to save your work from time to time. And the way to save your work is to import
the Jovian library by running joke, import Jovian and then running Jovian dot commit. Now, when you run Jovian dot commit,
you will be asked to provide an API key. And I'm going to go here on my
Jovan profile and copy the API key and come back and paste it here. Now, what this does is this takes
a snapshot of your notebook at this current moment, and then it
publishes it to your chosen profile. As you can see here, NYC
taxi, fare prediction, blank. This notebook was created just now
you can see this is version one. Now every time you run job under
commit in your notebook, a new version of this notebook will get recorded. And the benefit of having a
notebook on Jovian is that you can share it with anybody. Like I can take this link and I
can post this link in the chat. Of course you can also make your notebooks
private or secret if you would like to. And you can add topics to your notebooks
so that other people can find them. Okay. So coming back now, download the dataset. We are going to use the open datasets
library, which can connect using your Kaggle credentials to Kaggle and download
the dataset from this link for you. So here's how it works. I first import open datasets as
ODI, and now I can run the.download and I need to give it a URL. So let me suppose to put a URL here. There is a URL for the competition and I just provide dataset underscore URL. Now, when I run this open data sets
is going to try and connect to gaggle using my Kaggle credentials, but to do
that, it needs my Kaggle credentials and the way to provide your Kaggle
credentials to open datasets is to go to kaggle.com and then click on your
avatar, go to your account and scroll down to API and click create new API token. And when you click this create new API
token, it is going to download a file called Kaggle or Jason to your computer. Now you need to take this file. Kaggle dot Jason, and come back to CoLab
and on co-lab go to the file step and upload this gaggled or chase on file. Now, unfortunately you will have to do
this every time you run the notebook. So I suggest downloading the calculator
just on file once and keeping it handy like I have here on my desktop. So you can upload it whenever you need it. Now, this gaggled or Jason file
downloaded from my Kaggle account has my username and a secret key. So you should never put the secret
key into a Jupiter notebook. Otherwise somebody else will be able
to use your Kaggle account, but within your own Jupiter notebook, when you
run odi.download, it is going to read the credentials from the Kaldor Jason
file and download the dataset for you. You can see here, this
dataset is pretty large. This is about 1.5, six gigabytes. And of course it's a zip file. So after expanding, it's
going to become even larger. And it's, it's going to download
this dataset to a folder called New York city taxi fare prediction. So I'm just going to put that into
a variable New York city taxi fare prediction so that we have it handy
when we want to look at the files. All right. So on the files tab here, you can see, we
have New York city taxi fare prediction. Here's a folder. And inside the folder, there
are some files specifically. There are 1, 2, 3, 4, 5, Okay. Now there was the question. Should you try and follow along right now? But I would say right now you should
probably watch, and you will have a recording of the session and you should
try and follow along with a different dataset later, but it's totally up to you. Alright, so now the data has been
downloaded and now we have this data directory, which points us to
the directory where the data lives. And now let's look at the size number of
lines in the first few lines of each file. So first I'm going to use
the L S minus LS command. So this is a shell command. This is not pipeline. So every time you have an explanation,
exclamation mark at the beginning, this is going to be passed directly to the system. This is going to be positively
to the system terminal. So I'm just going to run LS minus LS,
and I need to access this folder, which, and this string is part of this variable. So you can pass the value of a Python
variable using these curly brackets. So when we put something inside these
braces or curly brackets, Jupiter is going to replace this entire expression
with the value of this variable, which is New York city taxi fare prediction. So Ellis minus let data shows us that
this is a total of 5.4 gigabytes of data out of this almost entirely. It is the training set with 5.4 gigabytes. That's a pretty large training set. And then the S set is just 960 kilobytes. And then finally there is a
sub sample submission file. As I mentioned, you can submit
some predictions on the test set to Kaggle, and there are some
instructions which we can ignore. Okay. So that's the sizes of the files. Let's look at the number of lines
in some of the important files. So the way to get the number of lines
is using the WC minus L shell command. And once again, I'm going to get the data
directory and under the data directory, I want to look at trained or CSP. So there you go. New York city taxi fare prediction slash
train that CSV contains what's this 55,423,856 road starts a lot of rules. And then let's look at the test set. The test set contains 9,914 rows, which is
a lot, lot smaller than the training set. Let's also look at the
submission dot CSV file. The submission, not CSV file
also contains 9, 9 1 5 rows. So just one additional rule
compared to the test set. This could just be a, this could
just be an empty line, so it wouldn't worry too much about it. And that's it. So let's, let's also look at the
first few lines of each file. I am going to use a head
shell command this time. So here we have the first
10 lines of train dot CSV. Remember it has 55 million lines
and these are just the first 10. And it seems like you have
information like a key. So this is the first key. This is a CSV file. So you will find the first row
contains names of columns and then the future rows contain the data. So this is the key. So seems like every row has a very,
has a unique identifier called the key. And then there is a fair
amount for that, right? So for example, here, the fermata is 4.5. Then you have the pickup date time. So this is the date
and time of the pickup. Then you have the pickup longitude
or, and the pickup latitude. These are the two places. These are the, uh,
geocoordinates of the pickup. Then we have the drop longitude
and drop off latitude. So here are the coordinates of
the drop off minus 73 and 40. And finally you have the passenger
count, which is the number of people who took the ride. Okay. So that's the training
data looks simple enough. Let's look at the test data. All right. So test data looks similar to, we have
the key, every row has a unique key, and then we have a pickup date time. There we go. We have a pickup long longitude pick
up latitude, drop off a longitude drop of latitude and passenger count. Great. Now, one thing that's missing from
the test data is the Fairmount. And this is what is typically called the
target column in the, in this machine learning problem, because remember the
project is called taxi fare prediction. So we need to build a model using
this training data and then use that model to make predictions of
the fair amount for the test data. And that's why the test data
does not have predictions. Now, once you make predictions for the
test data, you need to then put those predictions into a submission file. And here is what a sample
submission file looks like. Here's a sample submission file. There is a key. So you will notice that these
keys correspond exactly row by row to the test dataset. And here is supposed to be
the prediction of your model. Now, this sample submission
file just contains 11.35, the same answer for every test. True. But this is supposed to be the prediction
for your model prediction generated by your model for the test set. And you need to create such a file and
then you need to download this file. So I'm going to download it
right here onto my desktop. And then you can come to this competition
page and click on late submission. And you can upload this file here
containing the sample submission, which is the key for each row in the
test set and the, and your prediction. And then you can make a submission
once this is uploaded, of course. And once you make a submission, your
submission is going to be scored. So the score for the submission is 9.4. Now, what does the score mean? You can check the overview tab and
going to the evaluation section to understand what the score means. So this score is the
root mean squared error. So the root mean squared error is simply
a way of measuring how far away your predictions are from the actual values. So you can't, you are not given
the actual fair amount for the test data, but Kaggle has them. And when you submit your submission,
when you submit the submission, not CSV file, the predictions are compared to the
actual values, which are hidden from you. And the differences calculated. Those differences are squared up added
together, and then you take an average of the square differences, and then you take
a square root of the average difference of the average squared difference. And that's called a
root mean squared error. On average, it tells you how far away your
predictions are from the actual value. So for example, in our case, our
recent submission had a root mean squared error of 9.4, which means our
predictions are on average off by $9.40. Okay. And we'll see whether that's
good or bad in some time, but we definitely want to do more than that. And now one thing you can do is
check your submission against the leaderboard to see where you land. It seems like people have gotten
to a pretty good point where data, they are able to predict
the taxi fare within $2.80. And we're at $9.74, which is pretty high. If you asked me because most taxi
rides costs $10 or 10 to $15, maybe. So B if you're off by nine, your
prediction is practically used. Okay, but that makes sense, because right
now we've just put a fixed prediction. We've basically submitted the sample file. So that's all CalWORKs. And one tip I have for you here is that you should write
down these observations. So anytime you have an insight
about a dataset, you should write it down, document it so
that it's there for you later. If you need to come back to it. And again, Jupiter is
a great way to do that. Okay? So that's how we download it. So that's what we did. We downloaded the data from
Kaggle using open datasets. We looked at some dataset files
and we noted down the observations. The training data is 5.5 GB. Our test data is it has 5.5 million rows. The test set is much smaller,
less than 10,000 rows. And the training set has eight columns,
IE Fairmount, pickup, date, time latitude longitude drop off latitude,
longitude, and passenger county. Now the test that has all columns except
the target column, the Fairmont and the submission file should contain the key
and the fair amount for each test sample. Okay. Now I'm going to just save
my notebook at this point. I'm going to save regularly
so that I don't lose any work and you should do that too. Next stop. We are going to load
the training set, right? So here you can check
the table of contents. If you're ever lost, we are
going to now load the training set and then load the test set. So here's one tip when you're working with
large datasets, always start with a small sample to experiment and iterate faster. Now loading the entire dataset into
pandas is going to be a fairly slow and not just that any operation that
you do afterwards is going to be slow. So just to set up my notebook properly,
I'm going to first work with the. And then maybe come back and
work with the entire dataset. So we got to work with a one-person
sample that's we are going to ignore 99% of the tests, they, the training data,
but that still gives us 500,000 pros. 1% of 55 million is 500,000 rows. And I think 500,000 rows should
still allow us to create a pretty good model to make predictions for
10,000 rows of data in the test set. So we're going to use a one person
sample, and we're also going to ignore the key column because we don't really
need to use the unique identifier that is present in the training set
and just loading that into memory. You can slow things down. So we're going to ignore that we are
going to pass, pick up and daytime while loading the data so that, you
know, pick up daytime, uh, pick up daytime is a daytime column and pandas
has a special way of dealing with it. We can just inform pandas
and that makes things faster. And we're going to specify data types for
the particular columns, so that panders doesn't have to try and figure out by
low after looking at all the rules. And that is going to, again,
speed up things significantly. So with that, let's set up,
let's set up this data loading. Um, so first let's import pandas as PD. Okay. And we are going to use
the period or read CSV. Function and to three CSE function,
we need to provide the file name. So here I'm going to provide
data dis slash trainer CSV. Then here are some of the other
parameters that we can provide. Now we want to pick a
certain set of columns. So I'm going to use, I'm going
to provide a value use course. That's one. Then we are also going
to provide data types. So I'm going to use the
I'm going to use D type. And finally, we are also going
to want to pick a sample. So there are two ways to pick a sample. We can either just pick the first
1% of the first 500,000 rows for which we can use and rows. If you would just provide the value
of N rows and provide 500,000, that's going to pick up a 5,000 sample for you,
or there's another way to do it, which is using something called skip rows. So here we can provide a function
which can go through each row and based on the row index, it can tell
you whether or not to keep the room. So I'll show you both. Let's start by putting in the use calls. Um, so let me create a variable
called calls or selected calls, and I'm going to put in all the
call columns here except the key. So I'm just going to take this. Would that into a string. And I'm just going to
split that at the comma. So this has a nice effect of giving us this list of columns. There you go. So we are going to use selected calls. Then I am going to select the type. So let's set up data types or data types. Okay. I'm going to grab all of these
and use float 32 for the datatype. Of course not all of these are floor 32. We wanted to use you in
eight for passenger count. Um, and let's just intend that. So those are the data types. So that's the value of the D type. And now we could provide this
end rows equals 500,000 and you can also write numbers like this
to make them easier to read. So we could do this and we
would get the first 500,000 rows, but I want a random 1%. So for that, I am going to use
the, uh, skip rows function and I'm going to call it. I'm going to pass in a
function called skip row. This gets the row index or the row number,
and here's what we're going to do now. Of course, we want to keep the first row. So if grow index is zero, then
we do not want to skip the road. So we return false. Otherwise here's a quick trick. We can, we can apply here. Let's say I want my
sample fraction to be 1%. So 1% is just 0.01. Here's what I'm going to do. I'm going to first
import the random module. Let me do that here. And the random module can be used to
generate random numbers from zero to one. Okay. So that's a random number and that's when
these numbers are between zero to one. Now, if I write this random, random,
less than sample prac, And let's run that lesson sample fraction then because
random numbers have picked uniformly. There is a, exactly a 1% chance
that random not random is going to be less than sample fraction. So we should keep the rope only
if this expression returns false. Oh, sorry. Only if this expression returns false. So we should skip. We should, sorry. We should skip the row if this
expression returns through. So if random not random is greater
than 0.01, which is a 99% probability. We should skip the row. Otherwise we should keep the room. So that's what our skip row function does. Okay. All it does is for 1% of the rows, it
is going to randomly return through or return false, which means keep the row. And for 99% of the Rosa is going to
return true, which is to skip the room. Okay. Now, one last thing I'm going to do
here is I'm going to set random dot seed and set it to a particular value. So I'm going to initialize the random
number generator here in pipeline with the value 42, so that I get the same set
of rows every time around this notebook. Okay. So I encourage you to learn more about
seeds and this is going to take a. Okay, sorry, this is not a daytime. Of course. We also need to provide, uh, parts. We also need to provide a
list of daytime separately. So let me done that. And then we'll talk about this. Yep. So I encourage you to learn
more about random numbers seeds. Let's see, where is
parts underscore dates. There we go. Yup. And always fix the seeds for
your random number generator. So that's the third tip so
that you get the same results. Every time you run the notebook, otherwise
you're going to pick up a different 1% each time you run the notebook and you're
not going to be able to iterate that much. So there was a question. Can you explain the significance
of using shell commands instead of checking the dataset for checking
the dataset instead of Python? Yeah. So simple reason here is because
these files are so large loading them into Python can itself
slow down the process a lot. So normally I would recommend using
the OAS module from Python, but in this case I have recommended shell
commands because these notebooks are so large and shell commands are really
good at working with large files. Okay, there's another question. How do we know it's a regression problem? So here we're trying to predict for
fair amount and the fair amount is a continuous number, right at the
Fairmont can be $2.50, $3.20, $5.70. So that is what is called
a regression problem. A classification problem is one
where you're trying to classify every row into a particular category. For example, trying to classify, let's
say an insurance application as low risk, medium risk or high risk, that's
called a classification problem. Okay. This is taking a while. In fact, it has been running for
a minute and a half, and this is happening while we were working
with just one person of the data. So you can imagine when you're working
with a hundred percent of the data, it's going to take a lot longer. So, and not just this, but every
single step of this, it took about one minute 36 seconds to complete. Okay. Now that's the exercise for you. Try loading 3%, 10%, 30% and a hundred
percent of the data and see how that goes. All right. So let's load up the test set as well. So I'm just going to load up your
test set as period or read CSV data DIR plus slash test or CSV. And I'm just going to provide
the D type here and that's it. I think don't really need
to provide anything else. Because they start is pretty small. Let's look at the
training data set as well. Maybe let's print it out here. Okay. It's just called DF at the moment. Yep. So there you go. We have fair amount pickup, daytime,
like long and all the expected values. Now the test TF has, so we're going to keep the key
for the test data frame, because we're going to use this one making
submissions, but we have the pickup date, time longitude, latitude drop
off and passenger count looks great. And I can just commit again. Alright. Okay. So we're done with the first step
downloading the dataset that took a while, but we are now well set and
let's explore the data set a little bit. Okay. We're just going to do some
quick and dirty exploration. We're not really going to look at a lot
of graphs and I'll talk about why, but the quickest way to get some information
about a data frame is to go to df.info. And this tells us that
these are the seven rows. And then these are the
number of entries here. This is the total space. It takes on memory. This is an important thing to watch as you
go with a hundred percent of the dataset. So you can imagine that it's
going to take a hundred times more or 1.5 GB of memory or Ram. That's why we are using
co-lab and what else? Yeah, these are the data types
seems like there are no, there are no null values or missing values. So that's great. Now, another thing you can do is DF
dot describe, and that's going to give you some statistics for each
column for each new medical column. So those are fair amount. The minimum value is minus $52
and the maximum value is $499. All right, then the mean, or
the average value is $11 and the 50% dollar value is $8.50. So we already know that 50% of REITs
costs less than, uh, less than $8. And in fact, 75% of rights
cost less than $12.50. Okay. Now that gives us a sense of how
good our model needs to be if you're trying to predict, right. Uh, the right, uh, if, if you're trying
to predict the taxi fare and seventy-five percent of taxi fares are under $12. So. I want my prediction to be in
the plus or minus $3 range. Otherwise I'm off by a lot and
that's what we'll try and infer. Okay. Um, you can also look at pickup
back at your longitude drop off and then passenger counts. Now there seem to be some issues
in this dataset as is the case with all real-world datasets. It seems like the minimum pickup
longer dude is minus 1, 1 8 3, which is just not valid at all. It doesn't make sense. There are no such long attitudes,
neither are there such latitude. So we may have to do some cleaning. Uh, this would just be wrong data. And there also seems to be a max
passenger count of 2, 0 8, which again seems quite unlikely to me. You can see 75% of the
values are under two. So again, this is something
that we may have to fix later. Take a look at that. Now one thing that is missing
here is the date time. So let me just grab the pickup
date time and just look at the minimum and maximum values here. So you can see here that our date
start from the 1st of January, 2009 and end on the 30th of June, 2015. So it's about six years worth of data. And once again, all these
observations are noted here. Five 50 K rows as expected,
no missing data, fair amount ranges, passenger count ranges. There seem to be something. And we may need to deal with
outliers and data entry errors. Let's look at the test data here. So nothing surprising here. 9, 9 1, 4 columns, uh, rows of
data across these seven columns. No fair amount. And here are the ranges of values
and these seem a lot more reasonable. The pickup seem to be
between minus 75 and 72. So minus 34.2 is the, is the lowest
and minus 72.98 is the highest. So that's good. Then passenger count also seems
to be between one and six. Now here's one thing we can do if
our model is going to be evaluated on the test set and which is supposed
to represent real world data, then we can limit the inputs in our
training set to these ranges, right? Anything that is outside the
range of the tests that can be removed from the training set. And because we have so much data, 55
million rows or one person of that, which is still a large amount of
data, we can later just drop the rows, which fall outside the test range. Okay. So keep that in mind. And finally, let's check this too. Pickup date, time and maximum and minimum. And you see here that these tests,
dataset values also range from the 1st of January, 2009 to the 30th of June,
2015, which is interesting because this is the same range as the training site. Now that's an important point
here, which we'll use while creating the validation set. Alright, so let's come at this. That was quick enough and that we already
have a lot of insight, but now what you should do at this point, or maybe later
when you've trained a few models is to create some graphs like histograms
line charts, bar charts, scatter plots box plots, geo maps, you have location
data here, or other kinds of maps to study the distribution of values in each
column and study the relationship of each input column to the target score. Be useful thing to do, not just right now,
but also once you've created new features, when we do some feature engineering. And another thing that you should
try and do is something like this. You should try and ask and answer
some questions about the data site. What was the busiest day of the week? What is the busiest time of the day
in which month are the first highest in which pickup locations have the
highest fare, which drop locations have the highest, where, what is the
average right distance and keep going. The more questions you can ask about
your dataset, the deeper understanding you will develop off the data, and
that will give you ideas for feature engineering, and that will make your
machine learning models a lot better. So having an understanding of the
data is very important to build good machine learning models. And if you're looking to learn exploratory
data analysis and visualization, you can check out a couple of
tutorials or a couple of resources. We have a video on how to
build an exploratory data analysis project from scratch. And we also have a full six week course
on data analysis with Python zero two pandas.com that you can check up. Now, one tip I would like to share here
is that you should take an iterative approach to building machine learning
models, which is first do some exploratory data analysis a little bit like we've
done without even plotting any charts. Then do some feature engineering, try
and create some interesting features, then train a model and then repeat to
improve your model can instead of trying to do all your EDA for maybe a week,
and then doing, doing a lot of feature engineering for a month, and then trying
to train your model and discovering that most of what you did was useless. Use an iterative approach, try and train
a model every day or every other day. Okay. So I'm going to skip ahead right
now and maybe I'll do some EDA after we're done with this tutorial. All right. So that was step two. We've made good progress. Now we've downloaded the data. We've looked at the data, let's
prepare the dataset for training. So the first thing we'll do is
split training and validation sets. Then we will deal with the missing values. There are no missing values
here, but in case there were we, this is how we deal with them. And then we extract out some inputs
and outputs for training as well. Okay. So we will set aside 20% of the
training data as the validation set. So we have 550,000 rows out of those. 20% will be set aside as a validation set. And this validation set will be
used to evaluate the models we train on the training data. So the models are trained on the
training data, which is the 80% and then the evaluation is done. So we calculate the root mean squared
error on the validation set, which is the 20% for which we know the
targets, unlike the test set for which we don't know the targets. And what that will do is the
validation set will allow us to. Estimate how the model is going
to perform on the test set and consequently in the real world. Okay. So here's the next tip. Your validation set should be
as similar to the test set or real world data as possible. And the way you know, that is when
you find the root mean squared error on the validation set. And you can do that because you can get
predictions from your model and you have the actual targets for the validation
set, and you can compare those and calculate the root mean squared error. So the way you know, that the validation
set is close enough to the test set is when the evaluation metric of the model on
the validation and test it is very close. Okay. And if the root mean squared error on
the validation set is like $2, but when you submit it to calculate the, the
root mean squared error is $9, then your validation set is completely useless. And you're basically shooting in the
dark because you're trying to train different models to do better on the
validation set, but the validation set has no relationship to the test set score. So make sure that your validation set
and test sets have similar or very close scores and an increase in the
score on the validation set, reflect as an increase on the test set. Otherwise you may need to reconsider
how your validation set is created. Now, one thing here is that we can,
because the test set and trainings that have the same date ranges, right? The test set lies between
Jan 2009 to June, 2015. And the training set also comes
from Jan 2009 to June, 2015. We can pick a random 20% fraction of
the training set as a validation set. I suppose the test said was in the
future, suppose the training said was data from 2009 to 2014, and
the test set was data for 2015. Then to make the validation set similar
to the test set, we should have picked maybe the data for 2014 as a validation
set, and then the data data for 2013 and before as the training set. Right? So keep those things in mind. It's very important to create
validation sets, carefully support, creating a validation set. I'm going to import from eschalon dot
model selection, import train, test split. And this is something
that you can look up. You don't have to remember this,
and I'm just going to do train DF while the F equals train test split. And I'm going to split
the original data frame. And I'm going to set the test. Um, let's see here, the test size
or in this case, which is going to be the validation size 2.2. Okay. And now I can check the
length of the train DF and the length of the well validation. To make sure that we have the right sizes. So we have 4,440 1000 rows in a
training set and a randomly chosen 11,000 rows in the validation set. I'm also going to set random
state equals 42, just so that I get the same validation set. Every time I run the notebook,
this is important because your scores may change slightly. If each time you're creating
a different validation set. And also if you're combining
models across validation sets that leads to data leakage, et cetera. So to fix the validation set, the
random set that is picked, I'm going to set the random state to 42. Okay. Now that's one piece. Now, the other thing that we need to
do is to fill or remove missing values. Now we've seen that there are no missing
values in the training data or the test data, but it's possible because we've
only looked at one person on the data. It's possible that there may
be missing values elsewhere. So here's one simple thing you
can do three in DF equals train, DF dot drop any, and where the
F equals two IDF not drop any. Okay. Now why, what does this do? This is going to drop all the empty
rows from the training or all the rows where any of the columns has
an empty value or missing value. From the training and validation
sites, you shouldn't always do this, but if, because we have so much data
and at least so far, I've not really seen a large number of missing values. I'm estimating that the number of missing
values is going to be less than one or 2%. So it should be okay to drop
them like number of missing values in the entire dataset. So it should be okay to drop them. Okay. So I'm not going to run this
right now, but you know what? This does next. We, before we clean our model, we need
to separate out the inputs and the outputs because the inputs and the
outputs have to be passed separately into machine learning models. So I'm going to create some something
called input calls here, and maybe let's just first look at trendy of dark
columns so that we can copy paste a bit. Now the input columns are these, but
actually we can't really pass a daytime column by itself into a machine learning
model because it's a, it's a timestamp. It's not a number. Um, so we'd have to convert the daytime
daytime column into sum or split the daytime column into multiple columns. So I'm just going to use these
for now, the latitudes and longitudes and the passenger count. And for the target column, I am
just going to use a fair amount. And then, so now we have the
important target columns. Now we can create train inputs. So from the training data frame,
we just pick the input columns. So this is all you just
out, just a certain set of columns from the training set. And then we have the train. Well, let's call that train inputs. And we have the train targets,
which is clean DF target call. We can view the train inputs here and
we can view the train targets here. Okay. So you can see now we no longer have
the column fair amount here, but we still have all the rules and here we
no longer, we just have the single fair amount column in front of us. Okay. Let's do the same for the validation set. So while inputs is YDF input calls, where
targets is where the F get called, and then let's look at value inputs and okay. So 110,000 rows, that's already a lot
larger than the test set by the way. So should be good. Yep. And here are the validation targets. Finally test DF. Now the test data frame, remember
it doesn't really have any target columns, but we still want to
pull out just the input columns that we can use for training. So let's just do test PF. Input calls test inputs. Okay. And there are no targets. There's no fair amount
in the test data frame. That is something that we have to predict. So there it is. Okay. Not bad. We're making good progress. In under an hour, we have downloaded the
dataset, explored it a little bit, at least prepare the data set for training. And now we are going to first train
some hard coded and baseline models. So here's the next step. Always, always create a simple hard-coded
model, which is basically like a single value or something, or a very simple
rule or some sort of a baseline model. Something that you can train very quickly
to establish the minimum score that any proper machine learning model should beat. I can't tell you how many times I've seen
people have trained models for hours or days, and then the model ends up producing
results that are worse than what you could have done with a simple average. And that could be for a couple of reasons. One, you've actually not
trained the model properly. Uh, like you, you print a really, you
created a really bad model or second, you made a mistake somewhere in the feature
engineering or somewhere in preparing the data or somewhere and making predictions. Right. So it serves as a. A good way to test whether
what you're doing is correct. And whether you're, and it
gives you a baseline to beat. So let's create a simple model. I'm going to create a more a class and mean regressor, we'll
have two functions fit. I'm going to make it very
similar to a scikit-learn model. So it's going to take some inputs. It's going to take some targets. And so fit is used to
train are a simple model. And then we're going to define
a function called predict. It takes a bunch of inputs
and then create some targets. So here's what I'm going to do here. I'm going to completely ignore the inputs
and I'm simply going to set self dot mean. I'm going to store a value self
taught me where I'm just going to do targets dot mean here. Okay. And that's just going to calculate
the average value of the targets. And here I'm just going to return. So let's say we have, let's take
the lens of inputs or another way to do this is inputs not shape zero. And I'm just going to do something
like this and paid out full. Okay. Let me import number five. Yep. And I'm going to do this and not
full where let's say you can give it a shape and then give it a value. So let's say I have 10 inputs and I
always want to return the value three. So I do NPDR full 10 cometry and that's
going to list a return 10 threes. So I'm always going to return. I'm always going to return
input, start shape zero. So again, if you have a , that's
a train inputs, right? Sustain inputs. This is a non-player of
binders, a data frame. If I do attain inputs, not shape,
that tells me the number of rows and columns dot shape, zero is
going to tell me the number of rows. So I get the number of roles. So I'm just going to get the
number of rows here from the inputs that were past year. And I am going to return self.me. Okay. So yeah, at some object oriented
programming, some fancy non stuff, but ultimately what it's doing is this let's
first create a mean regressor model. Let's call it mean. Is mean regressor. So now we've created this mean
model and let's call me model.fit. So we're now going to train, train this
model, uh, this so-called, uh, model that always predicts the average and let's give
it the train inputs and the train targets. Okay. Now once we pass or
sorry, let's call it fit. Okay. So now once we call main requester.fit,
it's going to completely input the ignoring inputs and it's going to take
the targets and simply calculate the average of the targets, a single value. And it's going to store
that in the dot mean attribute. So the average is 11.35. Okay. That's the average spread for the taxis. And then when we get, want to get
some predictions, so let's say we want to get some predictions
for the train training set. We can say mean model,
dark predict rain inputs, and that gives us a prediction. So it's simply predicted the value
11.35 for every row in the trading set. Okay. Similarly, we can get some
predictions for the validation set. So let's say main model
dot predict while inputs. And once again, it's going to
simply predict the value 11.35 for every row in the validation set. Now we may want to compare these
predictions with the targets. How often is this model by of course,
it's going to be way off because we are just predicting the average. So here are the train predictions
and here are the training targets. Six 3.7. You can ignore this. This is simply the row numbers from
the data frame, but yeah, six, 3.7. And we we're always predicting
11.35 now to tell how badly we are doing, we are going to need to
compare these two and come up with some sort of an evaluation metric. So that's where we are going to use
a root mean squared error evaluation metric, because that is a metric
that is used on the leaderboard. So I'm going to import
from eschalon dot metrics. Okay. And I'm going to define a function
called RMSE just to make my life easier. Each takes some inputs, some targets,
and it returns mean squared error, uh, with, between the, uh, not
input, sorry, it takes some targets and it takes some predictions. And it doesn't mean squared error between
the targets and the predictions and to return the mean squared to sorry,
to get the root mean squared error. We need to set square to
false in mean squared error. Okay. All that said and done. I'm now going to be able to get
the root mean squared error. So we have some training targets. These are the fair amounts
for the training set rows. We have some green breads, some
predictions let's call RMSE on this and let's call it train RMSE. And let's print that out. So this is the root mean squared error
for the training set, which means that on average, the predictions of our
model, which is always 11.3, five are off or are different from the target. The actual value that the model should
be predicting by nine, which is pretty bad because the values we're trying to
predict the average is about 1170 5%. Mark is about seven oh, sorry is about 12. So if you're, if they're trying to
predict values in the range of, let's say 10 to 20 and you're off by nine. So that's a pretty bad model. Right. And that's expected because it's just
a dumb model, but here's the thing. Any model that we trained. Should hopefully be better than this
kind of a model that should be, hopefully be, have a lower training, uh, lower,
uh RMSE let's get the validation RMSE as well while targets well breads. Yep. So the model is, are, are hard-coded
models off by 9.899 on average, which is pretty bad considering
that the average fare is 11.35. Okay, great. So that was our hard coded dump model. Alright, next let's train a very
quick linear regression model to see whether machine learning
is even useful at this point. So I'm going to from eschalon dot
linear model, I'm going to import the new regression and I'm going
to create a linear model here. That's it? I think that's pretty much it. You can set a random state, I
believe to avoid some randomization. Oh no, there is no random state. So this is it. Right? Linear regression model is just
this in scikit-learn and here's how you fit a model by the way. We are expecting here that you are
already familiar with machine learning. And if you're not, then I highly
recommend checking out zero two gbms.com. This is a practical and coding focused
introduction to practical, to machine learning with Python, where we cover
all of these topics, all of the models that we are looking at today. Okay. So we do linear model.fit,
green inputs and green targets. Great. And then once it is fit, we can now
make predictions so we can get trained Preds equals linear model, dark predict,
and that's going to take the training inputs, and it's going to come up
with some predictions for us, and you can look at the predictions here and
compare them with the targets here. You can see that the predictions
are all still close to 11, but there are different, at least they're
not the same prediction each time, but there's still a way off, right? There's still way off. Um, let's maybe also get the Brit let's
look at the RMSE on the train breads and green breads. So the ultimacy is 9.788. So that's not much better. 9.789 was our average model
and our linear regression just. Hardly better and still
completely useless. Right? Let's get wild predictions, ultimacy on the well
targets and well breads. So the root mean squared error here is
9.8, nine or 9.898, which is just 0.0, zero one less than a cent, less than
0.10 cent better than our average model. Okay. And now at this point, you might want
to think about why that is the case. And in this case, I would say that
this is mainly because the training data, which is just geocoordinates at
this point, which is a latitude and longitude, et cetera, is not in a format. That's very useful for the model, right? How does a model, how is the model
going to figure out that latitude and longitude are connected and there's a
pickup latitude and a pickup longitude. And then there is a sort of a,
some distance between them or all those relationships are very hard
for models to learn by themselves. And that is where feature engineering
is going to come into picture. And we are also not using one of
the most important columns, which is the pickup date and time before. Spares are very seasonal in terms
of months in terms of base in terms of hour of day in terms
of day of the week, et cetera. So that's why our, our data in the
current format is not very useful form of machine learning perspective. And we are able to establish that
using the hardcore in baseline model. However, we now have a baseline that all
our other models should ideally beat. Now, before we train any further models,
we are going to make some predictions and submit those predictions to Kaggle. Now here's the next step. Whenever you're working on Kaggle
competitions submit early and submit often, ideally you want to submit, make
your first submission on day one and make a new submission every day, because
the best way to improve your models is to try and breed your previous score. If you're not making a submission,
then you're not going to figure out if you're heading in the right direction. Or if you have a good validation
set, or if there's anything else you should be doing. But on the other hand, if you're
making submissions every day, you will have to try and beat
your previous submission, right? And that will force you to
move in the right direction. So how do you make predictions
and submit to Kaggle? First, you have to make some
predictions for the test set. So we have the test inputs
here, right in front of the. So, all we need to do is we need
to pass the test inputs into the, let's say, let's take a linear model. So into the linear model, we say
dot predict, it's trained already using the training set, and we
get some predictions from this. And of course we don't have any targets. So the way we have to evaluate
these predictions is by creating a submission file so way we can create
a submission file is this way we first read in the submission data frame,
which is the sample submission file. Right? And now, so this is a
sample submission file. Now, all we can do is we can
simply take the test predictions. Here are your test predictions, and we
can simply replace this column of data. With this, because remember the rules
in the submission file correspondence one to one on one, uh one-on-one to
the rows in the test file, right? So the first row of the
submission file point to the first of the test file and so on. So I'm just going to
do something like this. Sob, DF dot, also resub, DF,
where amount equals test rates. And now you should see sub DF. We'll now have the data from the test set. You can see that there are all,
these are all different values. Okay? And now you can say that to CSV. So you can say sub df.to CSV. You can give them a file name, like
linear model submission, not CSV, and one thing you need to do and saving
submission file, especially for calculus to specify index equals none. Otherwise your otherwise find is
also going to add this 0 1, 2, 3, this column as an additional column
in your file, which you don't want. Okay. And now you will have this failure sample
or linear model submission dot CSV. And now you can download this file. So let's save that. And you can submit this file. So our previous submission
was giving us 9.409. That was the, that 9.409. That was the RMSC. And this one, let's see
what this one gives us. Let's click on late submission. Let's go here and let's
just call this linear model simple linear model. Let's make, oops, let's give that a second. Yep. It's uploaded. And the submission is
code 9.4 through seven. So not, not very different. And one thing that we can now verify
is that our test set metric is close to the validation set metric, right? So the remember the
validation set RMSE was 9.89. Um, and the test set metric is 9.4. Now that's not too different. I mean, they're in the same range. It's not that validation set at, uh, RMSE
was two and the test set RMSE was 10. So they're close enough. Uh, of course the validation
set is a lot larger. It's about 110,000 roles, whereas. As set as just 10,000 rows. So it's always harder to make
predictions on larger, uh, unseen data than smaller and seen data. So that could have an effect,
but at least they have they're close enough for us to work with. Okay. So that's how you make
predictions for the test set. Now, the next step here is to create
reusable functions for common tasks. Remember we said that you should be making
submissions every day, and if you need to make submissions every day, you should
not be copy pasting all this code around each time, because that just takes up a
lot of mental energy and you make mistakes near to figure change values and all that. So it's always good to
create functions like this. So we create a function predict and
submit, which takes a model xFi, lame calls, model dot predict on test inputs. And then that gives you test predictions. And then you can call, read CSV,
or you could even provide tests in Portsea as an, uh, as, as an argument. And then you call period or read CSV. And that's going to read the sample
submission into the submission file. You are going to put the fair
amount and then you're going to save it to the given file name. And, uh, yeah, let's put that here. We're going to save it to the
given file name and that, and then you do return the data frame. So we could do the same thing, Reddick
and submit let's, let's just call this, let's give it the linear model and let's
give it test inputs and let's give it the file name, linear subdued, or CSV. And you can see that it does exactly the
same thing, but now it's just one line. Anytime you want to generate predictions
and create a submission file. So here's linear sub two dot CSV. You can see what this file contains. So yeah, it shows you a
nice pretty view here. Uh, yeah. So it shows you a nice preview here, but
it's ultimately, uh, actually just a CSV file, which contains the key and Fairmont. Okay, great. So that brings us to make
predictions and submit to Kaggle. We're done with that. That was simple enough. And we now have this function
that we can use anytime to make predictions on the test side. And next one thing you will want to do
here is just track your ex ideas and experiments systematically to avoid
becoming overwhelmed, doubles, and dozens of models, because you're going to be
working on a machine learning project for at least a couple of weeks, but
probably a couple of months or longer. Uh, so you need to keep track
of all the ideas you're trying. So here's a tracking sheet that
we've shared set up for you, where you can, uh, and you can
just create a copy of the sheet. You can go to file, make
a copy and create a copy. But here you can put in ideas like the,
uh, the kind of models you want to try. Um, those sample sizes. So maybe like try sample size
10% to just keep, uh, keep a list of all the ideas you have. Whenever you have them, don't
have to try them right away. You probably can't and keep a list
of what you expect the outcome to be. Whenever you have an idea. And also then once you try
out an idea or mentioned what learning you have from the idea. So this is called just idea, tracking
where you like, let's start all your ideas and, um, the potential outcome
you expect and what you learned from it. And then here you have experiments. So each time you train a model,
you just want to put in like give a title to the model, give a date,
give some, uh, no doubt and whatever hyper parameters you want to know. What are the type of
model you want to note? Uh, not on the training loss,
validation loss, and the test score. This could be from the calculator. And a link to the notebook. So every time you save a notebook,
every time you save using Jovian dot commit, let's say I go here
and I say, Jovan dot commit. This is going to give me a link. Yep. Yeah. So you can know down here
that there's version 10 here. So you can just go down
something like this. You can just note down,
let's say version 10, right? Um, wasn't 10 of the notebook
and you can refer back to it. So over time you have all these
versions, dozens of versions. And for each version, you know exactly
what the parameters for that version on. And once you have maybe 30, 40 models,
you can look at the sheet and you can get a very clear idea of which models
are working in which models are not. Okay. Let's see if we have any questions
at this point before we move ahead. Can you please explain? Okay. I think we've answered, uh, why
we are using shell commands. Can we directly fit the model to
99% of the remaining data with one sample, 1% sample data of printing? Um, yeah, so I don't, I'm not sure if
I understand the question, but what we have done right now is there are 55
million rows available for training. I have simply taken 1%
of that or 500,000 roles. For training the model so
that we can train things fast. What you would want to do later is
instead of using just one person of the data use maybe 10% of the data or
20% of data, all a hundred percent of the data at the very end and train a
model on the entire a hundred percent of the data, and then make predictions
with that train model on the test set. And that should definitely be better than
training the model on just 1% of the data. So I hope that answers that
next for regression problems. We created a model which gives
you, gives the mean as an output. And then we tried linear regression. What should be our approach
for classification problems? Good question. So for classification problems,
you can maybe just predict the most common class, or you can predict a
random class and just go with that. Yeah. Most common class random classes, what I
would suggest for classification problems. Okay. Let's see if there are any other
interesting questions, any specific reason we are using floor 32 and unit eight. Yep. So I actually look. How many decimal places, the floor
32 supports, and it supports about eight decimal places like eight
digits of precision, roughly. And that is good enough to
support longitudes and latitudes. Now, if you just specify float, it
might pick float 64, which will take twice the amount of memory, which
can be a problem for large datasets. Similarly with UN eight. So I looked at the fair amount
values and seems like it's in the, um, not fair amount, the number of
passengers, and seems like that's in the range of maybe one to 200. Um, so if you just specify in, that's
going to, you use in 64, which is going to use 64 bits, but we can actually
get away with one eight of that, just eight bits, because the numbers
we're dealing with are fairly small. So that's why you went eight. So these are just techniques
to, uh, reduce the memory footprint of the data step. Okay, perfect. And there was a question about
prerequisites for this, uh, workshop. So zero to pandas.com is the prerequisite. And, but you can also watch this
right now and start working on a machine learning project and then
learn machine learning along the way. Okay, let's move on to
feature engineering. Now we're halfway through. So hopefully we'll be able
to train a few models. Feature engineering means taking
the columns of data that you have and performing operations on them to
create new columns, which might help train better models because machine
learning models are fairly dumb. They have, there is a certain structure. They have, they assume a certain
relationship between inputs and outputs like linear regression
assumes that the output is a linear as a weighted sum of the inputs. And that may not hold true in the current
form, which is latitudes and longitudes. But suppose we were able to somehow
calculate the distance between the pickup and drop off point. Then there will definitely be
some sort of a linear relationship between the, uh, the distance to
be covered and the fare, right? So by creating good features, you are
going to make, you're going to train much better models because you are now applied
human insight to provide features that are conducive to solving the problem in
the structure that the model assumes. Now, the tip here is to take an
iterative approach to each engineering. Don't go overboard. Don't spend weeks creating features, just
add some features, one, or to train a new model, evaluated, keep the features. If they help otherwise drop them,
then repeat with new features. So here are some features that
we are going to create, and I am not taking an iterative approach
here in the interest of time. I'm just going to create a
bunch of features right away. First is to extract parts of the date. We have totally ignored the data
so far because we didn't know how to put it into a linear regression
model, but we can extract things like your month, day, weekday hour. I think these are all useful things
because over years, I would assume that the fair increases across months, there
must be some sort of a seasonal trend across days of the month as well, because
maybe there are a bunch of deliveries or there are a bunch of things that
people have to do during the start of the month or the end of the month,
like going to the bank across weekdays. Of course, there should be a difference
between weekdays and weekends. There should be a difference during
the week and a close out of the team. So we are going to extract out
all these parts out of the date. So that's one thing we'll do. We will also deal with
outliers and invalid data. It's a form of feature engineering. We're kind of cleaning
up the data a little bit. We will add distance between
pickup and drop location. So we will see how to compute distance
using latitudes and longitudes. And we will also add distance
from some popular landmarks because a lot of people take. Taxis to get to places where they can't
normally drive, and these could be crowded places or these where they can't park,
or these could be things like airports. And there are also tools involved here and
tools are included in the, in the fair. So it might be useful
to track that as well. Okay. Uh, so we're going to apply
all of these together, but you should observe the effect of
adding each feature individually. Okay. So let's extract some parts of
the date and this is really easy. And once again, I'm just going to
follow my previous advice and create a function that allows me to do that. So here we have a function add date
parts that takes a date, a data frame, and a column name, and then it takes
the column and it creates a new column name called column name underscore
ear, where it extracts from that date, from that date time column here. So just to give you a sense of how
this works, we have trained DF. This is a data frame. And suppose I said, column two pickup
date time, then train F of call is just all the pickup date times. And if I called God DT, Year,
and it's going to give me just an ear for every row of data. You can see 2000 11, 12, 15, and so on. And I can actually now
save that to train DF. Let's say, I want to save it to pick up
date, time underscore year equals this. Right. And another way to do this is
to just say call plus, right? So that's what we've done, not just for
ear, but month, day, week, then hour. And we've put that into a function. So I'm just going to call,
add date parts to drain DF that's on that. And of course we need to
give it the column name. Yep. Let's ignore the volume for
now, uh, the warning for now. And let's do that for the
validation set as well. Our date parts to the let's do
that for the test set as well. Uh, let's yeah, well, let's see. Hmm. Yeah, it looks like there might be
some issue in the test data frames. I'm just going to re load it again. Uh, I see, I see what happened. So yeah, this is one of the
things with doing things live. There are always some issues. So let's go back to creating the test set. Let's see, we have a useful
table of contents here. Again. This is why it's useful to
have a table of contents. You want to also specify here past
dates equals pick up date time so that this can be parsed as a date column. And let's come back here to
extract parts of date and let's add the training data frame. Uh, let's add it in the
validation data frame and let's add it in the best data frame. Okay. I think we have it everywhere. So let's check green DF and you should
not see that train DF has not just a fair amount and pickup date, time, et cetera,
but it also has pickup date daytime year, month, day, time of day, weekday and hour. And you can verify the same for
while DF and test TF as well. There you go. Right. And yep. So with that, we have added
different parts of the date. So we've already done some
basic feature engineering. So dates are always a low-hanging
fruit for doing feature engineering. You can add more, you can add
things like start of quarter, end of quarter, start off your end of your. Uh, weekend weekdays, et cetera. The next thing that we're going
to add is the distance between the pickup and the drop location. And to do that, we're going to use
something called a have assigned distance. There are many formulas to do this. Um, like again, the way I found this
was just looked up online distance between the distance between two,
uh, geographical or map coordinates. I don't like it too long at your
distance, something like that. So there is this formula and it looks
something like this, essentially. This is what the formula looks like. It has an arch sign and a science
square and a cost, et cetera. And then I looked up a way to
calculate, have a sign for pandas. So I searched online, fast habits, sign
approximation, how do you calculate it? And somebody was very helpful. They created an entire function,
which are directly borrowed over here as to have sign distance. Okay. So it takes a longitude latitude. So this is going to be
the pickup location. It takes the longitude latitude of
the drop location, and it is going to calculate the approximate distance
in kilometers between the two points. This uses great circle geometry,
which uses the spiritual nature of the earth and how latitudes
longitudes are defined, et cetera. Um, we don't have to get into it, but
there are these resources, if you want to. And the interesting thing here is that
this works not just with one latitude and longitude, but also a entire
series or an entire list of latitudes. And longitudes. So if you provide one list
containing a list of longitudes and a list of attributes and a list of
longitudes and lists of latitudes. So basically a bunch of rules, it's
going to perform that for each row. And it's going to perform
that in a vectorized fashion because it uses non-pay. So it's going to be very efficient. So we can directly use this to add
the trip distance into our data frame. So from our data frame, we pick up the
pickup longitude pickup latitude, or then we give the drop of longitude and
drop of latitude and pass it into the hub assigned function that is going
to do this for each individual row. It's going to compute the distance
using the hub assign formula, and it's going to now add a trip distance. So let's add trip distance to train DF and let's add rep distance LDF, and
let's add trip distance to the SPF. And now we can see rain DF. Yep. So here now you can see there
is a trip distance, and this is a distance in kilometers. So this seems like a fairly long trip. This seems like a shorter trip and so on. Uh, well, 1.3 kilometers, and then there
are some trips, like 7.1 kilometers. You can already probably tell that
this the fare for this trip is probably going to be maybe four or
five times the fair for this trip. I would imagine. Let's see. Yeah. So the fair here is 18. The farrier is 3.7. So it's about five times, right? So there's already a very useful
feature for us to use now next, and this is something I learned by
looking at some discussions and some notebooks on the competition page. We are going to do a little
more, a slightly more creative set of feature engineering. We are going to add the distance
from popular landmarks specifically. You're going to add check if we, if the
trip is going to end near one of those landmarks, if people are going to one of
these landmarks and specifically airports, because airports, ports have tools. So we went to ad JFK airport, the
LaGuardia airport, the Newark airport. You're going to add the location of time
square met museum and world trade center. Okay. There are many more, you could
have the statue of Liberty. You could have central park, or you could
have a bunch of other locals. In New York. So this is something that you have to
look up and we will add the distance from the drop location, but feel free to also
add the distance from the pickup location that is left as an exercise for you. And let's see if this helps and
here's the next step that creative feature engineering, which generally
involves some human insight because no machine learning model will be
able to figure out, at least not the simple models that we are draining. We'll be able to figure out that a
certain location is very important by itself, but we can, we can
figure that out quite easily. Given the context we have, so involving
human insight or involving external data. So here we have picked up the latitude and
longitude values for GFK LGA, et cetera. And this is essentially, this
is what is called external data. Um, again, this, this data will
never automatically become available to the machine learning model. So create a feature engineering
is often a lot more effective in training, good models, Dan
excessive, hyper parameter tuning. Like you're doing a lot of grid
searches and like training a model for a long time for multiple hours
or overnight or over multiple. With tens of gigabytes and
like hundreds of columns. So keep in mind that just one
or two good features improve the model's performance drastically. So focus on finding what those one or two
good features are, and they can one, they can improve the model's performance far
more than any amount of hyper parameter tuning and S and the way you get these
features is by doing your exploration data analysis, by understanding the problem
statement better by reading are reading up discussions by discussing it with people. And also keep in mind that adding
too many features is just going to slow down your training, and you're
not going to be able to figure out what is the useful thing to use. That's why it rating is very important. So here are the latitudes and
longitudes longitude and latitude for JFK LCA, uh, Newark, the Mac
museum and the world trade center. And once again, we can use the
habit sign, distance function. We are going to give it the data
we are given to give it a data frame, and we're going to give it a
landmark name and landmark lawn lat. So from the lawn light, we're going to
get the longitude and latitude of the landmark, and we're going to create a
landmark name, underscore drop distance column, where we are passing into the
habits and function the longitude and latitude of the landmark as the pickup. And this is the interesting thing
about this function, that it doesn't have to be all numbers or all
series, like a couple of these can be numbers and a couple of these can be
series and it will still work fine. So here is the landmarks location,
and then here is the longitude and latitude of the drop location, right? So we're going to add the drop distance
from the landmark in this fashion. And of course we need to do
it for a bunch of landmarks. So here we have created
this ad landmarks function. That's going to do it for
each of the landmarks, right? So this is why creating functions
is really useful because now we can just do add line marks, drain DF. Um, I guess this should
just be called ADF. Yeah. Add landmarks, strain,
DF, and add landmarks. Well, LDF and Adeline marks SDF. Okay. And now if you check the
training data frame looking nice. Now we have this pickup lot at your
lung longitude, et cetera, but here is where it starts to get interesting. We have the trip distance,
JFK drop distance. We have the LGA drop distance. We have a EWL drop distance. This one seems to be near the mat. Uh, and then we have
the WTC drop distance. This one seems to be at the WTC. Yep. So now we have a lot more interesting
things here, and I think that's enough each engineering for now
enough new features, but let's also remove some outliers and invalid data. Remember if you look at like just
a data frame, df.info, DF dot describe, and we look at test TF dot. Describe the test set seems fine. Desert has fairly reasonable values. You can see here. It has pick up latitude longitudes
are in the minus 72 to 75 range. Um, similar people pick up longitudes
and drop off long longitude. Pickup latitudes is probably between
the minimum is 40 and the maximum is 41. So between 1442, um, passenger
counts are between one and six. Since we are only going to make
predictions on these ranges of data, it makes sense to eliminate any training
data, which falls outside these. And of course the training. It also seems to have some
incorrect values like minus 1, 1 8, 3 is not a longitude. Uh, 2, 0 8 people cannot fit in an Uber
and even this fair seems high, but I wouldn't bet so much on it with search. It could possibly go there, but
there's definitely some issue here. So here's what we're going to do. We are going to limit the fair amount
between one and $500, which is already the case, except that there are some
negative fairs and we don't want our model dealing with negative fears. It's just makes things harder. We may sacrifice maybe predictions for
one or two rows in the test set, but the overall gain in accuracy will be better. Then we're going to limit the
longitudes to minus 35 to 72. We're going to limit the latitudes
to 40 to 42, and we're going to limit the passenger count to one to six. So now we have this remove outliers
function, which takes a data frame. And then from the data frame, it
picks out those rules, which match all these conditions where the
fair amount is greater than one. And this is how you combine conditions in
pandas while queering or filtering rules. And the fair amount is
less than or equal to 500. And the pickup lot longitude
is greater than minus 75. And the pickup lunch duty
is less than minus 92. Same for drop-off. And the pick-up latitude
is greater than 40. And the pickup latitude is less than 42. And same for drop-off and the
passenger count is between one and six. Okay. So this is how we remove outliers. You don't always have to remove outliers. If your model has to deal with
outliers in the real world, then you should keep the outliers. But if your model doesn't have to
deal with outliers, or you're going to train a different model for
outliers, then it makes sense to remove the outliers, the ranges of values
that don't appear in the test data. So let's remove outliers
from the training dataset. Let's remove outliers from the
validation data, a data frame, and finally let's remove outliers from,
okay, then applies in a test data frame. So I wouldn't worry about that, but I
would want to save my notebook again. There we go. Okay. All right. So now we have done a whole
bunch of feature engineering. We've extracted parts of the date. We've added distance between
pickup and drop locations. We've added distance
from Poplar landmarks. We have removed outliers and invalid data. We have done some. Okay, next up here are a
couple of exercises for you. You can try scaling numeric columns into
the zero to one range right now, all of these numeric columns have different
ranges, the month data, et cetera. Um, this generally helps with
linear models or any models where the loss is computed using
the actual values of the data. Um, and you can also try
encoding categorical columns. So things like months can possibly be
treated as categorical columns month, year, even day of week, et cetera. And you can probably use a one hot
encoder there, and that makes it a lot easier for decision trees
to work with categorical columns. Now we won't do this because for
a couple of reasons, one, we just want to keep it simple right now
and get to a good model first. And then we can come
back and try this later. If we have time. But second three base models are
generally able to do a decent job. Even if you do not scale numeric
columns or encode categorical columns, assuming that you have a tree that
can go deep enough or you're training enough trees, which we will try and do. Okay. But yeah, but that's an exercise for you. Try scaling numeric columns and
encoding categorical columns. And if you don't know what
these terms mean, then check out the zero to GBM scores. Now, another tip here that I have for you
is we've spent what one closer to ours. Now preparing this data for training. We still haven't trained our first machine
learning model, but what you can do now is you can actually save these intermediate
outputs and download those files. Or you can even put those into
Google drive and then load them back later when you're, when
you're using your next notebook. Right? So this way you save all that time
of executing or downloading the data and preparing the data set
and running through all the scores. You can start from this point where
you have this pre-prepared data and you may want to definitely do that, do
this for the entire dataset once and get those processed files and then save
those processed files for the entire data set so that you don't have to
download and do this processing on the entire data set of 55 million rows. Okay. And a good format that you can use is. format. So you could just do train bf.to get
rained or pockets, or the pocket format is really fast to load and to write. And it is also very small and it's,
it's a footprint on the storage. So it's a good intermediate
format to use when you know that you're going to load it back. Using pandas CSV, unfortunately is a
very heavy format because everything has to be converted to a string. And similarly, we can do a valve
DF dot two bucket where dot bucket. And you can see that these files are here
and you can download these, or you can even just push them to your Google drive. All of those things are possible. Now, another tip here or sort of
corollary here is that you may also want to create different notebooks
for EDA feature engineering and model training so that your EDA, your feature
engineering, your initial EDA is just something where you just experiment
with different graphs, et cetera. Your feature engineering notebook is
where you have to create new features and then, uh, output these pocket files. And then your model training notebooks
can simply use the outputs of your feature engineering notebooks. And there is a way to connect Google
CoLab to will drive so that you can organize all your work well. Okay. So let's now train and evaluate some more. So we train three kinds of models. Although there are many more, you can
train, but because we have limited time, we are going to train these three kinds
of models, linear regression, or a form of linear regression called rich and
forests and grading boosting models. Maybe I should just change this year. I'm going to change this to
preach and you can try things like lasso, elastic, net, et cetera. Okay. But before we train the model, once again,
we have to create inputs and targets. Now we've added a bunch of new columns,
so let's just go train DF dot columns. And now, well the input columns,
we're going to skip the fair amount. We are going to skip the pickup date time. I'm still going to keep pick up latitudes
and longitudes because decision trees might still be able to use these. So those are all our inputs. It looks good. And then we have our target
column and here is the fair amount as a target column. So now we can create train inputs. So train in ports is green DF, and let's just
put in the input calls. And then let's create a train targets
that strain DF, and let's just put in the target called that read well inputs. That is where IDF input calls and
while inputs, while DF target calls. And finally, we have the test inputs,
which is just SDF input calls. Okay, perfect. Then, um, before we train models, I am
just going to create a helper function to evaluate models, which takes the
model train and ports and vile inputs. So this is what it does. It takes a model. It takes training inputs and
validation and puts a assuming this is a trained model. And then first it makes predictions
using the train model on the train inputs that gives us train predictions. It then computes the mean squared
error between the training targets and the training inputs. But maybe we can just drop this. We can just use the globals
because the model is what is changing most of the time. So we have a function
evaluate, which takes a model. It gets predictions on the training set,
and then it computes the mean squared. Using the training targets and the
training predictions, it then gets predictions on the validation set. Then it computes a mean squared
error using the validation targets and validation predictions. And then it returns the root mean
squared error for training validation sets and the predictions for the
training and validation sets. So now evaluating models is just
going to be a single line of code. So let's start with the Ridge
regression, both from SK learn dot linear model import rich. Once again, if you want to learn about
retrogression, you can check out the documentation here, or you can do the zero
to GBM scores and let's create a model. So let's call this modern one. Rich, I think let's see, there are
a few coordinates here, so I think we can specify a random state here. So I'm just going to do random
state equals 42 so that I get the same result each time. Now rich uses something called
a regularization in combination with the near regression in
the, in the objective function. So you can specify the
value of this alpha here. So I would encourage you
to play around with it. Let's do alpha equals 0.9, maybe. And then rich also has a bunch of. Are there things that you can set,
you can set a solver and you can start a bunch of other things. So I'll encourage you to try it out,
but that's a model 10 let's train our model by calling model.fit green inputs. Of course, we need to
provide the targets as well. So the model is not going to train
and it is going to try and figure out a set of weights that can be applied
to the input columns, to create it, to create a weighted combination of
input columns that predicts the target value, which is the fair amount. So the fair amount is being expressed
as, uh, some way w one multiplied by, let's say the distance. Let's some way W2 multiplied by,
um, the pickup latitude and plus some way W2 multiplied by number of
passengers, et cetera, et cetera. Right? So it's a weighted average. That's going to try and predict the
fair and the rich, when we call fit, it figures out a good set of weights. Now we can evaluate a rich
model by just calling evaluate modern one. Okay. Um, yeah. I'm not sure what the issue here is. Let's see here. Yeah, let's go step-by-step and we
should be able to figure out this issue. Doesn't seem like a big issue to me,
but here we have trained predictions is modern one dot predict train inputs. Yep. And so similarly, you have
validation predictions. I'm just going to change this to R M S C. Let's try and get the RMSE here. Yep. This works too. Let's try and get these going. Okay. Something wrong with
validation predictions. What did we break here? Ah, I see. Yep. Yeah. So that's just like a
quick note on debugging. Whenever you have functions
that you need to do. A good way to debug them is to bring
them down line by line and then identify which line is causing an issue. And then go back to where you created
this variable or whatever was the previous instance, uh, that leads to
right live debugging is always one. Okay. So let's evaluate the model and
upon evaluation, this model gives us 8.0 as the training set RMSE
and 8.2 as a validation set, Missy. Well, that's a lot, that's somewhat
better than our previous model. Uh, okay. In this case, it is, uh, I was
probably getting 5.2 earlier, maybe without this alpha. Let me try that again. Yup. So it's 8.8 0.2. It's somewhat better
than our baseline model. Not great, but still better check if a
train in foot still has the same shape. Yep. It does. Okay. Now remember our submitted
predict function. Oh, sorry. Predict and submit. So now we can use this reuses function and
give it our modern one and then give it our test inputs and give it a file name. Yep. So those are our predictions
for the reach model. Let's take this, these set of
predictions and let's upload them. Let's see what we get. So download the rates of
mission and go back here. Let's upload it here. Let's see what that does. Okay. It's almost up. Yep. Let me just call that rich. So 7.72, not bad, not bad at all. That's better than 9.7. So we're getting there. We're getting better. Let's try a random forest. Let's see if this let's try it. And of course at this point I would also
go and add it in my experiment sheet. I will note down that Ridge gave
me 7.72 and it was like eight point or something and so on. Okay. Let's try it. I know forest now. So I'm going to import from. Eschalon dot ensemble, I believe import
random forest regressor and then model two is random forest regressor and I'm
going to set the random state 2 42. And I'm going to say
something to minus one. Yeah. Well, N jobs and jobs is this number
of workers that reminds me though. Maybe I made a mistake
while removing outliers. Um, yeah, I think I made a
mistake while removing outliers. So what we're doing here is
when we remove outliers, we are returning a new data frame. So what we really need to do here is we
need to do train DF equals train DF, um, remove outliers, and we need to do value
DF equals remove outlasts from Valdez. And finally we need to also, yeah,
so that, that is now actually properly removed outliers earlier. We did not actually remove the outliers. We simply created new data frames,
but we did not set them back, but that is, that gives us an opportunity
now to see how actually removing the outliers has an impact on the model. I'm training my Ridge regression model. Once again, And low and behold, our
models accuracy went down or model's error went down from 7.2 to 5.2. So just by limiting the training data,
the columns of the training data to the range of values within the test data, we
are able to train a much better model. Okay. And once again, we can test this out. I'm going to download this rich,
uh, submission, not CSV file here. Once again, let's download that. And I think this is a great, uh, this
is a great example of how much preacher engineering, again, change things, right? We have gone from 7.2 to 5.2. That's a 30% reduction in the error. So let's upload this, let's click
late submission again, and let's upload the new rich file here and it's done. Yep. And that puts it at 5.15. And let's look at the leaderboard
before we go into a random forest. So 5.15. Where does that put us out of 1,433
submissions by 1,478 submissions. Let's load those submissions
quickly and let's search for 5.15. Okay. So that puts us at 1, 1 6, 7. So we've already moved up almost
300 places from our original submission, which is very low, which
is like 9.8, which is way down here. 9.8 and 9.4 was somewhere around here. 1400 X to 5.15. Yep. Somewhere here, 5.15. So we are at position one month, 600. So we jumped, we jumped 300 places just
by doing some good feature engineering. Of course I say good. And just, but this has probably taken
people several weeks or probably a month or more to figure out that
these features would be useful. So if you can think of more creative
features, even with a very simple Ridge regression or even a linear
regression, you'll be able to move up even higher, maybe into the top thousand. And of course, let's not forget
that we're still only working with 1% of the data, right. And this just keeps running,
keeps getting better. So let's go and try a random forest. So I'm going to try a
random forest regrets. And here in a random forest, we are
going to train a bunch of decision trees. And then each of those decision
trees is going to make a prediction. And then we are going to average the
prediction from all the decision trees. So I'm setting a random state 42 so
that we always get the same set of predictions and minus one, make sure
that trees can be trained in parallel. Then we have max depth. So by default, the decision trees
in the random forests are unbounded, but for a very large dataset, you may
not want to create unbounded trees because they may take a very long time. And they may also very badly
over-fit the training data. So I'm just going to specify a Mac step
of, I don't know, maybe 10 let's say, and by default it is going to train. Let's see, how many is it
going to train and estimator? So the number of trees that are
going to train is let's see an estimate is the default is 10. Let me train a hundred. Um, please. Okay. And let's just time the training here. So I'm just going to put time
here is going to time the training and model to dot train or not fit
green inputs and rain targets. Okay. And while it trains, let us
see if we have any questions. So first question, would the support
part kit, which is a pre compressed file? Um, yeah, so I, I'm not sure what the
question here is, but yes, a partial part kit is, uh, yes, I think it would
support a pre compressed, a pocket file. If you have a pre compressed pocket
file, you can load it back using pandas. You may have to specify the
type of compression possibly. So, yeah. Second question. Um, shall we use pike carrot? Well, I don't think, I don't
know if you can use pike it, but pocket works just fine. If we limit the range of the train data
to match that of the test data, aren't we reducing the generalization of the model. For instance, we won't be able to use the
same model for a future dataset that might not have the same range as the past day. Absolutely. And that is why having
a good test set matters. Right? So your test set should be as close
to what your model is going to do in the real world as possible because
otherwise the test set is useless. Now, if we are going to use the model
in the real world for something other than what is already present in the test
set, or the kind of data or the kinds of ranges that are present in the test
set, they're not predictions on the test that are not very indicative, right? So our accuracy of the model
is not very indicative. So even if you're getting 90% accuracy
on the test set, then, then in the real world, the model can just be 30% accurate. And this happens all the
time, probably 80% of machine learning models face this issue. So what I would suggest instead is to
create the test set in a way that can capture the entire range of values that
the model can encounter in the real world. Even if that means coming
up with some estimates. And I know this may not always be
possible, but that's a great question. So thanks. How do we know the number
of landmarks to create you? Don't you have to try a few and
then train some models, try a few more, do some exploratory analysis. See where draw maybe a
geographical heat map. See where there's a lot
of traffic, et cetera. Okay. So a random forest
model has trained eight. It took nine minutes, 57 seconds. And let's see what this model does. So first let's evaluate the model. So let's call evaluate and model two. Okay. And the model is able to
get to a pretty good place. Seems like we are down to 4.16. That's not bad at all. Let's make a submission. So 4.16 is the validation RMSE
here training RMSE is 3.59. And let's make a submission here
predict and submit model two. And, uh, let's call it
RF submission dot CSV. Uh, it also requires the inputs. Yep. Okay. So now we have generated a
random forest submission. So let's take that. There's download that. And we are just do ours in at this point,
but making good progress of course, to put this together, to put this outline
together, uh, definitely took more time, but even for a couple of days of
work, this is not a bad result at all. And remember, we are only
using 1% of the data. I might, at this point, I might even just
want to take this kind of forest and put this into the comments so that when I'm
looking back at my submissions, I can see exactly what this module contains. Interesting thought. Okay. It seems like this might
take a minute or two to, yeah, there's probably some issue here,
but why don't we train a XG boost model? In the meantime, the next
model we're going to train is a great in boosting model. Now a gradient boosting model is similar
to random forest, except that each new tree that it trains tries to correct
the errors of the previous three. And that's what makes it,
there's a technical posting. And that's what makes it sometimes a
lot more powerful than random forest. So we're going to use
the extra boost library. So from XG boost, import ex GB regressor,
and then let's create a model three equals X GB regressor and let's give it. Uh, max stepped of three. Let's make it five. Maybe a learning rate looks
fine and estimate is a hundred. Looks fine. Let us, uh, objective. So we may want to change the objective
here to drag squared error because we dealing with the root mean squared error. And again, you can look up the
documentation to understand what each of these mean. So in this case, the objective,
if you see here, let's see, um, let's see. XG boost. RMSE objective. Yeah. You can always look up. Yep. I think this is the one that I'm
going to use X squared error. So yeah, this is the objective,
the, or the loss function. Um, let's estimate us let's
maybe change that to 200. Maybe give it a little longer to train. Let us, uh, yeah. Random state. And, uh, let us set and
jobs for some parallelism. Okay. Let's train the model and then we will evaluate the model. And then of course we will also
predict and submit model three, test inputs, and let's call that XG
B submissions, uh, mission, not CSV. Okay. So let's give that a minute
to train in the meantime. Let's check this out. Okay. So it looks like we got to 3.35. And where does that put
us on the leaderboard? 3.35. Let's go down. It's getting pretty close carrying
2.8 and we're down to 3.35 still only one person on the data still model
just took a few minutes to train 8.35. So we all been in five 16. The 560 out of 14, 78,
that's in the top 40%. So we're in the top 40% already. That's not bad at all. Top 40% model is actually a very
good model because most of the top submissions, one trained for a very
long time and also use a lot of ensembling techniques, et cetera. Here's that extra boost model
it took about 34 seconds. Very short model. I'm sure we can probably bump up the
number of estimators to buy a lot more and it is able to get to 3.98. So is that better than the random forest? I don't know. Yeah, it seems like it's
better than the random forest. So in just 46 or maybe a
few minutes, I don't know. 46, 35 seconds probably in, let's see. Yeah. In just 35 seconds, we were able to
train the best model so far, probably. So let's go down here and let's go to
xDB submissions and let's download it and let's save it here. Come back late submission, put that in here. And I'm just going to drop
the description as weather. So I'm going to drop the exterior
aggressor prescription right into this, and let's submit
that and see what happens. Perfect. It seems like we have made a submission
and that brings us to 3.20 a quick look at the leaderboard once again, 3.20. And before, while that loads, let me also
just tell you what is coming up next. So, so far we have just evaluated a
bunch of different models and I would encourage you to try out a few more. I'm going to commit the notebook here as
well, but the next thing that we should be doing is tuning some hyper-parameters. So let's see 3.20, I believe. Okay. We are up to 414. Pretty good. Pretty good. Four 40 out of 1, 4, 7, 8 is okay. We've hit the 30% mark already. That's not bad. Uh, so the next thing we're talking about
is tuning hyper-parameters now here's the S now tuning parameters, unfortunately
is more of an, is unfortunately more of an art than a science. There are some automated units
available, but they will typically try a whole bunch of things like grid
search and, um, take a long time. And you ultimately have to train a
lot of models and build some intuition about what works and what doesn't. But I try to give you a couple
of tools here that you can use. Here's first strategy, which is
about picking the order in which you're doing hyper parameters. So what you should do is first
tune the most important and impactful hyper parameters. First, for example, for the XG boost
model, number of estimators or number of trees that you want to train is
the most important hyper parameter. And for this, you need to also understand
how the models work really well, at least in an intuitive fashion, if not the
entire mathematics behind it, you should have a good, intuitive understanding. And all of these models can
be once you understand them, intuitively can be described in
a single paragraph or maybe two. So for extra boost, one of the most
important parameters is estimators. And you tune that first and we'll talk
about how to tune that then with the best value of the first hyper parameter,
we tuned the next, most impactful hyper parameter, which in this case, I believe
would be max depth and so on and so on. So during the most impactful hyper
parameter use it's best value. And what do you mean by best? Well, use the value that gives you
the lowest loss for the validation. Okay, while still training in
a reasonable amount of time. So it's time versus accuracy. So wherever you feel that this is the
best, this is giving me the best result on the validation set, um, use that
hyper parameter and then with the best value of the first type of parameters. So all future models that you try to tune,
you should have the best value of the first type of parameter, and then you own
the next most impactful hyper-parameters. So let's say the number of
estimators, the best value is 500. Keeping that 500 fixed you own the next
most impactful hyper parameter, like max step, and then keeping the max steps
fixed during the next, most impactful hyper parameter and so on and go down
four to 5, 6, 7 hyper-parameters, and keep going then go back to the top and
then further tune each parameter once again for, for the module games, right? So that's the order. You sort of go through the parameters,
get the best value and go forward. And as I said, it's more an art
than a science, unfortunately. So try to get a feel for how parameters
interact with each other, based on your understanding of the parameter and based
on the experiments that you do now, in terms of how to tune hyper parameters. There's a, there's a image that
captures this really well, which is called the over spitting curve. Yeah. So this is a image that
captures the idea really well. Let's, uh, Yeah, this is
the one I'm looking for. So the idea here is hyper parameters, let
you control the complexity of the model. So certain hyper-parameters when
you increase the hyper parameter, it increases the complexity of the model
increases the capacity of the model. In some sense, for example, if you
increase the numb, if you increase the max depth of the tree, or you increase
the number of estimators, then you are increasing the capacity of the model. You are increasing how much it can learn. And the model starts out. Let's say you drive a number of
estimators 0 5, 10 2000, a 507,010,000. So when you have very few estimators
or very few or a very small model, or a very limited model, both training
error and test error or validation, error, pretty high because your
model has very low capacity and it has for dealing with a lot of data. It simply doesn't have all those
parameters to learn enough about the data as you increase the models capacity,
which is increase the number of estimators or increase, let's say the max step,
both the model can start to learn more. So it starts to the training arrow
starts to decrease and the validation error starts to decrease up to a point. And then what happens is at certain point
of validation, error starts to increase. So this is what's called over-fitting. This is where the model is getting
to a point where it is now. This is where the model is getting to
a point where it is now, instead of trying to learn the general relationship
between the inputs and outputs. It is now starting to memorize
specific values or specific patterns between the training data, mostly
specific examples in the training data or specific sets of examples
to further reduce the loss, right? And as you make the model more and
more and more complex by increasing the number of parameters it has by
increasing, let's say the max debt, it can memorize every single meaning input. And that's what decision trees do
if you don't bound their depth. So when you get, when your model gets
to that point, then it's a very bad model because all it's good for is
it's, it's kind of a model that is simply memorized all the answers. So anytime you give it a new
question, it completely fails, right? It's like memorizing answers for an
exam versus understanding the concept. As you go through the material, as you,
as you study, as you do some practice questions, or you spend more time, your
understanding of the material gets better. But if you get to a point where you're
just blindly memorizing all the answers, then your understanding of the material
may actually get worse because you won't know how to solve general problems. So it's not generalizing well enough. So that's what we want to find. And what we've done here for you
is created a couple of points. One called test, perhaps, which gives
each takes a model class and a set of parameters, and then trains the model
with the given parameters and returns, the training and validation RMSE and then another core test parameter
plot, where you can provide a model class. You can provide a parameter name, you
can provide a set of values, uh, for that parameter that you want to test. And then a list of the other
parameters that you want to set constant while waiting this parameter. Okay. And then it's going to train a model
for each of those values, and it's going to plot the figure for you. I'm going to show you
in just a second word. It does. So don't worry too much about
the function code right now, but here's what we're going to do. We're going to try and tune the
hyper-parameters number of trees. Now what's the number of
trees we have here, 200. So I'm going to try and figure
out, should we be increasing the number of trees or should we be
decreasing the number of trees? Okay. And the way I'm going to do about
this adobo go about doing this is calling test, but AMS and plot,
and let's time that as well, and in test parameters and plot, we
have first the type of model that you want to train, which is xDB regressive. We have the parameter
name that we want to vary. So we want to vary the
numb estimators parameter. Then we want to try the values. Let's try the value a hundred. Let's try the value 200,
which we've already tried. Let's try the value 400. Let's see. Let's say we're just doubling the
number and seeing if that heads and let us set the other parameters. So let's set the random state 2
42 number of jobs to minus one and objective two red squared error. So I'm just going to
pass other parameters. This is called a, these are called
quarks or keyword arguments. So each key insight, the sorry, not
other brands, each key insight, best parameters is going to be passed as
an argument to test params and flawed. And ultimately it's going to get
passed on to the ultimately it's going to get passed down to the best,
uh, to the xDB regressive model. Okay. This is going to take awhile. Uh, so how about we just start filling
out some, um, code in the meantime. So what we're going to do after this
is into best, perhaps I'm going to then add, so the best parameter I'm going to
add, what I think would be the best value of nom estimators that we should use. So I will going to add that,
then we're going to try or do some experiments with max step. So we have set the max step above two. What did we start with? We started with the max up to five. So maybe let's try three and seven
or three and six for the Mac step. So test Pedram and lot X GB
regressor and we want to test max depth and we want to test the
values three, five, and maybe six. Let's say seven may take a long time. Oh, there it is. Okay. Okay. So it seems like number of
estimators isn't really making a big difference at the moment. Seems like maybe we should, uh, maybe
we should like reduce the learning rate or something and then try
changing the number of estimators. Let me change the learning rate
here or the initial learning rate to 0.05 instead of 0.1. Let's see if that gives us any benefit,
but yeah, we can try and match up of three, five, maybe seven let's say and. Uh, let's give it the best parameters. And then we are going to add the best
parameter or max depth, and we can try the same thing with learning rate as well. So like we can try learning rates
of 0.05, 0.1 and 0.2 and so on. Okay. Yeah. So this isn't really doing much. So maybe the number of maximum, maximum
submitters, isn't really helping. So no need to worry about it, but
let's just go with a hundred for now. Then let's try maxed up
of three, five, and seven. Let's see what that does. And then we're going to try
different learning rates. So this is what we want to do, right? I hope you're getting the idea here. What you want to do is first green, a train, a basic model. So have a set of initial programs
that you want to start with. Then for each hyper parameter, try
out the, a bunch of different values. Try decreasing it. Try increasing. Similarly, try decreasing
it, try increasing it. Maybe try five values,
maybe try seven values. Look at the curve and look at the
curve that gets created and try to figure out where the best fit is. And the best fit is the point where the
validation error is the lowest rate. Once you put in enough values, you
will see a curve like this, and you want to pick not the point where like
these two are the closest, not the point where, uh, this is the lowest,
not the point where this is the highest you want to pick the point where
the validations error is the lowest. Okay. Now one caveat here is that sometimes
the curve may not be very nice like this. Sometimes it may have, it may sort
of flatten out here and if it's flattening out, that means it's
still continuing to get better. But if it, if going from this
point to this point is going to take three or four times the
amount of time to bring the model. And you're probably better off just
picking this value in state where it's kind of starting to flatten out so that
you will try more experiments faster. Okay. So that's something worth thinking about. Yeah. So here it is. You can see with max depth, it seems
like if we, if you were to pick a max step of seven, that will actually train
much faster and I'm sorry, that would actually give us a much better model. And so I'm just going to pick,
maybe let's say max seven, let's say, then we can try
out some learning rates. I'm going to try out a bunch
of learning rates here. And then based on that, I'm
going to test a learning rate. And then of course, uh, we can
continue trying to tune the model. Okay. So you want to try this with
all the different parameters, not just these parameters. And here is a set of
models that works well. So here's one where we have,
uh, 500 estimators and then we are trying a max stepped off. Let's go maybe a little bigger. Let's try max accept of eight leg style,
learning rates, slightly lower, because as you increase the number of estimators,
you want to decrease the learning rate. Um, and then here there's a sub sample. So for each split of each tree, we
only want to use 80% of the rows. And then there's something called a call
sample by tree for each three that we use. We only wanna use, uh, 20% of the camp
of the columns or 80% of the columns. So these are just a couple of things. You can try 0.8 0.7, the same way
that we've tried, test params and plot and see where that takes us. So I'm going to run this model
right now and see what, um, yeah. So I'm going to train this model here. xDB model final, and I'm going to fit it
to the train inputs and train targets. I am then going to make some,
I'm going to evaluate the model. I am going to then predict and
submit X, G B model final X, GB X, G GB dot, uh, submissions dot CSV. Okay. So I'm just gonna let this run and see
where that gets us in, in my case, I think the last time I trained this, it was able
to get us to about the four 60 a position. I'm hoping we can beat that. I'm hoping we can get maybe
into the top 25, 20 6% less. Um, but it should be somewhere
around that, that point. And again, what is pretty
amazing here is that we are still just using 1% of the data. We are throwing away 99% of the
train data, never looking at it. And the reason we are able to do that
is because the test set is really small. The test set is just 10,000 rows
and to make predictions on a test set of 10,000 rows, you don't
really need 55 million rows. Yes, it will help to add more
data or the entire data will definitely make it better. But if you're always working on 55
million rows, then to do what we just did in less than two and a half hours,
it would take you probably a couple of weeks, maybe longer because one of the
things we were able to do here right now, while working with a sample is we
were able to fix errors very quickly. We were able to try out
new ideas very quickly. We are able to brainstorm and go from
thought to action very quickly, where if you're working with 55 million rows, every
action that you take, every cell that you run is going to run for a couple of hours. And by the time you come back,
you're going to be tired. You're going to forget
what you had in mind. So speed of iteration is very
important and create a feature. Engineering is very important. And then hyper parameter tuning is really
often, generally just a very small step, which is a fairly small step, which,
uh, gives you that last bit of boost. But it's generally not the biggest factor. So let's just fix that in summit. This is looking pretty promising. It has gotten to 3.8. Let's see if that is any better
than the best model that we had extremely June submission. Let's submit that. So that's why it's very important to
plan your machine learning project. Well, it's very important to iterate. It's very important to try as many
experiments as quickly as you can and track them systematically. It can make the difference of a
machine learning model taking months and still not getting to a good result
versus getting to a really good model. Something that can be used in the
real world in a matter of hours. Okay. Um, let's see where this gets us. So we just submitted in this goddess
to about 3.20 let's check where that puts us on the leaderboard 3.20. Yeah, I bet it would still
be under the 30% mark. So, which is pretty good considering
this is a single model, most models on gaggle use ensembles and considering
our model has taken just what is this? One minute to train, not even 10 minutes,
I'm ordered was strained for just one minute and we haven't even fully
optimized the hyper parameters yet. Right. So there's a lot more we can do in
terms of hyper parameters as well. So let's see 3.20. Okay. That puts us at position four 40. Yeah. So that is within the top 30%. Right. And I encourage you to like simply,
maybe just go to next year instead of 500 estimators, maybe go for 2000
estimators and see what that does. So here are some exercises for you,
you and hyper parameters for the Ridge regression and for random forest. See what's the best model you can
get repeat with 3% of the data and percent 30%, a hundred percent. So basically three X-ing each time from
one to three, three to 10 and so on. So see how much reduction in error
does three X data produce this 10 X data produce this a hundred X data
produced, and you will see that the reduction is not a hundred X, but the
time taken definitely becomes a lot more. Right. And finally a last couple
of things, you can save the model weights to Google drive. So there are a couple of ways
you can, I'm not going to do this right now, but I'm just going to
guide you to the right place here. So the way to save model weights
is to use something like this, use this library called job lib. So you can do simply from job lib import
dump, and then you can take any Python object and dump that into the job. Look file. Right? So you can import, uh, you can, maybe
you can maybe just put the dictionary or, sorry, just put the model itself,
that extra boost martyr and dump that into a file and then load it back. And then user just like the extra
subject it is, or you could create a dictionary put into it, the extra boost
model, any other, like if you were using a scaler, if you're using an impurity
of some kind, anything else that you need to make predictions put all of
those into a job file and then dump it. So that's how you save models. And then here we have IO
inputs and outputs or Google drive for, uh, Google CoLab. So you can Mount your
Google drive this way. You can say fall from Google dot
co-lab import drive and then drive dot Mount slash content slash drive. When you do that, then on slash content
slash drive, your Google drive is going to show up here on the left. Let me see, let me try that. Yeah. So your Google drive is going
to show up here on the left. Um, I'm not gonna run the whole thing
right now, but the Google drive is going to show up here on the left. I believe you need to. Yeah, I believe this one is
going to ask you to do some, take some additional steps like this. You will have to open a link, enter
an authorization code, similar to adding your job in API key
that attaches your Google drive. And once your Google drive is attached,
you can take the job file that you created here and put that into your Google drive. So now you can have a or notebook, and
then you can have a feature engineering notebook, which takes the data, adds
a bunch of features, saves those files in pocket format to Google drive. Then you can have your machine learning
notebook, which can pick up those files and then train a bunch of models
and whatever are the best models you can write those models back to
Google drive, and then you can have an inference notebook which can load
those models from Google drive and make predictions on new data, right. Or. Make predictions on individual inputs. That is, again, something that I would
suggest if you're, if you hit a wall at some point us, by looking at some
individual samples from the test set or that put those into the model, see
the prediction on that individual input and see that prediction makes sense
to you just eyeball the predictions, and then you'll get some more ideas. Then you can do some more feature
engineering, and that's the iterative process that you want to follow. You want to make submissions
every day, day after day. Right now, one other thing that we've not
covered here is how to train on a GPU. You can train on a GPU with the
entire dataset to make things faster. So there's a library called dos that
you can use another library called C U D F or CUDA data frame, which
can take the data from the CSV file and put it directly into the GPU. Remember on CoLab, you also get
access to a GPU, so you can take the data, put it directly onto the GPU. Next you can create training and
validation sets, perform feature engineering directly on the GPU. It's going to be a lot faster. And most importantly, the training
that you do, the training can be done using extra boost on the GPU itself. And that's again, going to be
a lot, lot faster, probably orders of magnitude faster. So the entire process of working
with the entire dataset itself can be reduced to maybe 10 or 20 minutes of
work right now, dusk CU DF and CML. I have very similar APIs to, or very
similar functions, et cetera, arguments, et cetera, as binders and extra boost. But some things are different and some
things have to be done differently. Unfortunately, it's not a
hundred percent compatible APA. So I've left you a few resources that
you can check out specifically, do check out this project by Alan Cohn. He was, uh, one of the members of the
Jovian community who has created a model using dusk using the task library. And he's used a hundred percent
of the data and his model trains in under 15 minutes, I believe. And in under 15 minutes, he is able to
get to a point where I think he was able to get to 2.89, which was in the top 6%. Okay. Of course it took several days to
write out the code and try to learn the different things required to do this. But a model, a single model
trained in under 15 minutes was able to own the entire dataset. A hundred percent of the data under
17 minutes placed him in the 94th percentile or the top 6% of the DSS. So you can check out his
notebook here as well. His notebook is listed here, so
you can check out his notebook. It's a good tutorial on how to use dusk. So that's an exercise for you. And finally, here's the last thing
I want you to take away from this workshop, always document and publish
your projects online because they help improve your understanding. When you have to explain what you've
done, there are a lot of things that you've probably just copy pasted or taken
for granted or not really thought about that you have to now put into words and
that forces you to think and understand and fill the gaps in your understanding. So that's very useful to
improve your understanding. It's a great way to showcase your skills. If you're going to write on your resume
that, you know, machine learning under a skill section without offering any
evidence for it, there is no way somebody is going to believe that, you know,
machine learning and they don't have the time to actually interview hundreds of
people and figure out what they know. So the best way to offer evidence
is to do a blog post, write a blog, post I'll, explain and
list a linkage from your resume. And the last thing is that as people
read your blogs or you share them on LinkedIn or Twitter or wherever. That will lead to inbound
job opportunities for you. People will reach out to you. Recruiters will reach out to
employers will reach out to, I saw the project that you did. It looks look pretty interesting. We have a similar problem
here at our company. Would you be interested in talking
and you won't believe how many, how much easier it is going to become
for you to find opportunities. If you consistently write blog
posts and publish your projects online, any project that you're
doing, please put it up online. Please add some explanations using
markdowns, spend another hour or two, lean out the code and create functions. Show that you are a good programmer and
publish the Jupiter notebook to Jovan. It's just one. We've made it so simple for
you because we want you to publish these articles with us. Uh, so you can run jovan.com or you
can download the notebook and then you can, uh, like you can go file,
download notebook as IPNB, and then you can upload that notebook on Jovian. You can go here on new and
you can upload a notebook. It's really easy. Uh, but yeah, when you do that,
you, yeah, you, you can now share this notebook with anyone, right? And you can also write
blog posts like this one. And the benefit of blog posts is that
you don't have to show the entire code. You can make it much shorter
and you can focus on the bigger narrative or the bigger idea here. I think this is a great blog post
about the different steps involved here and the things that Alan tried
without showing a bunch of maybe like hundreds of lines of code, right? So it's a good summary blog post of
the code, and it's a great, uh, it's a great way to share what you've
done with somebody and summarize it. Now, one thing you can do is on your
blog post, you can actually embed code cells from Jovian and outputs and graphs
and anything from a Jovian notebook. And you should check out this tutorial
on how to write a data science blog post. We have a tutorial here. We write something from scratch. We did a few months ago that
will guide you in that process. So that was the machine
learning project from scratch. Not really from scratch because
we had written out an outline, but let's review that outline. Once again, we started other,
we're trying to predict taxi fares for New York city by looking at
information like pickup location, drop location, latitude, longitude, fare. The number of passengers
and the time of pickup. So we downloaded the dataset by first
installing the required libraries, downloading the data from Kaggle, using
open datasets, looking at the dataset files, seeing that we had 55 million rows
in the training side, but just 10,000 rows in the test set, we had eight columns. We loaded the training data
dataset and the test set. We then explored the training site and
we saw that there were some invalid values, but there were no missing values. The test set, although had fairly
reasonable ranges, then something we could have done is exploratory data analysis and
visualization, which is a good thing to go in and do right now to get ideas for
feature engineering and a great way to build insight about the dataset, um, is to
ask and answer questions because that'll give you ideas for feature engineering. Then we prepared the data set for
training by splitting the data into training and validation sets. Then we failed and, or
remove the missing values. In this case, we removed
remove them and there were no missing values in our sample. Of course, one of the things that we did
while loading the training set was we worked with a 1% sample so that we won't
have to, so we could get through this entire tutorial in three hours, but that
also had the unexpected benefit that we could experiment a lot more, very quickly
instead of having to wait for tens of minutes for each cell to run, then we. Expected the inputs and outputs out. We separate out the input columns and the
output columns because that's how machine learning models have to be trained for
the training validation and test sets. We then train some hard-coded
models, a model that always predicts the average, and we evaluated it. We made submissions from that model. Sorry. We evaluated it against the
validation set and saw that it gives us a average RMSE of about 11. We trained and evaluated a
baseline model, which gave us an average RMSE of about 11 as well. This was a linear regression model. And the learning here was that our
features are probably not good enough. We probably need to create new features
because our linear regression model, isn't really able to learn much beyond what we
can predict by just returning the average. So before you go out and do a lot
of hyper parameter tuning, make sure that your model is actually
learning something better than the brute force or the simple solution. Then we made some predictions and
submitted those predictions to Kaggle and that established a baseline,
which we would then try and beat with every new model that we create. Then when it came to feature engineering,
the low hanging fruit was extracting parts out of date the year, the
month, the day, the day of the week. And also the hour of the day, we then
added the distance between the pickup and drop using the habit sign distance. We found a function online
that we just borrowed. We've also added distances of the drop
location from popular landmarks and like the JFK airport, the Newark airport,
LaGuardia airport, uh, and a bunch of other places you can possibly also
add distance from the pickup location. We removed outliers and invalid data. We noticed that there was a bunch of
invalid data in the training set, and we noticed that the test set had a
certain range of values for latitudes longitudes of fairs and all that. So we put those in so that our model
is focused on making good predictions on the test set, which should be
reflective of how the model is going to be used in the real world. We can, we could have done scaling
and Manhattan coding, and that would have helped train the models a little. I'm sure. And then we saw how to save the
intermediate data frames and also discuss that we can put them onto Google
drive so that we can separate out our notebooks for exploratory analysis,
feature engineering and training. We then trained and evaluated
a bunch of different models. First, we once again, split the inputs
and targets, and then we trained a range, the regression model, we then
train a rhino forest model and we trained a gradient boosting model. Each of these, we did some very quick
and dirty hyper parameter selection, but even with that, we were able
to get to a really good place. We are able to get to the top
40% or so without much tuning. Then we looked at hyper parameter
tuning where we decided that we would do the most impactful parameter
first and then keeping it value fixed you in the next, most impactful. And by tuning, we mean picking the
value of the hyper parameter, where the validation loss is the lowest where it is
not, it has not started to overfit, but it is, is still, it has learned a little
bit about the data in more general terms. So we do number of trees, max
depth learning rate, and we ran some experiments here. We saw that all of these parameters
could be further increased. Of course we are short on time. So we can't really look at like going
to very deep trees that would take a couple of hours or so to train. But I encourage you to try those out
till the point that you start seeing the increase in the validation. Edgar. And finally, we picked a bunch of good
parameters and we train a model and that model was able to put us in the top 30%,
which was pretty amazing considering Ms. Still just using one person of the data. And we looked at how we can save
those moderates to Google drive. And we also discussed that the model
can be trained on GPU, which would be a lot better when you're working with the
entire dataset so that you don't have to wait four hours to train your model. Of course, it requires some
additional work because you need to install a bunch of libraries
and make, to make things work. But there's definitely a few
resources that you can check out here. Maybe it could be a topic for
another workshop where we could talk about using training classical
machine learning models on a GPU. Finally, we talked about the importance
of documenting and publishing your work. I cannot overstate this. Any work that you do, please document it. Please publish it, publish it to Jovian. If you were writing a blog post,
go to blog dot Jovan and check out the contribute tab here, and
you can feature blog posts here. We share it, not just with the
subscribers of the blog, but it also goes out in our newsletter, which
goes out to over a hundred thousand members of the Jovian community. So it's a great way to get
some visibility for your work. And become found. So finally, I just want to
share some references and then we'll take a few questions. If you have questions do stick around. The first one is about a dataset. This is a New York city taxi
fair prediction dataset. Definitely one of the more challenging
datasets that you'll find on Kaggle, but with the right approach, you
can see that it's all about strategy and approach and cluster iteration. You can do a lot with, uh,
just a little bit of data. If you want to learn shell scripting
a little bit, I'd definitely recommend checking out missing
semester from MIT to learn bash on how to deal with the terminal. Then the open data sets library. This has been developed by Jovian to
make it easy to download data from CA you can use it in all your projects. All you need to do is specify
your catalog credentials. Then for exploratory data analysis,
do check out this tutorial on building an EDA project from scratch. Again, it's a follow along kind of
tutorial that you can apply to any data set just as this entire strategy. You can pretty much apply
to any data set from Kaggle. Maybe only the specific parts
like feature engineering. I'm going to change. Then do check out the course
machine learning by with Python zero to GBMs that's a useful course. If you want to learn machine learning
from scratch and do check out the blog, post violin Kong on this
particular dataset, it's really useful. There is this experiment tracking
sheet that we talked about. Very important to stay organized
as you try hundreds or dozens at least of experiments do so that
you don't lose track of what are the best hyper-parameters are. The best models are the best features. Even then, if you want to learn
more about daytime components and pandas, you can check this out. There's some more resources about
habit, sign distance, and here is the rapids project, which builds all these
alternative libraries that work directly with GPS, which we have on Google CoLab. Fortunately, and if you're looking to
write a blog post there's again, a follow along tutorial we have on how to write
a data science blog post from scratch that you can follow a few examples
of good machine learning projects. These are all projects created
by graduates of the zero to data science bootcamp, um, that we
run it's a six month program. Well, you learn data analysis, machine
learning, Python programming, a bunch of analytics tools and build
some real world projects and then go out and also learn how to prepare
for interviews and apply for jobs. So here's one, you should check out. Walmart store sales is a great
project on forecasting, Walmart, weekly sales, using machine learning. So at retail and covers all the
specific aspects that we have talked about in this table of contents. Here's another predicting used car crisis. One thing that you get to see with machine
learning is how generally applicable it is to so many different kinds of problems. So again, a very interesting model to
check out also very well documented. So great project to check out. Here's one about applying machine
learning to geology about predicting lithologies using wireline logs. I can't say that I understand
the entire project, but I can definitely see the pieces. The pieces that you can pick up is
defining the machine learning problem, understanding what the inputs are, what
the outputs are, what kind of problem it is, what kind of models, union
to use, and then going through the process of training, good models and
experimenting and staying organized. Here's one about ad demand, prediction
predicting whether a certain machine learning ad is going to be clicked on. Here's one on financial distress
prediction, predicting whether somebody will face financial distress within the
next year or two and another machine learning project on credit scoring. And I hope you notice a similar
trend that is there across all of these projects, which is how to apply
machine learning to the real world. All of these are on real world
datasets from Calgary, right? So with that, I want to thank
you for attending this workshop. We are almost running up on three hours. We will take questions, but for those who
want to go, thanks a lot for retaining. Uh, we are planning to do more workshops
every week or every other week. So do subscribe to our YouTube channel. And of course, if you are learning
machine learning or data science do go on jobing.ai to sign up, to take some
of our courses, build some interesting projects and also share these courses
with other folks who might find it useful. Um, we also have a community discord
where you can come chat with us and we have a community forum as well. And if you are pursuing a career in data
science, definitely talk to us about joining the zero to data science bootcamp. We think it could be a great fit if you're
looking to make that career transition. So that's all I have for you today. Thank you for joining. I will see you next time. Have a good day or good night. Okay. And let's take the questions. There is a comment. Come question from partic really loved
the session, understood everything right from creating a project pipeline, feature
engineering, saving files, and pocket format uploading our submission files
with descriptions as model parameters, hyper parameter tuning, et cetera. I just had one question, not regarding
the session, that one, uh, not a session, but how can you find a problem? That was also unique. Okay. How can you find a
unique problem statement? Right. Yeah. So I don't think that there are
many unique problem statements in, at, in the void right now. Uh, like even the datasets that you
find online, you will find that many people have created many machine
learning projects from those, but that should not stop you on working
with the dataset because everyone brings their own perspective. Everyone's going to do their own analysis. Everyone's going to train their own
kind of models, try their own ideas. So you will almost certainly one
learn a lot from the process. Even if there are a hundred other projects
on a dataset like New York taxi fare, where 1500 people have made submissions
right out of the 1500, probably many people may have trained models for
several days or probably like months and still may not have been able to make
the top 500, but with some smart feature engineering, you might be able to get
to the top 400 in just a couple of days. So it shouldn't stop you
from trying that dataset. And the second thing is,
um, It's ultimately about. Yeah. So the second thing is
about finding good problems. I would say that you should try
and find problems where a lot of people are already working on that
problem, because that is an indication that it's a good problem to solve. So when I came across the New
York taxi fare dataset, I saw that it's a large dataset. I saw that like over a thousand people
had participated in the competition. So that probably means that it's a
very, very interesting problem to solve. So in a somewhat counter intuitive
sense, the more people have tried a particular problem. The more interesting it is, unless
it gets to a point where it becomes like a instructive problem, which is
taught in courses, for example, the MNS dataset or the CFR tenancy 400 datasets,
they are generally used for teaching. And because they're used for teaching,
pretty much everybody goes through creating models for those problems. So you want to pick something that
is not used in some course or some tutorial, which is very popular, but
at the same time is not very, um, like obscure where you don't understand
what the problem statement is. Or even if, whether it is a machine
learning problem somewhere is there's that sweet spot somewhere in between just
like model training, I guess there's that sweet spot somewhere in between where you
find some really good problems, but most important thing you should look at is
independent of whether it's unique or not how much you're going to learn from it. What would be a good reason to
test a split, a test set, to be so small from the training set? Well, um, I believe it could just
be that, uh, this was Google cloud that had run the competition. So maybe they just wanted to see how
much additional benefit can you get by 10 times or a hundred times more data. Right? How much additional juice
can you extract out of it? That's one piece the second, I
guess, could just be that because Google just has so much data. Um, but I don't know why
the tests are so small. I don't know it's these are all guesses. Can you teach me how to
make a customized dataset? Well, I would, I think that would
be a topic for another day because there is a lot that will be involved. Every time Kaggle works with a company. I know that they spend a lot of
time creating the dataset because the, on the one hand it should
be possible to make predictions using the inputs and the targets. There should be enough signal in the
data, but on the other hand, Sometimes you introduce something called leakage
where one or two features might completely end up predicting the outcome. So for classical machine learning, I
would say it's not very easy to come up with your own custom datasets. And of course there's the whole
issue of labeling is itself, right? So if you a certain label, all the
data, data yourself, that's going to make things a much harder for you. But for deep learning, when you're working
with image recognition problems or with natural language problems, there, it's
a lot easier to create custom datasets. And again, that's a topic for
another day, but, uh, we do have a tutorial on building a deep learning
project from scratch on our YouTube channel that you can check out. So thank you again for joining. You can find us on www.shovan.ai, and I'll see you next time. Thanks and goodbye.