Build a Machine Learning Project From Scratch with Python and Scikit-learn

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hello, and welcome to this workshop on how to build a machine learning project from scratch today, we are going to walk through the process of building a machine learning project, and we're going to write some code life. We're going to start by downloading a dataset, then processing it, training a machine learning model. In fact, a bunch of different machine learning models and evaluating those models to find the best model. We will also do some hyper parameters and do some feature engineering. Now, before we start, if you're looking to start a new machine learning project, a good place to find datasets is gaggle. So I just wanted to show you this before we get into the code for today. So there are a couple of places on Kaggle, which is an online data science community and a competition platform. So there are a couple of places where you can find good datasets for machine learning projects. The first is competitions on Kaggle. Now Kaggle is a competition platform that has been around for close to 10 years. I believe at this point. And they have hosted hundreds of competitions. So you can go on kaggle.com/competitions, and then go back all the way. You can go back to competitions from 2010 or 11, I believe. And use those datasets to work on your machine learning projects. For example, one of the dataset that we will be looking at today is called the New York city taxi fare prediction challenge. There's competition was conducted three years ago by Google cloud. And the objective here was to predict a writer's taxi fare, given information like the pickup location, application date, and the number of passengers. And you can learn a little bit about the competition. You can look at the data for the competition before downloading it. You can also look at a lot of public notebooks that have been shared by other participants in the competition. Reading other's notebooks is a great way to learn. And in fact, a lot of the techniques that we are covering today will be from public notebooks. And you can also look at the discussions. If you have any questions about how to go about doing a certain thing. Now, one of the best parts of Kaggle is that you can actually make submissions to the leaderboard. So you can do this. You can go to my submissions and click late submission. And although you will not rank on the leaderboard, your submission will still be scored and you can see where you live. Among the entire set of participants in this competition. So in this competition, for example, there were over 1400 teams that landed on the leaderboard and getting to anywhere in the top 30 to 40% on a Kaggle competition. Even one that has ended already is a sign that you're probably building really good machine learning models. So that's one place on Kaggle where you can find data datasets for machine learning, and you have at least a hundred options here to choose from. So if you're building a first or second project, I would just go here. But apart from this on kaggle.com/datasets, you can find hundreds of other datasets. Now, one of the things that I like to do when I'm searching for datasets on Kaggle, especially for machine learning or classical machine learning, as we call it to differentiate from deep learning is go to filters here and select the file size five type CSV, and set a minimum limit on the file size. So a minimum limit of 50 MB gently gives you a large enough data set to work with apply those filters. Then I sought by the most votes. And finally I S so that leaves us with about 10,000 data sets to choose from. And finally, I put in a query or a keyword to filter datasets by a specific. Domain. So here, for example, are all the datasets that are related to travel. Now, since these are sorted, by the most words, you, somebody has already done a lot of exploring for you. You can just look through the first five or 10 data sets. Not all of them may be suitable for machine learning, but many are, and you can in fact, open up datasets and read their descriptions. And in many of these descriptions, several tasks will be mentioned that can tell you how you can do machine learning. Another thing you can do here is you can go into the code tab and on the code tab, you can search for machine learning terms like random forest. And you can see here that people have used this data set to build machine learning models. So that's another good place to find datasets. So you have hundreds of real world data sets to choose from because most of these datasets and most of the datasets in cattle competitions have come from real companies that are looking to build machine learning models to solve real business problems. So with that context, let's get started today. We are going to work on this project called New York city taxi fare prediction. And this was a cattle competition, as I mentioned a few years ago. And you can learn all about this on the competition. What you're looking at right now is the notebook hosted on the Jovian platform. This is a notebook, a Jupiter notebook hosted on my profile. And on in this notebook, there are some explanations and there is some space to write code, and we are going to start writing the code here. Now, of course, this is a read only view of the notebook. So to run the notebook, you click run and select run on Cola. We are going to use Google CoLab to run this notebook, because this is a fairly large dataset. And we may request some of the additional resources that Google CoLab provides. Now, when you go to this link and I'm going to post this link in the chat right now, when you go to this link on this link, you will be able to click grant run on Cola, and you may be asked to connect your Google drive so that we can put this notebook into your Google drive and you can open it on Cola. Okay? But once you're done, once you're able to run the notebook, you should see this view. This is the co-lab platform, Colabra research or google.com. It is a cloud-based Jupiter notebook where you can write code and any code that you execute will be executed on Google servers in the cloud on some fairly powerful machines. In fact, you can go to. Change the runtime type. And from here, you can select, you can even enable a GPU and high dam machines, which I encourage doing if you're using either of these. All right. So whenever you run a notebook hosted from on Jovian on CoLab, you would see this additional cell of code at the top. This is just some code that you should always run at the beginning. So whenever you go to a Jovian notebook, run it on Cola. Definitely make sure to run this first line of code, because it is going to connect this CoLab notebook to your Jovian notebook. And any time you want to save this, a version of your CoLab notebook to your Jovian profile, you will be able to do that, but you need to run this single line of code. Okay. All right. With that out of the way let's get started. So we'll train a machine learning model to predict the fare for a taxi ride in New York city, given information like pick up date in time, pick up location, drop location, and number of passengers. And this data set is taken from Kaggle. And we'll see that it contains a large amount of data now, because this is a short workshop we're and we're doing all this life build attempt to achieve a respectable score in the competition using just a small fraction of the. Along the way. We will also look at some practical tips for machine learning, things that you can apply to your projects to get better results faster. And I should mention that most of the ideas and techniques covered in this notebook are derived from other public notebooks and blog posts. So this is not all entirely original work, nothing ever is now who run this notebook. As I said, just pick, run, and run uncola and connect to your Google drive. And you can also find a completed version of this notebook at this link. So I'm going to drop this link in the chat if you need to refer to the code later. Okay. So here's the first tip I have for you before we even start writing any code, create an outline for your notebook. Whenever you create a new Jupiter notebook, especially for machine learning and fill out a bunch of sections and then try to create an outline for each section before you even start coding. And the benefit of this is that this led to this lets you structure the project. This lets you organize your thought process into specific sections. And this lets you focus on individual sections at a time without having to worry about the. Okay, so you can see here, if you click on table of contents, I have already created an outline in the interest of time where there are sections and subsections. So here's the section and here to have some subsections and so on. And then inside the subsections, there's also some explanation that mentions what that subsection covers. So here's what the outline of this project looks like. First, we're going to download the dataset. Then we went to explore and analyze the dataset. Then we went to prepare the dataset for training machine learning models. Then we are going to first train some hard coded and baseline models. Before we get to the fancy three bays to our gradient boosting kind of models. And then we'll make predictions and submit predictions from our baseline models to Kaggle. And we'll talk about why that's important. Then we will perform some feature engineering and then we will train and evaluate many different kinds of machine learning models. And then we will tune hyper parameters for the best models. And finally, we will briefly touch on how you can train all the GPU with the entire dataset. So we will not be using the entire data set in this tutorial, but you can repeat the tutorial with the entire dataset as a little bit. And finally, we are going to talk a little bit about how to document and publish the project online. So let's dig into it. And if you have questions at any point, please post them in the Q and a, and we will stop periodically if possible, to take questions. All right. So, as I said, for each section, it's always a good idea to write down the steps before we actually try to write the code and you can, Jupiter is great for this because it has marked down cells where you can write things and modify them as required. So here are the steps. First we will install the required libraries. Then we will download some data from Kaggle. Then we look at the dataset files. We will then load the training that set with pandas and then load the test set with pandas. So let's get started. I'm going to install the Jovian library. Well, that's already installed, but I'm going to put it in here. Anyway. Uh, we're going to use a library called open data sets for downloading the dataset. You're going to use pandas. You're going to use NumPy. We're going to use scikit-learn. We are going to use XG boost and I believe that should be it. So I'm just going to install all of these libraries and I've added hyphen hyphen, quiet to. Uh, white, any outputs from this installation. Now, whenever you're working on a notebook, it's important to save your work from time to time. And the way to save your work is to import the Jovian library by running joke, import Jovian and then running Jovian dot commit. Now, when you run Jovian dot commit, you will be asked to provide an API key. And I'm going to go here on my Jovan profile and copy the API key and come back and paste it here. Now, what this does is this takes a snapshot of your notebook at this current moment, and then it publishes it to your chosen profile. As you can see here, NYC taxi, fare prediction, blank. This notebook was created just now you can see this is version one. Now every time you run job under commit in your notebook, a new version of this notebook will get recorded. And the benefit of having a notebook on Jovian is that you can share it with anybody. Like I can take this link and I can post this link in the chat. Of course you can also make your notebooks private or secret if you would like to. And you can add topics to your notebooks so that other people can find them. Okay. So coming back now, download the dataset. We are going to use the open datasets library, which can connect using your Kaggle credentials to Kaggle and download the dataset from this link for you. So here's how it works. I first import open datasets as ODI, and now I can run the.download and I need to give it a URL. So let me suppose to put a URL here. There is a URL for the competition and I just provide dataset underscore URL. Now, when I run this open data sets is going to try and connect to gaggle using my Kaggle credentials, but to do that, it needs my Kaggle credentials and the way to provide your Kaggle credentials to open datasets is to go to kaggle.com and then click on your avatar, go to your account and scroll down to API and click create new API token. And when you click this create new API token, it is going to download a file called Kaggle or Jason to your computer. Now you need to take this file. Kaggle dot Jason, and come back to CoLab and on co-lab go to the file step and upload this gaggled or chase on file. Now, unfortunately you will have to do this every time you run the notebook. So I suggest downloading the calculator just on file once and keeping it handy like I have here on my desktop. So you can upload it whenever you need it. Now, this gaggled or Jason file downloaded from my Kaggle account has my username and a secret key. So you should never put the secret key into a Jupiter notebook. Otherwise somebody else will be able to use your Kaggle account, but within your own Jupiter notebook, when you run odi.download, it is going to read the credentials from the Kaldor Jason file and download the dataset for you. You can see here, this dataset is pretty large. This is about 1.5, six gigabytes. And of course it's a zip file. So after expanding, it's going to become even larger. And it's, it's going to download this dataset to a folder called New York city taxi fare prediction. So I'm just going to put that into a variable New York city taxi fare prediction so that we have it handy when we want to look at the files. All right. So on the files tab here, you can see, we have New York city taxi fare prediction. Here's a folder. And inside the folder, there are some files specifically. There are 1, 2, 3, 4, 5, Okay. Now there was the question. Should you try and follow along right now? But I would say right now you should probably watch, and you will have a recording of the session and you should try and follow along with a different dataset later, but it's totally up to you. Alright, so now the data has been downloaded and now we have this data directory, which points us to the directory where the data lives. And now let's look at the size number of lines in the first few lines of each file. So first I'm going to use the L S minus LS command. So this is a shell command. This is not pipeline. So every time you have an explanation, exclamation mark at the beginning, this is going to be passed directly to the system. This is going to be positively to the system terminal. So I'm just going to run LS minus LS, and I need to access this folder, which, and this string is part of this variable. So you can pass the value of a Python variable using these curly brackets. So when we put something inside these braces or curly brackets, Jupiter is going to replace this entire expression with the value of this variable, which is New York city taxi fare prediction. So Ellis minus let data shows us that this is a total of 5.4 gigabytes of data out of this almost entirely. It is the training set with 5.4 gigabytes. That's a pretty large training set. And then the S set is just 960 kilobytes. And then finally there is a sub sample submission file. As I mentioned, you can submit some predictions on the test set to Kaggle, and there are some instructions which we can ignore. Okay. So that's the sizes of the files. Let's look at the number of lines in some of the important files. So the way to get the number of lines is using the WC minus L shell command. And once again, I'm going to get the data directory and under the data directory, I want to look at trained or CSP. So there you go. New York city taxi fare prediction slash train that CSV contains what's this 55,423,856 road starts a lot of rules. And then let's look at the test set. The test set contains 9,914 rows, which is a lot, lot smaller than the training set. Let's also look at the submission dot CSV file. The submission, not CSV file also contains 9, 9 1 5 rows. So just one additional rule compared to the test set. This could just be a, this could just be an empty line, so it wouldn't worry too much about it. And that's it. So let's, let's also look at the first few lines of each file. I am going to use a head shell command this time. So here we have the first 10 lines of train dot CSV. Remember it has 55 million lines and these are just the first 10. And it seems like you have information like a key. So this is the first key. This is a CSV file. So you will find the first row contains names of columns and then the future rows contain the data. So this is the key. So seems like every row has a very, has a unique identifier called the key. And then there is a fair amount for that, right? So for example, here, the fermata is 4.5. Then you have the pickup date time. So this is the date and time of the pickup. Then you have the pickup longitude or, and the pickup latitude. These are the two places. These are the, uh, geocoordinates of the pickup. Then we have the drop longitude and drop off latitude. So here are the coordinates of the drop off minus 73 and 40. And finally you have the passenger count, which is the number of people who took the ride. Okay. So that's the training data looks simple enough. Let's look at the test data. All right. So test data looks similar to, we have the key, every row has a unique key, and then we have a pickup date time. There we go. We have a pickup long longitude pick up latitude, drop off a longitude drop of latitude and passenger count. Great. Now, one thing that's missing from the test data is the Fairmount. And this is what is typically called the target column in the, in this machine learning problem, because remember the project is called taxi fare prediction. So we need to build a model using this training data and then use that model to make predictions of the fair amount for the test data. And that's why the test data does not have predictions. Now, once you make predictions for the test data, you need to then put those predictions into a submission file. And here is what a sample submission file looks like. Here's a sample submission file. There is a key. So you will notice that these keys correspond exactly row by row to the test dataset. And here is supposed to be the prediction of your model. Now, this sample submission file just contains 11.35, the same answer for every test. True. But this is supposed to be the prediction for your model prediction generated by your model for the test set. And you need to create such a file and then you need to download this file. So I'm going to download it right here onto my desktop. And then you can come to this competition page and click on late submission. And you can upload this file here containing the sample submission, which is the key for each row in the test set and the, and your prediction. And then you can make a submission once this is uploaded, of course. And once you make a submission, your submission is going to be scored. So the score for the submission is 9.4. Now, what does the score mean? You can check the overview tab and going to the evaluation section to understand what the score means. So this score is the root mean squared error. So the root mean squared error is simply a way of measuring how far away your predictions are from the actual values. So you can't, you are not given the actual fair amount for the test data, but Kaggle has them. And when you submit your submission, when you submit the submission, not CSV file, the predictions are compared to the actual values, which are hidden from you. And the differences calculated. Those differences are squared up added together, and then you take an average of the square differences, and then you take a square root of the average difference of the average squared difference. And that's called a root mean squared error. On average, it tells you how far away your predictions are from the actual value. So for example, in our case, our recent submission had a root mean squared error of 9.4, which means our predictions are on average off by $9.40. Okay. And we'll see whether that's good or bad in some time, but we definitely want to do more than that. And now one thing you can do is check your submission against the leaderboard to see where you land. It seems like people have gotten to a pretty good point where data, they are able to predict the taxi fare within $2.80. And we're at $9.74, which is pretty high. If you asked me because most taxi rides costs $10 or 10 to $15, maybe. So B if you're off by nine, your prediction is practically used. Okay, but that makes sense, because right now we've just put a fixed prediction. We've basically submitted the sample file. So that's all CalWORKs. And one tip I have for you here is that you should write down these observations. So anytime you have an insight about a dataset, you should write it down, document it so that it's there for you later. If you need to come back to it. And again, Jupiter is a great way to do that. Okay? So that's how we download it. So that's what we did. We downloaded the data from Kaggle using open datasets. We looked at some dataset files and we noted down the observations. The training data is 5.5 GB. Our test data is it has 5.5 million rows. The test set is much smaller, less than 10,000 rows. And the training set has eight columns, IE Fairmount, pickup, date, time latitude longitude drop off latitude, longitude, and passenger county. Now the test that has all columns except the target column, the Fairmont and the submission file should contain the key and the fair amount for each test sample. Okay. Now I'm going to just save my notebook at this point. I'm going to save regularly so that I don't lose any work and you should do that too. Next stop. We are going to load the training set, right? So here you can check the table of contents. If you're ever lost, we are going to now load the training set and then load the test set. So here's one tip when you're working with large datasets, always start with a small sample to experiment and iterate faster. Now loading the entire dataset into pandas is going to be a fairly slow and not just that any operation that you do afterwards is going to be slow. So just to set up my notebook properly, I'm going to first work with the. And then maybe come back and work with the entire dataset. So we got to work with a one-person sample that's we are going to ignore 99% of the tests, they, the training data, but that still gives us 500,000 pros. 1% of 55 million is 500,000 rows. And I think 500,000 rows should still allow us to create a pretty good model to make predictions for 10,000 rows of data in the test set. So we're going to use a one person sample, and we're also going to ignore the key column because we don't really need to use the unique identifier that is present in the training set and just loading that into memory. You can slow things down. So we're going to ignore that we are going to pass, pick up and daytime while loading the data so that, you know, pick up daytime, uh, pick up daytime is a daytime column and pandas has a special way of dealing with it. We can just inform pandas and that makes things faster. And we're going to specify data types for the particular columns, so that panders doesn't have to try and figure out by low after looking at all the rules. And that is going to, again, speed up things significantly. So with that, let's set up, let's set up this data loading. Um, so first let's import pandas as PD. Okay. And we are going to use the period or read CSV. Function and to three CSE function, we need to provide the file name. So here I'm going to provide data dis slash trainer CSV. Then here are some of the other parameters that we can provide. Now we want to pick a certain set of columns. So I'm going to use, I'm going to provide a value use course. That's one. Then we are also going to provide data types. So I'm going to use the I'm going to use D type. And finally, we are also going to want to pick a sample. So there are two ways to pick a sample. We can either just pick the first 1% of the first 500,000 rows for which we can use and rows. If you would just provide the value of N rows and provide 500,000, that's going to pick up a 5,000 sample for you, or there's another way to do it, which is using something called skip rows. So here we can provide a function which can go through each row and based on the row index, it can tell you whether or not to keep the room. So I'll show you both. Let's start by putting in the use calls. Um, so let me create a variable called calls or selected calls, and I'm going to put in all the call columns here except the key. So I'm just going to take this. Would that into a string. And I'm just going to split that at the comma. So this has a nice effect of giving us this list of columns. There you go. So we are going to use selected calls. Then I am going to select the type. So let's set up data types or data types. Okay. I'm going to grab all of these and use float 32 for the datatype. Of course not all of these are floor 32. We wanted to use you in eight for passenger count. Um, and let's just intend that. So those are the data types. So that's the value of the D type. And now we could provide this end rows equals 500,000 and you can also write numbers like this to make them easier to read. So we could do this and we would get the first 500,000 rows, but I want a random 1%. So for that, I am going to use the, uh, skip rows function and I'm going to call it. I'm going to pass in a function called skip row. This gets the row index or the row number, and here's what we're going to do now. Of course, we want to keep the first row. So if grow index is zero, then we do not want to skip the road. So we return false. Otherwise here's a quick trick. We can, we can apply here. Let's say I want my sample fraction to be 1%. So 1% is just 0.01. Here's what I'm going to do. I'm going to first import the random module. Let me do that here. And the random module can be used to generate random numbers from zero to one. Okay. So that's a random number and that's when these numbers are between zero to one. Now, if I write this random, random, less than sample prac, And let's run that lesson sample fraction then because random numbers have picked uniformly. There is a, exactly a 1% chance that random not random is going to be less than sample fraction. So we should keep the rope only if this expression returns false. Oh, sorry. Only if this expression returns false. So we should skip. We should, sorry. We should skip the row if this expression returns through. So if random not random is greater than 0.01, which is a 99% probability. We should skip the row. Otherwise we should keep the room. So that's what our skip row function does. Okay. All it does is for 1% of the rows, it is going to randomly return through or return false, which means keep the row. And for 99% of the Rosa is going to return true, which is to skip the room. Okay. Now, one last thing I'm going to do here is I'm going to set random dot seed and set it to a particular value. So I'm going to initialize the random number generator here in pipeline with the value 42, so that I get the same set of rows every time around this notebook. Okay. So I encourage you to learn more about seeds and this is going to take a. Okay, sorry, this is not a daytime. Of course. We also need to provide, uh, parts. We also need to provide a list of daytime separately. So let me done that. And then we'll talk about this. Yep. So I encourage you to learn more about random numbers seeds. Let's see, where is parts underscore dates. There we go. Yup. And always fix the seeds for your random number generator. So that's the third tip so that you get the same results. Every time you run the notebook, otherwise you're going to pick up a different 1% each time you run the notebook and you're not going to be able to iterate that much. So there was a question. Can you explain the significance of using shell commands instead of checking the dataset for checking the dataset instead of Python? Yeah. So simple reason here is because these files are so large loading them into Python can itself slow down the process a lot. So normally I would recommend using the OAS module from Python, but in this case I have recommended shell commands because these notebooks are so large and shell commands are really good at working with large files. Okay, there's another question. How do we know it's a regression problem? So here we're trying to predict for fair amount and the fair amount is a continuous number, right at the Fairmont can be $2.50, $3.20, $5.70. So that is what is called a regression problem. A classification problem is one where you're trying to classify every row into a particular category. For example, trying to classify, let's say an insurance application as low risk, medium risk or high risk, that's called a classification problem. Okay. This is taking a while. In fact, it has been running for a minute and a half, and this is happening while we were working with just one person of the data. So you can imagine when you're working with a hundred percent of the data, it's going to take a lot longer. So, and not just this, but every single step of this, it took about one minute 36 seconds to complete. Okay. Now that's the exercise for you. Try loading 3%, 10%, 30% and a hundred percent of the data and see how that goes. All right. So let's load up the test set as well. So I'm just going to load up your test set as period or read CSV data DIR plus slash test or CSV. And I'm just going to provide the D type here and that's it. I think don't really need to provide anything else. Because they start is pretty small. Let's look at the training data set as well. Maybe let's print it out here. Okay. It's just called DF at the moment. Yep. So there you go. We have fair amount pickup, daytime, like long and all the expected values. Now the test TF has, so we're going to keep the key for the test data frame, because we're going to use this one making submissions, but we have the pickup date, time longitude, latitude drop off and passenger count looks great. And I can just commit again. Alright. Okay. So we're done with the first step downloading the dataset that took a while, but we are now well set and let's explore the data set a little bit. Okay. We're just going to do some quick and dirty exploration. We're not really going to look at a lot of graphs and I'll talk about why, but the quickest way to get some information about a data frame is to go to df.info. And this tells us that these are the seven rows. And then these are the number of entries here. This is the total space. It takes on memory. This is an important thing to watch as you go with a hundred percent of the dataset. So you can imagine that it's going to take a hundred times more or 1.5 GB of memory or Ram. That's why we are using co-lab and what else? Yeah, these are the data types seems like there are no, there are no null values or missing values. So that's great. Now, another thing you can do is DF dot describe, and that's going to give you some statistics for each column for each new medical column. So those are fair amount. The minimum value is minus $52 and the maximum value is $499. All right, then the mean, or the average value is $11 and the 50% dollar value is $8.50. So we already know that 50% of REITs costs less than, uh, less than $8. And in fact, 75% of rights cost less than $12.50. Okay. Now that gives us a sense of how good our model needs to be if you're trying to predict, right. Uh, the right, uh, if, if you're trying to predict the taxi fare and seventy-five percent of taxi fares are under $12. So. I want my prediction to be in the plus or minus $3 range. Otherwise I'm off by a lot and that's what we'll try and infer. Okay. Um, you can also look at pickup back at your longitude drop off and then passenger counts. Now there seem to be some issues in this dataset as is the case with all real-world datasets. It seems like the minimum pickup longer dude is minus 1, 1 8 3, which is just not valid at all. It doesn't make sense. There are no such long attitudes, neither are there such latitude. So we may have to do some cleaning. Uh, this would just be wrong data. And there also seems to be a max passenger count of 2, 0 8, which again seems quite unlikely to me. You can see 75% of the values are under two. So again, this is something that we may have to fix later. Take a look at that. Now one thing that is missing here is the date time. So let me just grab the pickup date time and just look at the minimum and maximum values here. So you can see here that our date start from the 1st of January, 2009 and end on the 30th of June, 2015. So it's about six years worth of data. And once again, all these observations are noted here. Five 50 K rows as expected, no missing data, fair amount ranges, passenger count ranges. There seem to be something. And we may need to deal with outliers and data entry errors. Let's look at the test data here. So nothing surprising here. 9, 9 1, 4 columns, uh, rows of data across these seven columns. No fair amount. And here are the ranges of values and these seem a lot more reasonable. The pickup seem to be between minus 75 and 72. So minus 34.2 is the, is the lowest and minus 72.98 is the highest. So that's good. Then passenger count also seems to be between one and six. Now here's one thing we can do if our model is going to be evaluated on the test set and which is supposed to represent real world data, then we can limit the inputs in our training set to these ranges, right? Anything that is outside the range of the tests that can be removed from the training set. And because we have so much data, 55 million rows or one person of that, which is still a large amount of data, we can later just drop the rows, which fall outside the test range. Okay. So keep that in mind. And finally, let's check this too. Pickup date, time and maximum and minimum. And you see here that these tests, dataset values also range from the 1st of January, 2009 to the 30th of June, 2015, which is interesting because this is the same range as the training site. Now that's an important point here, which we'll use while creating the validation set. Alright, so let's come at this. That was quick enough and that we already have a lot of insight, but now what you should do at this point, or maybe later when you've trained a few models is to create some graphs like histograms line charts, bar charts, scatter plots box plots, geo maps, you have location data here, or other kinds of maps to study the distribution of values in each column and study the relationship of each input column to the target score. Be useful thing to do, not just right now, but also once you've created new features, when we do some feature engineering. And another thing that you should try and do is something like this. You should try and ask and answer some questions about the data site. What was the busiest day of the week? What is the busiest time of the day in which month are the first highest in which pickup locations have the highest fare, which drop locations have the highest, where, what is the average right distance and keep going. The more questions you can ask about your dataset, the deeper understanding you will develop off the data, and that will give you ideas for feature engineering, and that will make your machine learning models a lot better. So having an understanding of the data is very important to build good machine learning models. And if you're looking to learn exploratory data analysis and visualization, you can check out a couple of tutorials or a couple of resources. We have a video on how to build an exploratory data analysis project from scratch. And we also have a full six week course on data analysis with Python zero two pandas.com that you can check up. Now, one tip I would like to share here is that you should take an iterative approach to building machine learning models, which is first do some exploratory data analysis a little bit like we've done without even plotting any charts. Then do some feature engineering, try and create some interesting features, then train a model and then repeat to improve your model can instead of trying to do all your EDA for maybe a week, and then doing, doing a lot of feature engineering for a month, and then trying to train your model and discovering that most of what you did was useless. Use an iterative approach, try and train a model every day or every other day. Okay. So I'm going to skip ahead right now and maybe I'll do some EDA after we're done with this tutorial. All right. So that was step two. We've made good progress. Now we've downloaded the data. We've looked at the data, let's prepare the dataset for training. So the first thing we'll do is split training and validation sets. Then we will deal with the missing values. There are no missing values here, but in case there were we, this is how we deal with them. And then we extract out some inputs and outputs for training as well. Okay. So we will set aside 20% of the training data as the validation set. So we have 550,000 rows out of those. 20% will be set aside as a validation set. And this validation set will be used to evaluate the models we train on the training data. So the models are trained on the training data, which is the 80% and then the evaluation is done. So we calculate the root mean squared error on the validation set, which is the 20% for which we know the targets, unlike the test set for which we don't know the targets. And what that will do is the validation set will allow us to. Estimate how the model is going to perform on the test set and consequently in the real world. Okay. So here's the next tip. Your validation set should be as similar to the test set or real world data as possible. And the way you know, that is when you find the root mean squared error on the validation set. And you can do that because you can get predictions from your model and you have the actual targets for the validation set, and you can compare those and calculate the root mean squared error. So the way you know, that the validation set is close enough to the test set is when the evaluation metric of the model on the validation and test it is very close. Okay. And if the root mean squared error on the validation set is like $2, but when you submit it to calculate the, the root mean squared error is $9, then your validation set is completely useless. And you're basically shooting in the dark because you're trying to train different models to do better on the validation set, but the validation set has no relationship to the test set score. So make sure that your validation set and test sets have similar or very close scores and an increase in the score on the validation set, reflect as an increase on the test set. Otherwise you may need to reconsider how your validation set is created. Now, one thing here is that we can, because the test set and trainings that have the same date ranges, right? The test set lies between Jan 2009 to June, 2015. And the training set also comes from Jan 2009 to June, 2015. We can pick a random 20% fraction of the training set as a validation set. I suppose the test said was in the future, suppose the training said was data from 2009 to 2014, and the test set was data for 2015. Then to make the validation set similar to the test set, we should have picked maybe the data for 2014 as a validation set, and then the data data for 2013 and before as the training set. Right? So keep those things in mind. It's very important to create validation sets, carefully support, creating a validation set. I'm going to import from eschalon dot model selection, import train, test split. And this is something that you can look up. You don't have to remember this, and I'm just going to do train DF while the F equals train test split. And I'm going to split the original data frame. And I'm going to set the test. Um, let's see here, the test size or in this case, which is going to be the validation size 2.2. Okay. And now I can check the length of the train DF and the length of the well validation. To make sure that we have the right sizes. So we have 4,440 1000 rows in a training set and a randomly chosen 11,000 rows in the validation set. I'm also going to set random state equals 42, just so that I get the same validation set. Every time I run the notebook, this is important because your scores may change slightly. If each time you're creating a different validation set. And also if you're combining models across validation sets that leads to data leakage, et cetera. So to fix the validation set, the random set that is picked, I'm going to set the random state to 42. Okay. Now that's one piece. Now, the other thing that we need to do is to fill or remove missing values. Now we've seen that there are no missing values in the training data or the test data, but it's possible because we've only looked at one person on the data. It's possible that there may be missing values elsewhere. So here's one simple thing you can do three in DF equals train, DF dot drop any, and where the F equals two IDF not drop any. Okay. Now why, what does this do? This is going to drop all the empty rows from the training or all the rows where any of the columns has an empty value or missing value. From the training and validation sites, you shouldn't always do this, but if, because we have so much data and at least so far, I've not really seen a large number of missing values. I'm estimating that the number of missing values is going to be less than one or 2%. So it should be okay to drop them like number of missing values in the entire dataset. So it should be okay to drop them. Okay. So I'm not going to run this right now, but you know what? This does next. We, before we clean our model, we need to separate out the inputs and the outputs because the inputs and the outputs have to be passed separately into machine learning models. So I'm going to create some something called input calls here, and maybe let's just first look at trendy of dark columns so that we can copy paste a bit. Now the input columns are these, but actually we can't really pass a daytime column by itself into a machine learning model because it's a, it's a timestamp. It's not a number. Um, so we'd have to convert the daytime daytime column into sum or split the daytime column into multiple columns. So I'm just going to use these for now, the latitudes and longitudes and the passenger count. And for the target column, I am just going to use a fair amount. And then, so now we have the important target columns. Now we can create train inputs. So from the training data frame, we just pick the input columns. So this is all you just out, just a certain set of columns from the training set. And then we have the train. Well, let's call that train inputs. And we have the train targets, which is clean DF target call. We can view the train inputs here and we can view the train targets here. Okay. So you can see now we no longer have the column fair amount here, but we still have all the rules and here we no longer, we just have the single fair amount column in front of us. Okay. Let's do the same for the validation set. So while inputs is YDF input calls, where targets is where the F get called, and then let's look at value inputs and okay. So 110,000 rows, that's already a lot larger than the test set by the way. So should be good. Yep. And here are the validation targets. Finally test DF. Now the test data frame, remember it doesn't really have any target columns, but we still want to pull out just the input columns that we can use for training. So let's just do test PF. Input calls test inputs. Okay. And there are no targets. There's no fair amount in the test data frame. That is something that we have to predict. So there it is. Okay. Not bad. We're making good progress. In under an hour, we have downloaded the dataset, explored it a little bit, at least prepare the data set for training. And now we are going to first train some hard coded and baseline models. So here's the next step. Always, always create a simple hard-coded model, which is basically like a single value or something, or a very simple rule or some sort of a baseline model. Something that you can train very quickly to establish the minimum score that any proper machine learning model should beat. I can't tell you how many times I've seen people have trained models for hours or days, and then the model ends up producing results that are worse than what you could have done with a simple average. And that could be for a couple of reasons. One, you've actually not trained the model properly. Uh, like you, you print a really, you created a really bad model or second, you made a mistake somewhere in the feature engineering or somewhere in preparing the data or somewhere and making predictions. Right. So it serves as a. A good way to test whether what you're doing is correct. And whether you're, and it gives you a baseline to beat. So let's create a simple model. I'm going to create a more a class and mean regressor, we'll have two functions fit. I'm going to make it very similar to a scikit-learn model. So it's going to take some inputs. It's going to take some targets. And so fit is used to train are a simple model. And then we're going to define a function called predict. It takes a bunch of inputs and then create some targets. So here's what I'm going to do here. I'm going to completely ignore the inputs and I'm simply going to set self dot mean. I'm going to store a value self taught me where I'm just going to do targets dot mean here. Okay. And that's just going to calculate the average value of the targets. And here I'm just going to return. So let's say we have, let's take the lens of inputs or another way to do this is inputs not shape zero. And I'm just going to do something like this and paid out full. Okay. Let me import number five. Yep. And I'm going to do this and not full where let's say you can give it a shape and then give it a value. So let's say I have 10 inputs and I always want to return the value three. So I do NPDR full 10 cometry and that's going to list a return 10 threes. So I'm always going to return. I'm always going to return input, start shape zero. So again, if you have a , that's a train inputs, right? Sustain inputs. This is a non-player of binders, a data frame. If I do attain inputs, not shape, that tells me the number of rows and columns dot shape, zero is going to tell me the number of rows. So I get the number of roles. So I'm just going to get the number of rows here from the inputs that were past year. And I am going to return self.me. Okay. So yeah, at some object oriented programming, some fancy non stuff, but ultimately what it's doing is this let's first create a mean regressor model. Let's call it mean. Is mean regressor. So now we've created this mean model and let's call me model.fit. So we're now going to train, train this model, uh, this so-called, uh, model that always predicts the average and let's give it the train inputs and the train targets. Okay. Now once we pass or sorry, let's call it fit. Okay. So now once we call main requester.fit, it's going to completely input the ignoring inputs and it's going to take the targets and simply calculate the average of the targets, a single value. And it's going to store that in the dot mean attribute. So the average is 11.35. Okay. That's the average spread for the taxis. And then when we get, want to get some predictions, so let's say we want to get some predictions for the train training set. We can say mean model, dark predict rain inputs, and that gives us a prediction. So it's simply predicted the value 11.35 for every row in the trading set. Okay. Similarly, we can get some predictions for the validation set. So let's say main model dot predict while inputs. And once again, it's going to simply predict the value 11.35 for every row in the validation set. Now we may want to compare these predictions with the targets. How often is this model by of course, it's going to be way off because we are just predicting the average. So here are the train predictions and here are the training targets. Six 3.7. You can ignore this. This is simply the row numbers from the data frame, but yeah, six, 3.7. And we we're always predicting 11.35 now to tell how badly we are doing, we are going to need to compare these two and come up with some sort of an evaluation metric. So that's where we are going to use a root mean squared error evaluation metric, because that is a metric that is used on the leaderboard. So I'm going to import from eschalon dot metrics. Okay. And I'm going to define a function called RMSE just to make my life easier. Each takes some inputs, some targets, and it returns mean squared error, uh, with, between the, uh, not input, sorry, it takes some targets and it takes some predictions. And it doesn't mean squared error between the targets and the predictions and to return the mean squared to sorry, to get the root mean squared error. We need to set square to false in mean squared error. Okay. All that said and done. I'm now going to be able to get the root mean squared error. So we have some training targets. These are the fair amounts for the training set rows. We have some green breads, some predictions let's call RMSE on this and let's call it train RMSE. And let's print that out. So this is the root mean squared error for the training set, which means that on average, the predictions of our model, which is always 11.3, five are off or are different from the target. The actual value that the model should be predicting by nine, which is pretty bad because the values we're trying to predict the average is about 1170 5%. Mark is about seven oh, sorry is about 12. So if you're, if they're trying to predict values in the range of, let's say 10 to 20 and you're off by nine. So that's a pretty bad model. Right. And that's expected because it's just a dumb model, but here's the thing. Any model that we trained. Should hopefully be better than this kind of a model that should be, hopefully be, have a lower training, uh, lower, uh RMSE let's get the validation RMSE as well while targets well breads. Yep. So the model is, are, are hard-coded models off by 9.899 on average, which is pretty bad considering that the average fare is 11.35. Okay, great. So that was our hard coded dump model. Alright, next let's train a very quick linear regression model to see whether machine learning is even useful at this point. So I'm going to from eschalon dot linear model, I'm going to import the new regression and I'm going to create a linear model here. That's it? I think that's pretty much it. You can set a random state, I believe to avoid some randomization. Oh no, there is no random state. So this is it. Right? Linear regression model is just this in scikit-learn and here's how you fit a model by the way. We are expecting here that you are already familiar with machine learning. And if you're not, then I highly recommend checking out zero two gbms.com. This is a practical and coding focused introduction to practical, to machine learning with Python, where we cover all of these topics, all of the models that we are looking at today. Okay. So we do linear model.fit, green inputs and green targets. Great. And then once it is fit, we can now make predictions so we can get trained Preds equals linear model, dark predict, and that's going to take the training inputs, and it's going to come up with some predictions for us, and you can look at the predictions here and compare them with the targets here. You can see that the predictions are all still close to 11, but there are different, at least they're not the same prediction each time, but there's still a way off, right? There's still way off. Um, let's maybe also get the Brit let's look at the RMSE on the train breads and green breads. So the ultimacy is 9.788. So that's not much better. 9.789 was our average model and our linear regression just. Hardly better and still completely useless. Right? Let's get wild predictions, ultimacy on the well targets and well breads. So the root mean squared error here is 9.8, nine or 9.898, which is just 0.0, zero one less than a cent, less than 0.10 cent better than our average model. Okay. And now at this point, you might want to think about why that is the case. And in this case, I would say that this is mainly because the training data, which is just geocoordinates at this point, which is a latitude and longitude, et cetera, is not in a format. That's very useful for the model, right? How does a model, how is the model going to figure out that latitude and longitude are connected and there's a pickup latitude and a pickup longitude. And then there is a sort of a, some distance between them or all those relationships are very hard for models to learn by themselves. And that is where feature engineering is going to come into picture. And we are also not using one of the most important columns, which is the pickup date and time before. Spares are very seasonal in terms of months in terms of base in terms of hour of day in terms of day of the week, et cetera. So that's why our, our data in the current format is not very useful form of machine learning perspective. And we are able to establish that using the hardcore in baseline model. However, we now have a baseline that all our other models should ideally beat. Now, before we train any further models, we are going to make some predictions and submit those predictions to Kaggle. Now here's the next step. Whenever you're working on Kaggle competitions submit early and submit often, ideally you want to submit, make your first submission on day one and make a new submission every day, because the best way to improve your models is to try and breed your previous score. If you're not making a submission, then you're not going to figure out if you're heading in the right direction. Or if you have a good validation set, or if there's anything else you should be doing. But on the other hand, if you're making submissions every day, you will have to try and beat your previous submission, right? And that will force you to move in the right direction. So how do you make predictions and submit to Kaggle? First, you have to make some predictions for the test set. So we have the test inputs here, right in front of the. So, all we need to do is we need to pass the test inputs into the, let's say, let's take a linear model. So into the linear model, we say dot predict, it's trained already using the training set, and we get some predictions from this. And of course we don't have any targets. So the way we have to evaluate these predictions is by creating a submission file so way we can create a submission file is this way we first read in the submission data frame, which is the sample submission file. Right? And now, so this is a sample submission file. Now, all we can do is we can simply take the test predictions. Here are your test predictions, and we can simply replace this column of data. With this, because remember the rules in the submission file correspondence one to one on one, uh one-on-one to the rows in the test file, right? So the first row of the submission file point to the first of the test file and so on. So I'm just going to do something like this. Sob, DF dot, also resub, DF, where amount equals test rates. And now you should see sub DF. We'll now have the data from the test set. You can see that there are all, these are all different values. Okay? And now you can say that to CSV. So you can say sub df.to CSV. You can give them a file name, like linear model submission, not CSV, and one thing you need to do and saving submission file, especially for calculus to specify index equals none. Otherwise your otherwise find is also going to add this 0 1, 2, 3, this column as an additional column in your file, which you don't want. Okay. And now you will have this failure sample or linear model submission dot CSV. And now you can download this file. So let's save that. And you can submit this file. So our previous submission was giving us 9.409. That was the, that 9.409. That was the RMSC. And this one, let's see what this one gives us. Let's click on late submission. Let's go here and let's just call this linear model simple linear model. Let's make, oops, let's give that a second. Yep. It's uploaded. And the submission is code 9.4 through seven. So not, not very different. And one thing that we can now verify is that our test set metric is close to the validation set metric, right? So the remember the validation set RMSE was 9.89. Um, and the test set metric is 9.4. Now that's not too different. I mean, they're in the same range. It's not that validation set at, uh, RMSE was two and the test set RMSE was 10. So they're close enough. Uh, of course the validation set is a lot larger. It's about 110,000 roles, whereas. As set as just 10,000 rows. So it's always harder to make predictions on larger, uh, unseen data than smaller and seen data. So that could have an effect, but at least they have they're close enough for us to work with. Okay. So that's how you make predictions for the test set. Now, the next step here is to create reusable functions for common tasks. Remember we said that you should be making submissions every day, and if you need to make submissions every day, you should not be copy pasting all this code around each time, because that just takes up a lot of mental energy and you make mistakes near to figure change values and all that. So it's always good to create functions like this. So we create a function predict and submit, which takes a model xFi, lame calls, model dot predict on test inputs. And then that gives you test predictions. And then you can call, read CSV, or you could even provide tests in Portsea as an, uh, as, as an argument. And then you call period or read CSV. And that's going to read the sample submission into the submission file. You are going to put the fair amount and then you're going to save it to the given file name. And, uh, yeah, let's put that here. We're going to save it to the given file name and that, and then you do return the data frame. So we could do the same thing, Reddick and submit let's, let's just call this, let's give it the linear model and let's give it test inputs and let's give it the file name, linear subdued, or CSV. And you can see that it does exactly the same thing, but now it's just one line. Anytime you want to generate predictions and create a submission file. So here's linear sub two dot CSV. You can see what this file contains. So yeah, it shows you a nice pretty view here. Uh, yeah. So it shows you a nice preview here, but it's ultimately, uh, actually just a CSV file, which contains the key and Fairmont. Okay, great. So that brings us to make predictions and submit to Kaggle. We're done with that. That was simple enough. And we now have this function that we can use anytime to make predictions on the test side. And next one thing you will want to do here is just track your ex ideas and experiments systematically to avoid becoming overwhelmed, doubles, and dozens of models, because you're going to be working on a machine learning project for at least a couple of weeks, but probably a couple of months or longer. Uh, so you need to keep track of all the ideas you're trying. So here's a tracking sheet that we've shared set up for you, where you can, uh, and you can just create a copy of the sheet. You can go to file, make a copy and create a copy. But here you can put in ideas like the, uh, the kind of models you want to try. Um, those sample sizes. So maybe like try sample size 10% to just keep, uh, keep a list of all the ideas you have. Whenever you have them, don't have to try them right away. You probably can't and keep a list of what you expect the outcome to be. Whenever you have an idea. And also then once you try out an idea or mentioned what learning you have from the idea. So this is called just idea, tracking where you like, let's start all your ideas and, um, the potential outcome you expect and what you learned from it. And then here you have experiments. So each time you train a model, you just want to put in like give a title to the model, give a date, give some, uh, no doubt and whatever hyper parameters you want to know. What are the type of model you want to note? Uh, not on the training loss, validation loss, and the test score. This could be from the calculator. And a link to the notebook. So every time you save a notebook, every time you save using Jovian dot commit, let's say I go here and I say, Jovan dot commit. This is going to give me a link. Yep. Yeah. So you can know down here that there's version 10 here. So you can just go down something like this. You can just note down, let's say version 10, right? Um, wasn't 10 of the notebook and you can refer back to it. So over time you have all these versions, dozens of versions. And for each version, you know exactly what the parameters for that version on. And once you have maybe 30, 40 models, you can look at the sheet and you can get a very clear idea of which models are working in which models are not. Okay. Let's see if we have any questions at this point before we move ahead. Can you please explain? Okay. I think we've answered, uh, why we are using shell commands. Can we directly fit the model to 99% of the remaining data with one sample, 1% sample data of printing? Um, yeah, so I don't, I'm not sure if I understand the question, but what we have done right now is there are 55 million rows available for training. I have simply taken 1% of that or 500,000 roles. For training the model so that we can train things fast. What you would want to do later is instead of using just one person of the data use maybe 10% of the data or 20% of data, all a hundred percent of the data at the very end and train a model on the entire a hundred percent of the data, and then make predictions with that train model on the test set. And that should definitely be better than training the model on just 1% of the data. So I hope that answers that next for regression problems. We created a model which gives you, gives the mean as an output. And then we tried linear regression. What should be our approach for classification problems? Good question. So for classification problems, you can maybe just predict the most common class, or you can predict a random class and just go with that. Yeah. Most common class random classes, what I would suggest for classification problems. Okay. Let's see if there are any other interesting questions, any specific reason we are using floor 32 and unit eight. Yep. So I actually look. How many decimal places, the floor 32 supports, and it supports about eight decimal places like eight digits of precision, roughly. And that is good enough to support longitudes and latitudes. Now, if you just specify float, it might pick float 64, which will take twice the amount of memory, which can be a problem for large datasets. Similarly with UN eight. So I looked at the fair amount values and seems like it's in the, um, not fair amount, the number of passengers, and seems like that's in the range of maybe one to 200. Um, so if you just specify in, that's going to, you use in 64, which is going to use 64 bits, but we can actually get away with one eight of that, just eight bits, because the numbers we're dealing with are fairly small. So that's why you went eight. So these are just techniques to, uh, reduce the memory footprint of the data step. Okay, perfect. And there was a question about prerequisites for this, uh, workshop. So zero to pandas.com is the prerequisite. And, but you can also watch this right now and start working on a machine learning project and then learn machine learning along the way. Okay, let's move on to feature engineering. Now we're halfway through. So hopefully we'll be able to train a few models. Feature engineering means taking the columns of data that you have and performing operations on them to create new columns, which might help train better models because machine learning models are fairly dumb. They have, there is a certain structure. They have, they assume a certain relationship between inputs and outputs like linear regression assumes that the output is a linear as a weighted sum of the inputs. And that may not hold true in the current form, which is latitudes and longitudes. But suppose we were able to somehow calculate the distance between the pickup and drop off point. Then there will definitely be some sort of a linear relationship between the, uh, the distance to be covered and the fare, right? So by creating good features, you are going to make, you're going to train much better models because you are now applied human insight to provide features that are conducive to solving the problem in the structure that the model assumes. Now, the tip here is to take an iterative approach to each engineering. Don't go overboard. Don't spend weeks creating features, just add some features, one, or to train a new model, evaluated, keep the features. If they help otherwise drop them, then repeat with new features. So here are some features that we are going to create, and I am not taking an iterative approach here in the interest of time. I'm just going to create a bunch of features right away. First is to extract parts of the date. We have totally ignored the data so far because we didn't know how to put it into a linear regression model, but we can extract things like your month, day, weekday hour. I think these are all useful things because over years, I would assume that the fair increases across months, there must be some sort of a seasonal trend across days of the month as well, because maybe there are a bunch of deliveries or there are a bunch of things that people have to do during the start of the month or the end of the month, like going to the bank across weekdays. Of course, there should be a difference between weekdays and weekends. There should be a difference during the week and a close out of the team. So we are going to extract out all these parts out of the date. So that's one thing we'll do. We will also deal with outliers and invalid data. It's a form of feature engineering. We're kind of cleaning up the data a little bit. We will add distance between pickup and drop location. So we will see how to compute distance using latitudes and longitudes. And we will also add distance from some popular landmarks because a lot of people take. Taxis to get to places where they can't normally drive, and these could be crowded places or these where they can't park, or these could be things like airports. And there are also tools involved here and tools are included in the, in the fair. So it might be useful to track that as well. Okay. Uh, so we're going to apply all of these together, but you should observe the effect of adding each feature individually. Okay. So let's extract some parts of the date and this is really easy. And once again, I'm just going to follow my previous advice and create a function that allows me to do that. So here we have a function add date parts that takes a date, a data frame, and a column name, and then it takes the column and it creates a new column name called column name underscore ear, where it extracts from that date, from that date time column here. So just to give you a sense of how this works, we have trained DF. This is a data frame. And suppose I said, column two pickup date time, then train F of call is just all the pickup date times. And if I called God DT, Year, and it's going to give me just an ear for every row of data. You can see 2000 11, 12, 15, and so on. And I can actually now save that to train DF. Let's say, I want to save it to pick up date, time underscore year equals this. Right. And another way to do this is to just say call plus, right? So that's what we've done, not just for ear, but month, day, week, then hour. And we've put that into a function. So I'm just going to call, add date parts to drain DF that's on that. And of course we need to give it the column name. Yep. Let's ignore the volume for now, uh, the warning for now. And let's do that for the validation set as well. Our date parts to the let's do that for the test set as well. Uh, let's yeah, well, let's see. Hmm. Yeah, it looks like there might be some issue in the test data frames. I'm just going to re load it again. Uh, I see, I see what happened. So yeah, this is one of the things with doing things live. There are always some issues. So let's go back to creating the test set. Let's see, we have a useful table of contents here. Again. This is why it's useful to have a table of contents. You want to also specify here past dates equals pick up date time so that this can be parsed as a date column. And let's come back here to extract parts of date and let's add the training data frame. Uh, let's add it in the validation data frame and let's add it in the best data frame. Okay. I think we have it everywhere. So let's check green DF and you should not see that train DF has not just a fair amount and pickup date, time, et cetera, but it also has pickup date daytime year, month, day, time of day, weekday and hour. And you can verify the same for while DF and test TF as well. There you go. Right. And yep. So with that, we have added different parts of the date. So we've already done some basic feature engineering. So dates are always a low-hanging fruit for doing feature engineering. You can add more, you can add things like start of quarter, end of quarter, start off your end of your. Uh, weekend weekdays, et cetera. The next thing that we're going to add is the distance between the pickup and the drop location. And to do that, we're going to use something called a have assigned distance. There are many formulas to do this. Um, like again, the way I found this was just looked up online distance between the distance between two, uh, geographical or map coordinates. I don't like it too long at your distance, something like that. So there is this formula and it looks something like this, essentially. This is what the formula looks like. It has an arch sign and a science square and a cost, et cetera. And then I looked up a way to calculate, have a sign for pandas. So I searched online, fast habits, sign approximation, how do you calculate it? And somebody was very helpful. They created an entire function, which are directly borrowed over here as to have sign distance. Okay. So it takes a longitude latitude. So this is going to be the pickup location. It takes the longitude latitude of the drop location, and it is going to calculate the approximate distance in kilometers between the two points. This uses great circle geometry, which uses the spiritual nature of the earth and how latitudes longitudes are defined, et cetera. Um, we don't have to get into it, but there are these resources, if you want to. And the interesting thing here is that this works not just with one latitude and longitude, but also a entire series or an entire list of latitudes. And longitudes. So if you provide one list containing a list of longitudes and a list of attributes and a list of longitudes and lists of latitudes. So basically a bunch of rules, it's going to perform that for each row. And it's going to perform that in a vectorized fashion because it uses non-pay. So it's going to be very efficient. So we can directly use this to add the trip distance into our data frame. So from our data frame, we pick up the pickup longitude pickup latitude, or then we give the drop of longitude and drop of latitude and pass it into the hub assigned function that is going to do this for each individual row. It's going to compute the distance using the hub assign formula, and it's going to now add a trip distance. So let's add trip distance to train DF and let's add rep distance LDF, and let's add trip distance to the SPF. And now we can see rain DF. Yep. So here now you can see there is a trip distance, and this is a distance in kilometers. So this seems like a fairly long trip. This seems like a shorter trip and so on. Uh, well, 1.3 kilometers, and then there are some trips, like 7.1 kilometers. You can already probably tell that this the fare for this trip is probably going to be maybe four or five times the fair for this trip. I would imagine. Let's see. Yeah. So the fair here is 18. The farrier is 3.7. So it's about five times, right? So there's already a very useful feature for us to use now next, and this is something I learned by looking at some discussions and some notebooks on the competition page. We are going to do a little more, a slightly more creative set of feature engineering. We are going to add the distance from popular landmarks specifically. You're going to add check if we, if the trip is going to end near one of those landmarks, if people are going to one of these landmarks and specifically airports, because airports, ports have tools. So we went to ad JFK airport, the LaGuardia airport, the Newark airport. You're going to add the location of time square met museum and world trade center. Okay. There are many more, you could have the statue of Liberty. You could have central park, or you could have a bunch of other locals. In New York. So this is something that you have to look up and we will add the distance from the drop location, but feel free to also add the distance from the pickup location that is left as an exercise for you. And let's see if this helps and here's the next step that creative feature engineering, which generally involves some human insight because no machine learning model will be able to figure out, at least not the simple models that we are draining. We'll be able to figure out that a certain location is very important by itself, but we can, we can figure that out quite easily. Given the context we have, so involving human insight or involving external data. So here we have picked up the latitude and longitude values for GFK LGA, et cetera. And this is essentially, this is what is called external data. Um, again, this, this data will never automatically become available to the machine learning model. So create a feature engineering is often a lot more effective in training, good models, Dan excessive, hyper parameter tuning. Like you're doing a lot of grid searches and like training a model for a long time for multiple hours or overnight or over multiple. With tens of gigabytes and like hundreds of columns. So keep in mind that just one or two good features improve the model's performance drastically. So focus on finding what those one or two good features are, and they can one, they can improve the model's performance far more than any amount of hyper parameter tuning and S and the way you get these features is by doing your exploration data analysis, by understanding the problem statement better by reading are reading up discussions by discussing it with people. And also keep in mind that adding too many features is just going to slow down your training, and you're not going to be able to figure out what is the useful thing to use. That's why it rating is very important. So here are the latitudes and longitudes longitude and latitude for JFK LCA, uh, Newark, the Mac museum and the world trade center. And once again, we can use the habit sign, distance function. We are going to give it the data we are given to give it a data frame, and we're going to give it a landmark name and landmark lawn lat. So from the lawn light, we're going to get the longitude and latitude of the landmark, and we're going to create a landmark name, underscore drop distance column, where we are passing into the habits and function the longitude and latitude of the landmark as the pickup. And this is the interesting thing about this function, that it doesn't have to be all numbers or all series, like a couple of these can be numbers and a couple of these can be series and it will still work fine. So here is the landmarks location, and then here is the longitude and latitude of the drop location, right? So we're going to add the drop distance from the landmark in this fashion. And of course we need to do it for a bunch of landmarks. So here we have created this ad landmarks function. That's going to do it for each of the landmarks, right? So this is why creating functions is really useful because now we can just do add line marks, drain DF. Um, I guess this should just be called ADF. Yeah. Add landmarks, strain, DF, and add landmarks. Well, LDF and Adeline marks SDF. Okay. And now if you check the training data frame looking nice. Now we have this pickup lot at your lung longitude, et cetera, but here is where it starts to get interesting. We have the trip distance, JFK drop distance. We have the LGA drop distance. We have a EWL drop distance. This one seems to be near the mat. Uh, and then we have the WTC drop distance. This one seems to be at the WTC. Yep. So now we have a lot more interesting things here, and I think that's enough each engineering for now enough new features, but let's also remove some outliers and invalid data. Remember if you look at like just a data frame, df.info, DF dot describe, and we look at test TF dot. Describe the test set seems fine. Desert has fairly reasonable values. You can see here. It has pick up latitude longitudes are in the minus 72 to 75 range. Um, similar people pick up longitudes and drop off long longitude. Pickup latitudes is probably between the minimum is 40 and the maximum is 41. So between 1442, um, passenger counts are between one and six. Since we are only going to make predictions on these ranges of data, it makes sense to eliminate any training data, which falls outside these. And of course the training. It also seems to have some incorrect values like minus 1, 1 8, 3 is not a longitude. Uh, 2, 0 8 people cannot fit in an Uber and even this fair seems high, but I wouldn't bet so much on it with search. It could possibly go there, but there's definitely some issue here. So here's what we're going to do. We are going to limit the fair amount between one and $500, which is already the case, except that there are some negative fairs and we don't want our model dealing with negative fears. It's just makes things harder. We may sacrifice maybe predictions for one or two rows in the test set, but the overall gain in accuracy will be better. Then we're going to limit the longitudes to minus 35 to 72. We're going to limit the latitudes to 40 to 42, and we're going to limit the passenger count to one to six. So now we have this remove outliers function, which takes a data frame. And then from the data frame, it picks out those rules, which match all these conditions where the fair amount is greater than one. And this is how you combine conditions in pandas while queering or filtering rules. And the fair amount is less than or equal to 500. And the pickup lot longitude is greater than minus 75. And the pickup lunch duty is less than minus 92. Same for drop-off. And the pick-up latitude is greater than 40. And the pickup latitude is less than 42. And same for drop-off and the passenger count is between one and six. Okay. So this is how we remove outliers. You don't always have to remove outliers. If your model has to deal with outliers in the real world, then you should keep the outliers. But if your model doesn't have to deal with outliers, or you're going to train a different model for outliers, then it makes sense to remove the outliers, the ranges of values that don't appear in the test data. So let's remove outliers from the training dataset. Let's remove outliers from the validation data, a data frame, and finally let's remove outliers from, okay, then applies in a test data frame. So I wouldn't worry about that, but I would want to save my notebook again. There we go. Okay. All right. So now we have done a whole bunch of feature engineering. We've extracted parts of the date. We've added distance between pickup and drop locations. We've added distance from Poplar landmarks. We have removed outliers and invalid data. We have done some. Okay, next up here are a couple of exercises for you. You can try scaling numeric columns into the zero to one range right now, all of these numeric columns have different ranges, the month data, et cetera. Um, this generally helps with linear models or any models where the loss is computed using the actual values of the data. Um, and you can also try encoding categorical columns. So things like months can possibly be treated as categorical columns month, year, even day of week, et cetera. And you can probably use a one hot encoder there, and that makes it a lot easier for decision trees to work with categorical columns. Now we won't do this because for a couple of reasons, one, we just want to keep it simple right now and get to a good model first. And then we can come back and try this later. If we have time. But second three base models are generally able to do a decent job. Even if you do not scale numeric columns or encode categorical columns, assuming that you have a tree that can go deep enough or you're training enough trees, which we will try and do. Okay. But yeah, but that's an exercise for you. Try scaling numeric columns and encoding categorical columns. And if you don't know what these terms mean, then check out the zero to GBM scores. Now, another tip here that I have for you is we've spent what one closer to ours. Now preparing this data for training. We still haven't trained our first machine learning model, but what you can do now is you can actually save these intermediate outputs and download those files. Or you can even put those into Google drive and then load them back later when you're, when you're using your next notebook. Right? So this way you save all that time of executing or downloading the data and preparing the data set and running through all the scores. You can start from this point where you have this pre-prepared data and you may want to definitely do that, do this for the entire dataset once and get those processed files and then save those processed files for the entire data set so that you don't have to download and do this processing on the entire data set of 55 million rows. Okay. And a good format that you can use is. format. So you could just do train bf.to get rained or pockets, or the pocket format is really fast to load and to write. And it is also very small and it's, it's a footprint on the storage. So it's a good intermediate format to use when you know that you're going to load it back. Using pandas CSV, unfortunately is a very heavy format because everything has to be converted to a string. And similarly, we can do a valve DF dot two bucket where dot bucket. And you can see that these files are here and you can download these, or you can even just push them to your Google drive. All of those things are possible. Now, another tip here or sort of corollary here is that you may also want to create different notebooks for EDA feature engineering and model training so that your EDA, your feature engineering, your initial EDA is just something where you just experiment with different graphs, et cetera. Your feature engineering notebook is where you have to create new features and then, uh, output these pocket files. And then your model training notebooks can simply use the outputs of your feature engineering notebooks. And there is a way to connect Google CoLab to will drive so that you can organize all your work well. Okay. So let's now train and evaluate some more. So we train three kinds of models. Although there are many more, you can train, but because we have limited time, we are going to train these three kinds of models, linear regression, or a form of linear regression called rich and forests and grading boosting models. Maybe I should just change this year. I'm going to change this to preach and you can try things like lasso, elastic, net, et cetera. Okay. But before we train the model, once again, we have to create inputs and targets. Now we've added a bunch of new columns, so let's just go train DF dot columns. And now, well the input columns, we're going to skip the fair amount. We are going to skip the pickup date time. I'm still going to keep pick up latitudes and longitudes because decision trees might still be able to use these. So those are all our inputs. It looks good. And then we have our target column and here is the fair amount as a target column. So now we can create train inputs. So train in ports is green DF, and let's just put in the input calls. And then let's create a train targets that strain DF, and let's just put in the target called that read well inputs. That is where IDF input calls and while inputs, while DF target calls. And finally, we have the test inputs, which is just SDF input calls. Okay, perfect. Then, um, before we train models, I am just going to create a helper function to evaluate models, which takes the model train and ports and vile inputs. So this is what it does. It takes a model. It takes training inputs and validation and puts a assuming this is a trained model. And then first it makes predictions using the train model on the train inputs that gives us train predictions. It then computes the mean squared error between the training targets and the training inputs. But maybe we can just drop this. We can just use the globals because the model is what is changing most of the time. So we have a function evaluate, which takes a model. It gets predictions on the training set, and then it computes the mean squared. Using the training targets and the training predictions, it then gets predictions on the validation set. Then it computes a mean squared error using the validation targets and validation predictions. And then it returns the root mean squared error for training validation sets and the predictions for the training and validation sets. So now evaluating models is just going to be a single line of code. So let's start with the Ridge regression, both from SK learn dot linear model import rich. Once again, if you want to learn about retrogression, you can check out the documentation here, or you can do the zero to GBM scores and let's create a model. So let's call this modern one. Rich, I think let's see, there are a few coordinates here, so I think we can specify a random state here. So I'm just going to do random state equals 42 so that I get the same result each time. Now rich uses something called a regularization in combination with the near regression in the, in the objective function. So you can specify the value of this alpha here. So I would encourage you to play around with it. Let's do alpha equals 0.9, maybe. And then rich also has a bunch of. Are there things that you can set, you can set a solver and you can start a bunch of other things. So I'll encourage you to try it out, but that's a model 10 let's train our model by calling model.fit green inputs. Of course, we need to provide the targets as well. So the model is not going to train and it is going to try and figure out a set of weights that can be applied to the input columns, to create it, to create a weighted combination of input columns that predicts the target value, which is the fair amount. So the fair amount is being expressed as, uh, some way w one multiplied by, let's say the distance. Let's some way W2 multiplied by, um, the pickup latitude and plus some way W2 multiplied by number of passengers, et cetera, et cetera. Right? So it's a weighted average. That's going to try and predict the fair and the rich, when we call fit, it figures out a good set of weights. Now we can evaluate a rich model by just calling evaluate modern one. Okay. Um, yeah. I'm not sure what the issue here is. Let's see here. Yeah, let's go step-by-step and we should be able to figure out this issue. Doesn't seem like a big issue to me, but here we have trained predictions is modern one dot predict train inputs. Yep. And so similarly, you have validation predictions. I'm just going to change this to R M S C. Let's try and get the RMSE here. Yep. This works too. Let's try and get these going. Okay. Something wrong with validation predictions. What did we break here? Ah, I see. Yep. Yeah. So that's just like a quick note on debugging. Whenever you have functions that you need to do. A good way to debug them is to bring them down line by line and then identify which line is causing an issue. And then go back to where you created this variable or whatever was the previous instance, uh, that leads to right live debugging is always one. Okay. So let's evaluate the model and upon evaluation, this model gives us 8.0 as the training set RMSE and 8.2 as a validation set, Missy. Well, that's a lot, that's somewhat better than our previous model. Uh, okay. In this case, it is, uh, I was probably getting 5.2 earlier, maybe without this alpha. Let me try that again. Yup. So it's 8.8 0.2. It's somewhat better than our baseline model. Not great, but still better check if a train in foot still has the same shape. Yep. It does. Okay. Now remember our submitted predict function. Oh, sorry. Predict and submit. So now we can use this reuses function and give it our modern one and then give it our test inputs and give it a file name. Yep. So those are our predictions for the reach model. Let's take this, these set of predictions and let's upload them. Let's see what we get. So download the rates of mission and go back here. Let's upload it here. Let's see what that does. Okay. It's almost up. Yep. Let me just call that rich. So 7.72, not bad, not bad at all. That's better than 9.7. So we're getting there. We're getting better. Let's try a random forest. Let's see if this let's try it. And of course at this point I would also go and add it in my experiment sheet. I will note down that Ridge gave me 7.72 and it was like eight point or something and so on. Okay. Let's try it. I know forest now. So I'm going to import from. Eschalon dot ensemble, I believe import random forest regressor and then model two is random forest regressor and I'm going to set the random state 2 42. And I'm going to say something to minus one. Yeah. Well, N jobs and jobs is this number of workers that reminds me though. Maybe I made a mistake while removing outliers. Um, yeah, I think I made a mistake while removing outliers. So what we're doing here is when we remove outliers, we are returning a new data frame. So what we really need to do here is we need to do train DF equals train DF, um, remove outliers, and we need to do value DF equals remove outlasts from Valdez. And finally we need to also, yeah, so that, that is now actually properly removed outliers earlier. We did not actually remove the outliers. We simply created new data frames, but we did not set them back, but that is, that gives us an opportunity now to see how actually removing the outliers has an impact on the model. I'm training my Ridge regression model. Once again, And low and behold, our models accuracy went down or model's error went down from 7.2 to 5.2. So just by limiting the training data, the columns of the training data to the range of values within the test data, we are able to train a much better model. Okay. And once again, we can test this out. I'm going to download this rich, uh, submission, not CSV file here. Once again, let's download that. And I think this is a great, uh, this is a great example of how much preacher engineering, again, change things, right? We have gone from 7.2 to 5.2. That's a 30% reduction in the error. So let's upload this, let's click late submission again, and let's upload the new rich file here and it's done. Yep. And that puts it at 5.15. And let's look at the leaderboard before we go into a random forest. So 5.15. Where does that put us out of 1,433 submissions by 1,478 submissions. Let's load those submissions quickly and let's search for 5.15. Okay. So that puts us at 1, 1 6, 7. So we've already moved up almost 300 places from our original submission, which is very low, which is like 9.8, which is way down here. 9.8 and 9.4 was somewhere around here. 1400 X to 5.15. Yep. Somewhere here, 5.15. So we are at position one month, 600. So we jumped, we jumped 300 places just by doing some good feature engineering. Of course I say good. And just, but this has probably taken people several weeks or probably a month or more to figure out that these features would be useful. So if you can think of more creative features, even with a very simple Ridge regression or even a linear regression, you'll be able to move up even higher, maybe into the top thousand. And of course, let's not forget that we're still only working with 1% of the data, right. And this just keeps running, keeps getting better. So let's go and try a random forest. So I'm going to try a random forest regrets. And here in a random forest, we are going to train a bunch of decision trees. And then each of those decision trees is going to make a prediction. And then we are going to average the prediction from all the decision trees. So I'm setting a random state 42 so that we always get the same set of predictions and minus one, make sure that trees can be trained in parallel. Then we have max depth. So by default, the decision trees in the random forests are unbounded, but for a very large dataset, you may not want to create unbounded trees because they may take a very long time. And they may also very badly over-fit the training data. So I'm just going to specify a Mac step of, I don't know, maybe 10 let's say, and by default it is going to train. Let's see, how many is it going to train and estimator? So the number of trees that are going to train is let's see an estimate is the default is 10. Let me train a hundred. Um, please. Okay. And let's just time the training here. So I'm just going to put time here is going to time the training and model to dot train or not fit green inputs and rain targets. Okay. And while it trains, let us see if we have any questions. So first question, would the support part kit, which is a pre compressed file? Um, yeah, so I, I'm not sure what the question here is, but yes, a partial part kit is, uh, yes, I think it would support a pre compressed, a pocket file. If you have a pre compressed pocket file, you can load it back using pandas. You may have to specify the type of compression possibly. So, yeah. Second question. Um, shall we use pike carrot? Well, I don't think, I don't know if you can use pike it, but pocket works just fine. If we limit the range of the train data to match that of the test data, aren't we reducing the generalization of the model. For instance, we won't be able to use the same model for a future dataset that might not have the same range as the past day. Absolutely. And that is why having a good test set matters. Right? So your test set should be as close to what your model is going to do in the real world as possible because otherwise the test set is useless. Now, if we are going to use the model in the real world for something other than what is already present in the test set, or the kind of data or the kinds of ranges that are present in the test set, they're not predictions on the test that are not very indicative, right? So our accuracy of the model is not very indicative. So even if you're getting 90% accuracy on the test set, then, then in the real world, the model can just be 30% accurate. And this happens all the time, probably 80% of machine learning models face this issue. So what I would suggest instead is to create the test set in a way that can capture the entire range of values that the model can encounter in the real world. Even if that means coming up with some estimates. And I know this may not always be possible, but that's a great question. So thanks. How do we know the number of landmarks to create you? Don't you have to try a few and then train some models, try a few more, do some exploratory analysis. See where draw maybe a geographical heat map. See where there's a lot of traffic, et cetera. Okay. So a random forest model has trained eight. It took nine minutes, 57 seconds. And let's see what this model does. So first let's evaluate the model. So let's call evaluate and model two. Okay. And the model is able to get to a pretty good place. Seems like we are down to 4.16. That's not bad at all. Let's make a submission. So 4.16 is the validation RMSE here training RMSE is 3.59. And let's make a submission here predict and submit model two. And, uh, let's call it RF submission dot CSV. Uh, it also requires the inputs. Yep. Okay. So now we have generated a random forest submission. So let's take that. There's download that. And we are just do ours in at this point, but making good progress of course, to put this together, to put this outline together, uh, definitely took more time, but even for a couple of days of work, this is not a bad result at all. And remember, we are only using 1% of the data. I might, at this point, I might even just want to take this kind of forest and put this into the comments so that when I'm looking back at my submissions, I can see exactly what this module contains. Interesting thought. Okay. It seems like this might take a minute or two to, yeah, there's probably some issue here, but why don't we train a XG boost model? In the meantime, the next model we're going to train is a great in boosting model. Now a gradient boosting model is similar to random forest, except that each new tree that it trains tries to correct the errors of the previous three. And that's what makes it, there's a technical posting. And that's what makes it sometimes a lot more powerful than random forest. So we're going to use the extra boost library. So from XG boost, import ex GB regressor, and then let's create a model three equals X GB regressor and let's give it. Uh, max stepped of three. Let's make it five. Maybe a learning rate looks fine and estimate is a hundred. Looks fine. Let us, uh, objective. So we may want to change the objective here to drag squared error because we dealing with the root mean squared error. And again, you can look up the documentation to understand what each of these mean. So in this case, the objective, if you see here, let's see, um, let's see. XG boost. RMSE objective. Yeah. You can always look up. Yep. I think this is the one that I'm going to use X squared error. So yeah, this is the objective, the, or the loss function. Um, let's estimate us let's maybe change that to 200. Maybe give it a little longer to train. Let us, uh, yeah. Random state. And, uh, let us set and jobs for some parallelism. Okay. Let's train the model and then we will evaluate the model. And then of course we will also predict and submit model three, test inputs, and let's call that XG B submissions, uh, mission, not CSV. Okay. So let's give that a minute to train in the meantime. Let's check this out. Okay. So it looks like we got to 3.35. And where does that put us on the leaderboard? 3.35. Let's go down. It's getting pretty close carrying 2.8 and we're down to 3.35 still only one person on the data still model just took a few minutes to train 8.35. So we all been in five 16. The 560 out of 14, 78, that's in the top 40%. So we're in the top 40% already. That's not bad at all. Top 40% model is actually a very good model because most of the top submissions, one trained for a very long time and also use a lot of ensembling techniques, et cetera. Here's that extra boost model it took about 34 seconds. Very short model. I'm sure we can probably bump up the number of estimators to buy a lot more and it is able to get to 3.98. So is that better than the random forest? I don't know. Yeah, it seems like it's better than the random forest. So in just 46 or maybe a few minutes, I don't know. 46, 35 seconds probably in, let's see. Yeah. In just 35 seconds, we were able to train the best model so far, probably. So let's go down here and let's go to xDB submissions and let's download it and let's save it here. Come back late submission, put that in here. And I'm just going to drop the description as weather. So I'm going to drop the exterior aggressor prescription right into this, and let's submit that and see what happens. Perfect. It seems like we have made a submission and that brings us to 3.20 a quick look at the leaderboard once again, 3.20. And before, while that loads, let me also just tell you what is coming up next. So, so far we have just evaluated a bunch of different models and I would encourage you to try out a few more. I'm going to commit the notebook here as well, but the next thing that we should be doing is tuning some hyper-parameters. So let's see 3.20, I believe. Okay. We are up to 414. Pretty good. Pretty good. Four 40 out of 1, 4, 7, 8 is okay. We've hit the 30% mark already. That's not bad. Uh, so the next thing we're talking about is tuning hyper-parameters now here's the S now tuning parameters, unfortunately is more of an, is unfortunately more of an art than a science. There are some automated units available, but they will typically try a whole bunch of things like grid search and, um, take a long time. And you ultimately have to train a lot of models and build some intuition about what works and what doesn't. But I try to give you a couple of tools here that you can use. Here's first strategy, which is about picking the order in which you're doing hyper parameters. So what you should do is first tune the most important and impactful hyper parameters. First, for example, for the XG boost model, number of estimators or number of trees that you want to train is the most important hyper parameter. And for this, you need to also understand how the models work really well, at least in an intuitive fashion, if not the entire mathematics behind it, you should have a good, intuitive understanding. And all of these models can be once you understand them, intuitively can be described in a single paragraph or maybe two. So for extra boost, one of the most important parameters is estimators. And you tune that first and we'll talk about how to tune that then with the best value of the first hyper parameter, we tuned the next, most impactful hyper parameter, which in this case, I believe would be max depth and so on and so on. So during the most impactful hyper parameter use it's best value. And what do you mean by best? Well, use the value that gives you the lowest loss for the validation. Okay, while still training in a reasonable amount of time. So it's time versus accuracy. So wherever you feel that this is the best, this is giving me the best result on the validation set, um, use that hyper parameter and then with the best value of the first type of parameters. So all future models that you try to tune, you should have the best value of the first type of parameter, and then you own the next most impactful hyper-parameters. So let's say the number of estimators, the best value is 500. Keeping that 500 fixed you own the next most impactful hyper parameter, like max step, and then keeping the max steps fixed during the next, most impactful hyper parameter and so on and go down four to 5, 6, 7 hyper-parameters, and keep going then go back to the top and then further tune each parameter once again for, for the module games, right? So that's the order. You sort of go through the parameters, get the best value and go forward. And as I said, it's more an art than a science, unfortunately. So try to get a feel for how parameters interact with each other, based on your understanding of the parameter and based on the experiments that you do now, in terms of how to tune hyper parameters. There's a, there's a image that captures this really well, which is called the over spitting curve. Yeah. So this is a image that captures the idea really well. Let's, uh, Yeah, this is the one I'm looking for. So the idea here is hyper parameters, let you control the complexity of the model. So certain hyper-parameters when you increase the hyper parameter, it increases the complexity of the model increases the capacity of the model. In some sense, for example, if you increase the numb, if you increase the max depth of the tree, or you increase the number of estimators, then you are increasing the capacity of the model. You are increasing how much it can learn. And the model starts out. Let's say you drive a number of estimators 0 5, 10 2000, a 507,010,000. So when you have very few estimators or very few or a very small model, or a very limited model, both training error and test error or validation, error, pretty high because your model has very low capacity and it has for dealing with a lot of data. It simply doesn't have all those parameters to learn enough about the data as you increase the models capacity, which is increase the number of estimators or increase, let's say the max step, both the model can start to learn more. So it starts to the training arrow starts to decrease and the validation error starts to decrease up to a point. And then what happens is at certain point of validation, error starts to increase. So this is what's called over-fitting. This is where the model is getting to a point where it is now. This is where the model is getting to a point where it is now, instead of trying to learn the general relationship between the inputs and outputs. It is now starting to memorize specific values or specific patterns between the training data, mostly specific examples in the training data or specific sets of examples to further reduce the loss, right? And as you make the model more and more and more complex by increasing the number of parameters it has by increasing, let's say the max debt, it can memorize every single meaning input. And that's what decision trees do if you don't bound their depth. So when you get, when your model gets to that point, then it's a very bad model because all it's good for is it's, it's kind of a model that is simply memorized all the answers. So anytime you give it a new question, it completely fails, right? It's like memorizing answers for an exam versus understanding the concept. As you go through the material, as you, as you study, as you do some practice questions, or you spend more time, your understanding of the material gets better. But if you get to a point where you're just blindly memorizing all the answers, then your understanding of the material may actually get worse because you won't know how to solve general problems. So it's not generalizing well enough. So that's what we want to find. And what we've done here for you is created a couple of points. One called test, perhaps, which gives each takes a model class and a set of parameters, and then trains the model with the given parameters and returns, the training and validation RMSE and then another core test parameter plot, where you can provide a model class. You can provide a parameter name, you can provide a set of values, uh, for that parameter that you want to test. And then a list of the other parameters that you want to set constant while waiting this parameter. Okay. And then it's going to train a model for each of those values, and it's going to plot the figure for you. I'm going to show you in just a second word. It does. So don't worry too much about the function code right now, but here's what we're going to do. We're going to try and tune the hyper-parameters number of trees. Now what's the number of trees we have here, 200. So I'm going to try and figure out, should we be increasing the number of trees or should we be decreasing the number of trees? Okay. And the way I'm going to do about this adobo go about doing this is calling test, but AMS and plot, and let's time that as well, and in test parameters and plot, we have first the type of model that you want to train, which is xDB regressive. We have the parameter name that we want to vary. So we want to vary the numb estimators parameter. Then we want to try the values. Let's try the value a hundred. Let's try the value 200, which we've already tried. Let's try the value 400. Let's see. Let's say we're just doubling the number and seeing if that heads and let us set the other parameters. So let's set the random state 2 42 number of jobs to minus one and objective two red squared error. So I'm just going to pass other parameters. This is called a, these are called quarks or keyword arguments. So each key insight, the sorry, not other brands, each key insight, best parameters is going to be passed as an argument to test params and flawed. And ultimately it's going to get passed on to the ultimately it's going to get passed down to the best, uh, to the xDB regressive model. Okay. This is going to take awhile. Uh, so how about we just start filling out some, um, code in the meantime. So what we're going to do after this is into best, perhaps I'm going to then add, so the best parameter I'm going to add, what I think would be the best value of nom estimators that we should use. So I will going to add that, then we're going to try or do some experiments with max step. So we have set the max step above two. What did we start with? We started with the max up to five. So maybe let's try three and seven or three and six for the Mac step. So test Pedram and lot X GB regressor and we want to test max depth and we want to test the values three, five, and maybe six. Let's say seven may take a long time. Oh, there it is. Okay. Okay. So it seems like number of estimators isn't really making a big difference at the moment. Seems like maybe we should, uh, maybe we should like reduce the learning rate or something and then try changing the number of estimators. Let me change the learning rate here or the initial learning rate to 0.05 instead of 0.1. Let's see if that gives us any benefit, but yeah, we can try and match up of three, five, maybe seven let's say and. Uh, let's give it the best parameters. And then we are going to add the best parameter or max depth, and we can try the same thing with learning rate as well. So like we can try learning rates of 0.05, 0.1 and 0.2 and so on. Okay. Yeah. So this isn't really doing much. So maybe the number of maximum, maximum submitters, isn't really helping. So no need to worry about it, but let's just go with a hundred for now. Then let's try maxed up of three, five, and seven. Let's see what that does. And then we're going to try different learning rates. So this is what we want to do, right? I hope you're getting the idea here. What you want to do is first green, a train, a basic model. So have a set of initial programs that you want to start with. Then for each hyper parameter, try out the, a bunch of different values. Try decreasing it. Try increasing. Similarly, try decreasing it, try increasing it. Maybe try five values, maybe try seven values. Look at the curve and look at the curve that gets created and try to figure out where the best fit is. And the best fit is the point where the validation error is the lowest rate. Once you put in enough values, you will see a curve like this, and you want to pick not the point where like these two are the closest, not the point where, uh, this is the lowest, not the point where this is the highest you want to pick the point where the validations error is the lowest. Okay. Now one caveat here is that sometimes the curve may not be very nice like this. Sometimes it may have, it may sort of flatten out here and if it's flattening out, that means it's still continuing to get better. But if it, if going from this point to this point is going to take three or four times the amount of time to bring the model. And you're probably better off just picking this value in state where it's kind of starting to flatten out so that you will try more experiments faster. Okay. So that's something worth thinking about. Yeah. So here it is. You can see with max depth, it seems like if we, if you were to pick a max step of seven, that will actually train much faster and I'm sorry, that would actually give us a much better model. And so I'm just going to pick, maybe let's say max seven, let's say, then we can try out some learning rates. I'm going to try out a bunch of learning rates here. And then based on that, I'm going to test a learning rate. And then of course, uh, we can continue trying to tune the model. Okay. So you want to try this with all the different parameters, not just these parameters. And here is a set of models that works well. So here's one where we have, uh, 500 estimators and then we are trying a max stepped off. Let's go maybe a little bigger. Let's try max accept of eight leg style, learning rates, slightly lower, because as you increase the number of estimators, you want to decrease the learning rate. Um, and then here there's a sub sample. So for each split of each tree, we only want to use 80% of the rows. And then there's something called a call sample by tree for each three that we use. We only wanna use, uh, 20% of the camp of the columns or 80% of the columns. So these are just a couple of things. You can try 0.8 0.7, the same way that we've tried, test params and plot and see where that takes us. So I'm going to run this model right now and see what, um, yeah. So I'm going to train this model here. xDB model final, and I'm going to fit it to the train inputs and train targets. I am then going to make some, I'm going to evaluate the model. I am going to then predict and submit X, G B model final X, GB X, G GB dot, uh, submissions dot CSV. Okay. So I'm just gonna let this run and see where that gets us in, in my case, I think the last time I trained this, it was able to get us to about the four 60 a position. I'm hoping we can beat that. I'm hoping we can get maybe into the top 25, 20 6% less. Um, but it should be somewhere around that, that point. And again, what is pretty amazing here is that we are still just using 1% of the data. We are throwing away 99% of the train data, never looking at it. And the reason we are able to do that is because the test set is really small. The test set is just 10,000 rows and to make predictions on a test set of 10,000 rows, you don't really need 55 million rows. Yes, it will help to add more data or the entire data will definitely make it better. But if you're always working on 55 million rows, then to do what we just did in less than two and a half hours, it would take you probably a couple of weeks, maybe longer because one of the things we were able to do here right now, while working with a sample is we were able to fix errors very quickly. We were able to try out new ideas very quickly. We are able to brainstorm and go from thought to action very quickly, where if you're working with 55 million rows, every action that you take, every cell that you run is going to run for a couple of hours. And by the time you come back, you're going to be tired. You're going to forget what you had in mind. So speed of iteration is very important and create a feature. Engineering is very important. And then hyper parameter tuning is really often, generally just a very small step, which is a fairly small step, which, uh, gives you that last bit of boost. But it's generally not the biggest factor. So let's just fix that in summit. This is looking pretty promising. It has gotten to 3.8. Let's see if that is any better than the best model that we had extremely June submission. Let's submit that. So that's why it's very important to plan your machine learning project. Well, it's very important to iterate. It's very important to try as many experiments as quickly as you can and track them systematically. It can make the difference of a machine learning model taking months and still not getting to a good result versus getting to a really good model. Something that can be used in the real world in a matter of hours. Okay. Um, let's see where this gets us. So we just submitted in this goddess to about 3.20 let's check where that puts us on the leaderboard 3.20. Yeah, I bet it would still be under the 30% mark. So, which is pretty good considering this is a single model, most models on gaggle use ensembles and considering our model has taken just what is this? One minute to train, not even 10 minutes, I'm ordered was strained for just one minute and we haven't even fully optimized the hyper parameters yet. Right. So there's a lot more we can do in terms of hyper parameters as well. So let's see 3.20. Okay. That puts us at position four 40. Yeah. So that is within the top 30%. Right. And I encourage you to like simply, maybe just go to next year instead of 500 estimators, maybe go for 2000 estimators and see what that does. So here are some exercises for you, you and hyper parameters for the Ridge regression and for random forest. See what's the best model you can get repeat with 3% of the data and percent 30%, a hundred percent. So basically three X-ing each time from one to three, three to 10 and so on. So see how much reduction in error does three X data produce this 10 X data produce this a hundred X data produced, and you will see that the reduction is not a hundred X, but the time taken definitely becomes a lot more. Right. And finally a last couple of things, you can save the model weights to Google drive. So there are a couple of ways you can, I'm not going to do this right now, but I'm just going to guide you to the right place here. So the way to save model weights is to use something like this, use this library called job lib. So you can do simply from job lib import dump, and then you can take any Python object and dump that into the job. Look file. Right? So you can import, uh, you can, maybe you can maybe just put the dictionary or, sorry, just put the model itself, that extra boost martyr and dump that into a file and then load it back. And then user just like the extra subject it is, or you could create a dictionary put into it, the extra boost model, any other, like if you were using a scaler, if you're using an impurity of some kind, anything else that you need to make predictions put all of those into a job file and then dump it. So that's how you save models. And then here we have IO inputs and outputs or Google drive for, uh, Google CoLab. So you can Mount your Google drive this way. You can say fall from Google dot co-lab import drive and then drive dot Mount slash content slash drive. When you do that, then on slash content slash drive, your Google drive is going to show up here on the left. Let me see, let me try that. Yeah. So your Google drive is going to show up here on the left. Um, I'm not gonna run the whole thing right now, but the Google drive is going to show up here on the left. I believe you need to. Yeah, I believe this one is going to ask you to do some, take some additional steps like this. You will have to open a link, enter an authorization code, similar to adding your job in API key that attaches your Google drive. And once your Google drive is attached, you can take the job file that you created here and put that into your Google drive. So now you can have a or notebook, and then you can have a feature engineering notebook, which takes the data, adds a bunch of features, saves those files in pocket format to Google drive. Then you can have your machine learning notebook, which can pick up those files and then train a bunch of models and whatever are the best models you can write those models back to Google drive, and then you can have an inference notebook which can load those models from Google drive and make predictions on new data, right. Or. Make predictions on individual inputs. That is, again, something that I would suggest if you're, if you hit a wall at some point us, by looking at some individual samples from the test set or that put those into the model, see the prediction on that individual input and see that prediction makes sense to you just eyeball the predictions, and then you'll get some more ideas. Then you can do some more feature engineering, and that's the iterative process that you want to follow. You want to make submissions every day, day after day. Right now, one other thing that we've not covered here is how to train on a GPU. You can train on a GPU with the entire dataset to make things faster. So there's a library called dos that you can use another library called C U D F or CUDA data frame, which can take the data from the CSV file and put it directly into the GPU. Remember on CoLab, you also get access to a GPU, so you can take the data, put it directly onto the GPU. Next you can create training and validation sets, perform feature engineering directly on the GPU. It's going to be a lot faster. And most importantly, the training that you do, the training can be done using extra boost on the GPU itself. And that's again, going to be a lot, lot faster, probably orders of magnitude faster. So the entire process of working with the entire dataset itself can be reduced to maybe 10 or 20 minutes of work right now, dusk CU DF and CML. I have very similar APIs to, or very similar functions, et cetera, arguments, et cetera, as binders and extra boost. But some things are different and some things have to be done differently. Unfortunately, it's not a hundred percent compatible APA. So I've left you a few resources that you can check out specifically, do check out this project by Alan Cohn. He was, uh, one of the members of the Jovian community who has created a model using dusk using the task library. And he's used a hundred percent of the data and his model trains in under 15 minutes, I believe. And in under 15 minutes, he is able to get to a point where I think he was able to get to 2.89, which was in the top 6%. Okay. Of course it took several days to write out the code and try to learn the different things required to do this. But a model, a single model trained in under 15 minutes was able to own the entire dataset. A hundred percent of the data under 17 minutes placed him in the 94th percentile or the top 6% of the DSS. So you can check out his notebook here as well. His notebook is listed here, so you can check out his notebook. It's a good tutorial on how to use dusk. So that's an exercise for you. And finally, here's the last thing I want you to take away from this workshop, always document and publish your projects online because they help improve your understanding. When you have to explain what you've done, there are a lot of things that you've probably just copy pasted or taken for granted or not really thought about that you have to now put into words and that forces you to think and understand and fill the gaps in your understanding. So that's very useful to improve your understanding. It's a great way to showcase your skills. If you're going to write on your resume that, you know, machine learning under a skill section without offering any evidence for it, there is no way somebody is going to believe that, you know, machine learning and they don't have the time to actually interview hundreds of people and figure out what they know. So the best way to offer evidence is to do a blog post, write a blog, post I'll, explain and list a linkage from your resume. And the last thing is that as people read your blogs or you share them on LinkedIn or Twitter or wherever. That will lead to inbound job opportunities for you. People will reach out to you. Recruiters will reach out to employers will reach out to, I saw the project that you did. It looks look pretty interesting. We have a similar problem here at our company. Would you be interested in talking and you won't believe how many, how much easier it is going to become for you to find opportunities. If you consistently write blog posts and publish your projects online, any project that you're doing, please put it up online. Please add some explanations using markdowns, spend another hour or two, lean out the code and create functions. Show that you are a good programmer and publish the Jupiter notebook to Jovan. It's just one. We've made it so simple for you because we want you to publish these articles with us. Uh, so you can run jovan.com or you can download the notebook and then you can, uh, like you can go file, download notebook as IPNB, and then you can upload that notebook on Jovian. You can go here on new and you can upload a notebook. It's really easy. Uh, but yeah, when you do that, you, yeah, you, you can now share this notebook with anyone, right? And you can also write blog posts like this one. And the benefit of blog posts is that you don't have to show the entire code. You can make it much shorter and you can focus on the bigger narrative or the bigger idea here. I think this is a great blog post about the different steps involved here and the things that Alan tried without showing a bunch of maybe like hundreds of lines of code, right? So it's a good summary blog post of the code, and it's a great, uh, it's a great way to share what you've done with somebody and summarize it. Now, one thing you can do is on your blog post, you can actually embed code cells from Jovian and outputs and graphs and anything from a Jovian notebook. And you should check out this tutorial on how to write a data science blog post. We have a tutorial here. We write something from scratch. We did a few months ago that will guide you in that process. So that was the machine learning project from scratch. Not really from scratch because we had written out an outline, but let's review that outline. Once again, we started other, we're trying to predict taxi fares for New York city by looking at information like pickup location, drop location, latitude, longitude, fare. The number of passengers and the time of pickup. So we downloaded the dataset by first installing the required libraries, downloading the data from Kaggle, using open datasets, looking at the dataset files, seeing that we had 55 million rows in the training side, but just 10,000 rows in the test set, we had eight columns. We loaded the training data dataset and the test set. We then explored the training site and we saw that there were some invalid values, but there were no missing values. The test set, although had fairly reasonable ranges, then something we could have done is exploratory data analysis and visualization, which is a good thing to go in and do right now to get ideas for feature engineering and a great way to build insight about the dataset, um, is to ask and answer questions because that'll give you ideas for feature engineering. Then we prepared the data set for training by splitting the data into training and validation sets. Then we failed and, or remove the missing values. In this case, we removed remove them and there were no missing values in our sample. Of course, one of the things that we did while loading the training set was we worked with a 1% sample so that we won't have to, so we could get through this entire tutorial in three hours, but that also had the unexpected benefit that we could experiment a lot more, very quickly instead of having to wait for tens of minutes for each cell to run, then we. Expected the inputs and outputs out. We separate out the input columns and the output columns because that's how machine learning models have to be trained for the training validation and test sets. We then train some hard-coded models, a model that always predicts the average, and we evaluated it. We made submissions from that model. Sorry. We evaluated it against the validation set and saw that it gives us a average RMSE of about 11. We trained and evaluated a baseline model, which gave us an average RMSE of about 11 as well. This was a linear regression model. And the learning here was that our features are probably not good enough. We probably need to create new features because our linear regression model, isn't really able to learn much beyond what we can predict by just returning the average. So before you go out and do a lot of hyper parameter tuning, make sure that your model is actually learning something better than the brute force or the simple solution. Then we made some predictions and submitted those predictions to Kaggle and that established a baseline, which we would then try and beat with every new model that we create. Then when it came to feature engineering, the low hanging fruit was extracting parts out of date the year, the month, the day, the day of the week. And also the hour of the day, we then added the distance between the pickup and drop using the habit sign distance. We found a function online that we just borrowed. We've also added distances of the drop location from popular landmarks and like the JFK airport, the Newark airport, LaGuardia airport, uh, and a bunch of other places you can possibly also add distance from the pickup location. We removed outliers and invalid data. We noticed that there was a bunch of invalid data in the training set, and we noticed that the test set had a certain range of values for latitudes longitudes of fairs and all that. So we put those in so that our model is focused on making good predictions on the test set, which should be reflective of how the model is going to be used in the real world. We can, we could have done scaling and Manhattan coding, and that would have helped train the models a little. I'm sure. And then we saw how to save the intermediate data frames and also discuss that we can put them onto Google drive so that we can separate out our notebooks for exploratory analysis, feature engineering and training. We then trained and evaluated a bunch of different models. First, we once again, split the inputs and targets, and then we trained a range, the regression model, we then train a rhino forest model and we trained a gradient boosting model. Each of these, we did some very quick and dirty hyper parameter selection, but even with that, we were able to get to a really good place. We are able to get to the top 40% or so without much tuning. Then we looked at hyper parameter tuning where we decided that we would do the most impactful parameter first and then keeping it value fixed you in the next, most impactful. And by tuning, we mean picking the value of the hyper parameter, where the validation loss is the lowest where it is not, it has not started to overfit, but it is, is still, it has learned a little bit about the data in more general terms. So we do number of trees, max depth learning rate, and we ran some experiments here. We saw that all of these parameters could be further increased. Of course we are short on time. So we can't really look at like going to very deep trees that would take a couple of hours or so to train. But I encourage you to try those out till the point that you start seeing the increase in the validation. Edgar. And finally, we picked a bunch of good parameters and we train a model and that model was able to put us in the top 30%, which was pretty amazing considering Ms. Still just using one person of the data. And we looked at how we can save those moderates to Google drive. And we also discussed that the model can be trained on GPU, which would be a lot better when you're working with the entire dataset so that you don't have to wait four hours to train your model. Of course, it requires some additional work because you need to install a bunch of libraries and make, to make things work. But there's definitely a few resources that you can check out here. Maybe it could be a topic for another workshop where we could talk about using training classical machine learning models on a GPU. Finally, we talked about the importance of documenting and publishing your work. I cannot overstate this. Any work that you do, please document it. Please publish it, publish it to Jovian. If you were writing a blog post, go to blog dot Jovan and check out the contribute tab here, and you can feature blog posts here. We share it, not just with the subscribers of the blog, but it also goes out in our newsletter, which goes out to over a hundred thousand members of the Jovian community. So it's a great way to get some visibility for your work. And become found. So finally, I just want to share some references and then we'll take a few questions. If you have questions do stick around. The first one is about a dataset. This is a New York city taxi fair prediction dataset. Definitely one of the more challenging datasets that you'll find on Kaggle, but with the right approach, you can see that it's all about strategy and approach and cluster iteration. You can do a lot with, uh, just a little bit of data. If you want to learn shell scripting a little bit, I'd definitely recommend checking out missing semester from MIT to learn bash on how to deal with the terminal. Then the open data sets library. This has been developed by Jovian to make it easy to download data from CA you can use it in all your projects. All you need to do is specify your catalog credentials. Then for exploratory data analysis, do check out this tutorial on building an EDA project from scratch. Again, it's a follow along kind of tutorial that you can apply to any data set just as this entire strategy. You can pretty much apply to any data set from Kaggle. Maybe only the specific parts like feature engineering. I'm going to change. Then do check out the course machine learning by with Python zero to GBMs that's a useful course. If you want to learn machine learning from scratch and do check out the blog, post violin Kong on this particular dataset, it's really useful. There is this experiment tracking sheet that we talked about. Very important to stay organized as you try hundreds or dozens at least of experiments do so that you don't lose track of what are the best hyper-parameters are. The best models are the best features. Even then, if you want to learn more about daytime components and pandas, you can check this out. There's some more resources about habit, sign distance, and here is the rapids project, which builds all these alternative libraries that work directly with GPS, which we have on Google CoLab. Fortunately, and if you're looking to write a blog post there's again, a follow along tutorial we have on how to write a data science blog post from scratch that you can follow a few examples of good machine learning projects. These are all projects created by graduates of the zero to data science bootcamp, um, that we run it's a six month program. Well, you learn data analysis, machine learning, Python programming, a bunch of analytics tools and build some real world projects and then go out and also learn how to prepare for interviews and apply for jobs. So here's one, you should check out. Walmart store sales is a great project on forecasting, Walmart, weekly sales, using machine learning. So at retail and covers all the specific aspects that we have talked about in this table of contents. Here's another predicting used car crisis. One thing that you get to see with machine learning is how generally applicable it is to so many different kinds of problems. So again, a very interesting model to check out also very well documented. So great project to check out. Here's one about applying machine learning to geology about predicting lithologies using wireline logs. I can't say that I understand the entire project, but I can definitely see the pieces. The pieces that you can pick up is defining the machine learning problem, understanding what the inputs are, what the outputs are, what kind of problem it is, what kind of models, union to use, and then going through the process of training, good models and experimenting and staying organized. Here's one about ad demand, prediction predicting whether a certain machine learning ad is going to be clicked on. Here's one on financial distress prediction, predicting whether somebody will face financial distress within the next year or two and another machine learning project on credit scoring. And I hope you notice a similar trend that is there across all of these projects, which is how to apply machine learning to the real world. All of these are on real world datasets from Calgary, right? So with that, I want to thank you for attending this workshop. We are almost running up on three hours. We will take questions, but for those who want to go, thanks a lot for retaining. Uh, we are planning to do more workshops every week or every other week. So do subscribe to our YouTube channel. And of course, if you are learning machine learning or data science do go on jobing.ai to sign up, to take some of our courses, build some interesting projects and also share these courses with other folks who might find it useful. Um, we also have a community discord where you can come chat with us and we have a community forum as well. And if you are pursuing a career in data science, definitely talk to us about joining the zero to data science bootcamp. We think it could be a great fit if you're looking to make that career transition. So that's all I have for you today. Thank you for joining. I will see you next time. Have a good day or good night. Okay. And let's take the questions. There is a comment. Come question from partic really loved the session, understood everything right from creating a project pipeline, feature engineering, saving files, and pocket format uploading our submission files with descriptions as model parameters, hyper parameter tuning, et cetera. I just had one question, not regarding the session, that one, uh, not a session, but how can you find a problem? That was also unique. Okay. How can you find a unique problem statement? Right. Yeah. So I don't think that there are many unique problem statements in, at, in the void right now. Uh, like even the datasets that you find online, you will find that many people have created many machine learning projects from those, but that should not stop you on working with the dataset because everyone brings their own perspective. Everyone's going to do their own analysis. Everyone's going to train their own kind of models, try their own ideas. So you will almost certainly one learn a lot from the process. Even if there are a hundred other projects on a dataset like New York taxi fare, where 1500 people have made submissions right out of the 1500, probably many people may have trained models for several days or probably like months and still may not have been able to make the top 500, but with some smart feature engineering, you might be able to get to the top 400 in just a couple of days. So it shouldn't stop you from trying that dataset. And the second thing is, um, It's ultimately about. Yeah. So the second thing is about finding good problems. I would say that you should try and find problems where a lot of people are already working on that problem, because that is an indication that it's a good problem to solve. So when I came across the New York taxi fare dataset, I saw that it's a large dataset. I saw that like over a thousand people had participated in the competition. So that probably means that it's a very, very interesting problem to solve. So in a somewhat counter intuitive sense, the more people have tried a particular problem. The more interesting it is, unless it gets to a point where it becomes like a instructive problem, which is taught in courses, for example, the MNS dataset or the CFR tenancy 400 datasets, they are generally used for teaching. And because they're used for teaching, pretty much everybody goes through creating models for those problems. So you want to pick something that is not used in some course or some tutorial, which is very popular, but at the same time is not very, um, like obscure where you don't understand what the problem statement is. Or even if, whether it is a machine learning problem somewhere is there's that sweet spot somewhere in between just like model training, I guess there's that sweet spot somewhere in between where you find some really good problems, but most important thing you should look at is independent of whether it's unique or not how much you're going to learn from it. What would be a good reason to test a split, a test set, to be so small from the training set? Well, um, I believe it could just be that, uh, this was Google cloud that had run the competition. So maybe they just wanted to see how much additional benefit can you get by 10 times or a hundred times more data. Right? How much additional juice can you extract out of it? That's one piece the second, I guess, could just be that because Google just has so much data. Um, but I don't know why the tests are so small. I don't know it's these are all guesses. Can you teach me how to make a customized dataset? Well, I would, I think that would be a topic for another day because there is a lot that will be involved. Every time Kaggle works with a company. I know that they spend a lot of time creating the dataset because the, on the one hand it should be possible to make predictions using the inputs and the targets. There should be enough signal in the data, but on the other hand, Sometimes you introduce something called leakage where one or two features might completely end up predicting the outcome. So for classical machine learning, I would say it's not very easy to come up with your own custom datasets. And of course there's the whole issue of labeling is itself, right? So if you a certain label, all the data, data yourself, that's going to make things a much harder for you. But for deep learning, when you're working with image recognition problems or with natural language problems, there, it's a lot easier to create custom datasets. And again, that's a topic for another day, but, uh, we do have a tutorial on building a deep learning project from scratch on our YouTube channel that you can check out. So thank you again for joining. You can find us on www.shovan.ai, and I'll see you next time. Thanks and goodbye.
Info
Channel: Jovian
Views: 43,530
Rating: undefined out of 5
Keywords:
Id: Qr9iONLD3Lk
Channel Id: undefined
Length: 172min 30sec (10350 seconds)
Published: Sat Oct 30 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.