Kaggle's 30 Days Of ML (Day-1): Getting Started With Kaggle

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone and welcome to my youtube channel in today's video we are going to look at 30 days of ml which is the challenge from kaggle so in these 30 days you're going to learn something new each day for the for the first 15 days and in the next 15 days you're going to work on a kaggle competition so it says that from machine learning beginner to a kangal competitor in 30 days and i think it's it's a very nice opportunity for beginners and you should definitely do that we're going to take a look at what's there in day one and we will continue it for the next 30 days okay so um today i got an email saying i have an assignment so there is a link follow the instructions in this notebook to get started with kaggle and then you have you can join the machine learning discord community so i would definitely encourage you to join the discord community you are going to if you face any problems you can ask ask there you can get solutions instantly there are a lot of members there are thousands of people and today if you have not set up your kaggle account we are going to set up kaggle account um unfortunately i already have a kaggle account so you can set up your own kaggle account just go to kaggle.com click on sign up and just get started from there okay so um now this is the notebook that you have to follow and it's it's quite easy um i will talk a little bit about different steps that are mentioned here first of all you have kaggle.com me which will always take you to your toggle profile and when you have not started with anything your profile is going to look like this it's going to be empty and there are four different categories one is competitions data sets notebooks and then you have discussions and you get ranked from novus to grandmaster in all these four different categories um so competitions are competitions which are submitted by different companies and you can team up with other different other categories and compete together um and then you have data sets where people share different kinds of data sets notebooks are these are jupiter notebooks and we will learn about them and discussions are just helping each other uh progressing uh together asking questions replying to people so that's your discussion part and you go from novice to grand master level in each all all these different categories so you have novice and you have contributor you have expert master grand master and if you if you look at kaggle.com progression it will tell you what it takes to become grand master in any level or any other level um so if you have just joined uh you are a novice you have registered and so you see this check mark here in my profile if i run one notebook or a script i make one competition or task submission i get one one point for that so one another check mark and then you have if i make one comment i get under the check mark if i give one upward i get another check mark so that's how i will become a contributor and then you can become an expert if you have like for competitions you can become an expert in competition and data sets in notebooks or in discussions and also all of them together so you have to accumulate these medals and you get you get medals like for discussions you will get one bronze medal if you have five ports so things like that these are different for different categories then you have the master level again a lot of things and then the final grandmaster level so uh they also explain how you get medals so you should definitely go and take a look at kaggle.com progression and here you have uh like if you're in the top 40 then you get a bronze medal if you're in top 10 and there are only zero to 99 teams then you get a gold medal but when you have a lot of teams like um when you have less number of teams it's difficult when you have a lot of teams it becomes relatively easier but not not that easy and you will also get some points for participating in a competition so you participate in a competition you get some points and that will build your overall competition rank so you get points for everything like making a discussion post or creating a data set that is used by a lot of people creating a candle kernel or notebook or taking part in different competitions if you take part in competitions you will probably achieve everything else so um yeah there is a point calculation and also you have to keep in mind when you have let's say you have 100 points and they will decline or decay over time so they will probably become zero if you stop gaggling um and this is how the points are calculated so you can take a look at this if you want to uh but yeah you see like it's divided by the root of a number of teammates you have so if you have a lot of teammates uh you don't earn many points but it's also encouraged because you can have four teammates and your points get divided by only two um yeah so like uh then you have data set and notebooks and descriptions so every all these three um whenever you get an upload you get a point okay so now getting back to the notebook so now uh we have to level up so right now you are a novice you have just started taking part and you've just started you have to sign up for a kaggle account you don't know what anything is how difficult it is but you have to get to the contributor level and it's it's not very difficult you just need to follow some steps so now we are at step three so now we have taken a look at your profile so kaggle.com abhishek that because abhishek is my username or i can also do kaggle.com me it will take it take me to my username page and then you read about the progression how it works what are the different categories what are the different performance tires and then you have um the third step submitting to a competition submitting to an actual competition okay um so let's let's see step three first um and then we go to step four we can also do it the other way around but let's try it this way um so submit to titanic okay so first we you will run a notebook and make a competition submission to do this follow the instructions in this notebook so we have another notebook that we have to follow to make a submission in the cattle titanic competition so let's take a look at this notebook what what do we have here okay so you have to yeah you have logged into kaggle and now you see a lot of things a lot of data that a lot of competition people discussing different kinds of things so don't confuse yourself look at one-on-one competitions first and titanic is one of the most popular 101 competition for beginners so um now um what we are going to do is we are going to see what the titanic competition is about first so whenever you start a competition so let's let me let's go to the competition page so i have opened the competition page and here i have already joined the competition so i see that i can submit the predictions but let's say you have not uh joined the competition you have opened the page for the first time you will see um something like this and it says join competition so you can click on join competition it will ask you if you understand the rules and regulations so uh if you're starting then probably you should take a look at what the rule says and then you click on i understand and accept and um you join the competition so now you're in the competition and uh now you have to um you you can take a look at the overview to know what the competition is about and then you can join the conversion if you want to join if you don't want to join don't join the competition so here um in this competition what you have to do is you have to predict who survives titanic um and um you you are provided with some data so you have the data tab you have the code tab discussion leaderboard rules theme you have so many different things so here i first of all you read the description what you have to do so here it says use a machine using machine learning to create a model that predicts which passengers survived the titanic shipwreck so that's what you have to do and then you see okay uh what were what was the other information that i have so you can read about the challenge um so you have to like this this sentence says in this challenge we ask you to build a predictive model that answers the question what sorts of people were more likely to survive using the passenger data so this gives you more information like you are provided with name age gender socioeconomic classes these kind of things and you have to build a model to predict who survived uh titanic so now you go to the evaluation page and see what the evaluation how you your model is going to be evaluated so here you see um it's simple very simple it's accuracy so um if like to if there are only four samples to survive you didn't survive when you break them correctly it's 100 accuracy so you you only have to predict zero or one so passenger uh survived or not so survived is one and this is zero and this is how you have to format the submission file the csv file so now we will go and take a look at the data that we have uh before that let's go back to this notebook and see what it says okay so yeah the notebook also talks about the data that we have there are three different files one is train.csv then you have test.csv and you have gendersubmission.csv so trained.csp contains the details of subset of passengers 891 passengers um and if i go to this data page i can see i can click on train and then it will show me what train dot csb consists of so it has like 12 columns let me select all of them and here now i can see it has passenger id it has a column that says survived or not so 549 diseased and 342 survived um then you have p class name sex age you have bunch of other columns uh ticket number maybe i don't know it has fare uh how much did the person pay uh the cabin number or cabin type maybe embark src so all this information is usually provided in this data page so here like uh subsp is the data set defines family relations sibling or spouse so that's what the column means and then you have p class so social economic status upper middle or lower class so once you have you have taken a look at the data set and what each column means so like look at this table it's um quite explanatory and it says okay you have the ticket number ticket is the ticket number where is the passenger fair cabinet's cabin number embarka sport portal embarkation so yeah three different ports so we have um a training set going back here we have a training set we have a test set and we have a gender earners submission.csv so my training set it consists of survived column but my test set will not consist of the survive column that's what we have to predict and you have the general summation so these predictions assume only female passengers survive so this is the submission based on gender that's it so now we go to our old notebook so it's it's a survived column value is one then the passenger has survived with zero the passenger did not survive and you have the test data which is same as training data except um there is no survived column so the structure is the same not the values and then you have the gender submission um okay so now you will start coding and when you're working with kaggle you don't have to worry about setting up everything you don't have to worry about the gpus or gpus or installing any library everything is already done for you so now if i have to code with competitions data so like i'm looking at this titanic competition and i have this code tab here i will click on code and then i can click on new notebook it will automatically load the data for this competition in my notebook so let's try that and it starts a tackle kernel or a kaggle notebook and here you have your different things so some pre-filled values so now to run this all you have to do is click on this play button or you can just be here and press shift enter or control enter i prefer shift enter because then it creates another cell for me so i press shift enter and then it will start but it's starting so let's take a look at a few more things here you see there is data tab and it has the input titanic and it has all these files and you can whatever output you have will go into this directory and you're working with python um environment you can you don't have to care about it accelerator is cpu gpu or tpu uh we are not using any gpu or tpu we're just using the cpu at the moment and internet is on so i can access internet from my um notebook so let's take a look now so what's happening so we have imported a package called numpy and uh package pandas for reading csvs and then we import os which is also a python package which is there in python when you install python for the first time and we go through each and every file and directory inside this folder called slash kaggle slash input and try to predict uh print um and we print all the file names so we have slash kaggle slash input slash data strain.csv and so on the test.csv and generalsubmission.csv so let's remove this all this part um and also this one so here we have the imports so these are called imports when when you write a python script or a notebook they should be on the top of the file usually and now what we do is we are going to load the training and test data so if we want to load the training or test data we have to read the csv file so csv file can be read using pandas we just do pd dot read csv so if i have if i call it train data frame or i can call train data is pd dot let's follow the same convention that's uh in this notebook so they also do train underscore data so train underscore data is pd.read underscore csv and then inside here uh you have to provide the path of the training file so here we can do slash gaggle flash input um titanic and then you have train dot csv now you have successfully loaded the train data set and you can take a look at the data set by just saying writing train underscore data pressing shift plus enter or just the play button here so this is the data that you have and that's what we saw if you want to take a look at only a few samples you just do head and it's going to show you the first five rows similarly you can do test data test.csv and we will do the same thing test underscore data dot head so now we got um a training file and that's why and you you will see that there's a lot of values which are in n a n so that's not a number uh so these are missing values and when when you're working with a machine learning problem you will encounter missing values quite a lot uh we won't discuss about this on day one but maybe in future so right now you have train data you have death data and now you have to make some kind of a prediction so let's try to figure out what is the percentage of women who survived and what is the percentage of men who survived okay so whenever you press shift plus enter it will create a new cell so like if i'm here i press shift plus enter i get a new cell and uh here you have some tools um so like if you if you have to delete it or just cut it just remove the cell and here you have certain tools that you can take a look at um so now here it also says that we use python module called pandas and if you don't know about what pandas is you shouldn't worry about it and if you if you're eager to learn a little bit more about it just just google it pandas python and it's going to take you to this website go there and um here there is a getting started and here you can learn how to use pandas so yeah it's really very cool library a lot of people use it almost everyone uses it and this is something that you should know about so now um it says okay uh okay we are still still on step three so now i go back to the titanic tutorial and here it says okay um before making your first submission explore a pattern and i will also tell you what a submission is so uh it says remember that the sample submission file in general submission assumes that all female passengers survived and all male passengers died so is this a reasonable guess so now you have to check how many how many passengers survived how many died per both male and female so let me just copy this thing and we can just paste it here so it says train dot lock the location where train dot sex is female sex is female and you have the survived column only uh sorry i'm looking at the training data so yeah the survived column so now here if i print this i i'm getting some index so you can ignore the first column but the second one i'm getting some values which are zeros or ones so if i take some of this it's just adding all the ones so the people who survived number of women who survived now if i divide this by the length of women i'm getting a percentage then that is the percentage just multiplied by 100 so 74.2 of women survived so um this can also be done in many different ways so uh if you if you have to find out how many females are right there are many different ways of doing this so this is one of the ways so train underscore data is your data frame japan does data frame and dot lock is a function inside that and to do the same thing we can also do something like train underscore data train underscore data dot sex equal to female so this will give me all the rows where sex is female and then what i can do is i can just do dot survived so now it gives me the survived column so i can say women and then i can do the same thing here okay 74.02 74.2 um so now you have to do the same thing for men let's take a look at what's done here so yeah this variable is called rate underscore women so let's use the same name and let's copy everything and make it men cyclical to male rate underscore men um so right so now you can do either you can do print or we can do the same way that they have done so let's take this percentage of women who survived uh and we just give the print function and percentage of men who survived okay i have to so it says that it does it's not able to find rate and risk for women because we did not run this cell so we will run everything from the beginning so shift enter shift enter and just carry on to the end or you can also go and click on run and run all so percentage of men survived is 18.89 okay so this does tell us something and this part is known as eda so exploratory data analysis so you can make such analysis depending on different columns in the dataset on your own and see what makes sense now we will move to the next part so the next next part here is um your first machine learning model so now we are talking about a machine learning model on your very first day it's called random burst which is a really very good model has been used in in the industry in research for a long long time um and still being used um so what is random forest you don't have to worry if you don't know what random forest is on day one you can again google it google random forest there's a lot of information to read about there but now uh let's take a look at their explanation here and this diagram what what does it say so this model is constructed of several trees now you have to think what trees are so the trees that we are talking about are decision trees so in decision trees so let's say this is one tree and uh this tree start from the feature sex and it says if sex is female if sib sp is less than um three and part is less than four then survive is one but if we follow the same path and we say if parts is greater than four uh greater than or equal to four then survive with zero so decision trees are just like they they are very simple machine learning models so which are rule based so you can say if sex is equal to female and sub sp is less than equal to three and patch greater than equal to four then surveyed is zero simple if else so random porous combines many different trees into one uh like uh one not not into one but it takes results from many different uh trees and let's say it gives you a simple average that's the most basic way to understand this so i created one tree based on some features i've created another tree based on some other features so we are using like subset of these features so here the feature that we have used is sex sp and parts we start from six here the feature that we have used is p class sub sp and six so we have used different features we are always starting from some some different feature and then we combine uh the predictions of these three trees so if my tree number one has said survived is true or survivors one and she number two has also said survivors one and three three say survive is zero we say survived is one because this is like majority of trees are voting for um survived equal to one so that's also known as majority voting and these kind of models which combine different smaller models into one large model are called ensemble models and don't don't be scared of that term we will come to that maybe in near future so now you have to um now we choose four different columns so we choose b class x sub sp and parse so if we go back to our data we say okay sex save as p part and p class these are the four different columns that we have chosen um so we import randomverse classifier now this comes from another very cool machine learning library called scikit-learn um so you can go and google psychic learn if you don't know about it but probably you do already so we imported and i have imported it in the first cell so i have to rerun that cell and then i i will just um so my target variable here is survived so i can just do df survived this is my target oh sorry there is no df so there is train underscore data and if i look at my target i will see zero one the first one you can ignore it's just index okay so now i got my target variables and um then i choose which features i want to train the model on so i choose the features p class like sub sp part and when i look at the if i have to choose only these features i can say train uh features train underscore data and features let's run this one first and then this one so this is the training data that i have but only uh the features that i have mentioned here um so one thing that you have to remember like uh these are strings and you cannot train your model on strings so you need to convert these strings to numbers so what we can do is we can use another pandas function called pd.getdummies and run it on the subset of the features that we have so now what we get it it creates dummies for all the columns wherever required uh so here instead of the column sex we have now sex underscore female and six underscore male and this is represented by a binary variable 0 or 1. so now this becomes your training data so which is denoted by x so x equal to your um training data or you can also call training data uh and then you have trained data and training data okay let's call it x and similarly you do x underscore test and here you can say instead of train data this is test underscore data so you got x and you got x underscore test and now you train your model which is your random force model so cll is random first classifier and let's see what the parameters uh that we are using here number of estimators max death and uh random state so don't worry much about parameters at the moment number of estimators is the number of three trees that you have max that is how deep the trees are and random state is just some seed that you set if you don't set it your results will have randomness in it so instead of clf let's call it model and then we put the model on x and instead of y we wrote target so i will just change it to target here and then we create predictions on the test set so model.predict and let's see what happens here and it's done so it's quite fast as you can see we don't have a lot of data and now we got predictions which is an array of binary values zero or one so now we have the predictions so we what we will do is we will create a data frame out of these predictions so i will take the same ids so testdata.messengerid so you have to keep the order same so you don't want to assign survive to someone who has disease who whose probability of being disease is higher so you have to keep the order of the ids um the same as the predictions so they must match and now you can save your output file to a csv so let's call it mysubmission.csv you can choose to save index or not we will just choose index equal to false because that's how kaggle expects your submission csv to look like okay let's save it and now here if i want to take a look at output you can see you have the passenger id and survived or not survived so once when you're done with all the explorations and you see that everything works in the sequence you can click on save version and uh here just choose save and run all and it's going to run everything and now you wait so uh till we wait we can just go and see what's happening here uh so yeah they tell you save and run all and uh then you have the first submission but we need to wait for that so we will just wait for a few minutes or seconds so after a while it's successful uh so i can click i can either click here and click on open and viewer uh you have to remember that kaggle provides you a certain quota for gpus uh since you're not using a gpu it's fine but now since i'm done with the notebook i can just click on this button and it will power off so saving some resources so here is my notebook that has been generated and i i take a look and uh okay survived or not and here i have this output so you can just click on the output and then you have the submit uh button since it's associated to a competition you have the zombie button if it's not you don't have the submit button i click on submit and now it will tell me okay select notebook the version of the notebook and blah blah blah whatever the discussion so description so my first submission and i click on submit and then i can click on view my submissions and it gives me a score so i scored 77.511 um 0.775 so that's my accuracy i click on the leaderboard i go to jump to my position on the leaderboard and here i am let's see it's not jumping to my position oh my god there's a lot of people today but i have made my first submission to this competition titanic and that's your first step as a bonus i will tell you how you can improve your score so one of the things that you can do is increase the number of trees and the depth so you don't you don't have to um increasing it always is not going to work if you if you increase it by a lot it's going to learn the training data perfectly and then it's not going to perform well on the test data which it has not seen which the model has not seen and this is called overfitting so you don't want to over optimize your parameters so this is one thing that you should keep in mind here i will just do it 250 and 7 and uh maybe i can also um add another feature cage so let's try doing that and see what happens i don't know i don't know if it's going to work but let's see so then i run all the cells um i mean we can also remove the percentage of people who survived or not so now my now it says that input contains nan or so there's probably some missing values so now if you have missing values what can you do so in this step where we get the dummies we can add dot fill and a zero let's say so if i have any missing values it will be filled with zero or uh since its age it's not advisable to fill with zero so i will just do fill with minus one invalid h um then i run everything again and now it worked so let's save the version okay so yeah so now it's running so we wait for a couple of minutes or less than that again we are done so i just click on this open in viewer i can close this and when it takes a long time to run your notebook you can also leave it running and then come back and i will show you where you can find it so now it's done and i have the output again and uh maybe something has changed you can check that and here it says version two of two so now i have version two and um click on submit and then i can see my submission again so you see despite of adding new feature i did not improve so adding new feature doesn't always mean it will improve maybe i have to optimize certain hyper parameters to improve it further so in the first iteration my score was 775 and now i have 0.75 so instead of so you can keep experimenting this way instead of adding new feature if what if i had just [Music] improved the number of estimators increase the number of estimators so you can also check that and that's very easy we just remove uh all the things that we added and remove h so we don't fill nand values and we just remove age and now we have only have the estimator and max depth and random state so this is the only change we have and then we can try it one more time in the meantime uh we can go to part four um so part four is where you can learn more but no no we don't have to go to part four we have to go to uh the step four make a comment so now you see a lot of people on kaggle they share notebooks they share so if i click on code i find different notebooks here i can sort by uh different filters and when i when i click on a notebook i get comments but here you see there is something called your work so if you have done something so we [Music] created this notebook so i guess the third version is also ready so it's successful and okay i don't need to turn that on and now i can also try to submit the third version and see if it improves something if it doesn't then we will do something else but not today and view submission and it's actually exactly the same as before so it doesn't change anything so these four features alone can't do everything for us maybe we need to create some new features and that's called feature engineering we learn about that in future um so now i was showing you um the different notebooks that you can find inside code so you can go and learn from them see uh how how they deal with stuff i see this notebook and if i like it okay this is some there's a lot of things that has been done and uh here i can say okay um i can give an upward so click on this arrow to give an upward so now i have uploaded this notebook and here i can also uh if i have to thank them then i would just use upward option or if i have to ask something so then i can ask okay um why did you use random first classifier why not something else so i can just copy this um how did you end up choosing you know forest classifier and then you post a comment and or you can respond to someone else's comment if you want and that's called uh writing comment so i have shown you already how to give one of what so you have uploaded a notebook but you can also if you want you can also go and upload uh some discussions if you want so there's you both have you have the both options of word and download for discussions or notebooks it's only a port and you can also comment on something uh what this person has to is asking if you know the answer you can comment or you can say i also want to know the same things like that so this was day one and um we have finished with the couscouli if you have done everything as mentioned in this notebook here and as i explained uh you will become a kaggle contributor today so what's next we will see tomorrow so that's it for day one i will see you tomorrow on day two and if you like this video don't forget to click on the like button and do subscribe to the channel and do tell your friends about it see you bye bye
Info
Channel: Abhishek Thakur
Views: 79,243
Rating: 4.9655042 out of 5
Keywords: machine learning, deep learning, artificial intelligence, kaggle, abhishek thakur, 30daysofml, 30 days of ml, 30 days of machine learning, getting started on kaggle, how to start with kaggle, kaggle progression system, kaggle 101, titanic first submission, create a kaggle account, random forest classifier titanic, kaggle beginner tutorial, 30daysofml day 1
Id: _55G24aghPY
Channel Id: undefined
Length: 43min 41sec (2621 seconds)
Published: Tue Aug 03 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.