Beginner Kaggle Data Science Project Walk-Through (Titanic)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello everyone can hear about with another video for you got a little confession to make so I always tell you guys to use goggle to engage in the community to do projects there and admittedly I haven't really been that active on the platform so I decided that it was time for me to be accountable to actually practice what I preach and I recently started a new cago profile specifically for this channel on the platform so that's gonna be linked below in the description I'll also link to the actual analysis that I'm gonna do in this video in the description below so make sure to click on that workbook so you can actually follow along so in the video today I'm gonna actually go through the Titanic data set and do some analysis and actually submit my results to the actual competition so you'll be able to see what the process is like of going through and trying to understand this problem some of the techniques and algorithms that I use and then actually how to submit your work I want to make sure that you guys know this video is more focused on how to think about a data science problem how to think about one of these projects than actually the implementation so you know a lot of people are like where do I start how do I know when I'm done these things I'll cover in this video the last thing I want to say is it's totally fun to follow along as I'm going through this but you definitely want to make sure that you're citing your sources whenever you publish something on Cabul you can also fork the notebook and experiment with it if you don't do that you should go to the top of the notebook and use markdown and actually say where you're getting the analysis or the cells from I tried to do this one without really looking at any additional notebooks it was just what I was getting from the data and working through it on my own I'll probably go back and try to actually improve the results and I'm gonna bring in other people's work there and I'll definitely cite that in the notebook itself so that was just a quick warning you really want to be careful and you really want to make sure you give credit to other people's work that you're putting out so without further ado let's jump into the actual alright guys so here is my kago profile as you can see it is really new I haven't unlocked any achievements yet but hopefully that will change in the near future so let's jump into the actual notebook here so we can find the competition here in the Titanic it should be one of the ones at the bottom here it's only up here because I've participated in it so before I enter in any competition before I go through any workbook I make sure to understand actually what's going on like where the data is coming from who who it's gonna be valuable for and everything along those lines so we can go through and read here basically the idea of this data set is we want to predict who actually would survive or who survived this class so they give us a training set we train our model and then we actually try to predict which of the people in the test set survived this crash so it's a little bore bit but it's still a good use case from from actual history so we can go through there's pretty much everything you'd want to know how to submit all of these things and and so on what what is really relevant here for me is the data so if you click on the data tab you can see how to actually submit you can see what the test data looks like you know what the columns looks like there are 11 columns and you can also get some information about the individual variables so for example like siblings might be useful number of parents on board these are all very relevant things so I always want to make sure before I do any feature engineering before I start experimenting with the data I have at least a little understanding of what what the data encompasses next let's actually work into my notebook so if I wanted to create a new one I would click here I'm not going to go through this live because some of the training takes quite a bit of time and also maybe I'm a little a little nervous to do it live yet so I'm gonna go through one of the workbooks that I went through but I'll still talk about some of the places get stock some of the challenges that I have and again I really hope that this is helpful to you guys so I think this is it so here you know we always talk about how to write about your work I mean this isn't a perfect case study but the way that I use markdown the way that I use comments is hopefully very clear to you guys right now as you can see I talk about what I plan to do with the data I talk about what my best results were with the analysis that I used and I walk through each of the steps that actually eventually got me to submitting my results so we'll go into this a little bit but one thing I actually write the overview at the end I go through I do all my outlining and then I go back and make sure it makes sense to me logically how I go about it so this one this they actually just give it to you you import the data I think at the top you also have to include the data and we read it in with this scripting so before I write and before I do any project I just like to think about the problem in light of the actual data that we have so this is a thing you know some comments that I do before I run almost every analysis I go through these steps so the first thing I want to understand the data so like what what data types are they you know with numeric so we're working with categoricals what are some of the trends of the data what are the averages how many missing values do we have next I look at the histograms and the box plots this helps you understand the trends in the data so we might see that you know for example for fair there's a lot of people that didn't pay anything is that something we have to dive and do further the following thing is I want to understand the value counts for the categoricals so you can't really do histograms for categorical variables so value counts it just gives you like a nice little bar chart to see what categories people fall into after that I explore missing data a little bit more and I start thinking about how I'd want to either remove Mis data or impute the missing data following that we want to start thinking about our model building and we do some correlation between the different metrics we see what things are related and which things are related to dependent variables especially in a regression problem this is clearly a classification problem though we're trying to predict a yes or no yes they survived no they did going forward I just kind of brainstorm a couple of things that that I want to understand better so did wealthy people survive did you know that the cabin that they were in effect the their survival rate you know that did age effect you know their ticket price as is young and wealthy maybe a variable we want to explore and then do we want to explore tell us back I actually don't think I went through and analyzed these yet so in a follow-up maybe I'll go through and see if those actually affect the model that we have after that I wanted to do some feature engineering I go through and maybe make some new variables and then after that we pre-process the data so that we can use it in the analysis in our models you know after that we start doing the model building we think about do we want to scale our data is it relevant and I'll talk about some of the assumptions there now there there are really two things in my mind that affect how your model performs the first and which I think is probably the most important is the data that goes in so the better data that you have the better your model is going to perform the other ways you can actually tune your model and mess around with parameters but I found that usually you get the biggest jumps when you include better data and so that's where feature engineering is really important I think that's probably the best way to make a you know a couple percent increase in your performance they even over model optimization so let's jump right into the light data exploration here so what I like to do is I just run a couple of quick little commands to help us understand the shape of the data and these types of things so as I mentioned before we do training which is our training data set which we've isolated here and we want to understand the datatypes and also the null values so we can see age has quite a few no values and cabin has quite a few no values so we're going to want to start thinking early about how we're going to manage that we're gonna go through and actually you know look at the differences down here but this is just a good starting point the next thing we want to do is I always want to try and use the describe command and so this shows us like the standard deviation what percent of the samples survived you know the average age of the groups of the average family size and things like that so these all start helping us think about the data differently we start just making associations and you know we're not reserving judgment yet we're just starting and continuing to think of questions as we go when we do this I think it's relevant to break this into numeric variables and the categorical variables so these are things that that we want to understand with a histogram and these are things that we want to understand with value counts so this line of code here I just make histograms for all of the numeric variables and I just plot the what they are up top so age follows a fairly normal distribution like the siblings do not and neither does like the parents the fair also does not follow a normal distribution so these are you know fair for example might be something that we want to actually normalize because there's such a spike at a very low price some of these other things there's few enough categories that it probably wouldn't matter and age is already fairly normally distributed so we don't really have to think about normalizing it so I'm gonna set scaling we would want to normalize these and then scale them so let's now look at some correlations so as we can see you know the number of parents and the number of siblings so like families tend to travel together you know these are so age and the number of children is very high the course very has a very negative correlation and and please help us to just understand the different relationships at our data if we're using regression this could be really really important because we want to avoid multicollinearity and that's when there are two variables that are too highly correlated and they have a overwhelming effect on the bottle so the next thing we do is we just want to look at how survival rates differ across these across these different groups so if they survive what's the average age of survivors what's the average price that they paid the average family size and the number of children so you know it looks like you know these are not statistically significant but we can start to see that okay younger people might have our chance of surviving people that paid more you know maybe the rich survived which is a kind of sad story here if you have parents so if you're a kid and your parents are on board they might put you first and if you have siblings if you're a child you might have slightly less of a chance of surviving so these are all things that we want to take note of and think about when we're building our models the next thing we do is we do a very similar thing to the histograms here but we use our categorical variable data so we look and see okay what is the survival rate so zero is not survivable one two survived how many people are in each class so there's you know less people in these two these are probably you know general tickets these are probably first class actually we should look and see what those things are so let's go back to the data so we can see class so yeah first is first class second class third class now we go down and we look at you know these are kind of unintelligible but we can see you know roughly the distribution of we will pay the classes and on cabins and then we can see where people embarked from so we can see that embarked shurberg Queenstown and Southampton those are the three locations and we can evaluate how much of an impact all of these things have so after we've done that we do something very similar to this pivot table and we do it for each of these different things and we want to compare it to our dependent variable which is if people survive so we can see that by class a lot more people in first class survive survive proportionally then in these other ones this probably suggests to us that there's some value and understanding this actual problem this will probably be relevant in our model the next thing is you know men and women so it looks like the saying you know women and children first actually applied in this scenario here finally we don't really see anything too relevant maybe if they got on this location they have a slightly higher chance of surviving all right now moving on to the Future engineering so we saw that ticket and cabin there's just a ton of data and you know there's only I think eight eight hundred or so samples here in the training set so if we have too many columns that really doesn't cooperate well with our data so we want to simplify some of this through some feature engineering so if we look at the actual cabin data we see that there's basically a letter and then a number following it and that's what cabinet is so I wanted to separate them into individual cabins and to do that I use a little bit of regex so we just split here on on spaces and in the first one I just wanted to see if they had multiple cabins so some people have multiple cabins it looks like the vast majority did not and we can see here that you know what the survival rate looks like across those things the next thing is I wanted to look at the actual letter of the cabin that they were in so you would expect that cabins with the same letter or in roughly the same locations they might be the same floor whatever that might be and again I used some very simple like a regular expression to strip this out so n stands for na if you recall that we had you know almost 3/4 of the data was null for this variable but you know we can actually treat that null value as a categorical so maybe if they don't have data on it and we mark that it might tell us something different about the data there or it might tell us something additional so we don't just have to drop those rows and not use them we can use them as a feature and maybe that can help us it probably won't but it's at least worth experimenting with so as you can see you know a lot of the people in the null column did not survive people that actually did have a clear cabin had a lot higher survival rate unless they were in a so I think that we can comfortably use the column letter as a categorical variable here and that might give us a little bit more insight you know it takes it from having you know fifty a hundred different evening cabins to having like less than you know ten or so the next thing we want to look at is the tickets so each ticket number was fairly close it was either unique or very close to unique but some had letters and some just had numbers and I thought that that might be telling it was pretty probably is related to where they marked from things like that but it was worth experimenting with so I am I included those here I just as a variable if they have a number yeah if they have a number then then it'd be a 1 if there is some text involved then it would be a 0 so as you can see I just kind of explore all of the different ticket lettering conventions and didn't think that any of these were common enough that we'd actually want to include them so I just removed that and kept this more simple variable I probably could have aggregated I think CA or any of these with ease and them could have been combined together but I didn't really know what all of these things were for and you know maybe if I wanted to go further in the analysis this is something that I would look at so the next thing I wanted to do is see if this had any impact it looks like these ratios are pretty similar so there there wasn't really any value added from that but it's again worth experimenting you know I think something that I want to make very clear is that when you're doing this it is experimental a lot of things that you try or knock it'll work but that makes it so much more exciting one of the things that you do try does work so we experimented with the ticket letters here and again there wasn't anything too relevant in some of the models that we're going to use it makes sense just to throw in as much data as you can and the model will sit through it in the random forest or decision trees they you know they split on the greatest variables that have the largest different for differences first so if there's nothing really significant there probably wouldn't split on it at all or or at least fairly so the next thing we do is I wanted to look at the individual people's names so I thought that this might give us a little bit more data than just if they were male or female as you see there's doctors on board Reverend majors there's a lady a countess and these are things that might you know if someone's royalty or something like that they might have a higher chance of surviving for a lot of these they're you know very small samples of so I would have liked to have aggregated it just slightly more but I thought you know for the sake of time and for this data set I would keep it as is there aren't too too many different surnames again I did this with some very simple you know regular expressions or just by using some some basic Python commands if I wanted to really try and improve this I'd try and group some of these names together you know maybe working people like doctors reverend's or military people should be grouped together and it'd be also be interesting to see if the captain went down with the ship or not I think that that's you know like an idiom that people talk about quite a bit so after we've explored the data and again this isn't this isn't very in-depth of an analysis this is just enough to get you know moving on the model building part you know I probably could have spent you know 20 hours just on the on the data exploration part and that's kind of what makes a project unique is that you can go down these different avenues you start thinking about these different questions like you know like this one that we just talked about if the captain survived or if royalty at a higher instance of surviving you can really make some cool visuals relating to that as well I didn't want to focus this one as much on the exploration I just wanted to go through kind of a whole pipeline to get you guys familiar with the process here so next we go into actually pre-processing for the model so none of these models handle null data well so we dropped we want to drop the null values from embarked we also only want to include relevant data so I only included ones that we featured ended here or that we're like fine as they were next we have to actually you know transform all this data and I used you know pandas get dummies what that means is when you have multiple categorical variables so let's say you know like the class the cabin right we have seven different things I think it was seven maybe there's 10 different cabins but in order for the model to use that for each of those individual cabins you need one column and the column has a you know basically a zero if they weren't in that cabin and a one if they were in that cabin so that's how we actually integrate that categorical variable into those categorical variables into our analysis you can't just have them as one chunk in most cases or one column in most cases so you know basically what I do here is I take all of the data so I joined the training and the test sets because it's easier to make sure that the training data has the same columns as the test data if I do it this way it also for the case of the Calvo competition might give us a little bit more information in the training set about the distribution of the test set this is not I want to be cleared not a good practice for real world's data science you generally want to happen you want to train a one on coder you want to train a scaler only on the training data and then you you you you know you transform your test data using the encoder that was only trained on the train data that that assures that your distributions and stuff are very similar again in this case where we want as much information about the test set as possible it might make sense to do it how I did it here it's also a lot shorter and easier they do it how I did it here so for the sake of time that's that's why I analyzed it this way so you know going forward we just do all of the same transforms that we did only on the test set before and we do them across the training and the test data we also fill the age with the mean and we fill the affair with the mean as well we in theory probably should have used the median for Fair because it was not normally distributed so that is something that I'll take into account and I might when I go back through this experiment and see if that actually helps I don't think that there were that many missing fair values I'd have to look up top you can do that pretty quickly this notebook wasn't other than as short as I thought so we can see fair actually I don't think there were any missing values so that's good to know all right next we actually drop the the columns that had embarked on the roads that had embarked in with we dropped the rows of had nulls and embarked here you know finally I decided to try and normalize a couple of different things I didn't actually use the normalized siblings and here it just looked really wonky but we normalized the fare and we get a closer to normal distribution here which I think is good so that that to me made sense to use instead of the traditional fare data yeah so after that we create our dummies down here we split up back into our train and test sets and then we're actually ready to get going here so we have our train or our train variables and our train dependent variables and what we want to do is we actually want to try a bunch of these different models and experiment with how they do so the way that we tell how they do is we we use cross-validation so what that does is it takes some of the samples of our training set and it splits them off and it trains on this data well so what it does is it randomly samples from our training data we run a model on some of that training data and then it predicts on the on that model there this way were you know validating it on data that's held out and that should give us a better estimation of how this will perform in the real world so we for the first pass I just ran a bunch of like the basic models without tuning any parameters to see how things would go so I try the naive Bayes which I think is generally the baseline it's a very simple model and you would expect other models to do better but sometimes you just want to try with kind of the easiest thing next I looked at logistic regression decision tree K nearest neighbor random for support back to class i XG boost and then i tried a voting classifier so i wouldn't really worry about knowing what all of these models are right now i think it's important to just understand how to implement them and when you're just starting out in data science it's ok to just try all the models and see what works best it gets a little bit more intricate when you're trying to figure out what data you're using and how that goes with the model specifically luckily this model I mean this data is very generalizable and pretty much all of these models should work fairly well you know you do have to understand a little bit more of the map when you start getting to feature tuning but that's also an experimental process as well so we import the cross-validation and that's how we're going to evaluate the success of these models with the cross tiles for here so to run any of these you just import it from SK learn you create an instance of it and then you can fit it to the data down here I'll show you how to how to do this without cross validation when we're actually you know predicting for the for the results on the test set but I'll save that to a little later I'm gonna go through and show you how I did some feature tuning as well so the next thing I think it's important to talk about the voting classifier as well so a voting classifier means you just take a bunch of the different classifiers and they you know literally vote on on which if they think the person survived or not so let's say we have five classifiers and three of them say that they survived in what's known is a hard voting classifier then the model would spit out that they believe that it sir that that person survived if we're using a soft voting classifier that means that the models are sending forward their like confidence or the probability that they think this person survived so let's say the logistic let's just use a to voting classifier that is soft let's say the logistic regression said that it was a hundred percent chance that this person survived and the cane is nearest neighbor said that there was a 30% chance that this person survived even though that their you know one is saying one direction one saying the other direction you would average the hundred percent and the thirty percent so that would be over fifty percent and it would still say that the voting classifier would believe that they survived so that's how voting works generally if you have some breadth of models voting classifiers work very well because they help to normalize your results and generalize the data a little bit so in most cases ensemble approaches which are already you know random forests actually boosts are ready in Samba approaches there are really powerful techniques for solving problems and you know they are generally a best practice when you're not using deep learning in this notebook I chose not to use deep learning that the size of the data is very small so it probably would be optimal and it's also fun to just go through and not have to worry about the deep learning implications going forward so for the voting classifier we also did the cross style score and you can see it's a little lower than I think the random the support vector classifier was and that's okay just because these are doing great and validation doesn't mean they aren't they're going to do as well on the actual test set so that's something to keep in mind you still have to experiment with the testing data and you know when they're doing really well here it might actually mean that they're overtraining slightly so let's go down here and actually you start thinking about model tuning so as you can see I have a grid or a table of the performance after tuning and you can see almost all of our performance increased for ninety phase it doesn't you know really make a ton of sense to to feature tune that there aren't really too many leverage you can pull them prove it with decision tree there are but we're using a random forest and actually boost which are both tree based models and we would expect that these would actually perform better so I didn't think it was relevant to actually spend tuning that one what we're gonna use to tune is grid search and randomized searched search so with any of these grid search allows you to just put in a bunch of parameters and try them all and it'll spit out what parameters have the best results so that's what we do here is we actually go through we try all these parameters and these are the parameters that ended up having the best performance for us and this is the score that that it had so we do that for all of these different classifiers each of the different classifiers have different parameters I think that that's probably a Tory for another time but I believe I was fairly exhaustive with with most of these classifiers for the different things that you can try so for random forests and actually boost there's like you know technically infinite number of parameters that you could try and if I tried them all it would take days months years whatever it would be so I used a kind of funnel approach to find parameters that worked well for me so I did a very broad random randomized classifier so a randomized search and what that does is it doesn't try everything in this grid it doesn't try all the options it randomly samples from it and it gives you what the best results were and then after you have the best results you can tune it a little bit more and find something that that does just a little bit better so this is a way to kind of simplify on time and maybe shorten your workflow a little bit so for here we get our best classifier with random forest was at you know 83 and a half percent and then four down here actually boosts the best performance I believe was like eighty five point two which is really high so in this case it actually ended up overfitting spoiler alert and and that didn't produce the best results but it's interesting to have in here that that that was a really high performance one thing that I haven't talked at all about in this is the actual feature importances so what this means is which of the variables that we put in which of the features actually have the greatest impact on predicting if someone will survive or not so you know just looking at this we see that the how much they paid their age if they're you know if they had if they were male basically if they were female and then if they were in you know the I guess it's like a normal class not first class or second class those are the things that had the largest impact next we see you know like if they had children and you know if they had multiple cabins which is interesting so I think that all of these are pretty cool and relevant and that helps us understand our problem a little bit better so maybe even with with our fare if we experimented without a little bit more we might be able to get even better returns you know some of where they embarked from didn't seem that important and maybe we could actually remove some of these things and they might be a little bit confounding so for some of our models for other models they have probably aren't relevant at all so I went through and tuned these things and for I thought the performance of XG boost as well as random forests were worth exploring just sending him in and so I submitted both of those but I also wanted to do another voting classifier with our actual tuned model so what I did is I went through and we did the exact same thing that we did before I just took the best estimators from all of the tune variables and I made a bunch of different voting classifiers so I tried a hard voting a soft voting with just K nearest neighbors random forests and support vector classifier and then I tried one with with all of them except checked extra boost and then I tried one with extra boost to see how they would perform and so we can see that you know in that order that I did them in the best results came from this one when I we submitted it that was not the case the best results came from this one which I thought was very interesting it looks like maybe the extra boost one just over trained a little bit and so that's part of this process it's iterative it's experimenting and there are a couple things that I probably want to tweak and see if they actually have a positive impact on so one of the last things that I did was I wanted to see if the weighting impacted the output here so I did a grid search and I experimented with different votes in the soft voting classifier so what that means is I can add additional waiting to one or two of the models so that it counts for more in the analysis so I tried different weighting all the way across and it turns out that the weighting that we had was actually optimal it produced the best course though so this code is just basically like going through and like fitting all of the data so for each of the classifiers we've made we have to fit it to the data and after we fit it we can go down here and predict the results that we got so you know this is how you'd actually use the models what this produces is a series with all of the answers to your to your problems there so with with all your predictions for the input variables there so this next one I just go through and I make them into data frames and I make it so that I can actually export them into excel around a little bit more code here do just see what the difference is between the outputs were you know if we were submitting data that was all the same you know two of the voting classifiers produce the same results on on the testing it wouldn't make sense to include that so all of them had slight differences you know between three and and in this case you know eight to twelve different variables here the last thing I did was I actually submitted this workbook and you do that by just creating excel files in the local and and then at the top when you go in and and and run this you can do you can click up here and then you can actually commit the changes so you click you click that and run it and it runs through your whole workbook this takes for this analysis because I had all the tuning with random forests quite a bit of time so I didn't include those you know so I didn't include that part in here when you actually want to submit you can click output and you'll have all of your output files here so let's say I wanted to submit a new version of the actually boost I would go through and submit that and well it looks like it didn't rain so that wasn't very good but you get the idea here so as you can see the best results that I had were these two runs that I did I'm gonna probably try and get it just a little bit better so I can break into that top 10% you know that's just some fun tuning that I'm gonna experiment with and you know hopefully we can we can do a little better and and improve our performance here so hopefully that was a good way I mean this was fairly in-depth this was a long analysis but hopefully I broke it down and simple enough terms to get you started you can go through you can walk through the code you can definitely ask questions of me but I just wanted to get this kind of kaggle series started I'm planning to do something along these lines almost every month I think that showing you how to do some of these projects can really help kick-start things and get you thinking about this in the right way so as usual thank you so much for watching and good luck on your data science journey

Info

Channel: Ken Jee

Views: 140,106

Rating: 4.955164 out of 5

Keywords: Data Science, Ken Jee, Machine Learning, data scientist, data science journey, data science project, Beginner Kaggle Data Science Project Walk-Through, data science project walkthrough, kaggle compeition, kaggle project, data science titanic, data science basic project, kaggle.com, kaggle basics, kaggle submission, data science kaggle, kaggle beginners, data science beginners, kaggle analysis, titanic analysis, kaggle titanic, classification, svc, random forest, kaggle, xgboost

Id: I3FBJdiExcg

Channel Id: undefined

Length: 38min 16sec (2296 seconds)

Published: Fri Jul 17 2020