Introduction to Data Science with R - Exploratory Modeling 1

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi welcome to video 4 in my series on introduction to data science with our today let's we're going to be talking about exploratory modeling and I'm and the reason why we're finally getting to this point is that we've done enough data analyses to warrant some additional intuition some additional reinforcement for the things that we may have gleaned from the data and I know that there's probably been questions that no I'm willing to bet that there's been a non-trivial contingent of folks that have been watching these videos going man I'm three hours in and we haven't done anything that I thought even remotely resembled what data science would look like that is building models and as I've is it at the risk of sounding preachy I've talked about this before but I'll mention again sixty to eighty percent of the time spent on a data mining slash data science project is relatively low-level work of getting data cleaning data massaging data validating data confirming what things mean with subject matter experts in the business or what have you and very little of it is modeling not to not to be too cheeky but you can really think of the modeling aspect as the tip of the iceberg pun intended look given that we're looking at Titanic data of the whole data science lifecycle that being said exploratory modeling in my opinion is actually something that's quite beneficial and is something I wanted to go through with you today given the place below are and where we are at in the lifecycle so let's start with the intuition of why you may want to do some exploratory modeling so first and foremost there are many machine learning algorithms that can go through the data as they're being trained and can look at the variables the features the columns if you will in the in the data and actually decide which of these these variables are important in terms of being predictive and those that are not now this is typically known as feature selection and some some models do this explicitly for example if you're using a logistic regression algorithm that implement what's known as l2 regularization if you don't know what that is it's fine doesn't matter l2 is an explicit way of doing feature selection and some algorithms do it implicitly so for example as we'll see random forests do this as part of the algorithm of repeatedly training decision trees each individual decision tree will pick and choose which which variables which features are beneficial for understanding how to predict the data and then over the course of many many trees that are being trained as part of a random forest you kind of get this emerging idea of what variables tend to be important which ones don't so this is great right because it can validate some of the stuff that we've some of the intuition that we've gleaned from looking at the data which is this whole idea of women and children first and that wealthier folks rather than poor folks tend to survive so we can use this exploratory modeling technique to validate some of the things that we've gleaned from analyzing the data so and I kind of just jumped into the second bullet point here which is there are a number of scenarios where this idea of number one can be helpful first and foremost is in scenarios with lots of features that is what is commonly referred to as high dimensional data exploratory modeling can assist us in identifying high-value candidate features first and this is this can be important imagine if you will that you've got a data set with 300 400 500 variables features columns you may be beneficial because your time is valuable as a data scientists to start you know whittling that down a little bit saying look before I each one of these pieces of data reach column each feature in detail maybe I should target some of the stuff that has more likelihood of being predictive so explosion remodeling can help you with that now unlike in this video series you can start with exploratory modeling in the beginning but since our since our dataset was small it only has what 10 features and it was I wouldn't do that normally I would start the way we just the way we have in this video series this second bullet point is where we're at the video series which is we've done some data analysis and we want to use exploratory modeling to provide additional insight or confirmation of what we've gleaned from the data and we'll see this later in the video series when we're actually do some feature engineering can we validate that our feature engineering is useful now in our case we actually have done a small bit of feature engineering already so the title feature that we created as a proxy for both the the gender as well as the age of a passenger the Titanic is an example of feature engineering another one that we'll see in the video today is family size is also a engine is a feature that we engineered but so will confirm this so really I guess we're combining both of these things in today's video okay now here are some characteristics that you want in your exploratory modeling Rhett needs to be simple fast and effective let's talk a little bit about that so it needs to be simple so models have these things called hyper parameters and we'll see this later in the video when we take a look at the random forest algorithm but in terms of our programming literally it's a function call and this this denotes the number of parameters that you can tweak and the way to think about it is the hyper parameters are not knobs and dials their ways of tuning in dialing in the performance of the tuning excuse me the training of the algorithm to fit the particular zuv the data that you have in your training set and some some algorithms for example like deep neural networks have tremendous numbers of hyper parameters that you have to worry about and their their performance their effectiveness as models varies widely based on how you set those and that can be a non-trivial endeavor so generally speaking when you're just doing exploratory modeling you don't want to do that because it takes too much time another thing that we want to worry about is data pre-processing for example do I have to convert factor variables categorical variables in our to numeric variables essentially a set of binary ones and zeros instead because the algorithm that I'm using only works with numeric data it does not work with factor or categorical data so generally speaking we while this isn't that big of a deal later on we'll probably take a look at how to use the caret package to do this for us pretty easy it's still it's just extra work so if you can avoid it it's probably a good thing very intuitively we want our exploratory modeling to be fast we want to pick algorithms that train as fast as possible so for example logistic regression is a classification algorithm that would be that you could use for the Titanic competition and it trains generally speaking very very very fast is it so with that that that's a good benchmark to say look and might as fast as logistic regression great that's pretty fast and lastly we want our models to be effective and this is where for example a logistic regression is very fast but out of the box it's not necessarily particularly effective because it is at base a linear model and as we've seen in our data analysis of the Titanic data set which is actually quite typical in my experience for business data in general as well is this is kind of not this is a very nonlinear problem so this this boils of this last bowl point that I've highlighted which is we want our models to be effective so we want our exploratory models to be powerful enough to work with the data that we have at hand so that we can put stock in the results that it gives us and in the case of Titanic this is a very complicated data set it's highly nonlinear because we've seen that well if you're a master and you know if your title is master in your first class you basically survived but if your title is master in your third class well you survived but we don't really know why araly to yet based on the data that we've analyzed why that would be the case it may just be purely random for any number of reasons so you want things to be effective and the whole idea is when you take these things in aggregate this is really this idea of cycle time how many iterations of a model of a model can i crank out in a given hour of activity because as a data scientist my time is valuable your time is valuable so the more thing the more iterations we can crank out in a unit of time the more the more value we can generate because we contrive different scenarios and explore and create experiments and that sort of thing so we want to exploratory modeling to be fast simple and effective okay so that leads us to arguably the king of exploratory models in my opinion which is the random forest algorithm now in this video series we're not going to go in-depth on any particular algorithm that we're going to use mainly because quite frankly there are better resources available on YouTube or on the Internet then I could provide so why reinvent the wheel so I'm going to provide in just a bit some great resources for learning random forests off of YouTube we'll see that in a second so don't worry if you don't understand random forest is I'll give you some some materials to learn that what we want to emphasize for this video is that random forests meet all of the intuitive criteria that we talked about above so first and foremost as I mentioned earlier they do provide feature selection automatically it's arguably implicit but random forests do do it which is great they're extremely simple from a development perspective they're arguably ridiculously easy to use in our out of the box because the default hyper parameter values work very well so you can actually get quite a bit of value of just passing in your data and you're your so your training data and your labels and you just tell the random force to run and you can actually just look at it let the results and it's actually quite valuable just with just that simple line of code now random forests are also really awesome because they can handle numeric categorical and correlated variables without pre-processing so for those of you who aren't familiar with correlation basically it says look if I've got two numeric variables let's say and they both tend to be they both tend to have high value use at the same time and they both tend to have low values at the same time on a row by row basis that means they're highly correlated in many many machine learning algorithms especially linear algorithms tend to have a problem when you present correlated variables to them so often what needs to happen is you have to do some pre-processing let's say you use something like logistic regression you don't want to put two highly correlated variables it you don't wanna present two highly correlated numeric variables to logistic a logistic regression algorithm when you train it so you have to worry about that but random for us actually handle that pretty well so you don't really have to do much which is great right it just lowers that bar for doing some you know some high high volume iterations on our Explorer modeling so random forests are relatively quick they're not nearly as fast as just regression I mean logistic regression just blows it out of the water to be quite frank but when you compare it to other potential exploratory algorithms that have all of the other benefits that random forests provide from the data scientists perspective it's way faster for example any sort of multi-layer neural network with multiple hidden units you know a K deep learning it's going to blow it out of the water in terms of training time SVM support vector machines same deal going to be much faster than that and they're random force will tend to be faster as your data set grows than those other options they the amount of time the increasing amount of time it takes to train a random forest won't grow nearly as fast as those other options that I talked about now random forests are effective so when I first started first started cackling so two-three years ago random forest world essentially the gold standard almost everybody used them there were many many teams that won cattle competitions using only random forests that that's that since changed with the advent of deep neural networks and booster tree packages like XG boost random forests are still a go-to algorithm and they're definitely part of any winning teams ensemble so they're they're still very powerful they've been eclipsed by some more recent technology we will use XG boosts later in this in this series for example instead of a random forest for our submissions to Kaggle but random forests are still a go-to algorithm and you should not feel like you're shorting yourself at all if you use them in your work so great random forests awesome okay so let's go take some look at some resources on YouTube so finding some stuff on random forests is really easy so just go to YouTube this type and ran for skagle is your shirt's criteria and what would I recommend is about third here on this list random force tutorial from kaggle three videos okay so let me pause that okay so this is a really great video I would recommend watching all of it it's from a while back but it really talks about how you it's really about it's really focused on competing in cattle competitions which as you know I heartily recommend to everybody to do all the people I mentor it's a great way to build your chops and this video really talks about you know getting prepared for doing the data science competitions on cattle but in general it's equally applicable for anything that you do in your professional life as a data scientist now the random forest specific stuff if you don't wanna watch the entire video starts around 40 minutes or so and so this is this is some code that he wrote for his own random force implementation but you can watch this video and it gives you a really high-level introduction to random forests now this video here is actually a lecture from the University of British Columbia and it's actually very in-depth if you search on this lectures named mine and Odie fry cos hopefully didn't butcher that too badly he also has a really good video talking about decision trees so ran force essentially is a collection of individual decision trees so you can watch both of these I know that if you watch all three of these videos I just recommended you're looking at you know three four hours worth of material but its goodness so I would highly recommend it if you can if you've got the time to do it okay that being said let's go ahead and flop it a plop into our here I'll maximize this so I'm going to do is I'm just going to run all of the code from the series up through the end of video three and as usual we're going to see a bunch of output here and a bunch of diagrams plots pop up now you'll notice that I'm going to start adding these blocks into the code I'm notice it's starting to get long so starting from now on you'll see blocks that say that'll show you in the file on github hey this is where we're at for video 4 video 3 so on and so forth so hopefully I'll be a little bit helpful okay there we go so first thing I want to do is if you've got a clean relatively clean our installation and our studio installation you probably don't have random for us so let's install it so click the packages tab over here and then click install and just type in random forests now we just want the plain old random forest now as you may have noticed random forces are popular and there are many variations on it but for our purposes we're just going to use the de-facto random forest implementation here okay so if you say that let's just go ahead and restart the our session here okay so everything's installed we can go ahead and load it now as always don't be afraid to use the help system so the function in question is oddly enough just called or not oddly enough called random forest so we just go ahead and execute the help function on that and expand this you can see a little bit better okay so here's the function definition in our charms right as I said earlier all of these are essentially type of parameters for tuning it great news is you don't need to worry about any of that for right now you can explore these in depth if you want to over time I know I have as I've competed in cattle competitions but the good news is for our exploratory efforts we really don't need to worry about too much okay so let's start going through the code um first thing we want to do is let's take a look at exploring just peak class and title as you saw ad nauseam in previous videos P class and title were really basically the the pivot the drill through that we use to take a look at the data because we saw those two things as being very very highly predictive in our visual analysis of the data so let's just take a look and say look if we just built a model with just those two things what kind of results would we get okay so what I'm going to do here is I'm going to say okay grade we're going to create a specific training data frame and I'm just going to call this random forest or RF train one for the first training data set for our RF data a model exploration exploratory modeling excuse me and what I'm gonna do is I'm going to go hey grab me the first 891 rows out of the data combined data frame because those are the ones with the labels and then I'm going to say hey I only want two columns I want the P class and I want the title that's all I want right so this is very minimal data frame just to explore modeling the problem space with just these two features and then I'm also going to create a dedicated label variable and in the data dot combined data frame we created a factor with three levels survive basically one zero and none because the data combined has all of the data in it including the training data which has no labels so but to actually train a model we just need essentially the training data sets version of the survived variable so I'm just going to grab it directly because it doesn't have three factors it has two you lived or died so I'm just going to grab that convert it into as a factor and then cram that into my random forest label variable so now I've got my training data and my labels okay next up as we'll talk about in a second random forests not surprisingly or random so if you want to have some sort of reproducibility or some way to evaluate different runs of the random forest algorithm setting the seed is really important because otherwise you're going to get you know slightly different results every single time that's by design the way you counteract that is by using the set dot seed and as always and this basically says look this is how you can set the state for the random number generator and our so by you setting it to one two three four classic example we can get some reproducibility next up we're going to call the random forest function this is what actually trains an instance of a random forest for us and using the the parameters here we can say look the x values which corresponds to the training we're going to point it at the data the data frame that we created it's just our two variables of P class and title the Y's are the labels did you survive or not and then we have our label variable for that now the last two you could in theory leave off to be quite honest with you when you're doing exploratory modeling but I tend to set these just for the reasons I'm going to enumerate here first of all setting no importance to true tells of the random forest algorithm as you're training all of these decision trees that in aggregate make up the forest hence the name keep track of the import the relative importance of the features and the variables so that way I can report out on them later on and then lastly this says how many individual trees are there in the forest so by default so we go back here we can see that in the help system and whereas in tree entries right here so the default hyper parameter for random forests is 500 trees and actually to be honest the that is a reasonable default value as we talked about before I only up it to up into a thousand just because you know a few extra trees aren't such a bad thing so I tend to just do a thousand but all 500 would work just as well so you could potentially and your explored remodeling just leave that parameter off because you're not going to necessarily get a huge huge difference in the results but I just had to do a thousand ok great so let's train that nope actually let's hold our first and I will set the seed and train and you can see here that was pretty fast and the reason for that was we're only building a thousand trees and we only have two variables times 891 rows so it's it's not a lot of data so obviously if your data set was huge it would take a lot longer our data sets relatively small so it was fast which is exactly what we want for exploratory modeling okay by default the object that gets returned from the random forest function call which I'm calling intuitively enough random forest one actually has is set up that you can print it out by default so I have to do is run that line of code and what you get is some really useful output let's go ahead and take a look at that more in-depth so first it talks about hey how did you call the random forest what's the number of trees and this is the number of variables tried at each split this is actually also a hyper parameter called M try that you can tune but for exploratory modeling the idea is not to worry about it and again the way the algorithm is set up by default it works pretty well so you don't really need to mess with this now this is what's done what's below this is what's really super interesting but let's start with this thing called uber now boob stands for out of bag and again you'll learn more about this in the videos that I showed you that are available on YouTube but at a high level if you think about your training data as an excel worksheet as a table as a matrix the way the random forest algorithm works first and foremost is that it takes a random set of columns and rows out of your spreadsheet and does that iteratively every single time it trains a tree basically it does that now what it also does is it says look I will also replace what the statisticians say this would be sampling with replacement that is look I will draw let's say 891 I'll do 891 tries excuse me I'll do 1000 tries because I'm doing a thousand trees excuse me and I have 1000 times I'm going to randomly select a collection of rows and columns from the spreadsheet but here's the thing I'm okay with any single time that I build a tree I may select the same rows and the same columns more than once that's what statistician would call random sampling with replacement so what that means is that if you do that a bunch of times the same rows and columns are going to get selected a huge number of times while you're training the entire forest and what that means is that by definition a certain number of rows and columns will never get selected for any particular training of a tree and then you can actually say look you know what I can actually hold that out it's outside of my bag out of bag and then I can use that and say look I can train this tree with the stuff that I did select and the stuff that I didn't select I can then use that to test how accurate it is so that's one of the powers of the random force is that not only can it help you with feature selection it can also give you a estimate of how accurate your model will be in general now leo breiman the the gentleman who created the random forest would tell you that that this is very very accurate my my experiences it varies it's awesome for exploratory modeling it's not necessarily the value I would use in an actual cattle competition for example to say hey look I'm about ready to make a submission and I think my boob estimate of error rate is going to match what I'm going to see in my kago competition I would use something that's called cross-validation instead we'll talk about that in a later video but this is good for our purposes now what this says is look on average I trained a thousand trees and for each tree they're averaging about an error rate of about twenty point seven six percent or the way think about that is we're getting nearly eighty percent accuracy just using two variables which is pretty good which matches our intuition from our data analysis it says look these two things together are extremely predictive of whether you lived or died okay next thing that's really super important this output is the confusion matrix and what this basically says is look hey break it broke broken down by our labels did you perish or did you survive on the Titanic what were the results of this random forest all up and what we can see here is something that's extremely important so you can see here at the intersection of the matrix and really doesn't matter the way you think about it this could either be the true value and this is the predicted value or vice versa it doesn't really matter so let's say this is the true value for the sake of discussion and this is the predicted value so what this says is hey our train model predicted correctly 538 Records perished when in fact they did parish and we predicted that 11 people survived or we predicted that 11 people perished when in fact they did survive okay which gives us an error rate of around 2% so this model is 90 about 98% accurate in producing in predicting correctly when people perish which is pretty good now that being said we know that already that the the data is is skewed remember if we can do this for you we can double-check that again by going train survived and we knew already that it's pretty it's biased already so most people perished so that's not a that's not a surprising result are arguably this current model is is pretty good at pessimistic results I'm really good at identifying when you perished reflexively we can say okay good how how good are we have correctly predicting that somebody survived so here we have we predicted that somebody survived when in fact they perished oh not so good all right we have a tendency to say we're tend to be a little a little optimistic here or low seeing a little little pessimistic because you can see here our accuracy rates only it's less than 50% so we're really good at identifying people who did in fact perish when they did perish we're not so good at identifying folks that survived when they predicting people survived when they actually did okay moving on this line of code this is where we can say okay cool this is where we can do the features a graphical representation of the random force algorithms feature selection you run this line of code and we should get a plot here I'll just make that a little bit let's just zoom in that's better yet okay so not not super interesting we only have two variables but if you have more variables what you can see is I won't drain what these mean you can look it up in the help system but in general if you have more variables you'll see a long list of the variables on the y-axis here and generally speaking where their corresponding dot is the farther to the right it is more important that variable is so as we can see here and as we expected from our data analyses title is actually more important than P class for determining whether you survived in this model what this is saying is look if just these two valid if building a model just these two variables title was way way more predictive than P classes enter in predicting survival all right moving on as we did our analyses we we were able to throw out a bunch of the variables as in our first pass is not being particularly predictive so that was things like tickets cabins and where you embarked we got rid of age and gender because we use title instead because title is a is a a single representation of both of those concepts combined but one of the things we did identify that was potentially predictive was the SIP spa or the number of siblings and or spouses that you were traveling with so let's go ahead and add that to the model let's train a new model and add that to the number of features that is and we'll just call that two and the rest of the code is basically the same so I won't drain it okay let's take a look at so this is particularly interesting ah with just the two variables we got an error rate of twenty point seven six percent but when we add a third one we got more than a one percent increase in our accuracy that is a one percent decrease in our error rate they may not seem like much but in machine learning competitions as well as business applications often one percent improvement is huge right for example the famous Netflix prize competition word a million dollars to the first team that could increase the accuracy of or they improve the efficacy I guess I should say of the Netflix recommender by only ten percent it was worth a million dollars so one percent is HUGE if you can add one variable you get a one percent increase in your accuracy one percent decrease in your error rate that's that's huge so this confirms our intuition this is what we thought sip spa was predictive and sure enough yes it is it helps out the overall model now how does it help it out it's kind of what we would expect which is it dramatically increased our accuracy rate for those who survived now reflexively it you know we it had a negative impact on there our accuracy for those that failed but while this one up ostensibly nine percent you can see here around nine percent this decreased by almost seventeen % so great I'll take that trade-off any day of the week now I have to caveat that with that statement is bold but it makes sense for the purposes of this application which is overall accuracy is how the Kaggle competition for the Titanic survivors is judged so we essentially give equal weighting to whether we make a incorrect prediction for those who perish versus an incorrect prediction for those who survived they're equal we don't really care we'll mix and match that all day long as long and as long as an aggregate we're the most accurate we can be now there are many many many situations where that won't be true for example let's say for a fatal disease a test that detects whether or not you have a fatal disease a false positive telling somebody that they have the disease when in fact they don't is bad it's not nearly as bad as a false negative a false negative which says no we don't think you have the disease even though you do so that's in that situation you probably wouldn't you wouldn't wait errors exactly the same so as you do your exploratory modeling you would want to incorporate that into your analysis and understanding but in our case here it's very simple which is good which is they're equally they're equally fine so we're willing to make that trade-off if our overall accuracy increases which it has here we've got it by over percentage that's great okay so we also thought that parch was potentially predictive as well so let's take a look here let's create a model and evaluate it instead of sips ba let's use p class title and parch and say see what happens okay so this is interesting so parch is is predictive because we improve over the original model here of twenty point seven six percent down to nineteen point nine eight but it's not nearly as predictive as sips bow which is interesting okay so we say like they both have predictive power but one is better than the other so inevitably the next question is is well what if they're like chocolate and peanut butter are they better together so we can go ahead and do another model says look let's do p class 'title sip spa and parch and see what our results are and again you'll notice that i'm using the same seed every time so that way we can actually compare the models legitimately and say okay we know that the random seed is going to be the same so the the processing of the algorithm is going to start at exactly the same point every time so that we can know that the only thing that varies between the results are the features that we're plugging in okay let's run that one all right sweet so the combination of sips bump arch which is what we would expect based on our analyses is that the combination of them actually improves the model again so to wrote for way of comparisons go ahead and pull up the first model so we've gone a full two percentage points down in our error rate just by adding sip arch and what that what that probably tells us intuitively is that look as we would expect the number of siblings and the number of children and the number of parents probably matters right so we have this theory that of women and children first so that that this holds out with that we also have the theory that wealthier folks tend to survive that also holds out and what also probably holds out too is they say look um you know as we saw maybe smaller families and second and third class had a tendency to survive better than larger families just because of the inertia of let's say being in the bowels the ship and getting five children all the way up topside is probably unfortunately not likely so so this kind of reinforces that intuition here okay excellent so we also we also said hey you know let's create a family size variable which says look let's take this concept of for each individual record we start with one I'm traveling with myself so my family size is at least one and then I will add to it the cipa and the parts variable so this is really this idea of saying look this engineer a feature that says this is the total family size is that predictive it's essentially marrying this idea of sips ba and parched together and sometimes that's helpful for an algorithm because sometimes algorithms can't necessarily Intuit combinations of variables so for example our decision trees may say look if your if your sip spa is over two and your parch is over two then go down this branch of the tree but what it doesn't really understand is what's the total family size it the algorithm isn't that smart so sometimes deliberately engineering a feature may be potentially helpful so let's take a look at that we thought it was predicted let's let's say look if I just say P class title and family size and the reason why we'll tie this first is because implicitly family size incorporates the idea of both parch and sips pot so you could make an argument that the putting those two things in the model may potentially be redundant so we'll just start with family size first see what we get ah and that's an interesting result so our total error rate now is eighteen point four one percent versus eighteen point five two it's a small improvement but this is an interesting result it it seems to validate this exploratory modeling seems to a valid eight hour intuition that says look family size actually matters and probably the random force algorithm isn't necessarily very efficient at combining the idea of SIPP spa and parch to arrive at this idea of aggregate family size so we'll give it an explicit feature that says family size and see if it works and the answer is yeah it works better not a lot better but you know what in the case of cattle 0.1% sometimes can be he'd the difference between a top 25 placement or a top 10% placement or the next the next bracket bound so it's nothing to sneeze at and more importantly this exploratory modeling gives us some ideas of where we may go for feature engineering which says look family size matters so the question may be to explore in the next video series potentially or some in the next video I should say or the next video or maybe one of the videos further down the line is explicitly breaking this family size out characterizing the family size in terms of the number of children the number of adults whether our wife is traveling with a husband or husband traveling with a with a wife may be important because this hang look family size matters it matters more in aggregate than just sips pail and parched by itself so that's that's really good okay so but because this is so quick and easy to do it makes some sense to actually just explicitly try it like our intuition was if we add sips but it's not going to really do much because family size already implicitly incorporates the knowledge the information the predictive power of subscribe and already so well let's just try it doesn't hurt us to try it explicitly and here we go okay so let's let's let's break this down right so at the first level this seems to be pretty indicative right if we add sip spa our error rate goes up so that means our our accuracy goes down so sips but it seems to be confusing the issue and we can see down on the confusion matrix how how that breaks out which is okay if we add sips PAH our error rate for those for our Parrish predictions goes down a tad but then our error rate for survival goes up disproportionately more which then Nets out to the fact that our overall error rate goes up our accuracy goes down which matches our intuition but just again because this is so fast and you know can't really hurt let's go ahead and try that again with adding parch to the mix and as we saw before parch isn't nearly as predictive as Alishan say nearly that exaggerate s-- the difference but parts isn't as predictive as subspace so as we would expect when we add sips but to it the error rate goes up even more all right and then here's the same well this is interesting that the error rate hasn't actually changed here that's interesting right so they are right between sips be' and parch when you have family size doesn't really change much what the problem is is that we get worse at predicting survival which is which is interesting which which may which may be indicative of the fact that family size is actually the thing that is predominantly used for determining whether or not you perish which is the thing that didn't vary between these two because remember setting the seed is trying very hard to reduce the variance between each run of the random forest algorithm so that we can say okay as we just alter the features that we use in training that's going to be the primary source of variation between runs so reflect you know you know just thinking about it the only thing that varies between this value and that value is whether it's sips bar parch so we can say look and if these errors are the same may that's exactly indict may be that's indicative of assuming these error rates are exactly the same maybe this is indicative of family size being the disproportionately important variable for determining whether or not so many parishes and we can take a look at the plots just to verify that I've been ignoring the plots here for a bit but let's take a look at it okay so here is our variable importance plot for the seventh random force that we built which has the features of title and P class and family size and parch now I'm going to see here once again is Man title is way way more important than anything else which basically means that gender and your age is a single most important thing which drives this idea of women and children first P class is the next more important thing which is our proxy for wealth so that's important not nearly as important as women children but pretty important which matches into entering tuition and family size is the next more next important thing and it's you know non-trivial e more important than parch you can go back and double-check that when you say yup same thing holds for SIPP spa family size is more important let's see here there are five this is our best single model in our exploratory session today in terms of accuracy and which I alluded to this last time in the third video but I think it's worth mentioning again is you know this idea of Occam's razor right generally speaking we prefer simpler models rather than more complicated models this is an example of that which is a look you know you can actually get if you have the right features you don't necessarily need a lot of features to actually get a reasonably predictive model this is you know this isn't it this kind of shows that and you know this is this is this was the next most accurate model which is the combination of sips and and so on and so forth I won't drain all the rest of these these plots so we saw that our f5 those are most accurate models let's pull that up again okay and that's so let's say and just so that I mean keep it easy let's calculate what our accuracy rate is okay so our accuracy rate are pretty so the random force is going to predict that in general our estimate of our accuracy is going to be around eighty one point five percent with just three features which is pretty cool but for the sake of argument let's go take a look at where that would place us in the general scheme of things in the cago competition you know just as a benchmark skagle dot there's me yay um we go p-- titanic let's go to leaderboard yeah you do know okay cool let's go ahead maximize that now first thing that we should note is that the titanic competition is interesting for a couple of different reasons one is it's very broadly participated in right this is basically an ongoing competition it'll stop and you know at some point and they'll just restart it again lots of people do it so these are teams most of the time they're individuals but they could be teams of people competing under a single name so but a lot of folks competing and the second thing is is that this is this competition is ripe for cheating because as you know from our data analyses there's it's extremely unlikely that you could really build a model that gets you 100% accuracy which is what all of these show because you know they're just they're just doesn't seem to be enough signal and all of the variables to determine that so however you can literally build a a a model that says hey look I can actually go find all the names in the test set when we haven't actually looked at the rod we do look at the raw data and video 1 the name the full names are listed there I could go out on the internet and actually find everyone who lived or died by name and build a model that way so for example I don't want to accuse anybody but you know one submission with a hundred percent accuracy that seems a little fishy to me okay and I would say a lot of these you know are interesting I mean one submission at 99.5% accuracy yeah I've competed in this competition myself and I can attest that these things are fishing so you know realistically we can scroll down and when you start to see blocks of numbers being the same that's pretty indicative of probably legitimate scoring and the reason for that is that nowadays Cagle allows for the posting of scripts to a particular competition and what happens is that you know folks do a submission and they'll just post on the forms hey look I just did the submission with this this script and it scores this amount and all you have to do is you can go in there and you can literally tell Kaggle hey I want to submit the same script under my user account and you'll get the same score so you can see here you know a bunch of 83s that's probably indicative of this thing going on because you've got a couple here they're just one you know I've submitted one score and it's exactly the same but then I've got people that have done 25 and this is may be indicative of scenario where this person's been competing the competition and they've done a bunch of submissions and they found on the forum a script that somebody else and implemented and they just submitted it and it's better than anything that they've done nothing wrong with that it's all part of the learning process and same here with the 82 75 s and then oh here's a whole bunch of 82 to 9 and sevens so yeah okay and then some eighty one eight so Wow a lot eighty one point eight okay so which is which is good because now we're saying okay look if we if we take dr. leo breiman at his word and the estimate of error rate is actually quite accurate it's a quite estimate it's it's very accurate estimate of how well will perform on unseen data so this would be the test data in this case that we would submit to the cago website we should expect to get around eighty one point five nine and if you soon that's accurate anyone five nine would put us you know right around right here so two hundred thirteenth place which would be top ten percent placement on our first submission assuming of course that the estimate is is accurate it's not bad not bad at all top ten percent just three just three variables pretty good just using a random forest algorithm you know no tuning not bad so the reason why I bring this up is that I would argue that hopefully this illustrates the method to the madness now of why it's so important to do data analysis well why it's so important to do feature engineering because it matters it matters more than anything else it matters more than what algorithms you use it matters more than what how you tune it those things are important as I've said before don't get me wrong but they are in material feature engineering understanding the data is the number one thing so with that I will conclude hopefully you've enjoyed this video hopefully you're starting to see the big picture here or maybe you already did but you're just happy that I'm finally getting around to it either way I look forward to the next video and until then happy data sleuthing
Info
Channel: David Langer
Views: 72,253
Rating: 4.9787798 out of 5
Keywords: Data Science, Data Analysis, Feature Engineering, Visualization, Data Wrangling, Data Exploration, R Programming, R Programming Tutorial, R Programming Training, Data Science with R, Data Scientist, Machine Learning with R, Programming, Tutorial, Training, Data Science Training, Data Science Tutorial, Machine Learning, Data Visualization, Data Science with R Programming
Id: UHJH7w9q4Lc
Channel Id: undefined
Length: 53min 0sec (3180 seconds)
Published: Sun Jan 24 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.