Introduction to Data Science with R - Data Analysis Part 3

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi welcome to this third video in my series on introductory data science with our my apologies for the long hiatus between the first two videos in this third video to be honest with you I didn't actually expect to get the number of views on YouTube that I've gotten so far so I've decided to redouble my efforts to complete this video series as I originally envisioned and hopefully folks will get continue to get value out of what I'm doing here okay with that being said just a couple of housekeeping things first and foremost I need to mention that since the last video was made I've upgraded both my Mac OS as well as our in our studio so things will look a little bit different than the last video but it shouldn't be anything difficult to reconcile so just for the sake of completion here you can see this is a version of our I'm running right now which is 3.22 and where that's right there we go on the Mac and I'm running version 0.99 point four eight nine of our studio just in case you want to upgrade your system to be the same as what I have here okay that being said all the code for this series is up on github what I'll do here is just quickly run all of the code from the first two videos and you'll see here it's generating a bunch of code and a bunch of plots you can go check out the first two videos if you haven't already seen those or review them if necessary because of the long hiatus between the second third videos to understand what's going on so just to recap in video two we started going through data analyses for the cattle competition related to the Titanic this is an introductory competition on cattle it's constantly ongoing and I highly recommend it to everybody that I mentor getting into data science to give it a try because the data set is real oddly small it's relatively straightforward but it's a still a relatively complicated problem to get right or to do well on in terms of accuracy it is a classification competition so it's it's the stuff that you learn by executing on this competition is useful to a wide range of business problems everything from fraud detection to customer segmentation to it you name it I mean there are many many many business problems that are amenable to a classification problem okay so that being said these are the data points that are available in the data set and through video 2 we covered data analysis of all of the highlighted variables in the data set so we're going to pick up with is these last four and just close out the initial pass through the data and then in subsequent videos we'll go back through and do more detailed analyses to try and figure out of the data elements the data variables that we want to keep how do we improve upon those and then get to actually training a model to see how well our data analysis and feature engineering efforts we're successful ok so we need to take a look at the ticket variable so the first thing that I like to do when I'm looking at a variable there I'm going to maximize this is just do a quick STR the STR command and as always don't hesitate to use the help system and as it's indicated here this is all about just saying look what what is this what is this data item this variable in our you know what's the structure that's what STR stands for structure so you can see here that ticket is currently represented in the combined data set is a factor variable with 200 clicks even with 929 levels that strikes me is not being truly a factor variable with 999 values it's unlikely that's the case more than likely these should probably be trade treated as string variables so that's what I'm going to do here you see in this line of code here I'm going to go ahead and take the ticket portion of the data combine data set and using the as dot character function and as always you can see here this is one of the things I wanted to mention our studio the newer versions of our studio has what visual studio calls intellisense which is great and you can see here this is just encoding the documentation right it just transforms things from whatever representation they are currently into a string or a vector of characters or an array of characters so I'll go ahead and execute that and say look take the ticket take the essentially the factor level as you can see here the factor levels in fact strings is indicated here by the double quotes take each one of those convert it away from a factor into a string and then cram it back into the ticket variable this has the effect of just transforming the factor to a string and just in case we go ahead and run this code up here if we'd like and just run them again you can say okay now it's a character it's a it is a array of arrays so one 1309 of various characters great so to get some sort of flavor of what the data looks like now we can just ask R to give us the first one to 20 remember that are unlike other programming languages indexes from one and this this syntax here tells our hey for the ticket variable on the data combined data frame give me the index is 1 through 20 so spit out the first 20 strings and you can see that down here in the output now the first thing that strikes me by looking at this data is there's not a lot of structure in it right we've got some guts numbers and we've got some letter prefix codes of various kinds and to be honest with you with being an expert in how at the time the shipping cruise line that ran the Titanic decided to create their ticket numbers I have really no idea what any of this means so as the data scientists working with data that is unknown or doesn't have good metadata with it sometimes you just kind of have to play around with it to see what kind of structure that you can get out of it so when the first thing that strikes me is is though I could probably just grab the first character of each one of these you know the a is the ones the threes the peas the esses and see it just take a look and see if anything at all pops out at me so I can do that pretty easily the own line of code this is where the power of R really starts coming in once you get used to the the syntax and the functions you can do a lot of really cool stuff in very few lines of code as opposed to an imperative language like C++ C sharp or Java okay so what I'm going to do here I'm going to unpack this one piece by piece so that you can understand it so the first thing I'm going to do is I'm going to grab the substring now this is a function and again don't hesitate to use the power help system the usual friend so substring is a built-in function which allows you to create you know pull out chunks of a string very common if you've done any sort of programming at all sure you've done this before so this one said this function will grab the ticket and say hey starting at position one and going through position one so start one stop at one give me the first character pull it out and let me return that back out to whatever I want to do with it now the thing to remember is with most functions in are these these functions are vectorized so this function is smart enough to know that what it's operating on is not one string but a whole bunch of strings thirteen hundred and nine strings to be exact because as we can see over here data combined is 1309 or observations right one of which happens if these 13 variables just happens to be one called diggit so this is great because what allows you to do is you don't have to write for loops all the time like you do another language is to iterate over all these things Python of course has ways around that but you know the classic languages like Java and c-sharp for example they get that you have to write a lot for loops okay so grab me the first string or the first character out of all of the 1309 strings that are that are in the ticket variable on the data combined data frame okay return that back okay well here what I've got is that the substring call is actually wrapped in an if-else and as always so if else is just basically a ternary operator so it's equivalent to question mark colon and languages like Java C sharp and C++ it just allows you to do and if else test and you can see here it's very intuitive if-else here's my test condition if it's true do this and if it's not do that so that's what we got here so check to see if the combined data combined ticket is equal to an empty string now I don't know for sure if there's an empty string or not because you can see there's 929 levels but since this is a factor in a an empty string is a valid factor level this is basically some defensive defensive programming and basically I'm saying look if you happen to run into a ticket string that is empty that is the factor label is blank just put in an empty space instead and return that back otherwise give me the results from the substring call ok now as we were just talking about with substring the FL statement is a a function call statement is also vectorized so it knows how to go through and say okay look this thing here that I'm operating on is actually that's not one logical test it's 1309 buying logical tests which allows us to go basically iterate over the entire thirteen or nine tickets in one line of code and then cram the results into this new variable that we're calling ticket first dot char okay so what I'll do here is I'll go ahead and run this as well as the unique function we've talked about that before and all the unique function does is say look you know if you give me whatever you give me is the parameter to the call I will return you back the list of the unique values and you can see here that my defensive programming for checking for an empty string was unnecessary but you know it's okay and I get back all of the individual unique items right so this is all of the first characters of every single one of those 1309 tickets okay great so now I've got a much smaller set so that's reasonable for me to actually consider a factor variable you know there's maybe ten fifteen twenty of these values so what I'll do is I'll actually convert the variable we just created this ticket first char ticket dot first char and I'll make it a factor we've seen this before but I'll go ahead and pull it up just for the sake of completion in the help system as factors basically a utility function allows you to convert an array of things a list of things a vector of things into an R factor variable and then what I'm going to do is I'm just going to stuff it into a new variable that I'm going to add called appropriate left ticket dot first char and just assign that variable into the data combined data frame that'll just make some of these graphic analysis ever to do easier now as you've seen over and over again in this video series we're going to use ggplot2 to visualize our data because it's great way to start doing these data analysis especially when you're looking at things like classification because visual representations of the data are a good way for you to receive patterns that could potentially be useful in training your models we're going to run that and as before I'm will just go ahead and zoom in because uh you just bigger graphs or better than smaller ones okay so here we have just a really super high-level plot that says look along the x-axis just give me all of the values that we pulled out so these are all the first characters of all the various tickets the unique values and then give me the counts right so I get the the overall sense of the distribution here so there's a lot more tickets that start with three two and one than anything else then lastly these are color coded based on survival rates the orangish color is the fact that folks perished and the turquoise colors that people survived and what you can see here is that in general you know looks like slightly more people than the not survived if they had a ticket beginning with one maybe close to fifty fifty here with two maybe quarter or so three and then you know it's a mixed bag but P looks like it's pretty good you know if your ticket started with P you more than what you survived so this is on the surface here at this very super high level this is some signal that maybe there's some structure in this data that allows me to help me decide whether someone survived or perish on the Titanic based on the ticket now that may not be in exactly intuitive sometimes in business problems believe it or not things that are counterintuitive actually are are pretty powerful in terms of prediction and this that's one of the big differences arguably between applied data science in the business world and more more purely scientific pursuits like statistical modeling and econometrics those those two those latter two disciplines are really focused on not only modeling and modeling accurately but they're also very focused on providing explanations as to why things are the way they are for example in econometrics hey what is the effect what is the statistically measurable effect of interest rate changes on employment the economy prosperity whatever you whatever you can name right that so those two pursuits are much more focused on having statistically viable statistically valid ways of explaining things whereas oftentimes in the business world and applied data science your business customers or the people that you're working with don't care as long as the thing works oftentimes they're not necessarily interested in the explanation they're fine with it just working so the reason why I mentioned that is because maybe sometimes you'll come across something in your data analysis that doesn't match your intuition but yet is surprisingly effective and if you're if your particular problem at hand is such that you don't need to explain why your model works it just you just need to show that it does work go for it just roll with it but I'm a little I'm a little skeptical here so what I'm going to do is this let's do another plot and drill in a little bit so this first plot that we have right here is just a high level right all the tickets for all the training set and saying look okay what weather or the relative counts of the each type of first character for each ticket and what's the relative proportion of survival rates but let's go ahead and drill in and pivot that all based on on P class which is the class of where your first class second class or third class customers essentially the quality of your accommodations aboard the ship now when you see here a plot where we've broken it out by P class right so this is the first class passengers said class patch-through class passengers and then you'll see the duplicate range of all of the various first characters of the tickets now this starts to tell a more interesting story for example oddly enough or maybe not depending on the business rules at the time when the tickets were issued most folks in first class by far and away had tickets of type one or started with a ticket number of one and as you can see here those folks disproportionately seem to survive even more so here on P which is the second most common and then first class or in the second class starts with the number two and you would think maybe there's a pattern here that first class lots of people's tickets start with the number one in second class lots of people's ticket starts with ticket number two but then you can see here in the third theory of the plot that that kind of gets shot down pretty quick whereas this because you can see here the single most common tikkun ticket number starts with two here in third class so that pattern doesn't really exist moreover you start to look at these things and you start to wonder well there's an awful lot of turquoise here and here as a comparative here but we knew that already we knew that people in third class not surprisingly unfortunately perished and far more you know far more often than folks in first class did in general so you would kind of expect to see that most of these ticket bars would be predominantly turquoise in first class anyway so this is probably worthy of one more click down so what I'm going to do is take advantage of what we've done to date which we saw that the most predictive the most predictive data that we saw the most predictive variables that we've seen so far are the combination of P class and title so let's take that pivot here okay and takes a little bit take a second run and we'll go ahead and zoom in and I'll just fullscreen that okay this this one's a little harder to read but I think it's also a little more explanatory and whether or not you know tickets actually haves any sort of underlying structure any sort of signal that allows us to detect whether ticket was you know good way of knowing if somebody was likely to perish or survive on the Titanic and as you can see here it's kind of not because we see at a meta level the same pattern right so if you're if you have the title of Master which corresponds to a boy a non-adult male you're blue which you're turquoise which means you survive which we saw before miss which ostensibly maps to uh unmarried adult females as well as female children Girls was vastly great our turquoises before so that's not really surprising and generally speaking what this shows me is that there's not much structure here there's not much signal in the ticket variable which matches to our intuition as I said before sometimes that's not the way it works but more often than not you're into you know your intuition is probably pretty good and always double check it with a data of course but you know here we've got a situation where the data seems to match our intuition so what that means is that I probably wouldn't want to use ticket and my modeling at least in the beginning because there's not if there's not a lot of signal in the in the variable in the data in question it doesn't make a lot of sense to put in the model and that's because Occam's razor is very applicable to applied data science and modeling in particular which is you know all things being equal you tend to prefer simpler models than more complicated models and that and that's everything from the nature of the algorithm itself so for example logistic regression which is a classification algorithm is far simpler than a deep neural network that you're using for classification if they both tend to produce the same accuracy on your cross-validation you probably would tend to prefer the logistic regression one because it's going to generalize better the simpler models tend to perform better on unseen data in the future than more complicated models in general again Occam's razor same thing with the data that you use to train the model in general if you can get the same level of performance in your cross-validation with smaller data sets that is less rows or maybe less columns or maybe both that's probably good because that generates less complicated models which again tend to generalize better to future unseen data okay so tickets out for the time being maybe you would add in back later but at least for the time being you're probably not going to put it in the model okay so next step let's talk about the fares so this is the amount of money that each of the passengers paid to write on the Titanic now I'll start with the intuition that I would have here which you probably do as well which is hey wait a second there's probably a very strong correlation between p class first class second class third class and the amount that you paid typically speaking first class tickets credit cost much more than third class tickets just like in modern Airlines first class costs more than business class and business class costs a lot more than coach or economy class same type of deal but we want to be thorough we don't want to leave any data on the table that can potentially help our model so we should take a look at it so what I'm going to do here is first and foremost just take a look at the summary of the fare variable as well as spit out the length of the number of unique values okay so fare is obviously a numeric variable I assumed that but of course I could have double-checked that explicitly by just typing out all right the structure and the right there would say oh okay it's a numeric variable you got thirteen hundred nine of them and here's the first few values but I just assumed that it was a numeric variable and asked for the summary statistics on that and that's what I got here which is okay some folks didn't pay any fare at all that could be potentially interesting twenty fully twenty five percent of the fares right the first quartile are seven I'm assuming pounds here because I believe the Titanic was run by an English line so say seven point eight nine six pounds sterling or less and the median value is for about fourteen and a half pound sterling so basically fully fifty percent of the passengers in the in the data paid less than fourteen and a half pounds but you'll notice that the mean is actually more than double the median which means that this do tribution we would expect to be skewed to the high end and you could you have evidence of that as the max value of 512 over 512 pounds that's going to certainly skew things so that could be potentially interesting not not surprisingly we did the length of the unique values of fair there's a large number of them the reason why I did this is I as I illustrated in the previous video is that sometimes even if you have a numeric variable if there's a small number of unique values in your data set you could potentially transform that into a factor because that may be useful for you for certain analyses or it acts as a natural bidding mechanism rather than actually trying to create a numeric bins you can just use the actual instances of unique values themselves as bins to help with various types of algorithms but in this case there's way too many unique values so I don't really see any value in that so once again we go to ggplot2 say look or intuition says that we're going to have kind of a skewed distribution here let's validate that in fact is present in the data and you'll see here a warning message this is new for ggplot2 compared to the videos in the past and this is basically saying hey i tossed out a value that's because there's a 1 na I didn't really 1 na out of 1309 values isn't something I'm going to necessarily spend a lot of time worrying about so I just said ok did you plot go ahead and take care of it and it just tossed it up alright and zoom in on that and you can see here here's the distribution for the fares and we've got this outlier way out here 512 pounds but you'll notice there's you know some numbers here going from 100 pounds on up you know it's not obviously it's a you know almost an exponentially declining distribution but you can see it's got a pretty long tail here sliding out to the right and you know as you would expect a large bulk of the folks on the Titanic or corresponding third-class more than likely didn't pay a whole lot for their tickets because we saw for from the summary statistics that fully half of people paid 14 and a half pounds or less so very long distribution long tail distribution ear okay so that matches our intuition based on the summary statistics it's good to know that there's you know not many people are out here paying 500 so that must have been one heck of a room cabin stateroom whatever they got on the Titanic for that kind of outlier so let's drill down again see it's got predictive power what we're going to do is just jump into the plot that we've seen before which is grab the training data because those are the ones with the labels and those are records 1 through 8 91 facet wrap or essentially breakout pivot the plot based on the combination of P class and title and fill in the bars with the colors for survive so this is basically the thing we've seen many many times so far on this video series I will zoom in I know those plots hard to read I apologize for that but it's still visible enough to where you can start taking a look at it you can say okay some things at a meta level don't really surprise us here in this plot for example those with the title of miss happened to be in first class very very little orange all turquoise for the most part which is exact what we expect same thing for master in first class the title of master in first class and then you know some some breakouts here missus in first class so this would be adult married women most almost all turquoise just a little bit of orange again what we would expect and then going all the way down to adult males and third class with the title of mister you know yeah not not to grade there's a small blip here where you can say well is there some signal there maybe but you know that would be again I would be worrying about look if I actually tried to make my model pick this one piece of data out and saying oh if you happen to be an adult male in third class and you happen to pay this exact dollar amount let's say for your fare and I'm going to say look most of the time you survive I would be worried about over fit and what I mean by over fit is you can it's it's entirely possible to build a model that will 100% always predict based on your training data correctly every single time you can imagine if you will and I've talked about this in previous videos is that I could literally write an algorithm that says if your name is this I know you survived if your name was that I know your parish that would be 100% accurate for your training set but you would not generalize at all because you would well what if you had you know Billy Joe Jim Bob come in and Billy Joe Jim Bob is not a name that was in your training set you wouldn't know what to do with that your model would know what to do with that that's overfitting so you know creating a rule let's say in your model that says hey if you have if your adult male in third class and you paid this for your fare you are let's say 75% likely to survive and then you just do a simple random number generation and then how that rolls and that's how you predict it's probably an instance of overfitting your data so what I'm seeing here in general is nothing that really speaks out to me that fair is going to add any signal it's going to add any additional predictive power to my model over just simply picking the class right take for example adult males in first class these fair distributions don't don't really tell me unless of course your what your way out here with the outlier right so this guy out here paid five hundred and over five hundred and twelve pounds for his for his ticket and that probably corresponds with using the axes to exactly one person maybe two if I had a rule that just fit this that would again I'd be risking over fit so the I would worry about using this instead of P class or in addition to P class I don't think there's going to be adding a lot of so as with ticket I think in the beginning we're not going to be using fare and our initial modeling efforts now of course that's my opinion so part of learning data science is to you know find your own way and the own styles and and patterns and techniques that you like so there you have it okay so we took a look at fare we took a look at tickets now let's take a look at cabins now intuition here would be potentially that cabin would look an awful lot like class because you would imagine that the cabins are denoted by some sort of nomenclature that implies what deck and therefore what class you're in most typically if you're in third class you're probably you know in below below decks maybe even potentially below the waterline maybe of the of the ship so you won't actually have a porthole or a window or anything like that whereas first class you're probably up on the upper part of the ship upper decks same thing right now in modern cruise ships right you get on these big cruise ships and same thing right usually the more expensive cabins the higher class cabins are up and above and the sunshine and the cheaper cabins are down down lower in the ship okay so we can go ahead just say hey what's the structure real quick confirm and we can see here that the cabin variable is a factor with 187 levels so again it's probably not really a factor variable we should probably turn that into a string variable because 187 levels is really high and by the way certain algorithms in one area in particular that is extremely useful and that we'll be using in this video series is a random forest and the default random forest implementation the de-facto I should say the random forest implementation in our cannot handle factors with more than 32 levels so this is this is a problem for on two levels pun intended first is it doesn't make any sense because at one hundred eighty seven levels it's probably a string variable and secondarily it's not going to work with a random forest algorithm anyways we should probably just change that to characters so what we got here is same code that we saw before with ticket let's go ahead and change transform it from a factor into characters and then stuff that back into the cabin variable on the data combined data frame and then spit out the first 100 and why I pick 20 up above and 100 this time was purely stuff I did offline before recording this video saying okay how many would I need to take a look at to get a high-level sense of the data and it turns out with cabins unfortunately as we'll see in a second here you need a lot more dumped out and now you can see why because lot of records are blank so that's why I needed a lot more than 20 okay and I just played around with it there was no magic to pick 100 I just kept increasing the number until I got a good sense of what the data look like and this this output here gives us a lot of a lot of good information about this particular variable in the data frame we can see here a number of things in particular first as I mentioned already lots and lots of empties so we'll need to worry about that and we also see some nomenclature where you have kind of what you would expect is to say look we have a number which is probably cabinet which probably based on this idea of a deck you know F or G or C or D and then a number right this is the opposite of what you would see obviously on a commercial airplane these days but similar idea right we're in commercial airplane you have row and then seat number so it's exactly the opposite it's number and then letter but here you have letter and then number which is probably indicative of the cruise lines not being an expert in that space I'm just guessing at that another thing that we notice here which is kind of interesting is that you can actually have multiple cabins listed so this entry in this entry are probably from the same group of folks I would imagine and may have three cabins C 23 C 25 and C 27 these folks here have two cabins D 10 and D 12 so that's interesting so now we've got a couple of different pivots here from a structural perspective not only do we have the idea of potentially if these letters represent decks which is highly likely that that may actually be potentially predictive right because the intuition would be is the further you were up the decks higher you were in the ship from the waterline the closer you were to the lifeboats and also you were probably a more wealthy person were influential person so all that probably combines to say you probably had a so higher survivability rate than folks that were below decks now what I'm wondering and I don't know for sure is is this indicative of folks in third class as I recall I think I remember seeing a picture one time of third-class accommodations on the Titanic and I think they were like some of them were rooms like multiple bunks so maybe this is a dick ative of third-class potentially and since it was we know there's a lot more people in third class and they were in other classes maybe the you know that may reinforce this idea but anyway this alone probably won't give us a lot of structure so what we're going to do is actually massage this data a little bit and as we did before with ticket and fare and see if there's anything to it so the first thing I'm going to do is I'm going to take all of the cabins that are blank and just replace them with au for unknown and you can see that here right again which we've talked about that previously it's like we're in a sequel claw sequel clause so this is like saying hey go through the data frame and find me every place where the data dot combined data frames cabin variable is equal to a blank string grab me call those indexes so then I can index the variable or the date frame excuse me so that gives me all the index indices where the cabin is a blank string and then grab the cabin variable off of that and then cram a u into it do that and then just double-check that the code work correctly we'll run that again and you can see here all the blanks that we saw up above are now use okay great and then we're going to take the first letter again and that makes some sense because at a high level right if there's signal if there's any signal in in this variable it's going to be first and foremost denoted by the decks most likely because the differentiation between hey the survivability between cabin C 1 2 3 and C 103 so it strike me as being something that may be prone to overfitting and you know you're going to hear me talk a lot about that going forward in the video series you don't want to over fit overfitting is bad bad bad bad bad so I'll just start with creating a new variable called cabin first char that's just basically the first char every cabin and I don't need the defensive program and we saw earlier with ticket because I already took care of the blanks up here so let's go ahead and run that piece of code great and then I can check the structure I made it a factor and it's got nine nine variables it's great I can also use this thing called the levels function if I actually want to explicitly run all the levels out lips got actually run the whole line Dave okay there you go and we see here great here's all of the expected labels that we have on our new factor variables so deck a b c d e f g t and then unknown or u great let's go ahead and cram that into a variable that we'll add on to the data combined data frame as before that just makes some of the ggplot a little easier so we'll just run that real quick and well as before with ticket we'll do a high level plot this is just basically saying look in general for each level of this cabin dot first char is denoted here by the X on the code I just create a bar chart and then within each of the bars color code that based on survival rates so nothing fancy same thing we've seen many times before zoom in on an app and here we can see something potentially interesting which is something that's not necessarily all that interesting which is you know these survival rates at on the first blush seem interesting we should probably click down and we will but you know looking at all the data as much time as we've been spending with the day to know this is where as I said before it starts to speak to you I can look at this plot and I can go well there's not nothing jumps out of me really because yes individual each one of these bars a lot of them seem to be disproportionately survived but I already know that these more than likely correspond to the classes anyway I don't know for sure but seems to me just based on relative proportions of the population I know that very small numbers of folks relatively speaking we're in first class and second class and if the decks are a deck happens to be the top-level deck which is probably likely and that corresponds to smaller numbers of folks then yeah it probably means that these these folks appear probably first class and these folks here probably in second class these folks here are predominantly in third class and the proportions of survival rates they're kind of what I already expect right because I've been spending so much time getting steeped in my data and that's so so important but we'll drill in anyway because we're we're data scientists and we don't want to run on our guts we want data always to drive what we're doing so here we go this is the next down the next click down which is okay look let's let's let's uh let's validate our intuition that yep or the the lower lettered numbers actually correspond to first class because there's higher in the ship these of you the waterline and that's borne out right here a through e obviously very much predominantly in first class with very few people relatively speaking with no tickets first class and then you start to see a big shift here and saying okay well the vast majority of those folks with no tickets to find are in second and third class and as we would expect as you go through the upper letters they're predominantly they're predominantly focused on the lower class tickets so here you can see de and F predominantly you survive but that's not surprising because you're in second class but if you look at en F here not so good but that's third class so again that's interesting so we'll click down one more time but this is starting to jive with our intuition which says look maybe cabins not going to be super helpful for us because there's a strong correspondence between the cabin the decks and the class anyway and as we know class was obviously a very powerful predictor of survivability okay so now let's run this one which is just to be blunt thorough which is our plot of p class and title and again not seem a lot here to really shock us which is there's no real discernible trend right okay so for example adult males in third class if you if you were lucky one dude looks like that had an e-ticket you survived you get an F ticket or maybe it's a couple couple guys here then you didn't survive if you had rules in your model specifically for those instances you're probably I mean I would say probably very very likely that you're going to over fit and over here this distribution makes a lot of sense because we know already that generally speaking adult males in third class didn't have very good survival rates and the rest of this is there's just not really a lot of signal here that I'm seeing right first class adult females they all survive basically okay we already knew that so I won't belabor that but one thing we didn't the one thing that we identified that maybe we should check it could be interesting was what about folks with multiple cabins that would probably the intuition would be is that you've got multiple cabins you're probably pretty well-to-do right because you know you if you don't have a lot of money you're going to try and probably cram as many folks as you can in a room and you not necessarily have more staterooms oh and it's either reflection of you just have a lot of money and you can afford more space or maybe just have a larger family either way you're talking about more pounds some more wealth intuitively more likely to survive but let's check it out anyway because again we're data scientists and that's what we do so here we have another interesting line of code here again showing the power of a bar for data manipulation tasks which is why personally I prefer it over I do work in Python but I prefer are generally speaking for for this reason is this this expressiveness of the of the language so we saw if-else already we know what as factor is so I'm not going to belabor that so let's take a look at string detect so pull that up so string detect is a function in the stringer package which we talked about as I recall probably in the first video I think we introduced stringer which is a package as I believe it's written by Hadley Wickham who is a guru an author of many many commonly used in powerful packages and are this ones for making string manipulation a lot easier and specifically what we're going to use this for is to go through the cabin variable of the day combined data set a data frame excuse me and as with other functions we talked about string detect is vectorized so it knows automatically that we're talking about thirteen hundred and nine string values so we're asking it to do 1309 detection x' and we're saying okay look at all the cabins and tell me if you can find a blank space in the cabin because as we saw from the earlier data too far we can see formatting wise that in the case of multiple in general the case of multiple cabins you'll see a space now we got some dirty data here ostensibly we have an f space g seven i think that's okay because we're just doing a high-level analysis right now if we decided that we're going to use multiple cabins as a predictor variable we would probably want to do some more in-depth analysis and data cleansing on this variable but for right now I'll just assume that this is actually two cabins even though this one is not fully filled out it's just an F and then of course the F else then takes the takes that as a test and says okay if a space is there then return Y otherwise returning in yes it's a multiple cabinet or no it's not a multiple cabinet let me make that factor and we cram it into a variable that we're just going to add directly on the data combined data frame called cabin multiple all right run that yeah and now we can plot it so I'm just going to skip because I don't want to belabor this idea of drill downs because if we've already seen it twice now that we're just going to go ahead and and plot this bad boy and see what's going on at the lowest level of green which is our pivoting on P class and title simultaneously okay and so here we have it so unfortunately you know we tend at least in the States we tend to think of things as yes and then no but by default ggplot orders these things in alphabetical order so this is the nose and the SS that just keep that in mind for those of you outside the state maybe that's not going to be an intuitive problem okay what we can see here is again nothing that really pops out at us so as we would expect most adult males and third class do not have multiple cabins not a shocker and very few of them survived well we already knew that on the flip side we can see that in general on the S is let's say the S columns which is the second here not not a lot if this is a very rare thing and again given the rarity of it I would be worried about its generalizability that is I would be worried that if we coded rules specifically to this variable at this point that we would be prone to overfitting so not seeing anything here either at least on the first pass maybe we will revisit it later on but I think for the first pass of modeling I don't think we'll need to worry too much about that okay so lastly and this will this will clear us out of all of the first pass of the variables here is let's take a look it's where you got on board the Titanic this is where you embarked okay so what we see here is the structure of the data combined embark'd variable which is a factor of four levels and looking at the levels of that factor we can say that some of them are blank C Q and s ok so if we tab over to our definition here we can see that C correspond to Cherbourg Q is Queenstown and s was Southampton intuition is that I don't think where you got on the boat is really going to matter now statistically maybe there's a high correlation between folks of certain economic backgrounds being disproportionately embarking from locations but I think that would be a bit tenuous but let's check it out just to make sure that we're doing our due diligence since there's only four levels here and one of them is blank I'm not going to worry about populating that like something like a you because this is I think this is fine so we're going to go ahead and plot this and take a look at it again we'll just go to the base level drill down which is the pivot of P class and title ok and there you have it we have all the folks here so you can see that this one here is unknown and this is Cherbourg Queensland or Southampton and again not seeing anything that really jumps out at us which matches our intuition which is interestingly enough maybe that as we talked about for adult males and first class disproportionately came from only Cherbourg and Southampton no one came from Queensland looks like not many folks came from Queensland but what it look like would be is that if anything people the single that you would pick up by focusing on embarked would be look if you happen to be if you happen to embark from Queensland your way disproportionate likely to have been a third class passenger I mean that's probably the single most powerful thing that I've noticed is that that jumps out at me right because it's very few folks from in the entire classes first and second class that come from Queensland so but from a survivability which is what we're actually interested in predicting here in this case and is not not much going on because as we know if you were had the title of miss in first class you predominately survived which is what this is showing for example if the orange here was in Queensland that would be interesting potentially but again you would have to weigh that against the risk of overfitting so again I would say in our first pass for modeling embark ation is not going to be that important so to summarize if we look at the data combined here and this has all of the feature engineering work that we've done to date in it we're going to use P class we know that's super important name is not particularly useful as we saw in the first video sex we know we're going to get now from our derived title so we wouldn't use sex we've also determined that age has way too many missing values and we don't want to use it if we don't have to so we found the title is a good combination it's a good proxy for the combination of sex and age making it a potentially powerful single feature to use we've taken a look at sip spa and parch and we know there's some stuff in there so we'll probably take the deeper look at both of those in the upcoming videos take 8 we've ruled out fair we've ruled out cabin we've ruled out embark we've ruled out going down we know that as we talked about title is extremely important along with P class those two things are very very predictive together family size look like it had a lot of potential as well one of the things we'll take a look at in future videos is potentially is the ratios between family size and sip spa and parts for example there may be some signal out of that take it first jar we ruled out cabin first child we ruled out and cabin multiple ruled out so this is the end of video 3 we've done a lot of goodness here hopefully you're finding this useful and starting to see some of the method in the madness and the approach I look forward to the next video where we're going to dive in deeper on a smaller subset of our features and see if we can get to a point of building a model so until next time happy data sleuth you
Info
Channel: David Langer
Views: 84,893
Rating: 4.9667015 out of 5
Keywords: R (Programming Language), Data Science, Data Analysis, Feature Engineering, Visualization, Data Wrangling, Data Exploration, R Programming, R Programming Tutorial, R Programming Training, Data Science with R, Data Sceintist, Machine Learning with R, Programming, Tutorial, Training, Data Science Training, Data Science Tutorial, Machine Learning, Data Visualization, Data Science with R Programming, language, tutorial, programming
Id: aMV_6LmCs4Q
Channel Id: undefined
Length: 55min 32sec (3332 seconds)
Published: Sun Jan 17 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.