Introduction to Data Science with R - Data Analysis Part 1

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi my name is Dave Langer welcome to this first video in a series of tutorials on introductory data science using our I'd like to start this video with some information about myself I don't plan on doing this with every video but I figured I would just do it on the first one just in case it would be helpful for folks so as I mentioned my name is Dave Langer I'm on LinkedIn in case you want to look me up for any particular reason I work at Microsoft and the only reason why I mentioned that is because if some folks look me up Donna LinkedIn and they found that out they may be wondering why I didn't mention it I'm mentioning it up front I have no ulterior motive other than the fact I'm really in a data science as you'll see down here I have no technical religious affiliation so even though I do work in Microsoft I love all kinds of technology like our Python Java as well as things like c-sharp sequel server and as your machine learning so you won't get a lot of bias one way or another I'm pretty interested in all technologies and I try to use the right tech for the problem at hand wherever possible in general I've been working in IT for 17 years with a two-year stint and that product group is the only exception to that working in sequel server arm in particular if you've ever used the database product database projects in Visual Studio I worked on that team for about a year or so just to give you some background I've spent most of my time and IT working in software development business intelligence and data warehousing that's what really sparked my interest in data science about a year and a half ago because I see data science and machine learning and data mining in particular is just the next step in value over traditional business intelligence and data warehousing scenarios you can think of bi and DW is really being this backward facing looking back in time analysis kind of activity whereas data science quite frankly is a lot about taking information and then actually projecting forward in time which is actually quite quite a bit more interesting at least to me anyway and I'm assuming to you as well otherwise you're gonna be watching this video lastly I do want to apologize up front I do consider myself quite the the comedian so I may sprinkle this video with a bunch of jokes and quite frankly your mileage may vary on how funny you find them okay so let's move on to the goals of this series first and foremost I do want to make sure that folks are watching this understand what the focus is gonna be so I'm not wasting their time I am gonna focus on quote-unquote small data and and it's not because big data isn't cool it absolutely is cool it's just that I imagine my experiences like a lot of other people's experience which is you don't really face big data problems in your daily lives in the business that you either support or the business that you operate for example I work at Microsoft so as you might imagine we have lots of places where there are big data as your Xbox Bing and so on but where I work in the supply chain or organization in particular we don't really have any big data problems however that doesn't mean that you don't get lots of business value from doing data set science on small data in fact I would I would think that if you look at the Fortune 1000 overall you're probably likely to get a disproportionate amount of business value from small data especially compared to the amount of effort you have to put in to get the value out big data problems tend to be more difficult quite frankly so small data maybe a bigger bang for your buck ok another goal of this series is our in particular using our for data analysis and predictive modeling as I mentioned before one of the things that gets me really excited about data science is this ability to move from looking backwards and understanding what happened to using data to maybe predict or shape what will happen and I think that's pretty cool so I use are quite a bit and I also use Python as well but I tend to be more comfortable with R at this point in the game and I also like its ability for making visualizations I'm sure you can do the same thing in Python but I'm more familiar with R which is why I chose it for these series of tutorials lastly if you're not already I want to get you hooked on data science well I didn't write this explicitly down and this word doc here I really hope that these tutorials will help folks in ways that I wished I could have been helped when I first started teaching myself data science and predictive modeling and data analysis about a year and a half ago so it had a lot of struggles but it is it's great I and I love Pina you know being involved in data science problems so hopefully you'll get hooked on it as well and I may hit other things later on in this series I've been thinking maybe doing some text analysis videos with Python or something like that but for right now we're just gonna stick with R and some predictive modeling okay let's go ahead get started okay so I am going to assume that you're reasonably competent technically and I'm not actually going to go through the details of getting all this set up but I'll give a little bit of an overview right so what we'll be using in these videos as our studio so you can get our studio pretty easily it's free just go to our studio comm and you can go ahead and download it so we get the powerful IDE and what the desktop and you download the our studio desktop and here you go it's multi-platform pick your poison I'm using Mac right now you may be wondering why I do love Microsoft Microsoft products it just happens that I have a really cool video software package on my macbook and so only reason why I'm using my Mac right now everything we do is equally applicable on Windows or Linux as you can see here so pick your poison you'll also need to load download are and install that as well our is the language on top of which our studio sits and you can see that here it's relatively easy so I'm going to go ahead and pause the video here for a second and that's nothing more than the ability that gives you a chance to pause your own video in youtube and then go ahead and download our and get it set up before we move on okay I'm gonna make the assumption that you've got our in our studio setup and working so the next thing that we're going to do is we're going to get over to the Kaggle website and the reason we're going to do this is because I love cackle myself it's one of the things that I've been using to help propel my skills in the data science arena and it also provides a very good opportunity for some interesting data sets that allow us to explore some very cool things in this series of videos so joining up on Kaggle is free it's it's it's very simple all you do is go to cackle comm click start competing up here and what we're going to do is go to the competitions and cackle offers competitions where you can compete in earn prizes and they also just offer some general knowledge based activities some competitions that you don't actually win anything for but give you an opportunity to actually do some real data science work and in particular we're going to use the Titanic competition in this series of videos and the reason for this is pretty simple one is everyone should be relatively familiar at least in the US and probably Western Europe as well but probably all over the world thanks to James Cameron with Titanic right it was a tragedy that happened in the early 1900's an ocean liner sank you can see pictured here a lot of folks perished it was tragedy this so in data sight this really it's really important to understand the problem domain to give you context when you're looking at the data so Titanic's a very simple problem it's very well known so it gives a lot of people context which is one of the reasons why I selected it you click on the data here and you can see the various files what we're interested in here are the Train files and the test files now when you click the download each of these files on the links here if you're not already a member of kaggle you'll be asked to create an account so go ahead and do that and get the files downloaded the train and test files downloaded to a location on your hard drive in preparation for the next part of the tutorial video now there's other files here that you may want to take a look at later but we're going to ignore them if you're curious they're essentially just other types of attempts at creating a machine learning data science solution that solves the problem and oh I should probably mention what that is I'm so sorry about that the point of the Titanic competition here is to take a look at the data that's provided by Kaggle and build a machine learning model that accurately predicts who will survive and who will perish and what these other files are our example attempts at doing exactly that which is why you see model here at the end of the file names so you can take a look at those if you fit on your own time if you'd like that's no problem however we won't be doing a lot with that right now okay so I'm going to go ahead once again pause to allow you a chance to download these files so go ahead and pause now on YouTube and I'll come back in just a bit and we'll move on to the next thing okay cool we're back we're gonna go ahead and keep this up on the background here in chrome because this will be very interesting this essentially is our data dictionary or metadata and it'll be extremely helpful for what we're gonna be doing next but without further ado let's pop into our and let's start doing something start doing some fun stuff okay so here's our studio maybe some hard core our people may think I'm kind of kind of a weenie for using our studio I however love our studio it makes our much much more usable much more user-friendly and that that's actually a concern our is not exactly the most friendly programming language in the world but it is extremely powerful so the first thing I'm going to do here now that I've got our studio set up now is set my session well this is going to do is this going to tell our where I've stored all of my data files so I don't have to use full paths to resolve every time I want to work with a file so I'm going to go ahead and choose a directory here and there we go and you'll notice down here that essentially that series of menu clicks translated into some command line prompts here for the our engine which essentially said set my working directory to my home my home directory here on my Mac cattle folder and then the Titanic subfolder ok cool now our is awesome for working with data you can once you get really good with our code and I don't claim to be a master of our code by the way but when you get good with it you can actually do a ton of stuff with very little amounts of code coming from your fingertips so for example reading in CSV files or is dead simple and are you can literally use what's known as the read CSV function you provide it the file name now of course you could use the entire file path but I set my working directory to prevent me from having to do that and I know that the CSV files have headers so I just say equal to true and I can run this code by simply highlighting these two lines and clicking run up here and you'll see down here that it gets executed in the our console automatically for me and our studio depicts my environment of what my what variables I have loaded in memory so what I have here is a variable called test and a variable called traying now if I click on these I'll get a visual exploration pane of the data here and you'll notice that it's very spreadsheet like it's essentially just a table of data or a matrix if you're mathematically in kind inclined and you can just take a look at the data here now behind the scenes if we hover over the variable here in the our studio UI you'll notice that it says hey test is a data dot frame now data frame is a particular data type inside of R which is specifically designed for handling tabular data now as you might imagine those of you have worked with databases or maybe you've worked with spreadsheets a lot a lot of the world's data is tabular and format and are works very well with tabular data and you can see here we've got it you know if 418 observations of ten variables in our test set and you can see the names here and some data values we can also take a look at the train set now this is important because you'll notice the first thing here is that in test I at the first column I have a very of variable values here as P class but if I look at the train set I don't have that I have survived and this is pretty important in machine learning terms what we're trying to do here is build a classification model literally we're trying to tell the computer to build an asset from this data that enables it to either say with a certain amount of certainty to someone not survive do they perish or do they survive the sinking of the Titanic and what differential it's our train data from our test data is the fact that we have these values that said yep mr. Owen Harris bronde did not survive but mrs. John Bradley Cummings did survive and here are all the data points that are associated with her and what that allows the computer to do is to use a machine learning algorithm or algorithms to use this data and learn from the data the patterns between who survived and who did not and the data set and this is a pretty key point you use machine learning or data science techniques when the the pattern in the data the signal that you're trying to understand is either too complicated or it would take too long for human beings to divine it to do examine the data by hand understand the patterns create algorithms or logic to actually implement the patterns and also do the classes and do the classification so that's what machine learning is really really about and we'll see that we'll see that more as we go through the data analysis ok great so we've got our data loaded we've in our tests and train sets cool now what I want to do though is that for my my overall data analysis I want to actually combine these two data sets into one big superset of data so what I'm gonna do here is you write it a little R code well I've already written it because I'm sure you don't want to see me type it we're going to go through a little R code here to actually start to combine it now the first thing that you'll notice over here is that the train set has 11 variables because it has the survived label and the test set does not write because we're gonna train our machine learning model with this label which is the 11th variable and then we'll then have this test set to actually be able to decide whether or not the model that we built is any good but to combine these two into one set I need to add a variable to the test set so this is what I'm gonna do with this line of code and we'll take a look at it piece by piece so the first thing that I'm going to do is I'm going to use the data frame function now the data frame function it allows you to create your data frame oddly enough or intuitively enough to think I'm going to look at it and one of the cool things about R is that it has a pretty rich help system so for example if you ever wanted to know more about the data frame function you can go down here to the console type question mark data dot frame hit the enter key and what you'll get over here is some help on the data frame function from the our help system and it's pretty rich and this is indispensable to be quite frankly as you're ramping into the doing data science with our the our help system is your friend and it head covers a wide variety of topics guard it from from functions to data types to syntax it's just immensely helpful so we're using the data frame function to create a data frame and in particular what we're doing is we're saying okay hey I want to add a variable to this data frame that I'm creating called survived survived excuse me which would make sense because as we saw there's a survived variable in the Train data set before and we'll need to add one to the test set to make it 11 variables and make it on par with the train set now what I'm doing here is I'm using another function from R and let's go ahead and take a look at that in the help system the rep function which allows you to replicate you know elements essentially it's a way to repeat values and so what I'm doing is I'm saying hey R repeat the value of none and the number of times to do that I want you to repeat it is equal to the number of rows that's what n rows stands for in test and again it's going to ask the help system the in row function returns the number of rows slash columns in turn right so what I'm saying is hey the number of rows in the test set we should come back 418 repeat the value of none 418 times and assign that to the survived variable then I want you to combine that variable with the test variable and this this syntax deserves some explanation data frames can be indexed using this kind of syntax between square brackets this essentially tells you hey I want to subset the test data set or the test data frame excuse me via rows and columns so for example if I was interested in the first row and the fourth column I would do this syntactically in R and I can run this code by highlighting it and again hitting the Run button or hitting alt enter command enter depending on what operating system you're using and what you'll see here is value gets spit back and that is the cell that would be in the view of this data set on the first row of the 4th column so is checked I'll check that so first row 1 2 3 yep 4th column and there's the value now this syntax tells me is you know I want to access one thing if I leave this blank what I'm telling R is I want you to use all the rows in all the columns so basically what this code is saying here is take the entire table or the entire matrix or the entire data frame of test all 418 rows of 10 columns or in part of our speak for 18 observations of 10 variables and combine it with a list of 418 strings of none and then return that all back as a big data frame and then assign it back to this new variable called test dot survived for those of you from who are programmers at heart you may be wondering what this is this is the equivalent of the equal sign and it's kind of an all our oddity that says hey assign there was assign the results of this code statement to this thing here and it's a basically a new variable so we'll go ahead and run this line of code and sure enough you can see hey we've now got a test survived very well it's new of 418 observations of 11 variables let's go ahead take a look at that and sure enough none all that panned down sweet now that we have our dimensions the same between this modified version of the test data set or the test data frame and the train frame they have both have 11 variables I can now combine them using the R bind function which stands for row bind again let's go ahead and hit the help system and you can see there's also another function called C bind which stands for columns intuitively enough versus our bind and what this is doing is telling our hey look take the train data frame 891 observations of 11 variables as a table and then append to it row by row the test dot survived variable which is 418 observations of 11 variables now what that should give us in the end is a combined data frame a table that has was at 1309 observations of 11 variables so let's go ahead and run that line of code and sure enough that's exactly what we get I mean you take a look at that and we can see oh yeah the zeros and ones it's the labels and if we scroll down far enough then we get into the nuts right we've just appended after 891 records okay know why you don't want that thank you 892 on down is the test records great so now we have both of the records combined into a single data set okay cool so that's a pretty long explanation for only four lines of code that allows you to do some things that are pretty pretty integral to doing data analysis on data sets that would take a lot lot more lines of code in something like Java for example now we've got our combined data set we should probably talk a little bit about types and armed so what I'm going to do is take a look at the STR function here and once again let's pull it up in the help system and the STR function essentially allows you to ascertain the structure of an arbitrary are object you know essentially what is its data type describe to me what it what this data type this instance of this data type is so we'll go ahead and run the STR function on our combined data set and see what our tells us it's built out of okay now you can see here that okay what are tells you is that the data doc combined variable is a data frame it's three hundred nine observations of eleven variables we already knew that but it's good to confirm and it's composed of these particular variables and variable types so the survived variable is a character you can see it's 0 1 1 1 1 1 and that would make sense because we know that around 892 in round row a 892 we go from collections of zeros and ones to the word none because we put that up here and then we did our byte and combine them together so it makes sense that survived would be a character string now we know the p class variable now is a integer and you can see the various values there and then we look at name and we see this thing called a factor and if you're not familiar with our that may be a little bit confusing so hey no problem but ask the help system about factor r is case-sensitive so apparently didn't like capital F didn't know that but it did like lowercase F and you can see here the help system gives you some help on what factors are and a factor is used to encode a vector as a factor and this is what's really key is right here in terms of understanding factors factors are what are denotes categorical or enumerated type variables if you're an old-school programmer who maybe use C or C++ or any of the derivatives of C like Java or C sharp enums same idea right which is this idea that you can define a variable or a data type that only takes on a limited set of discrete values and another way to think about it is is factors are quick into dropdowns in a web UI for example if you've ever been on an e-commerce site and you had to pick your state province or your country of residence usually you get a drop-down it has a finite number of values in the United States primarily you'll see 50 values for the 50 states of the of the Union and that's an example so factors are a way of encoding numeric or your character data into a discrete data type that then can be used as part of your data science and machine learning activities because as you might imagine computers don't like strings in general and machine learning algorithms like them even less so factors are a way of translating these things into a representation that works well you'll also see that sex variable here is a factor with two levels female and male which would make sense there's an age variable and it's considered a numeric and you can see the numeric buh-buh-bah and then n/a now you may be wondering what n/a means at hey if this is a numeric why is there a string value in there and the answer is is that the n/a denotes in are the absence of value if you're familiar with programming or databases n/a is equivalent to null the absence of value so every other value in this list you'll see is numeric but the this this one was missing so and a denotes the fact that there is nothing there you can see there's the the dsi bsp and the parched variables they're both integers factors so on and so forth okay so by default this is how our interpreted the data that we read from the csv files that we downloaded from cackle now that's not particularly useful right now we're gonna have to actually change this around a little bit but I did want to introduce the variables a bit variable types of bits that you understand why we're doing the things that we're going to be doing later on in the in this series of videos okay so let's go back to Google real quick or excuse me not Google Chrome now let's take a look at these data types here in terms of their definitions now p-class isn't is currently in are as an integer but you see here that in actuality p class is representative of essentially the class of ticket that the passenger bought whether they were a first class passenger a second class passenger or a third class passenger in fact you can see down here in the special notes that p class is actually a proxy for socioeconomic status which is a very politically correct way of saying essentially were you richer or were you poor okay cool this probably shouldn't be an integer value it's it it's it's not it's not a number we're not actually gonna do any multiplication or division on these values for the p class variable it really is a factor it's it's you're either first class or you're second class or you're third class and that's essentially all we need to know so we can go ahead and tell our that we want to change the p class variable using this line of code here and what we say is hey look take the p class portion of the data dot combined and you use a dollar sign in our to actually address the p class portion of the data dot combined data frame and turn it into a factor okay so we'll go ahead and do that and let's go ahead and run this line of code again and then we can see that in fact indeed p-class used to be an INT and now P class is a factor okay that's good and we can also take a look at survived now we know that survived shouldn't be a character string because as I mentioned before machine learning algorithms don't like character strings so we should probably turn that into a factor as well and what we would expect is that we would get two values or three values excuse me let's survive once we convert it to a factor and take a look at that again yep and there you go survived is a factor with three three levels zero one and none which is good cool so we've we've kind of massaged our data a little bit here into a format that's more amenable to both machine learning as well as data analysis by making the things that should be categorical variables or factor variables that way as opposed to what are originally had loaded them in as okay next up let's take a look at survival rates in aggregate as I talked about before the whole point of this Titanic competition is to build a machine learning model that can predict whether a passenger survives or not does not survive the sinking of the Titanic just based on the data that's provided so one of the first things we want to know is okay what's the distribution look like of those people who lived in those people who did not and what you can use here is the table function and again and in doubt use the help system from our and there you go there's the health help file for our table it basically tastes some data and puts it into a tabular format for you and you can see the results here we said hey mister our grab the survived variable and tabulated for me and then spit it out as a table and essentially what it did is exactly that so out of our thirteen hundred and nine records 549 people perished in the Titanic 342 survived and then 418 is none which is what we would expect because this is our test set not our training set so the first thing to know it is is that as you would expect if you know anything about the history of the Titanic is that the date is a little bit skewed more people did not survive than did survive in fact pretty close to twice as many people perished as survived overall and that's going to be important because machine learning algorithms often will ten will often fit the data that they're provided so for example here it's a good bet that in general if you didn't know anything better and you had a place I'll bet on whether someone survived or not you would say nope they're probably going to perish just because statistically they're more likely to not survive and machine learning algorithms essentially work the same way the more skewed the data is to one side or the other the more likely without some help that the machine learning algorithm is going to just pick the most common scenario in the most extreme example if only one person survived in 549 people perished statistically speaking you would just always predict that you didn't survive because you would be right you know ninety-nine point nine nine something percent of the time and that's what a machine learning algorithm will do so we always want to be cognizant of skew in the data especially in the labels because sometimes we have to do some pretty interesting things and sometimes difficult things with the data if it's heavily skewed this data is not too bad you know it's less than two to one that's not too bad if it was like ten to one then you would have to start looking at some interesting techniques to combat that but this is interesting but it is in line with our expectations which is as we know historically more people perished on the Titanic than survived alright so next up let's take a look at the distribution across the classes as we know from our metadata dictionary that there were three classes of passengers and there was first path first-class second-class and third-class now generally speaking as you would expect that there would be far more people probably in the lower classes in the upper classes let's see if that's actually true okay so execute this line of code once again it's just they execute the table function on the P class variable and what you get out is the distribution and across the classes and this is particularly interesting and not both at the same time now for example there's a lot more people in third class than there are in the upper classes and that would make sense right just like in if you've ever flown on a commercial airliner you know that the vast majority of people are not in business or first class they are in coach or economy or whatever the designation is for folks who didn't pay a lot for their ticket so that makes a lot of sense now what is interesting though is that you you would expect that general rule of thumb to apply for the upper classes as well for example you would expect that there would be more people in second class than there would be in first but in fact that is not the case so that's that could be potentially interesting from an analysis perspective from creating a for creating a machine learning algorithm there are some rules to understand who or some patterns to understand who perish and who would not just based on data so this is interesting we should keep this in the back of our mind so you could get some pretty interesting information straight out of our just by using the table command but human beings tend to find you know visualizations we a much more productive means of actually understanding and gaining insight from date rather than just looking at tabular tabular representations so what we're going to do is next take a look at some some graphs some plots of this data to see if we can derive some more insight from it and in particular we're going to use ggplot2 ggplot2 is essentially the de-facto standard in the art community for creating high-quality visualizations of data now getting ggplot2 is extremely simple you don't get it in a base art by default but it's very easy to add so all you need to do is go to the packages tab over here on the right side of the our studio UI and just click install and just type in literally ggplot2 and click install and what you'll see down here is the command line will actually execute an install package and obviously you have to have internet connectivity but it'll download and install ggplot2 and then using the library command you can then load that package that library up into memory and then access the functions of ggplot2 just quit and run it and interesting enough it says hey look you got an error i can't load ggplot2 because i need a package called stringer all right no problem let's go ahead and load the string our package great and let's go ahead and try and rerun this this line of code here excellent so now we have ggplot2 loaded successfully and we can use its functions now we've got a hypothesis here that I want to explore which is we know that there was 300 in something folks let's go ahead and scroll up here so you can see this again 323 folks that were in first class so I've I've traveled on a an ocean liner before maybe other folks maybe you have as well and typically what happens is that the more you pay for your room the higher up on the ship you are and which also tends to make you closer to the lifeboats so you'd be interesting to see hey did rich folks that is people who were in with higher classes have a different survival rate than those that were in the third class those people who are lower in the boat so we can go ahead and test that since we can only tell if you survived or not in the training set we're gonna go ahead and go ahead and turn P class on the train set because if you if we do a quick STR on train you can see here that the P class variable is not a factor like we did with the combined data set so we'll go ahead and fix that real quick here by right on this line of code and then we're going to go ahead and call ggplot now this is a bit confusing because the function is ggplot out of the library ggplot2 and in case you're curious this is the reason why there was a a first version of ggplot which essentially got replaced by ggplot2 but the name of the function stayed the same so you use the ggplot function out of the ggplot2 library and let's take a look at some of the syntax here so ggplot is the function and of course you can pull that up in the help system and you can say here create a new ggplot okay now particularly given examples of how you use it but you don't need to worry about that because I'll explain so the first thing that we passed into you passing it the ggplot function is the data that we're interested in plotting now more often than not that's going to be a data frame which is what we have here right we have the Train variable which as we know from over here in the upper right corner is a date for great next we have to tell ggplot2 some aspects of the aesthetic that's what a EES stands for and again we can go ahead and look that up in the help system generate aesthetic mappings and what that essentially means is you're you're controlling the way that the graph or the plot is going to look and what we're saying is hey the aesthetic will have an x-axis corresponding to the p class variable of the Train data frame and we're also telling ggplot that we would also like a fill color to be used this optional you have to specify an X at a minimum right you have to have an x-axis at a minimum otherwise you're not gonna have a plot but the fill rate the filth excuse me the filled variable is optional but it's going to be helpful because we're going to say is look color code the resulting plot based on the survived variable now as you see down here survived actually isn't a factor because we haven't converted it yet like we did with a combined frame before so I'm just illustrating to you here another option that says look I can convert the survived variable which is currently an INT into a factor on the fly by using the factor function and again whoops pull it up there we go pull it up in the help system there's the factors all right we saw that before so this is a function that I'll convert it on the fly to a factor and then lastly we're now telling ggplot look and you've got the data you've got some aesthetics that I'm providing you what the X what the x-axis is and what the what the fill color should be based on this variable survived what I want you to actually plot out is a histogram of a width of about 1/2 and go ahead and display that to the screen and then these last bits just say look I want to add in an X label called P class and the Y label called total count and this is saying hey I want to assign to the fill a label of sir and let me just run this and I'll become much more clear in a second okay and here you see here in our studio the plot you can see here the axe label is P class the Y labels total count and the fill over here is labelled survived now what you'll notice is that this part here generated the green and I don't know reddish orangish color here this says look give me a histogram for each value of p class but then break it down bisect it based on the survived variable so then that'll give me a visual indication by p class the total count of people who survived and the people who didn't survive and what you can see here from this plot is that our hypothesis seems to be confirmed which is in general if you were in third class unfortunately you were far more likely to perish all things being equal than you were to survive whereas if you were in first class you seem to have much better chance of survival than perishing and that kind of makes sense and then introduced interestingly we can see in you know second class it looks to be pretty close to 5050 so that's pretty powerful right because that's telling you a lot right it's saying look from data analysis perspective it seems that p class is kind of important for our problem at hand which is determining the pattern of who survived and who didn't on the Titanic okay now if we go back real quick here to Chrome we've kind of taken a look at these two variables initially we said let's take a look at the initial survival rates what's the distribution how does that work let's take a look at the peak last variable what's the distribution and we've also combined those two in this pretty intuitive plot which is actually arguably more powerful more informative than just looking at these previous table presentations that we saw earlier right these right here I mean this is help this is interesting but when you combine the two visually it really does pop yeah so there's a lot more people in third class and they were far more likely to perish than there were people in first class interestingly enough there were far more people or there were there were more people I should say in first class then in second class and as you move down your odds of survival decrease it pretty dramatically by the time you have third class now if we go back here to the the metadata for the for a data set the next thing that we need to take a look at is the name variable right because data analysis said basis and you know is it's a pretty methodical process here so we're going to edit go ahead and take a look at the head command and first a couple things that we need to look at first before we talk about the head command so there's this AZ dot character thing and to understand why I put that in there for while we need it let's go ahead and look at the structure of the train data frame again and if we go down to name you can see that it's been labeled a factor variable which is pretty interesting essentially what it did was it says look hey it's a factor with 891 levels which essentially means every row had a unique name and it created a categorical variables 891 levels another way to think about it is in a drop-down and either a Windows app or a web app you would have 891 entries for a long list of entries in the drop-down and that doesn't work for us so what this does is this says hey our I don't really want you to think of this as a factor on the fly just give me back the strings so essentially give me back the names and then issue the head command on the on that now if we look at the help system here the head command just allows you to take a look at the first or last part of an object so essentially what its gonna do is it's gonna say hey look grab all the names turn them into character strings and just give me the first few off the top so we run that lo and behold we can see first name second name and the list third and so on and this is this is particularly useful the head command is extremely useful to just you know give you a general sense of the data and what we're seeing here in the data is first and foremost that the data is kind of formatted in a very formal way what you have here is it appears to be the last name with a comma then a title first name middle name and you can also see here last name title and then I'm assuming here that this is you know in fact her husband's name and then back here you actually get I'm assuming although it doesn't necessarily tell me but let's double check that to be sure that that is her maiden name the name she had before she was married and let's see here if there's anything and no does not however we could probably infer that pretty easily so for the time being let's just assume that that's in fact the case that this is a married woman this is her husband's name and this was her maiden name before she got married yeah we can see here there's a mr. and a missus and a Miss and a mr. which is kind of interesting so there's the names are formatted kind of interesting that may be germane later on so it's worthy to know okay moving on so let's just take a look and see how many unique names are there across both the Train and test sets as we saw before with the Train Set every name was unique because we got 891 levels and there was exactly 891 observations in the train data frame okay so let's see across the combined data set which will turn at will take the name from the combined data set then make it a character string and then we'll use the unique function so let's go ahead and pull that up okay so it's exactly what you expect right unique function extracts unique elements from a vector data frame or array but we're the Dybbuk elements removed it's exactly what we want so and then lastly we say okay give me the length function on top of that and length of an object I mean there's almost almost silly for me to type this in to the help system because you could probably infer what it means but let's just go through it all at once grab the name from the data dot combined data frame which should be thirteen hundred and nine values convert them to a character string find out which ones are unique and then tell me how many are actually unique and if we run that line of code we get thirteen hundred and seven unique names that's interesting because we would expect that we would get thirteen hundred and nine which means we've got some duplicate names probably two duplicates so let's go ahead and take a closer look at that now this line of code gets even more complicated but let's go ahead and break it apart just so we understand what's going on so we'll go from the inside out so now I've got some duplicate names I need to determine whether or not that's because of its legitimate for example John Smith is a common name maybe there's more than one John Smith or do I have some bad data in the data set which I would need to know because if I've got duplicate records I would need to either take the duplicate records out or deal with them in some other way but it's nice to know what my situation is so how we do that in code as we say okay go ahead go ahead and grab the combined names the date that names from the combined data set convert those to a character string and then invoke the duplicated function on that let's take a look at duplicated in the help system okay so determine duplicated elements okay great so what this is gonna do is it's going to go through all of that and it says determines which elements of a vector data frame are duplicates with of elements with smaller subscripts and returns a logical vector indicating which which elements are duplicates okay great or rows so that's gonna say look I'm taking a look at 1309 character strings and I'm going to return you back which rows of those thirteen hundred and nine are the actual duplicate rows sweet and now we've got which so what switch ah so this says okay it's very much like if you're familiar with sequel it's very much like a where clause essentially it's a way of actually honing in on the day that you're interested in so what we say here popping out a level this gives it some context is great from the data combined data frame I want to grab the name column but I only want to grab those records which are duplicated great that should pull out essentially for you know we're thinking for records maybe this maybe it's maybe it's you know it's definitely for records but maybe it's it could be two names duplicated twice or it could be one name duplicated or triplicate you know repeated three times we'll have to see but this this is pretty should be pretty intuitive you know this essentially is like a where clause in a sequel query and then take that turn it into a character and then stick it in the dupe names so we go ahead and run that and if we go over here we can see in fact okay dupe dot names is a character string consisting of two values mr. James Kelly and Miss Kate Connolly excellent so what we see is we've got two names that are duplicated twice for total four times which would then explain the difference that we saw earlier of 1307 versus 1309 okay so let's go ahead and take a look at so what we now that we understand which names are duplicated mr. James Kelly and Miss Kate Connolly we need to actually go through the entire combined data set and we want to pull out all of the records all of the observations that matches those names and then display them so then we can take a look at it and say hey are these people different people or are they're the same people just duplicated in the data set and then we'll address that if we need to so how we do that is we see okay great four go through all of the names and data dot combined and if they happen to be in if that if the current name is I'm iterating through data dot combined all the names if the current name I'm looking at in that iteration is in the set of names as exemplified by dupe names which we know from over here are mr. James Kelly and Miss Kate Connolly go ahead and pull that record out again remember this is the witch so if you're from every sequel hey grab from data dot combined those those observations those rows where the name is in the duplicate names and then return me back all of the data right because remember from our previous example finally wanted one thing I put a number in there but since I have no numbers are says oh you want everything that comes back so now if we pull that if we run that code excuse me we pull back four records which is kind of what we expected and we can take a look at them then we can say okay here's miss Kate Connelly she survived she's in third class female she's age 22 she's not she has 0 for these values here's her ticket number how much she paid for a fare she had no cabin to sign apparently and she embarked on q which we'll see later is Queensland but that's not really germane for right now now we go down to this miss Kate Connelly she has none for survive so she's obviously in the test set she's also in third class which is kind of suspicious she has the same name but she has a different age she has a different ticket and she has a different fare so it's probably safe to assume that in fact these are just two women who had exactly the same name and the same thing for mr. James Kelly which makes sense Cate Connelly and James Kelly those are probably fairly common names in either the US or the United Kingdom and it wouldn't be unheard of for there to be soon you know maybe those folks to be on to folks are the same name of those names onto the ship at the same time so looks like we're pretty good there so okay so we can loose from a name perspective we could say okay great doesn't look we have any duplicates all the names are unique everything checks out and that's good and that's something that's extremely important when you're doing data analysis especially in the context of trying to build a predictive model is you need to go through the data the fine-tooth comb to try and find any anomalies that you can so this is an example here of where we may have had an anomaly but in fact we don't we validated that in fact these are different people so the records don't need to be modified in any way or were removed okay now one thing that's kind of interesting is we noted before moving on is within the names there are these titles and they seem to be pretty important because as we saw up here we got mister we got missus we got miss every record seems to have one so they took a lot of great pains to put them in there so what that should tip you off as as a data sleuth is maybe there's some something interesting going on there maybe there's a pattern maybe there's some predictive power with these these titles so there that's probably worthy of some further investigation so what we'll do here is we'll load up another library we saw this earlier when we try to load up ggplot2 and it said it wanted the stringer library we'll go ahead and load it it's probably already loaded but it never hurts just a explicit load it again just to make sure it's okay and we're gonna go ahead and take a look at saying look you know what's up with this miss and mr. thing write these titles do they dude can we can we derive anything from in particular is there any correlation between these titles is within the names in any other variables for example the SI BSP variable so the first thing we're going to do is we're going to take a look at the string detect function here and game let's pull it up in the help system and we can see here it detects the presence or absence of a pattern in the string and what we're doing is we're saying look grab every single name in the combined data set and detect to see if miss period is in that string and if so again here's our which which essentially is like a where clause in sequel this thing look grab me every record out of the combined data set where Miss is in the name and give me back all of the data remember again because we've got a blank here nothing and then store the result of that in the Misses variable and I just want to take a look at the first five now this is worthy of note we haven't seen this syntax yet this is the way of denoting a range you are to say look I want a range of one to five of the rows over the observations here and similarly I could do something like this as well and what this would say is hey give me the first five rows one through five oh that's also where they don't know for those who are programmers are indexes from one not from zero so give me the first five values and then also give me the first five columns well we don't want that we want all of the columns so we'll just delete that and run these two lines of codes you can highlight them all highlight them both and just click run and there you go you can see the first five rows here you know they all contain the word miss in the name the term miss in the name which is excellent and then we can see all of the rest of the variables associated with it and we kind of can just kind of look at it and say hey is there anything interesting going on here well the first thing you notice is that out of the five records for four of them survived so that's 80% that's probably significant given what we saw earlier when we took a look at the tabular count of survived versus perished and that's pretty good what's even more interesting is that most not only do most of the folks that the of the first five didn't four out of the five misses that came back excuse me survived but four out of the five misses are also in third class which is we saw which is even better yet because as we saw your survivability rates declined dramatically in third class versus first class so this so this is already getting to be pretty interesting from a data analysis perspective now moving on the names okay great they're all female which would make sense and you look at the age now the age is interesting because there's quite a variance here the range is all the way from 4 to 58 which is highly interesting now if you move over here you can look at the T of this first column here I'll call it the SIB spot for lack of a better way of describing it so let's take a look at the metadata for sips bunts it would kind of last to say so Sims V denotes the number of siblings or spouses on board okay interesting now we'll notice that the only one of the five here that have of any value here is four years old so I think it's probably a safe assumption to say that a four year old girl is not married so that probably indicates that she is traveling with one sibling and not one spouse and you notice the rest of these are zero and then also we have the parch here so if we go back to parch you could probably guess what that means and that's the number of parents or children on board and again with a four-year-old girl it's unlikely that she has or has children died on probably biologically impossible so I think it's safe to infer that this is a four year old girl who is traveling with one sibling and one parent and that's significant because again you'll see here all zeros so what this seems to what the seems to infer when you combine it with the ages here is that Mis denotes a non married woman generally speaking which is potentially interesting there may be some predictive power there also you can tell at least from this very small data set that not only is yet an unmarried female it may tend they may tend to be younger in age in general which would make sense because in the early 1900's it was pretty common for women to be married past a certain age so that would make a lot of sense so that's interesting so given that we just you know came up with a hypothesis that there may be some correlation between titles and age let's go ahead and take a look at mrs. now so we'll do the same thing right we'll just do the same exact code except for with a string of misses instead of miss and see what we get well that's wrong apologies to go there okay as again I mentioned before right r was case sensitive and I had a capital in there so once I changed it to a lowercase and it worked just fine okay so with mrs. as you would expect we've got this we've got a number of patterns going on all of these these women survived in general they tended to be in the upper classes here of passengers they all seemed to have the same general pattern which is their husband's name first and then there appears to be their maiden name we're assuming in parentheses behind and well so this already in the modern day this would raise a bit of an eyebrow but in general you can say that yep it seems to be misses indicates in general older females as you can see here and they tend that looks like most of them tend to travel with a spouse maybe that's a sibling but maybe for the time being will infer that are mainly traveling with a spouse and there you have it this is also immensely interesting I mean we just sample the first five and they all had to survive which as we know from our previous distributions is pretty interesting let's also take a look at admin so we're going to take a different a little bit different tack rather than going off of the actual title let's just grab the first five male records out of the combined data set and just take a look at their names as a result of that and see what happens okay well first thing you'll notice is that if your sucks to be a guy on the Titanic apparently and so lot all these men perished most of them were in third class with the exception of this gentleman here in first class ages are all adult males there's a missing value here so we don't know about James Moran but this this records interesting age too and it's not a mister but it is a master so like miss maybe this is indicative of if you have a title of master that's indicative of a male child a boy as opposed to an adult male a man and this would also be born out here because a two-year-old is unlikely to be married to three women simultaneously so that's more likely is that this mr. pastor gosta Leonard Paulson is traveling with three of his siblings and one parent and that was probably borne out by the fair because he's in third class 21 21 pounds for a third class tickets pretty expensive so that would be probably indicative of a family traveling together okay so that's particularly interesting I think I'll wrap up this first video with one last look at ad names in particular titles so we've looked at some some tabular data here and we've inferred quite a bit of interesting things from the data already it appears that title is pretty interesting we've seen some general patterns around sex and class and survival rates some proxying for potentially of these titles for general and ages as well as the sibling spouse and parent-child variables but let's go ahead in and visualize the data because as we've said before tabular data is good but visual data is usually even better yet so I'm gonna make a little room in my plot area here and clear this guy out so what I'm gonna do is I'm going to expand upon the relationships between survived peak class by creating a new title variable by literally extracting out of the the the strings the embedded title value for the name and adding it as a as a new variable at the end of the data frame so we'll go from 11 observations 1309 observations of 11 variables excuse me to 13 and 1309 observations of 12 variables in the last one being title and then that way we can actually plot out visualize how title is related to both survival rates and p class that may be an interesting visualization so the way I'm going to do that that's alright okay let's go ahead and just make this fullscreen to get a little more real estate here is I'm going to create a little utility function so this shows some of the syntax in R for defining how defining a function so I'm going to find a function that takes a single parameter which is the name and I'm going to name this function extract title makes sense so the first thing I'm going to do is I'm going to convert the name as a character because as we've seen previously there these are treated as factors by our currently so we'll turn it from a factor into a character string and then we'll use the grep function and let's go and pull that up those are you are programmers or are familiar with Linux UNIX then you already know what grep is so grep is a pattern matching function and what we're gonna do is we're gonna say hey if you recognize the pattern of miss within the name great and if the length of that is greater than 0 which essentially tells me yep you found it it's within the name just return this as a string do this thing it's gonna do the same for master do the same for mrs. to same for mr and if you find something else for what other reason to return back other great that's pretty simple function right moving on we're gonna hear some code actually takes advantage of the function so we're gonna create a variable called titles and we don't have anything to stick in it right now so we'll make it null and then we're gonna go ahead and loop over all of the values in Dave combined so we said hey grab me the number of rows which we know to be thirteen hundred and nine and go from one to 1309 in each time go ahead and call this function so the I value will be our index it'll go from one to 1309 so grab the first row the first observation from the data dot combined data frame grab the name pass it into the function and then when you take a look at this real quick the C function the stands for concatenate or combine excuse me so what this does is essentially says take whatever comes out of this function call and add it two titles and basically we're doing is we're just building up essentially a vector an expanding array of these values that come out of the extract extract title function so essentially it'll be like a collection of strings will be like maybe it's miss miss miss master master master mister mister missus mister master miss you know you'll get a big long list and eventually you'll get thirteen hundred and nine of those because we are iterating through the entire combined data frame and then at last last point we'll say hey look once I'm done with that create a new variable on data combined so make it a data frame of thirteen hundred and nine observations with twelve variables the twelfth one being a new one called title and go ahead and cram in all the titles that I created through going through this loop and calling the extract title function but before you do that on the fly convert that to a factor because as we said before you can't do a lot of really cool analysis on on raw string so we'll turn the miss the master the missus the mister and the other string values into a factor of five values and then that allows us to do some interesting analysis on that so just highlight all this code and run it and we'll see over here that we now have 12 variables and if we pull the if we click on it and pull it up sure enough you can see the title column has been added to the end and it's got all of our values and just a double check we can look at the first three mister missus missus check the names mister missus miss sweet everything works great okay lastly let's go ahead and do a visualization on that now since we're interested in the relationship between whether or not somebody survived what what passenger class they were in and well as their title we can only just use the labeled data which corresponds to the Train data frame which if you recall was the first 891 observations so we're gonna call ggplot again good old handy ggplot2 and we're gonna say look use the data dot combined data frame but only grab the first 189 observations the first 189 rows out of the table go ahead and grab all of the columns all the variables and I want an aesthetic where the X the x axis is the title so that would be Miss master mrs. mr. or other I want you to color code it I want you to fill it based on the value of the survived did they survive or perish and then I also want you to then pivot that also based on passenger class so in ggplot2 terms that's called facet wrap and of course as with everything you can ask the help system about fastener out and here you go faster drap wraps a 1d ribbon of panels into two deep which is not particularly intuitive description so it's better just to see it so once again we'll go ahead and just run this and see what the plot looks like here you go now this is a very very interesting plot arguably so here we have our facet wrap right for each p class one two three so we've gone from one d to 2d essentially that's what that not particularly helpful description from the help system was trying to imply and down here we can see the titles master miss mister missus another and then of course they're color coded so right away some things pop first of all if you are a mister in third class life is not so good for you you were very very likely to perish if you were a missus in third class you're about 50/50 if you're a Miss about 50/50 ish mastered unfortunately you're more likely to die unfortunately perish so this is particularly interesting you know this this there's an old adage in in the US I'm assuming it's the same thing in the UK as well maybe in Australia and other english-speaking countries that women and children first this kind of maybe bears this out a little bit at least in third class because it seems like you had a better chance of survival in general if you were a woman or a child as opposed to an adult meal now conversely if you go to first class and take a look at it things really pop as well again this is even more striking arguably from this women and children first perspective that you if you were a male child a female child or a married woman will make that to something we can't make that assumption that I miss at this point only a child or a married woman you are far more likely to survive than perish but if you were an adult male will use mister as a proxy for adult male in first class you your odds were far less likely that you were going to survive so this is more indicative of this idea of women and children first this is also born out very strikingly in second class as well as you can tell by the relative colors so this is a very very striking plot from a data analysis perspective on this data set and it's indicative of the power of doing data analysis and actually just looking at things and saying look is is there is there hidden patterns are there hidden pieces of information within the data that could help me solve my problem at hand in this particular case embedded within the name string was this title thing that we were able to pull out and add as a new variable or in machine learning parlance add as a new feature to our data frame to our table to our dataset and then say look is this thing particularly interesting and sure enough it is you type this title thing is extremely interesting as you can be seen from this plot now whether or not in the end once we're totally done with our analysis is it more is it still is a significant or do we find something more that's to be seen but at this point I think we'll wrap it up for this particular video hopefully this is helpful for you and it's piqued your interest in the power of data analysis and visualization using are especially if you're interested in the data science of predictive modeling thanks for watching I hope to see you again in our next video thanks a lot
Info
Channel: David Langer
Views: 1,379,616
Rating: 4.939497 out of 5
Keywords: R (Programming Language), Data Science, Data Analysis (Media Genre), Feature Engineering, Visualization, Data Wrangling, Data Exploration, R Programming, R Programming Tutorial, R Programming Training, Data Science with R, Data Scientist, Machine Learning with R, Programming, Tutorial, Training, Data Science Training, Data Science Tutorial, Machine Learning, Data Analysis, Data Visualization, Data Science with R Programming, language, tutorial, programming
Id: 32o0DnuRjfg
Channel Id: undefined
Length: 81min 50sec (4910 seconds)
Published: Sat Nov 08 2014
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.