Exploratory Data Analysis With Excel - Part 2 - Basic Categoricals

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
if you're a professional looking to have more impact using data if you're looking to up level your data analysis game this is the video series for you this is part two of exploratory data analysis with excel and if you've got here prematurely if you need to find the first video in the series just click up here in the card there's a link to the first video in the series okay the subject of this video is basic categoricals what we're going to do is take a look at the titanic data set and focus on the categorical features in the context of our overriding business question which is what are the patterns in the data that are highly associated with survival so let's flip over to excel and start working through the data okay you can see here that i'm in excel and what i've done prior to this video was just add a couple of new columns of data a couple of new features to the data oh and by the way i come from a machine learning background so i will tend to use terminology like columns variables and features and they all mean the same thing just a column of data so what you can see here is i added a new column called new survived and that's because what we established in video number one is that the survived column while it's encoded as numeric right zeros and ones it isn't actually a numeric it's a categorical essentially one it indicates that the passengers survived and zero indicated that the passenger unfortunately perished so what i did was just create a new handy dandy feature a new column of data called new survived and populated it using a simple excel function call i just checked to see if the b column the survived column is equal to one and i and if it is i substituted in the string survived otherwise i put in the string perished and this is going to be interesting for later on in the video for us is to actually work with these categories as text rather than use the numbers you could certainly use the numbers if you wanted to but it's a little bit cleaner and nicer if you actually have labels like this similarly what i did before the video was i added a new feature called new p class or a new column of data called new p class and i just used the good old ifs function up here to essentially say hey if your p class is equal to 1 make it the string first if it's two then make it the string second and if it's three then make it the string third and that's going to just make our analyses a lot more fun a lot more interesting out of the box excel comes with a really really great and powerful tool for exploring categorical data and that is the mighty pivot table so pivot tables are a great way to do a very early exploratory analysis on categorical features especially when you have a business question in mind so let's go ahead and add a pivot table and illustrate what i'm talking about here so let's go ahead and insert a new pivot table and i'm going to go ahead and put it in this existing worksheet and where we're going to where we're going to put it let's put it in p2 let's say so let's go ahead and location p2 okay and we'll just add pivot table and i'll just scroll over so that we can see it a lot easier okay pivot table awesome so if we just explore left to right you know columns from the left to the right because why not we're in a very early phase of our analysis here the first categorical feature that we're interested in is the p class column you could take a look at survived if you wanted to just to get the relative proportions but that's not as nearly as interesting as actually saying look what are the other category categorical features how do they relate to the over arching business question that we're looking at which is for example how does p class relate to survival rates because that's the thing we're interested in analyzing so that's pretty easy to take a look at what i'll do is i'll grab the new p class feature drag it to the rows and you can see up here i've got first second and third now some terminology here a very common terminology that is used in these types of analytical scenarios is the concept of a dimension in terms of a pivot table what you do is you drag columns of data to the rows and columns aspects of the pivot table and that allows you to build out a pivot table every time you drag a column of data to either to the rows portion of a pivot table or the columns portion of a pivot table you're adding a dimension so right now i have a one dimensional pivot table i'm doing a one dimensional analysis but i can add features quite easily so what i'm going to do is i'm going to add the new survived to the columns and now i've got a two dimensional pivot table here i got new p class as my first dimension and then i've got new survived as my second dimension and not surprisingly of course we can do cool stuff with the pivot table as you well know we'll just drop new survive to the values and we get some counts here but percentages are actually a lot more interesting to use so let's go ahead and convert these to row percentages here using our handy dandy excel skills okay and this is where we get our first insight what we can see here is the survived the survived column here the survival rates the percentages based on the data that we have by p class new p class and what we can see here is that not surprisingly or maybe surprisingly depending how you look at it survival rates tend to go down as you move from more expensive tickets first class tickets to less expensive tickets or third class tickets and once again we inferred in the first video that first second and third corresponded most likely to tickets on the titanic because it was the only logical explanation for the data that we saw so we're going to operate under that assumption so first and second and third now it's very tempting as a data analyst and i'll ever quote that because maybe you don't have the title but you're operating as a data analyst it's very tempting to start hypothesizing why you're seeing the things that you're seeing to start forming ideas of why you're seeing the things you're seeing so for example maybe you come up with the idea that oh the reason why i'm seeing this is because the lifeboats in the titanic were at the top of the ship first class passengers had cabins at the top of the ship and then second class and then finally third class at the bottom and it's a proximity thing right folks in third class were way far away from the lifeboats so it was extremely unlikely that they were going to make the journey up the ship all the way to a lifeboat it's a reasonable idea however it's not particularly useful in this context because we have no data in our data set that tells us relative proximity of folks to a lifeboat because the ship is not only vertically large the titanic was not only vertically large it was also horizontally large so maybe it's not just how far down in the ship you were but it was also how far back in the ship you were maybe i don't know the point is that we should only hypothesize with the data that we have because otherwise we can get off on a tangent that isn't really useful in the end for driving a business outcome because we can only work with the data that we have now you could for example maybe go find that data outside of this data set but for the sake of this video series we're only going to work with the data that we have because that's usually the case most of the time okay so we have the data so what we can see here right now is that the pattern and the data tells us that third class passengers perish unfortunately at a very high rate three out of four of the passengers in third class based on this data set perished now given my diatribe earlier about don't get fanciful with your hypotheses let's just work with the data that we have so reasonable another reasonable thing to check out is be like okay hey is there any is there any pattern in the combination of new p-class and sex so there is a nautical rule of thumb in bad situations that goes women and children first so one thing to explore potentially because we have the data in the data set is to see is there any sort of gender pattern in the survival rates so we can easily check that out by just grabbing sex and dragging it down to the rows and you get exactly what you would expect a three-dimensional representation of the data so we have new p-class we have sex over here and then we have new survived and what this does is pop some patterns in the data so check this out females in first class overwhelmingly survive almost 97 out of 100 females in first class at least based on the data that based on this percentage survived similarly in second class 92 percent of females in the data set survived and lastly in third class you've got a 50 50 split what's really super interesting is then when you compare them to the males in the same class because you can see here survival rates for males are significantly lower in first class than they are for females which given that we have the data gives us some credence to this idea of hey maybe just maybe women and children first is an underlying pattern in the data and we can also see here that males in second class they only survived unfortunately at 15.74 that is so low and in third class you can see it's even lower 13 and a half percent so what we see across all of the values of p class first class second class and third class is that females always survive at a much higher rate than males it's not as pronounced in third class here as you can see it's not as pronounced however it is non-trivial difference so that's a pretty good that's pretty good insight into the data and it also tells us some other things as well to help us guide what we're going to do later on in the series which is we're probably not going to focus a ton at least in the beginning on understanding which factors were significant in predicting female survival in first and second class because they already overwhelmingly survive we're not going to get a good return on our analytical investment here at least in the early stages we're going to have a much better much better return on investment if we can figure out for example why males survived or not in first class second class and third class and then maybe as well why the female survival rate was only 50 50 in third class that's where we're going to focus a lot of the energy in the next videos now we're not done we can certainly add another we can certainly add another dimension here we've got three dimensions going on let's add a fourth dimension let's go ahead and add the embarked column and what we can see here is a lot of goodness now unfortunately this also exhibits a problem with using multi-dimensional pivot tables the human brain is optimized for visual pattern recognition humans are visual creatures and tables of data are great and they work just fine and they're used all over the business world every day however when you start doing exploratory data analyses you tend to find that you want to use multi-dimensional types of techniques and pivot tables break down really really fast so a lot of what we'll be doing in later videos in the series is issuing the use of tables of data and moving towards data visualizations and excel offers a lot of data visualizations out of the box either plain charts or pivot charts and we will be using those extensively to try and visualize the data and use our human brains to see what patterns pop this is all this is all cool and you can see that there's some dif some differences here between the various ports of embarkation so what embark tells you is where did the passenger get on the titanic it's one of three locations sherborg queenstown or southampton and you can see there are some differences between them and maybe that's worthy of exploration i mean you can see that through the various classes and by gender you know there's not necessarily huge differences until you get down to the males like down here in third class and you can see that males in third class that got on in q embarked on cube only survived at 7.69 but those that embarked on c survived at 23.26 that might be potentially interesting to explore later on in this series now another thing that i should mention as well in terms of exploratory data analysis is that pivot tables only take you so far because one of the things that we will definitely want to do in our data visualizations later on in the series is combine the categorical data like we see here with the numeric data and let me get rid of embarked here real quick and let me explain that a little bit further what i mean by that males we have a working hypothesis using the data that we have of women and children first and we can see that females survived overwhelmingly in first class so that's like the women portion of that hypothesis there might be a male aspect to that as well so for example of the 36.89 of males that survived in first class how many of them were male children boys because if women and children first is a good pattern to use to understand what's going on in the data set then you would expect that male children in first class survived because they were put on the lifeboats ahead of adult males and maybe you might also see elderly males men that are old maybe 50 plus years old or something like that also survived at disproportionate rates because they were also given spots on the lifeboat in preference to adult males between the ages of like i don't know 20 and 50 let's say to do those sorts of analyses we need to be able to combine the age column of data the age feature with some of these other categoricals and we will be working with visualizations in excel that allow us to do those sorts of analyses one last thing i should mention along these lines is that excel has a lot of capabilities however some of the visualizations that it can do are limited so for example you can use tools like the r programming language to do dimensional analysis with five six seven eight dimensions simultaneously okay so when you're doing an eight dimension data visualization in r programming you need a really big monitor because you'll need the real estate to see it but it is possible excel breaks down typically in those high-level dimensions and our programming can also easily combine both numeric data and um categorical data as well so i teach a course by the way on our programming where i take excel users and i teach them our programming and these data visualization techniques that i'm that i'm talking about here and of course you can use other tools as well i tend to use our programming for this but tools like tableau and power bi are also useful to take over on this multi-dimensional high-dimensional data visualization analysis when excel just can't do it okay pivot tables excellent excellent way to do an initial pass of categorical exploratory data analysis next up in the series we're going to be investigating numeric data using histograms when video 3 is up i will put it either here or here and if you're interested in learning more you can go and check out the card up here i've added a couple of other videos that you might be interested in and remember that you can get the workbooks from the github and the link to the github is down below in the description under this video okay that's it until next time please stay healthy and i wish you very happy data sleuthing
Info
Channel: David Langer
Views: 2,814
Rating: 5 out of 5
Keywords: excel, microsoft excel, exploratory data analysis, exploratory data analysis with excel, tutorial, eda, data analysis, business analytics, data analysis toolpak, histograms, data visualization, boxplots, box plots, bar charts, pivot tables, pivot charts, excel pivot tables, excel pivot charts, excel workbooks, analytics, machine learning, machine learning models
Id: cHk81hqBv_I
Channel Id: undefined
Length: 17min 20sec (1040 seconds)
Published: Fri Apr 23 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.