Exploratory Data Analysis With Excel - Part 1 - Basic Numerics

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
if you're a professional looking to up level your skills with data you want to have more impact at work by using business analytics this is the video series for you this is part one of exploratory data analysis with excel throughout this tutorial series i'm going to take a data set and walk you through the process that i use personally in my hands-on work as an analytics professional to explore dataset uh formulate business questions and then go look at the data and see if i find any patterns or any associations with a thing that i'm interested in that business question so a little bit of housekeeping this video will be a little bit longer probably than some of the others and that's simply because we need to get things set up so first of all in the description below there is a link to a github so every single one of the worksheets workbooks that you see in this video series will be uploaded to that github so you can go and grab them anytime you want you can grab part one or part i don't know 13 whichever however long this goes i'm not quite sure at this point so you can get the files so the link's in the description below second and this is important the subject of this particular video series is exploratory data analysis that doesn't have a time component so for example you grab a bunch of data about your customers from your database let's say and put it into excel those customers are just you know like they have static attributes like their name and where they live and how much product they've bought and things like that if you're interested in analyzing business data over time so for example sales or expenses or your digital ad click-through rates or your conversion rates on your website that sort of thing i have a video dedicated to that i call that kpi analysis and you can just check that out up here but don't click on it just yet okay it's there at any point you say hey this video isn't for me it's not what i really want to do right now so there we have it that's the housekeeping so first thing that we need to do is go to excel and enable something known as the data analysis tool pack because that is super awesome so let's go let's go ahead to excel right now okay you can see here that i'm in excel this is the data set that we will be using through the entire series but we're not worried about that right now what we want to do is enable the data analysis tool pack add-in in excel and this works for both windows and mac versions of excel if you have a modern version if you have an old version on on mac then it won't work but if you have like 365 or 2019 versions of excel you should be good to go so the data analysis tool pack is an awesome add-on that provides a bunch of really cool features for us to use to analyze our data to explore our data so let's go ahead and enable that so how we do that is we go up here to file and we click on file and then way down here we want to go to options and that brings up a nice dialogue called the excel options dialog and what we're interested in is add-ins so you go all the way down here to add-ins and click on that and we want to manage our excel add-ins down here so we're going to click on go and that'll bring up a little dialog here and there's a bunch of different add-ons the solver is really super awesome i might do some videos on that some other day but right now all we're interested in is this first one right here analysis tool pack and click that check check the box there and then click ok now you don't notice a change except for if you go to the data section of the ribbon here the data section of the ribbon and click on that you'll notice that way over here you now have a new subsection of the ribbon of the data ribbon called data analysis that is the analysis tool pack and this is awesome we're going to be using this in this video and we will likely use it in additional videos later on as well so we've got the analysis tool pack open that's awesome that's our infrastructure the thing you have to remember is that the power of your exploratory data analyses are not a function necessarily of the tool that you use we're going to use excel in this video series i also use the r programming language as well what you see here in this video series is the exact same thought process and mostly the same techniques that i use when i use rns instead of excel r just has some additional benefits that you just don't get in out of a box excel but we'll talk about that at some other time so the important thing is that you as the data analyst as the business person applying your knowledge of what's going on in the business and trying to figure out what the data tells you about that that's the most important thing it's not the tools it's what how you execute the analyses what you bring to the table as the analyst that's most important and second of course is the data you can't if you have no data there's no way to get insight so those two things are most important the tools doesn't matter so what we're going to be focusing on in this video series is the thought process the ideas the concepts the kinds of analyses that we that we execute to to derive insights from the data rather than focusing on the tool itself so you're going to hear me say this so many times throughout this video series and it's super super important when you're using exploratory data analysis in the business analytics space for analyzing business data the single most important thing that you need to keep in mind is a question if you want to use a fancy term a hypothesis but basically the same idea you want to have a question you can't think of exploratory data analysis as kind of this organic natural thing where you just sit down with a table of data like this and just start looking at it and you can do that however in business analytics my experience has universally been that if you want to have maximum impact you want to focus on a business question first and foremost the business question is the critical thing and this might be a bit controversial but i'm going to go ahead and say it anyway if you don't have a business question you should really stop and ask yourself why am i doing this data analysis business questions provide context they provide roi potential for your data analysis and you should always have one always and sometimes maybe it's a business question that you're that you want to investigate because it's related to another question that really interests you so for example you're interested in understanding how your digital advertising spend drives paid conversions so you might go then say okay what ads typically have the best click-through rate that question that sub question rolls up to the higher level question and that's okay that kind of that kind of thing is totally reasonable and totally expected first off we got to start with our data set here and ask ourselves what is the business question i'm going to air quote that in this data set that we're interested in answering now this data set here is a very famous data set it is the titanic data set and i picked this data set because it is a relatively small data set but it's very interesting from an exploratory data analysis perspective that's why i use this data set in all of my courses all of my paid online courses for example it teaches you a lot about how to do data analyses some folks might be saying well dave why don't you use a marketing data set or a human resources data set or a finance data set those are all legitimate questions the only problem is that those are all very domain specific what i want to teach is the general concepts that any professional can then take and use with their hr data or their financial data or their marketing data so not surprisingly one advantage of using this data set the titanic data set is that everyone's familiar with the problem which is unfortunately there were folks that perished on the titanic and this column right here the survive column gives you a one if the passenger survived in a zero if they did not so not surprisingly the high level business question that we're going to be looking at throughout this tutorial series is what pieces of data are highly associated with survival that's the business question that we're going to answer that we're going to try and answer with our exploratory data analyses now throughout the series we might break down into a sub-question for example we'll take a look at age in particular later on for reasons that will become clear later on and that'll be like a sub question how what what factors are associated with ages that's definitely something we're going to be interested in that's a sub question that then rolls up to the high level business question of what factors as evidenced by the data are highly associated with survival okay so now we have our business question great first up we say look we've got a collection of data here we've got you know some columns of data and i'm not going to go through what any of these mean if you're already familiar with the data set and what they mean that's totally okay you're still going to get value from this series if you don't know much about the data set that's the best because we're doing exploratory data analyses however one thing that we do know is that we want to understand how the rest of the data in this data set is associated with survival that's our overarching business question that guides our analyses so first up well let's just go left to right okay we're doing exploratory data analysis let's assume that we can infer a little bit of what's going on in the data just by the column names the column headers totally reasonable you're going to see that all the time in the real world so we know this is passenger id so if you're familiar with it obviously are you familiar with the concept of an id column this is an identifier so in the united states for example citizens of the u.s have an identifier it's called a social security number it's a unique identifier for each you know citizen the united states same idea here but if we didn't know that passenger id was an identifier column we could actually perform an analysis on this column and derive some insight and in fact the way i'm what i'm going to show you you're going to actually understand why this column is likely an identifier but let me let me zoom in on the data first so just you can see a little better so what we're going to do is we're going to take a look at passenger id column a and we're going to do a very quick analysis on this column all up we're going to explore it and how we're going to do that is we're going to use the mighty analysis tool pack so we click on data here in the data ribbon and we move over to the data analysis option here and we click that and that brings up dialog and there's a ton of really awesome things that you can do with this for example you can do regression analysis i teach a course that explains how to do regression analysis in excel to folks with no technical background and i use this particular option it is awesome okay but for this particular analyses what we're interested in is the descriptive statistics option so that's the one we're going to select here so we click on it highlight it in blue click ok and we get another dialog and the dialog says okay you need to give me an input range what are the cells of data that you want me to analyze with descriptive statistics and what we know is we want to do column a so we'll just do column a here and then we know that we got a label in the first row so we'll just go ahead and click that and then we want the output in this worksheet so let's just go ahead and go over here oops click the button there and let's just go ahead and put it in n2 let's say n2 that's where we're gonna put it so we're gonna do column a we're gonna do output range in two and then we're gonna go and do summary statistics here click check that and everything looks good and we just click okay and we chug away here and what we get here is some output for the passenger id column some descriptive statistics or summary statistics and this is wildly useful stuff okay so first and foremost what we want to do is take a look at the count of values without even scrolling through the bottom of the table what this tells us is that there are 891 rows there are 891 values in the passenger id column sweet that's good that gives us a lot of information about what's going on how many pieces of data do we have 891 excellent so what we need to do here is then take a look at some of these other things and see what's going on now this is what next is interesting is taking a look at the range the range is simply the maximum value found and you subtract off the minimum value this is super interesting because what this tells us is that okay we have 881 rows we have 891 pieces of data and the minimum value that we have is 1 and the maximum value is 891 and we have a range of 890 as a result which tells us that okay this is this is kind of interesting if we had just a number that was kind of randomish so something like revenue or somebody's age on the titanic or how much they paid for their ticket you wouldn't see something so closely associated with the raw amount of rows that you have so if we scroll back over what we can confirm here is that oh yeah it looks like this is just what is known as a monotonically increasing number notice that we got 60 61 62 63 so forth okay but even if i didn't look at that i can tell from this output here that there's some fishy going on in this data it's it's unlikely that a typical numeric piece of data in business analysis is going to have 891 rows with a minimum value of 1 and a maximum value of 891 okay so that's interesting let's let's contrast this with another analysis so we know this is a numeric column it's not a useful numeric column probably because it's just an identifier but it's a numeric column so we can use this technique of descriptive statistics on numeric columns of data now strictly speaking this is encoded as a numeric column it's ones and zeros however we know a priori it's part of our business question that in fact this is actually not really a numeric variable because it's simply a yes no did the person survive yes they have a one if the person perished unfortunately then it's zero so we know that this is actually a categorical piece of data it's got categories survival or not survival it's not actually numbers we can take a look at this particular column of data now and let's do an analysis on that on pclass and kind of explore and see what's going on so once again we go up to the ribbon we fire up the data analysis tool pack and we want to do descriptive statistics this time though we're going to go ahead and click on column c here and we're going to change the output range and keep everything else the same and let's go ahead and put that in column q or cell q2 let's say let's start the output there and then we click okay and then we've got p class okay so once again we've got something interesting going on here so notice that we've got a mode notice that over here the mode didn't really exist wood couldn't be calculated and if you're not familiar with the mode is is it says look out of all these numeric values which one is the most frequently occurring and notice that this one is a mode of three which is interesting because notice over here we couldn't calculate it because every piece every piece of data was different every piece of data was different right it was 62 63 64 6566 there wasn't any value that happened more often than any other so mode couldn't be calculated but notice over here that it can be calculated that's really interesting and then if we go once again we say okay there's 891 rows great and we say oh look at the range the range is 2 we have a minimum value of 1 and a maximum value of 3. this is interesting notice that we see some similarities between p class and passenger id in terms of the range of maximum value right that they seem to be well defined but notice that a big difference is that mode is not calculated over here because every value is unique whereas over here we say oh no look 3 is by far and away the most common value so if we scroll back over to the table and just kind of eyeball it what this tells us is that this is a number column to be sure the way it's encoded but maybe it's not really a number column because what what it's telling us is that we have a small number of values and the easiest way to do that of course is also to check the filter in excel you just click on that and you say okay i've only got three distinct values available in this thing so that's probably an indicator that this is actually a categorical variable which kind of makes some sense if we think about the context which is this is the titanic data set and the p class probably represents what class of ticket did the passenger have first class second class third class and notice how i didn't need the data dictionary i didn't need a business subject matter expert to tell me what this means i can use these techniques and just a little bit of knowledge about the data to kind of infer maybe what's going on and that's kind of the heart one of the hearts one of the hearts okay one of the ideas how about that of exploratory data analysis next up we're not going to look at name obviously because that's not numeric we're not going to look at sex because that's not numeric let's take a look at age that's probably a pretty interesting thing for us to look at we're going to look at the age column here column f so once again we'll go up to the data ribbon here we'll go to the data analysis tool pack we'll click on descriptive statistics and we're going to swap out column f for column c and then let's go ahead and put our output t cell t2 okay let me keep everything else the same right we want our summary statistics and we know we have a label in the first row boom okay now this this this is cool right this is where this is where the analysis tool pack really starts coming into it it's its own in helping you understand helping you explore columns of numeric data so here we have the age column and notice that the count is not 891 that tells us that we're missing data and we're missing quite a bit because you notice that we have 891 total rows of data in as verified by the count here and we can see that for peak lasting way because p class has no missing values but we can see here we have 714 so we're missing well over what 170 rows of data 177 rows of data yeah 177 rows of data so we're missing a lot of data here in the age column that's the first thing that this tells us which is problematic because missing values in when you're doing business analytics are not uncommon and unfortunately typically there's no way for you to rectify them because if they're not in a database or they're not in a computer system they're not in a spreadsheet there's no way for you to go out go out and collect the information so this is super interesting now what we've got here is a more of a typical kind of situation that you see with the numeric data in business analysis which is we've got a range of values notice how our minimum age is 0.42 so we'll assume that that's in years obviously because of the name of the column and maximum is 80. and we can see the range is pretty high and then what we can see here is like her median age which is like if you take all 714 age values the median is and you arrange them in ascending order right from lowest to highest the one that's right in the very center 50th percentile the halfway point in that sorted list of numbers that's the median which is 28 and the most frequently occurring value is 24 years of age and notice that we now can calculate a mean that's prob excuse me a mean or an average that's just a fancy way of saying the average which is interesting which is 29.669 or excuse me 29.699 years of age we're going to explore this in a later video probably video three using a histogram which is a data visualization that you use for numeric data to kind of understand what's going on with the range of values but this is a good first step it tells us a lot it tells us that we're missing quite a bit of data we've got ages ranging from less than a year old all the way up to 80 years old and we've got some measures of like well if prototypically you know if we just had to guess about the age of a passenger in the titanic what would we guess based on the data that we have we would say well they're probably in the range of you know 28 to 30 years old or so on average and a lot of folks around 24 years of age were on the titanic as based on the data set that we have what i would encourage you to do is go to the github and grab the files grab the excel files and perform this analysis on the sibspa and parch and fair columns because those are the remaining numeric columns and look for the kinds of things that i looked at for example is this really a numeric variable is this is this really a numeric variable because notice that p class for example when we looked at that wasn't actually a numeric variable it was actually a categorical variable just encoded with numbers so we always want to be cognizant of that for sipspa that's why what i pronounce it perch and fair so i'd really encourage you to do the analyses as homework using the analysis tool pack also this is the first video in the series when i record number two i will update the video and it will show up in one of these two places and as i mentioned before you can get the github from the description below and if you're interested in learning more about kpi analysis analyzing data over time you can go ahead and check out the video up here it's a good time to click it now if you're interested okay that's it for video number one i'm looking forward to this series it's going to be awesome until next time please stay healthy and i wish you very happy data sleuthing
Info
Channel: David Langer
Views: 6,026
Rating: 4.981132 out of 5
Keywords: Exploratory Data Analysis With Excel, tutorial, exploratory data analysis, data analysis, business analytics, data analysis toolpak, histograms, data visualization, boxplots, bar charts, pivot tables, pivot charts, excel workbooks, analytics, machine learning models, eda, excel, microsoft excel, excel tutorial, exploratory data analysis excel, exploratory data analysis in excel, exploratory data analysis using excel, excel exploratory data analysis
Id: 1zEFJHbG0aE
Channel Id: undefined
Length: 23min 56sec (1436 seconds)
Published: Wed Apr 21 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.