Exploratory Data Analysis With Excel - Part 3 - Histograms

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome to exploratory data analysis with excel part 3 histograms in this video we're going to talk about how to create and use histograms to explore your data in excel so if you've gotten to this point prematurely if you're looking for video number one just go ahead and click up here in the card and there'll be a link to the first video in this series okay let's go ahead and get started you can see here i'm in powerpoint and let's say hypothetically speaking you get a collection of numbers you get a pile of data maybe your boss gives it to you or maybe you pull from a database or you get it in an excel spreadsheet i don't know but this collection of numbers here simulates this idea that you're just given a pile of data and you're asked by somebody once again maybe it's your boss to say and he asks you she asks you make some sense of this data for me tell me something about it so you've got some options in this particular scenario a very common one which will resonate with you as an excel user is this you can calculate the average you can take all these numbers add them up divide by the total number of numbers that you have and that'll give you what's known as the average or the arithmetic mean and the average is a wildly useful thing it is a statistic that characterizes the balance point of all the data basically if you want to think about it you add up all these all these values you divide by the number of values you have and that essentially gives you a center point that balances all the data that's what the average conceptually is and that's a very useful thing that's one thing you could do with this group of numbers this pile of data another thing that you can do is take a look at the minimum and maximum values and just eyeballing the data you could probably tell that the minimum value is zero and the max value is nine and that's useful because it gives you some idea of how far spread out the data is the average tells you you know that balance point it talks about the centrality of the data you know where where is it located where is it centered but looking at the minimum value and the maximum value allows you to get some idea of like how spread out are the values and in fact there's a statistic called the range where you take the max value and you subtract the min value from it and that difference tells you the maximum distance between a min value and the max value and that is what's known as a measure of dispersion right how far apart are the data and that's a couple things that you can do another thing that you can do that's also very very useful is you can go through all of the data you can go through all of the individual values and say what is the count of each value how many times does each value occur in the pile of data and you can just use a table for that and you can see here i've got a table of the values 0 1 2 3 4 all the way through that should be 9 by the way not 0. sorry that's a bug in this line it should be 9. and you can see here the counts of the various values so we have two nines not two zeros one eight one seven two sixes you have zero fives interesting this is what's kind of known as a statistician would call this a way of understanding the distribution of the data not only are we understanding the values that we have 0 through 9 but also how many of each kind we have how many of each value do we have and for example we have zero fives and that's potentially interesting so this is essentially a frequency table i've got a pile of data how often does each value occur within that pile of data and this is an immensely useful thing a histogram is basically a data visualization of this frequency table that's all it is and one piece of terminology that we need before we actually slide over to excel and create some histograms is this the concept of bins and what you can see here is i have used bins corresponding to each individual unique value a value of zero zero a value of one two three four five six seven eight and nine those are the bins i've defined them based on the individual values but i don't have to as we'll see in a second in excel i have the ability to change the bins if i would like and that will change the shape the look of the histogram but at base this is what i want you to remember the histogram is nothing more than a data visualization of this frequency table let's pop over to excel and start creating some histograms okay i'm here in excel and all i've done is cloned i've copied the part 2 worksheet and just renamed it part 3 histograms it's all the same data that we saw in video number 2. a histogram is a numeric data visualization first and foremost so you use it with numeric columns of data numeric features numeric variables so for example we can look at the age column here in the table of data and take a look at the histogram for it and that'll be quite useful so what we do is we can highlight the column i'm just going to scroll over here to give me some real estate and what we do is we just go up to insert in the ribbon and we can select this statistics chart option click the down arrow and we want a histogram so voila we have a histogram right here what we can see here is essentially the frequency count based on bins and the bins are denoted right here as you can see this bin goes from 0.42 up to 6.12 years of age and the reason why we know that is because we picked the age column and we've inferred from previous videos that it is the age of the passenger on the titanic in years and we can see here that the lowest value which we saw in video number one is 0.42 and the bin includes every value starting at 0.442 up to and including 6.12 and these are the bends and what we can see here is the frequency counts so what we can see is that we have 47 values that are in the range that are in the bin of 0.42 up to 6.12 and here's the thing we can play around with this we can play around with this i don't really like these bins so what i'm going to do is i'm going to click on the bins and see how they're highlighted now i'm going to right click and go to format axis and excel says no problem dave you want to format the axis no problem and we get over here the format axis option and notice that it's set to automatic i can change that for example let's say i don't want a bin width of 5.7 years i want a bin width of exactly five years so i just go ahead and type in five hit enter and you'll see that i automatically adjust my histogram so i'm going to go ahead and make this a little bit bigger just so we have some fun and what we can see here now is we have 44 ages between 0.42 and 5.42 right on the titanic and notice that the bin width is now exactly 5 years it starts at 0.52 you add 5 years to it and that's the high end of the bin and what we can see here is a lot of information we can see that there's a relative spike in very young passengers in the titanic and a good chunk of the data is centered here between it looks like 15.42 and let's say 40.42 so a big chunk of the passengers are right in this area of the distribution this is very useful information now we can also change the bin with if we want to look at the data a little bit differently so let's double the bin width from five years to 10 years so i just go over to the format axis again click in 10. and notice that it's the same exact data the data is exactly the same it hasn't changed however now what we see is the shape is a little bit more mound it's more like a mountain it's a little bit more pronounced it's kind of this bell kind of shape this bell curve kind of shape but we can see here that our bins are from 0.42 to 10.42 and all this tells us essentially is that yeah it looks like there is a a good chunk of people in this range here of 20.42 to 40.42 which is kind of what we saw in the previous histogram but as you change the bin widths you get a different visualization so typically what you do is you play around with the bin widths and in the business world this is a little bit easier than you might think because usually you have business subject knowledge that helps you guide what the bin widths should typically be so for example in the case of the titanic we're interested in exploring this idea that we talked about in the previous video of women and children first bin widths of 10 are probably a little bit big for us to explore that idea so if we go back down to five years of bin with we get a little bit more information and we can see here oh yeah there's quite a bit uh relatively speaking quite a bit of children on board the titanic at least based on this data set that are 5.42 years or younger in age and actually we can see here that there's 20 and 19 so there's only 39 between these two bins and 44 in this bin so what we know is that we're a little bit skewed in terms of children to the very young so that is histogram it's very very useful but we can improve upon this in the last video i talked about this idea of dimensions that when we're doing exploratory data analysis we want to use data visualizations and we want to add as many dimensions as we possibly can that's usually when we start to see patterns in the data pop so for example this is what is known as a univariate data visualization or a single variable or a single dimension here we're just looking at h that's it which is useful don't get me wrong it's totally useful however it's not as useful in the context of our business question that we're trying to investigate in this series of videos which is what are the factors in the data that are highly associated with survival that's what we're trying to figure out what pieces of data tell us which passengers on the titanic are more likely to survive than others and this is useful this visualization is useful don't get me wrong but it's not particularly high powered in terms of answering that business question so what we would really like to do for example is to add another dimension to this histogram for example would be really cool is if we could color code the bars based on for each age bin for each bucket of ages how many of these folks in this age been survived versus perished that would be useful for us because remember we have a working hypothesis of women and children first and what we anticipate under that working hypothesis is that we would see survival rates for children to be higher than for adults but we can't tell that from this data visualization but there's an easy way to do that using pivot charts and that's what we're going to do next so we're going to add a pivot chart to this worksheet so i just click in the table here i'm going to close that down so i don't need it anymore and i'm going to insert a pivot table and i'm going to keep it in the existing worksheet and i'm just going to go ahead and put it right below the histogram to start with here so i'm going to go ahead and pick this cell right here for the pivot table and i'm just going to add it i'm just going to scroll down okay we're good i'm going to go ahead and scroll to the right here boom okay so the first thing we're going to do is we're going to manufacture a histogram using a bar pivot table bar chart that's what we're going to do pivot table bar chart or a bar pivot chart pivot bar chart not quite sure what the proper terminology is but we're going to build a bar chart from a pivot table and we're going to make it a histogram that's what we're going to do okay so the first thing that we do is we're going to go ahead and drag our age down to the rows and we'll get all the individual ages listed and what we want to do is we want to right click on this and we want to group it because what we want to do essentially is make the bins we want to bin up the data in the pivot table and we can say okay cool started 0.42 all the way up to 80. that's our min and max ages and we're going to go ahead and group by 5 because that's the bin width that we saw in the histogram earlier boom and you can see here we get all of our bins and notice this is a cool benefit by the way of doing it this way remember we were missing ages we were missing a lot of ages they were just blank they were empty and notice that we get that in this particular visualization so that's pretty cool next up we're going to go ahead and add in our new survived for our columns and then we're going to drag that to the values and what you can see here is we've got the counts this is awesome okay now all we need to do is go ahead and go up to insert in the ribbon go over to pivot chart and we're going to go ahead and insert a pivot chart now we want a column chart but what we want is a stacked column chart this is what we want this is going to be super useful for us and we click ok and boom look at that what i'm going to do is i'm going to go ahead and actually i think i'm want to move this down so we can actually have more real estate with that here it's going to be pretty cool okay move this down and i'm going to hide some stuff here because we don't need it i'm going to hide this we're going to have this get rid of these because i want these awesome this is great so this is a histogram it's basically histogram even though we've used a pivot bar chart to make it it is a histogram and we've added some dimensionality to it which is really cool because now we can see for example that yeah look at that the survival rate because orange means survive just just eyeballing that bar just eyeballing this bar here you can see that easily more of the children in this been survived than perished and you can see here that that inverts more folks perish rather than survive as you get older which is a good indication of women and children first and we can see here our blanks so these are our missing ages and we can see survival rates for those that are missing ages and we can see here it's maybe third a little bit less between a quarter and a third of folks just eyeballing the data now this is super super interesting and super useful notice how we added an extra dimension to the histogram and it becomes much more useful to us but what's really cool is is that we don't have to stop there using the mighty pivot table and pivot charts we can add slicers and slicers allows us to do even more cool things so let's go ahead and insert and we're going to go all the way over here we're going to add a slicer and we're going to add new p class and we're going to add sex as slicers and this is awesome awesome because now we can say look let's take a look at only males not good but what we can see here is that the only bright spot in the male data is children you can see here that very young boys survive more than 50 and then we can say okay cool well does that vary by p glass so we can say how about in first class oh yeah look at that you can see here the survival rates get improved for males in first class basically across the board look at second class if you're a child in second class if you're a boy in second class you basically survive if you're an adult male not so much in third class same thing now it's awesome courses we can go ahead and we can get everything back and then we can say let's take a look at females all up and we can see that generally speaking females survive much better on the titanic but we already know from video 2 that it's not equally distributed so for example look at first class basically all women survive all females survive in first class irregardless of age very few perished second class same thing third class that's where it gets problematic once again notice that very young girls survive disproportionately and also notice here that blanks females in third class without an age survive more than they perished so this right here shows you the power of data exploration with excel doesn't take much throw up a quick pivot table quick pivot chart get some slicers bam you're getting all kinds of goodness coming out of the data now what would be really useful though is if we could see something like this all at once because you notice here that i just gotta you know click the slicers and all this kind of stuff and i could go through the rigmaroo of like trying to create a six different pivot or six different histograms here which i could totally do but there's an easier way to do it and let me show you what i mean by that okay this is what i mean by that this is a awesome data visualization notice that the first row here are all females and the second row is males in each column corresponds to a p class i've got my histogram of age and i've also got parish versus survive notice that this visualization is super powerful what we had in excel was great don't get me wrong it was awesome however this is better because we can see all the data all at once and it's very very easy this data visualization was created in the r programming language and by the way i have a online course that teaches excel users how to create charts like this and it is very very easy by the way super easy anybody can do it with excel skills but notice how this multi-dimensional data visualization is so powerful because notice that i've got how many dimensions have i got i've got age that's one i've got sex that's another i've got p class that's a third and then i've got survival that's four so this is a four-dimensional data visualization based on histograms and this is wildly wildly useful stuff if you're interested in learning more about how to create data visualizations like this and trust me if you know excel you can learn how to do this you can click up here or you can check the description below the video there'll be links to another video on my channel where you can learn more about how you can create the data visualizations like this that's it for video number three histograms use them typically don't use them out of the box with just one dimension as we saw in the beginning you want multi-dimensional histograms that's when they become really really useful and in the excel world that basically means roll your own using a pivot chart with slicers and you're off and running next up in video four we're going to be talking about box plots which is a really really super useful technique and when that video is ready it'll show up either here or here on as a as a link that you can click and until next time please stay healthy and i wish you very happy data smoothing
Info
Channel: David Langer
Views: 2,340
Rating: 5 out of 5
Keywords: microsoft excel, excel, histograms, histograms in excel, exploratory data analysis, data analysis, data analysis with excel, exploratory data analysis with excel, pivot tables, pivot charts, analytics, business analytics, tutorial, excel histograms, histograms with excel, histograms using excel
Id: 7QHcyrOOuZU
Channel Id: undefined
Length: 20min 50sec (1250 seconds)
Published: Wed Apr 28 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.