How to Analyze the Distribution of Datasets in ArcGIS

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this tutorial we'll look at how to analyze the distribution datasets in ArcGIS in this case we're looking at two variables the percentage of Hispanics in the community and the owner occupancy rate of the same community I have here a shapefile of Massachusetts since the tracts acquired from the Census and have joined to it American Community Survey demographic data for the percentage of Hispanic residents as well as the owner occupancy rate of census tracts the question is how do we analyze this data well the very first thing we want to do is we want to get to know our data specifically we want to ask how is the data distributed in a GIS context distribution takes on at least two meanings the first has to do with geography literally how is the data distributed the second has to do with the behavior of the values themselves we'll start with geography since that's the most obvious place here we're looking at the distribution rather the percentage Hispanic residents and each census tract throughout the state we can see areas where there are high and low values to some extent the borders are sons tracts are a little distracting so I'm going to change the symbology so that we can get rid of that although the properties right click on the symbol one of the symbols and go to properties for all symbols and eliminate the outline that way it's not distracting we can actually see the values a little bit better and so we can see off the bat that certain parts of the state there's a few places that have high percentages of Hispanic residents immediately Boston or just north of Boston and East Boston neighborhood Lawrence and then we also see Springfield over here and a few other clusters throughout the state but most of the state appears to have relatively low numbers or a low percentage of Hispanic residents the complication of course of looking at data in this way is that it's partly affected by the classification scheme that we use by default the arch s will go to a natural breaks Jenks method and when you go to the properties to look at the symbology you'll see that listed over here in the upper right that method uses a statistical technique to identify clusters in the data in order to try to identify data that is similar when you quoted the classic then you click on the classify button you'll see how the data is broken up into those different classifications this is indicated by these blue bars on its histogram the history I'm showing the frequency values that fall in a certain range so here we can see that values at fall between zero and just under seven percent have the highest frequency of occurrence so most census tracks have relatively low percentages of Hispanic residents and in the far end where you get high percentages of Hispanics there are actually very few census tracts that have those high values but what's interesting about the way that these classifications are broken up you can see that the ranges are very different sizes and again that goes back to the method by which these ranges are constructed a more intuitive way is to use a DIF interval where all the range is the exact same size equally spaced and this is I think what most people would imagine now when you look at the data that way it looks slightly different we lose a little bit more the variation the high spots still stick out to some extent but most of our state is completely washed out lastly we could also try another method called quantiles in which we have the same number of observations in each classification meaning that you in each range essentially has the same number of census tracts that fall within it just ordered from low to high and we can see that the low numbers have a very high frequency there's a lot of them and the high is that includes a lot of census tracts because it was very few of them when we look at the data that way we see an entirely new pattern of data we still see those clusters but we see a lot more variation in the middle and so the fact that you can see this data and look at the same data in three different ways and see three different patterns McCue somewhat critical of the map and not take it for granted what you're seeing but again I think we do see some patterns to it those certain population centers seem to have high percentages we can also look at the other variable which is the owner occupancy rate and so what I'm gonna do is I'm going to copy the layer again but instead of putting a copy in the same data frame I'm going to insert a new data frame and what that does it's gonna allow me to look at the map simultaneously later on in the layout so I'm gonna right click on the layer copy it and then click on the new data frame and paste it and it'll duplicate it essentially but we're operating two different data frames I'm gonna change the value that we're looking at to the owner occupancy percent owner occupancy and a defaults again to natural breaks and this looks very different from the percent Hispanic in fact it almost looks like a mirror image of it you can see that much of the state has high owner occupancy rates and there are certain but there serves pockets say again look in Springfield looking at the immediate Boston metro area as well as Lawrence but also other places too Worcester in the middle if you know the geography of the state have relatively lower owner occupancy rates so so that seems to be suggestive again what we'd want to do is we want to cycle through different ways of classification classifying the data to get a sense of how the data looks when we map it differently because that some of the patterns we think we see might actually be misleading so again we'll look at equal interval and we do that a little of the variation is lost again but we still see some of those pockets when we look at quantile in this case again more variation and again those pockets so there does seem to be some some kind of inverse relationship between Hispanic percent Hispanic residents and the owner occupancy rates but it's not quite as clean as we might like it and it seems to vary depending on how we map the data and that is a central challenge of analyzing data geographically is kind of coming to a conclusion or getting a clearer sense of what's going on between these two sets of data that occupy the same geography so but again we're still trying to get a sense of data itself so this way of looking at it of identifying or looking for Geographic patterns of identifying where in the state high and low values are occurring is is definitely an important starting point but the other way that we look at data distribution is to look not at the the geographic pattern but to look at the behavior the values themselves and this is kind of a conventional statistical method of looking at our data you'll notice that when we go into the properties for a layer to change the symbology and we can go to the classification method you are presented with this nice histogram that shows you the distribution of the data in the way that we're thinking about in the second sense and here you can see that for the owner occupancy the the high frequency of census tracts with high values and the relatively interesting lower frequency of census tracts that have low occupants owner occupancy rates you also get some of the basics descriptive statistics in the upper right corner here but I want to actually show you another tool that access is the same kind of statistics and a little nicer way so when you're interested in looking at distribution data you can use set of tools called exploratory spatial data analysis tools but they require the geostatistical analyst extension so under the customize menu bar you want to check your extensions and you want to make sure that this geostatistical analyst is is turned on once that's on or checked off you can right click on the upper bar here and then activate the geostatistical analyst floating toolbar and what this will do is give us access to a lot of really neat tools so from the geostatistical analyst floating toolbar will go to explore data and the first thing we'll look at is again a histogram and so this histogram pops up with a floating window and in the lower part of the window is where we specify which variable or attribute we want to see so again what here we're going to start with the percent owner which is the owner occupancy rate and you see right away the distribution of the data again on a familiar histogram the nice thing is this histogram you can control how many bars are the way the data is broken up and again just to clarify we're looking at this histogram shows us the frequency of different occurrences of value so this bar over here on the far left the way that we read it is that the number line along the x axis as it were tells us the data values meaning the owner occupancy rate and the numbers along the y axis here tell us the frequency or how many currencies are of that what you pay attention to to is the power of E versus ten to the power of minus two which is to say that you actually have to move the decimal point over two spots in order to get the actual number and same thing down here 10 to the negative 2 means you get the move it over two spots so what it means in practical sense is that about 56 census tracks have values between 0 and 10% homeowner occupancy rate so very few because there's actually if you paid attention earlier about 1400 actually can see it up here in the count census tracts very few have such low occupancy rates so as we saw in the map most parts of the state actually have pretty high owner occupancy rates and on the far end over here we can see that something like 150 or so census tracts have own an occupancy rates that are between 90 and 100 percent now what's really neat about this particular histogram that's unlike the one that you see when you look at the properties is that if you click on the bar itself you'll see the corresponding census tracts that fall in that number range highlighted on the map and you can do that with every bar you can do several at a time and you'll see where those fall and this is a really powerful way of exploring your data because in addition to seeing the distribution of your data you're also seeing how that relates to it's geographic distribution - so you're getting two things at once the first thing to observe though just from a very conventional statistical sense is that with the distribution of owner occupancy rates you can see that it's negatively skewed which is to say that the the data kind of spreads out thinly in the left-hand direction toward the lower values while the majority of data the higher frequency of occurrences are actually kind of clumped over here toward the right now the significance of that is at one it's very different from it what we call a normal distribution a normal distribution is a situation in which most the data is kind of in the middle of possible values you know you get very few on the low end and very few in the high end and most are somewhere in the middle and it's kind of a starting point an assumption about how you expect data to behave it's also important though from a statistical perspective because there are a number of statistical tests and other processes that assume or need to have that in a normal distribution if it's not you'll have to actually make accommodations or actually look for different tools but any case it's it's a way of describing what's going on with the data so and it's apparent from the map itself that that large parts of the state have very high percentages of home owner occupancy we can see that reflected in these descriptive statistics in that little box we see that out of the fourteen hundred and seventy-one census tracts though there are some that have zero meaning no owner occupancy and up in some that have 100% owner occupancy the average is about sixty the mean or average sixty percent which would fall right about here but notably the median is 65% nemedia means just the middle most number and among from high to low and 65 you'll see is right here and they don't line up mean and median and and the reason for that is mean it's kind of pulled to the lower end by the fact that you got you know values so far to the left we say that the data here is skewed negatively as I said earlier and the skewness is reflected also and up here and a slight negative number and just an indication of the fact that the data is kind of being pulled to the left here the kurtosis is also kind of as a descriptive statistic showing you indicating the symmetry of the data so if it was symmetrical about the mean meaning that the data was equally distributed on the left and the right of the mean then we would have a kurtosis of something like 3 but the lower and tells you that it's again pulled over so they're kind of redundant not redundant but the repeating or kind of reflecting then that the nature of the data lastly our standard deviation is telling us the level of dispersion about the mean so the mean again falling about right here would be the average but the dispersion is pretty significant so one standard deviation represents the data that falls within about 70 percent of all values fall within one standard deviation generally and you can see that in order to capture that 70 percent you have to go pretty far out in both directions so it's the data is fairly dispersed so there's a lot of scatter but dispersion around there so all of that is to say that you know this is what's going on with the data and what's interesting about that is it's going to be very different or it could be very different so when we look at if we switch over to percent Hispanic we'll see there's a very different distribution compared to owner occupancy and so if we switch back on our activate the other data frames we can look at that for a second we can see that almost inversely compared to the owner occupancy the percent Hispanic residents they're most sensitive very low numbers and again we can click on that bar and we can see that because I'm in the wrong layer so actually let me go back to the because we were working off of that laughs there you go you can see that the distribution I'll have to switch to percent Hispanic here we go okay so you can see that the vast majority of strikes have very low numbers so we see the vet that you know over 1200 almost all out of the 1,400 have values that fall somewhere between zero and about 10% and then again it's it's only when we go to the far end over here where you have values above see 76% you're looking at only a small subset of census tracts that have these values and again there's concentrated in the Springfield Boston and Lawrence area and those of you know the state that may not be entirely surprising but it kind of confirms that so notice too that this data distribution for the percent Hispanic is not normally distributed so the data is not evenly distributed around the mean in fact this is highly positively skewed in this case because we can see that although we've got this big clump of high frequency a frequent high frequency of values on the low end the values actually extend far to higher numbers so you got a real wide spread and it tends to be biased toward that higher positive end so then you see the numbers reflected there you can also see that there's a significant difference between the mean and the median the mean being on 11% would place it about right here that would be the average value while the median is actually far over here and that's because again the mean gets pulled by those higher or more extreme values that are in the data set so we also see a higher standard deviation which means that there's a there's a lot of dispersion around that mean because the data is very distributed or highly dispersed so there's a lot of variation within the data set and as it's not normally distributed so it at the very least what it tells us is that there's only a very few places of high and high values and a lot that have low values but there's a big spread and we see that the distribution the data distribution is very between the percent of the owner occupancy rate and the percentage of Hispanic residents so that's a good starting point because it tells us you know we're looking for similarities or differences between it is as we see there's a lot of differences but it doesn't necessarily tell us whether or not the date is related or there's something else going on between them it does as we as he noticed earlier suggests that there it's something of an inverse relationship because it does appear to some extent that the places that have high values in one have low vowels in the other and and it also seems that the the datasets seem to have opposite distributions again and when it comes to the percentage Hispanics we see the bulk are low values that tend to be very spread out with the high values very few and concentrated and it's the opposite situation when we look at the owner occupancy right although the owner occupancy rate isn't quite as skewed as the percentage Hispanic so some interesting differences and similarities and inverse kind of are mirror images of it between the two data sets last thing is we can we can bring this the nice thing with these histograms is we can bring them into our layout so if we want a nav is the layout and it's kind of nice because we have histogram built and we also give these statistics we can click Add to layout and we'll see that in the layout we have now my layout tools again we have two data frames we've got the one showing the percent of somatic one owner occupancy so let me make sure that we're looking at the same thing here so this is owner occupancy in the lower one so we get the mixed up and we can compare that to I will get this eventually let's do this okay good enough so we see here we see the owner occupancy and we see percentage Hispanics above at two different data frames the histogram that we created over here is for owner occupancy and so it's nicely formatted with the statistics for that particular map and we can do the same thing for the percentage Hispanics will just again generate that histogram specify that that's what we're looking at and then again add that to the layout and then we can have the two graphs right in there and gives us a starting point for comparing the two data sets
Info
Channel: Marcos Luna
Views: 11,017
Rating: undefined out of 5
Keywords: ArcGIS, ArcMap, Census, ACS, ESDA, histogram, brushing, linking, Salem State University
Id: Dr1c9LDM3v4
Channel Id: undefined
Length: 20min 53sec (1253 seconds)
Published: Wed Feb 24 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.