Esri 2014 UC Tech Session: Spatial Statistics: Simple Ways to Do More with Your Data

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so I'm Lauren Bennett and this is my colleague flora Vail we are going to be talking about spatial statistics what the simple ways to do more with your data I'm on the spatial statistics team in Redlands so that's the development team we work on building all the tools that we're gonna be talking about today fora is an honoring member of that team since she started at ESRI has been working with our tools and she created the slides today which I think you will all agree are like the most awesome slides ever so just keep that in mind she's the best and we're gonna be we so we have to decide what are what workshops we're gonna do we get to name them and give them little paragraphs in like February and so February is really far away from July like oh we're gonna do this and we're gonna do this and it's gonna be so great so in February we had this brilliant idea that we were gonna do a workshop where we were gonna explain every single tool in the toolbox and we were gonna demo every single tool in the toolbox and it was gonna be great it is gonna be great but it was we had moments have just like why so this is the first time we're doing this session we're really excited about it gave us an opportunity to dig into some of the tools that maybe we don't use as much as others cuz just like we have our favorites like but just like children we really love them all equally so we're gonna have a little fun today I think so I want it sorry okay okay so they are going to simulcast this in 11a which I guess means the slides and the words will be will there okay the slides and audio will be broadcast from 11a you missed this yeah I know that's not they're all gonna run right now I don't want to see that so 11a if anyone when you get tired of standing you can make a raid 11a we will not take it and now even those of you who in the middle are like I don't want to look at this well but we'll just assume you're going to 11a I should always look at that's so great okay so I want to start by talking about what our spatial statistics since we are going to spend the whole day the whole session talking about them if you looked it up in the dictionary you would find kind of boring definition that would say that spatial statistics are a set of tools methods techniques that allow you to explore spatial patterns spatial distributions spatial relationships that exist in your data and that is true that is what they are I do think it's really important to note that these are not just traditional statistics that happen to be applied to spatial data these are tools that inherently take into account the spatial nature of your data so that includes things like proximity area length orientation how things are connected coincidence and distance direction there's all these different ways that we use the spatial characteristics of your data in the mathematics of these tools so they are truly spatial statistics and I just think that's a really important thing to know a lot of the time that idea of connectivity or relationships a lot of times it's this proximity that we're using to determine in terms of how we use it in the math but you'll see we use all sorts of different aspects of your data throughout the tools as part of the underlying algorithms and methods now all of that aside the truth is that spatial statistics are really just this extension of what we do naturally every time we look at a map it's what our eyes and our minds just do when we look at a map when you look at a map a thematic map a point map whatever it might be you're instantly oh you know there's a cluster over here what's going on I see that there's not a lot happening here why is that so first we see these spatial patterns but our brains are also incredibly good at starting to understand relationships so if you're looking at a map of an area that you're familiar with and you see a pattern you go beyond just saying oh there's a cluster over here then you're saying well I know something about that area it must be related to these things or these variables must be important so we instantly every time we look at a map start doing that that's why maps are so powerful but I think this really comes down to what makes spatial statistics so important that's why it wakes me up in the morning and we explore up in the morning it's this question wait I think we see it'll I think it's becoming something that's being talked about this idea is something like a spreadsheet is that data or information this distinction is being made between data and information just because you have data doesn't mean you know anything right data itself isn't enough we have to turn that data into information and I would kind of argue that the same is true for maps I mean is a map data or information I of course I don't think a map is the equivalent of a spreadsheet I think a map is we're getting there we're starting to turn that data into information but I kind of think of it like we have one foot in data and one foot in information when we have a map we're still just looking if you're thinking about a thematic map all you've done is put the numbers in your spreadsheet on a map you're still just looking at numbers essentially now they're color coded but but we can do more than just look at nematic Maps just put points on a map so whenever we look at a spreadsheet I don't know about you but I cannot just look at thousands of numbers and make sense of them so when I look at a spreadsheet I immediately want to do things like what's the average value that's a standard deviation what's the minimum value what's the maximum value very simple descriptive statistics but it's the only way that I can boil something like a spreadsheet all this data down into information very simple information but it's the only way I can really start to make sense of that data and the same thing goes for maps we can do more there are ways that we can go beyond just that visualization the equivalent of looking at a spreadsheet and turn that data into information and spatial statistics are a really powerful way to do that so with that I'm gonna pass it over floor or she is gonna be going through the slides and I'm gonna be given just a couple of demos okay hello is this loud enough should we turn around or switch on and off is it all right okay all right so we are really gonna go through every single tool in the toolbox there are 21 tools that we're gonna go through so I think we have like probably less than an hour left at this point so it's gonna be fast and the goal here is not to turn you into experts for all of these tools the goal here really is to make you aware that the tools exist maybe to get you excited to want to learn more and maybe inspire you to look at your data in some new ways and if we do inspire you inspire you to want to learn more we have lots of other workshops and at the end of the session I'll put the schedule up so that you can see what you can come and get more specific information about the different tools so also before we start I just wanted to point out that these little script symbols next to the tools mean that they are Python scripts so can actually look inside and see what's happening there's no secret the math is right there it's not a black box tool you can see exactly what's happening and if you are so inclined you can even tinker around with the code and maybe make some adjustments I do recommend though that you make a copy of the original script before you go changing things in there but you know that's a learning but learn from experience yes okay so the tool the tools are broken down into different tool sets so we're going to go tool set by tool set and we're gonna start with measuring Geographic distributions so this tool set is kind of like the descriptive statistics that you would use with you know just non spatial data so we're gonna look at things like central tendency and distribution standard deviations that sort of thing to kind of measure what's happening with our data so the first tool we're going to look at is central feature so central feature is going to identify the feature that is at most Center so let's see what that looks like let's say we have a set of points we can measure the distance from each point to eat all the other points and then we're gonna see which of these points has the shortest distance to all the other points and that feature with the shortest distance will be the central feature and the tool actually output a feature right there and so Lauren's going to show you how that works okay so this is an example of a good use case for using the central feature tool so what we have here are all of the locations of fire stations and what we want to do is the the battalion chief wants to move his office so that he's in the fire station that is most centrally located so that he's close to all of the people that worked for him so he wants to choose the central location very simple analysis but would be pretty difficult to do just looking at a bunch of points on a map sure you could get into the general area but which one really is the best one we might as well pick the truly best one so we pick those stations and we can run the analysis now we have options you might want to wait those stations that have more employees more so we can wait these descriptive statistics so that those locations that have more importance in our analysis can pull that central feature so we can do that here using the number of employees in each of those locations and when we run that analysis it returns to us the location of that central feature and we can Reese umbilicus oh that we can see it and we can see that it lines up here with station number five so really simple analysis and I think what's important and I think it is that these tools are incredibly simple but they can also be really powerful to solve solve problems so that central feature alright next on the list is means Center so means Center is going to identify the geographic center of your features and so with central feature you're actually choosing one of the features that is the most central with mean Center you're putting a new feature where the center of all of your features are so that's going to look like this we're gonna actually take the average of all of the lat/longs and then that will become our mean Center so that looks like alright so now you're thinking please don't demo mean Center that's ridiculous but actually mean Center can be really powerful it's not always and and I think this is true for a lot of the tools that we're talking about particularly in measuring Geographic distributions but really in a lot of the spatial statistics tools in general some of the most powerful ways to use these tools is to look at things either over time or to compare different distributions so in this case what we're going to do is we're gonna look at the mean center of population in California and so we're gonna wait we're gonna use our population data for 2010 we're gonna wait it by population and we're gonna run that analysis and so we can see where the mean Center is currently for population in California and it's no surprise that it falls somewhere in between Los Angeles and San Francisco because that those are the two big population centers in in California but what about a hundred years ago would the Maine Center have been the same in 1900 or 1950 how might what would that look like and so what we can do is actually use mean Center run it multiple times for each of these different time periods and what we end up with is a bunch of mean centers and we can see them we can see the one for 2010 2009 2008 2007 and create a group animation with the mean Center's group layer here and I think a half I can't remember if I could invert the order so we'll see we'll see soon enough where does it start yeah so we can see it's just turning layers on and off in that group layer so I have a group layer and it goes through it turns on 1900 1910 1920 1930 and this is reflecting this path this shift in population which started when people moved to California in the gold rush then Los Angeles became kind of the place to be Hollywood became kind of known throughout the world people who started to move down to Southern California until now what we see with this kind of distribution right in the middle there so a ridiculously simple tool that as floor showed you could basically do the math by hand for but you can still really get some some meaningful information if you use it in kind of the right context all right next median Center so a median Center is similar but it's more robust to outliers so again it's pretty simple we're actually going to take those lat/longs and then find the median value for each and then that will become our median Center so I think we'll switch over and then okay so a really good use case for median Center here we're going to cite a library not just for reading but also for access to the Internet in this community where there are a lot of a lot of the population doesn't have access to computer at home we want to give kids access to computers and the Internet in the library so we're citing libraries and we don't just want to cite one we actually want to cite one in each district so one thing that is really awesome about a lot of these tools is another parameter that I haven't used yet so if we look at the patron locations and in this case we don't want to wait it we actually want to use the case field and so essentially this case field what it will do is we give it an a field in our data set and for each unique value in that data set it's gonna group things together and create its own median Center so it's not just gonna create one median Center it's gonna create 3 this feels right here district each one of those patrons has a one two or three associated with it so it'll find all the ones create a median Center all the twos create a median Center and all the threes and create a median Center and so we can see the output of that if we turn on we change the symbology here we can see those three locations in each district that are created based on that case field so another really useful thing in this case we're trying to create three median centers for these locations but we could also again use this for comparison if you had if we wanted let's say each of those districts were different games and we wanted to see the mean Center of graffiti incidents for each one of those gangs we could compare where their median centers are using that same case field so another kind of powerful thing about those tools that a lot of people don't know about alright and you might be thinking those seem pretty similar what's the difference between the mean center and the median Center so to illustrate that I'm going to add an outlier up in the top corner there and then we'll see what happens so the mean changes and the median changes based on the outlier so that this is where they were to begin with before the outlier with the outlier the mean moves all the way up there the median only moves this far so it really depends on what your question is but the median is definitely less affected by outliers and the the mean is really just the true mean so next up is linear directional mean so this is going to be for for linear features for lines and it's going to identify the average direction that our lines are pointing so to illustrate that we'll take a look at a set of lines and then we'll measure the angle at each line has and then the average of those angles will become our linear directional mean so one pretty cool example of how we could use this this map which is clearly very useful is looking at it looks like one of those things that you used to have when you put the pen in it and just circle it around fire graph those things are awesome this is not awesome though this is a really big mess of data and so it's hurricane tracks it's hurricane tracks for many many many years and what we want to do is understand have the the average directions of those hurricanes changed over time and so what we can do is we can run the linear directional mean tool this time we're gonna run it on those hurricane trajectories and we're going to use a case field again which is gonna be the year so we don't want the average for all years we want yearly what was the average and so it's gonna create one for each of those and I think it is pretty amazing how quickly you get the answer to this because that is a lot of hurricane data I think this goes back to somewhere in the 1890s and 38,000 hurricanes and we get this result which of course is it's quite enough for us to understand cuz all we see is each year we don't know which is which so what I'm gonna do is I'm gonna make a version of this that is symbolized by year so I'm gonna use the year data set add all the values there in order and I'm gonna bump up the symbol just a little and that'll be on top of our arrows and so now we can start to see the darker the darker arrows are the more recent and you can see that this data goes from 1850 all of the way until 2008 wait let me do that again it's not doing it there we go so now what we see is the lighter ones are more the ones in 1850s all the way up to the darkest ones in 2008 and we can start to see if there is any sort of change in direction I'm not sure I've come to a conclusion on this one but it certainly is a lot we're getting a lot closer to turning those tracks into information and we are just looking at all the tracks at one time alright now we'll move on to standard distance so standard distance is and measure that's the degree of concentration or dispersion of your feature so think of it kind of like a standard deviation of a set of numbers you want to know what the variance is here we're gonna see what the variance is in our locations so we'll take a look at what that looks like we start with our mean Center and then we're gonna take the standard distance to all the other features and that will become the radius for our standard distance and you can have one two or three standard deviations of the standard distance so we'll see what that looks like okay this is actually pretty cool and we're looking at two bookstores in the San Francisco area and we have the locations of the customers for each the people that belong to their loyalty clubs for each of these bookstores and so we want to do is see which one has a larger distribution throughout the city kind of a bigger pull essentially so we're gonna run the standard distance tool again you'll see we're gonna use that case field comes in quite a bit we're gonna use the store and we'll run it and it's gonna create two of these standard distance polygons and we immediately see that the Steiner bookstore has a pretty significantly smaller distribution then the Bosworth bookstore and we get a picture for those and it's interestingly they don't overlap at all that overlap would be a pretty interesting place and and if you're another bookstore the gaps are an interesting thing to know about too because that is a place where there's a gap in service so that standard distance all right directional distribution or standard deviation ellipse so this one is pretty cool so it's also looking at that standard deviation but this one actually takes into account the directional trend of the data so again we start with the mean Center but then we actually find the standard deviation of the x-axis and of the y-axis to make a tilt for our ellipse and that becomes our standard deviation all ellipse so let's see how that we can use that so this is actually one of my favorite tools which I know sounds kind of crazy but I think it's just really powerful whenever I have a point is that especially if there's any sort of temporal aspect to it I almost immediately run this tool just to see any sort of trends that might be there in the data I mean alternatively so this is this is a Shahidi data in libya this is data that Lucia Heidi is a a crowdsourcing application that essentially is especially stood up during times of crisis and so these are essentially requests via SMS for humanitarian assistance and one thing we could do to look at this data is just have it turn on and off over time because that would be super useful but I think we can do a lot more than that so the the ellipses helped us do that and so we can run the ellipse tool and in this case we're going to use the case fields but it's a little bit different we're not it's not a type it's actually time that we're using and what I've done if we look in the data here is I've basically grouped together all of the the first week worth of data second week worth of data third week worth of data and each one of them has a unique number and so when we run the analysis it's going to create a number of ellipses and we can take a look at the and that's in a similar way I'll bring in all that yearly data and the darker green ones are the ones that happened in the beginning time periods and the most recent is in red and so we've pretty quickly get a clear picture that the trend has been pretty dramatically changing over time and we and what I think is really powerful about a lot of these tools is that they can drive their they can be a really critical part of the data exploration process you get a new set of data you just want to understand some of the things about that so you can guide your questions actually because in the beginning our questions what's going on and now our question is why has the pattern shifted what has caused this shrinking the distribution and also the movement of the distribution and so rather than just what's going on we have very kind of clear questions that have been driven by some of this in introductory descriptive exploratory analysis and that concludes the measuring geographic distribution stool set so one down three to go next up is analyzing patterns and I would like to say I think they get more exciting as not analyzing patterns tool set so this we have lots of different ways to measure if our our data is clustered if it's dispersed kind of get the degree of which our data has patterns so the first tool is average nearest neighbor so we're actually gonna create an index to see if our data exhibits clustering or not by looking at the nearest features to each feature so we're gonna actually measure the distance between each feature and it's nearest feature and then we're going to get the average of all of those distances and then we'll compare that to an expected average distance based on a random distribution so then we'll we'll create a ratio an average nearest neighbor ratio so take the observed and divide that by the expected if that ratio is less than 1 then our data is clustered and if the ratio is greater than 1 then our data is dispersed so yeah we have demos for all of these and actually up until now every demo that I've given was created by jinora who's sitting I'm sneezing back there I was gonna ask her to raise her hee at the end yeah razor yeah yeah I think you know she's the best and she really helped us out a lot and I think learned a little bit about the spatial stats tools and along the way so it wasn't a total loss there so this example is looking at tornadoes in the Great Plains region and so we want to look so of course we know that tornadoes cluster right we know that they are obviously much more likely in this region than they would be in Connecticut or California but what about within this region would we expect there to be clustering in this region and and and is there clustering and can we explore that so we're gonna use the average nearest neighbor tool to explore that so I'm changing tool boxes here going to the average nearest neighbor tool using all of those torn Joe's gonna get it to generate a report now these reports are pretty cool because it means so we output a lot of information in our in this messages window actually so a little food for thought while this runs how many of you leave background geoprocessing on and never have to look at this messages window nobody i love the messages window but also the spatial statistics tools often rely on the messages window we put a lot of output in the messages window and you'll see in some of our tools it is the only output this is a great example of one so the output is a z-score and a p-value now the good news is you you actually will make it really easy you don't even have to know what does he score what the z-score and p-value mean if we look at our results okay how many people use the results window okay not everyone raised their hand so I'm about to like blow some people's minds here okay seriously I don't even know how I would live my life without the results window so if you go to geoprocessing i and you go to the results window there is this result option under geoprocessing it stores every time you run into your processing tool which allows you to rerun geoprocessing tools you can also link you can drag and drop these into model builder and if you have ittle inputs and outputs all of that comes along saves a lot of time there the results window is saves so much time I and one really great thing about the results window is it stored with your map document so you come back to a project six months later and I'm sure this never happens to you but it happens to me all the time I have no idea what I did six months ago I'm not sure what I did yesterday so that list is still there I can follow my whole workflow and oftentimes what I'll do is I'll delete the ones that aren't relevant what I'm done so I save really just the ones that I want to remember and I use that as this really great way to save workflow so I know that's a step off from what I was supposed to talk about but I think it's really important it's an important stuff yeah and it goes with math packages as well so we can send each other a map package see the analysis that was run in that map back yeah it's really good for that kind of collaboration too so one of the other things you can do from here is if we do create a report or if we create a PDF which some of our other tools do you can click on that right from the results window and this report shows us that given the z-score of negative 114 it's a very big z-score that is very big there is a less than 1% chance that this clustered pattern could be the result of random chance so that's telling us that we have statistically significant clustering in those tornado touchdowns in the Great Plains area so we get a lot of information there and you don't even have to we give you the sentence we give you a picture we give you the numbers got this for all kinds of learners and thinkers here so that is the average nearest neighbor and a little side step into results window and messages oh and for those of you who don't know if you go to geoprocessing options there's two things one you can decide how long you want to keep your geoprocessing results so I just always go in and say never delete and the other thing you can do is this is where you can enable or disable the Geo products the background geoprocessing so if you do have it on and you didn't know how to turn it off it's in this geoprocessing options you can turn it off and then that messages window will be there okay next up spatial autocorrelation so this is another way of measuring if your data is clustered or dispersed but with with this method we're looking at the correlation between distances of features and the difference of their values so do we have two features with high values that are near each other or two features with low values that are near each other those low values and high values will have a short distance and data space between them so if if there's a positive correlation meaning that as the distance from the features gets farther away the distance and value gets farther away then that means we have clustering and if it's the inverse if as the features get farther away their values get more similar then we have a dispersed pattern so this helps us measure that okay so for this one we're actually going to go back to California population because I think it's a good example of how we can use spatial autocorrelation also because we use we really use spatial autocorrelation in two ways and so this idea of is it random or is it dispersed it's it's very or is it clustered it's kind of a you might call it kind of a stupid question in some ways because we know things are gonna be distinct are random right we know people cluster together crimes cluster together pub in income clusters together disease clusters together a lot of these things cluster together and so it it really while it is useful to know for instance for the tornadoes to guide our decision-making and and to guide our questions is their clustering oftentimes these are most useful in a compact from a comparison standpoint so in this case we're gonna use the spatial correlation tool to see if the clustering has if clustering has been increasing or decreasing over time we also use the spatial autocorrelation tool a lot when we do regression analysis and we have a whole session on regression analysis so if you're interested in that will point you to it but within regression analysis one of the the requirements of regression analysis is that our output is not clustered we can't have clustered over and under predictions and so we use the spatial autocorrelation tool to test to see if we have clustering because it is a requirement of the tool so those are probably the two ways I use it most either to compare things over time or different distributions or regression analysis so in this case I'm going to run the spatial autocorrelation tool on the popul on the data from 1900 I'm going to use my population data set I'm going to generate a poor report and I'm going to go and so we can see right away it tells us what the z-score and p-value is so I can tell just by looking at this a very small p-value means we have clustering and we can also take a look at that report and we can see what the that we can see that there's clustering and we can again we get the pictures we get the words now what I've done is I've actually run this analysis I've run that analysis 10 11 times for each one of those decades and what I did after I did that is I created a little table that has what year it was and what the z-score was and it's this table of information that I'm gonna use to look at the pattern in clustering over time and so we can go in and create a graph and I'm gonna create a line graph using year and z-score and it's actually a really interesting one because we see that over time there was this decrease happening in the intensity of clustering and then in the between 1930 around 1940 it starts to really pick up again and the intensity of clustering starts increasing so things were dispersing up until the 30s and then starting in the 40s they start to come together again and cluster again which it makes a lot of sense if we think about we know now everybody's coming there's this movement into cities right so it makes sense that we're seeing this kind of clustering happening so imagine looking at these this is a thematic map just changing colors right it would be really hard to figure out if that if there was a change in that kind of intensity of clustering but with the graph it's really really easy to do so that is spatial autocorrelation yes well you're one step ahead of us so incremental spatial autocorrelation is next and this tool is really it's just a series of spatial autocorrelation run and put into a graph for you so Lauren just now created a separate z-score for each of the years but what incremental spatial autocorrelation does is creates a graph for you of the z-scores over different distances it's actually helping you decide the scale of your analysis so for some of the other tools we have to choose what we call a neighborhood size we have to conceptualize spatial relationships and so this tool helps us do that it's actually going to run that to a bunch of times for us and then output a graph and that graph will identify the peak distance or the distance that exhibits the most clustering and that number can be really valuable in some of our other tools and we'll kind of touch back on that and a little bit and so we don't really need to demo that one because it's the same tool it's just kind of a model process of that tool happening over and over and then giving us a nice graph it's something I used to do a lot for analysis and so I have the benefit of working on the spatial statistics team and sitting next to the developer and I said you know I'm doing this a whole lot and so are probably our users are doing this a lot too and I really don't want to run this ten times every time I need to figure out a good distance band so could you build a tool that does this for me and outputs a pretty graphic and so he did and that was a really nice of him because it makes a big difference and now all of you get to take advantage of it too all right next up is high-low clustering so again this is another global measure to see if the data exhibits clustering but the difference here is that we're not looking at whether the data is clustered or dispersed we're looking at whether we find clusters of high values or clusters of low values so in the math before I said it was the correlation of the distance between the features and space and the distance and value but now we actually take the product of the two values and so if you get a really large number that means that the data exhibits high value clusters and if you get a really small number you're exhibiting low value clusters and so this is going to be more useful again for comparisons all of these global stats tools are more useful when you're comparing one year to another or one place to another to see which one has that the most intense degree of clustering and so this one is just a way of measuring the intensity of low clusters versus the intensity of high clusters the next tool that we have in the tool set is multi distance spatial cluster analysis or Ripley's kei function so Ripley's Kei function is actually I think one of the only tools where we're looking at clustering of the locations themselves with all the other tools each of those points had to have a value so in order to have clusters of high values and low values each point has a value associated with it but for Ripley's K we're actually just looking at where the points are and we're measuring to see if one feature being in the location attracts other features to be in that place too or if it repels them so we're again measuring clustering and dispersion but this is kind of a spatial dependency does does me being right here attract other people to be near me or do it does it cause other people to want to be farther away so I have no comment yeah so we also get a graph output for this one and so we get we're comparing this to an expected distribution so it's looking at it over on different distances another thing that this does is measure if we have clustering or dispersion at different scales so it's going to go through a bunch of different scales and it's going to give us a line that we expect to follow with a confidence envelope and then we look at the observed line and so if we get a line like this that means that the shorter distances we have clustering and at the farther distances we see dispersion okay so that's another tool set that we wrapped up next is mapping clusters which is one of our favorites yes it is so these tools again we're looking at clustering but now we're gonna be looking at local clustering but first we're going to look at similarity search tool this tool helps us to identify features that are similar to a feature of interest based on how similar they are by set of attributes so let's look at an example of that oh wait hold on no more of a hypothetical okay so let's pretend that we have a point data set of potential store locations I want to open a new store and I can choose from any of these locations I also have a high-performing store that I know a little something about so I have some information about each of these store locations the potential ones I know the population density the average income and the distance to competition from each of these locations and then I also have that same exact information for my high-performing store so what I want to do is I want to rank my potential store locations by how similar they are to my high-performing store based on three variables population density average income and distance to competition so the tool is going to depending how many results I ask for in this case I'm going to ask for five give me the five locations that have the most similar attributes to my high-performing store and the tool is going to rank those five locations for me with an output okay so that tool exists both in desktop and also our just online so I'm going to show an ArcGIS online because I just think it's really awesome that these tools are being exposed to so many more people now that they're in ArcGIS online and it's also just I actually just started exploring data and it's just amazing what you can do looking at something is simple a similarity search especially when you combine it with geo enrichment so basically what I did is I have all these bike-share locations and first and I have a location that I know we had an effect that that's been doing really well and I want to figure out what other places should we be expecting to do well based on the characteristics of that location so what the first thing that I want to do is I want to enrich those locations using geo enrichment and so perform analysis allows me to enrich my data and once we enrich the data we'll have all those attributes that we need to run the similarity search so we have thousands of variables that we can choose from I chose data on we open up our table here I chose data on age population per capita income education so I picked a couple of variables that I thought would be important or that were interesting variables to look at in terms of the success or the the usage of these bikes or locations so the fine similar tool allows me to find first I choose that location that I'm interested in that's my location of interest and then I'm going to search from those enriched stores from the so enriched bike share locations and I'm just gonna choose the attributes that are important to me and I can either choose to rank all locations from most to least similar I can just say show me just the one most similar in this case I'm going to choose to rank them all I'll run my analysis and the result of that analysis is that this store right here is the the number one most similar store it gives us the information about those variables so we get all this information but most importantly we get these ranked store locations and we get a picture of their similarity and there's a ton of applications for this one of the big use cases that we had when we're building this tool was you know I'm a local city or county and from an economic development standpoint I want to say I'm just like these other cities but here's how I'm better so it can be really great for that kind of benchmarking you say I'm similar in terms of these characteristics but here's all these ways that I'm different so it can be really valuable from that that standpoint that was one of the the big use cases we had there's also a ton of commercial applications for the similarity search tool alright grouping analysis so grouping analysis allows us to find patterns in our data with multiple variables so we actually group our features based on within the than being as similar as possible in between the groups the features being as different as possible so the output of a grouping analysis might look something like this there are two different ways to run a grouping analysis we can ask for a grouping analysis where we just want to group the features but we don't need them to be spatially contiguous so no spatial constraints and to do that the tool runs a k-means analysis and so let's say that we have these points and this is what they look like in data space just by value not in locational space so we can ask for two groups and it's gonna split up this way if we ask for three groups it would split that way or four groups so it's really finding the natural places to split this data into different groups based on their different attributes so finding areas in the data space that are more similar if we choose a spatial constraint it's a little bit more complicated because this features have to be near each other to be in the same group also so the tool finds different seed locations based on how many groups you want as a result and then it conducts a minimum spanning tree calculation and then puts the features into different groups so it looks at how how different they are in data space but also they have to be proximate in a physical space and to interpret the results of a grouping analysis we look at a box plot and so we can see each of those groups how they compare to each other based on the variables that you input into the tool so this is a cool one let's see how that works yeah it's definitely one that makes a lot more sense than you just see it in action it's so free for those of you out there who are big nerds like we are this is one of the most fun tools just to play around with because you learn a lot about your data and you can explore just the different relationships that exist in your data and how things grouped together I could spend all day just running this on my data I'm guessing there's at least one other person in this room that feels the same way so this is data we're looking at data this is from the sea caps foundation they this is data about vulnerability in Africa and so what they've done is they've basically created these they call them baskets and these baskets are different ways to kind of think about vulnerability so one of them is about governance so it's this idea that if something catastrophic were to happen what is the what is the likelihood that this that the government would be able to help the population in that area or would be willing to help the population in that area it's also a measure of transparency and so what I'm looking for no.not vulnerability these are all measures of vulnerability it's about like a corruption that sort of thing so that's the governance index the other another one is about household resiliency so that's this idea of something catastrophic were to happen what is the likelihood that the individual households themselves would be able to feed themselves and this is this has a lot to do with income levels education levels employment all those sorts of things so there's those there's also one about population density and another going about climate change and vulnerability to climate change so all four of those variables so this is essentially you can imagine each of those is a variable in this data set and so I can turn on let's say the household resiliency index and I can map that then I can go in and change which one I'm looking at let's say we look at governance I can map that so we can map all four of these variables and you can imagine if this was demographic data we could just create a bunch of different thematic maps and try to correlate them in our bring them together do that grouping in our head but that would get pretty old pretty fast so the grouping analysis tool helps us make sense of all that data so we're gonna analyze vulnerability vulnerability data we're going to use we're gonna create four groups because I happen to know that that's the best number of groups and I get that by evaluating the optimal number of groups here there's a way to help you figure out what sometimes you just have a number that you need to create right like I need to create 10 sales territories what's an obvious you put in 10 but if you're just trying to find the natural groupings in your data this will allow you to find those by evaluating the data so we're gonna look at the population density climate index household resiliency and the governance index we're going to use no spatial constraint and what's really cool about no spatial constraint is that even when you don't force it to be spatial it usually is and that's no surprise to any of us because everything spatial and it's the first law of geography right things that are closer together are more related than things that are farther apart Tobler's law so it's no surprise that even when we don't force these things to be spatially contiguous that we find these spatial patterns and it's often the spatial patterns that emerge from this this analysis that are the most interesting so we get a lot of information of from this analysis actually in our report so if we go to the math I think it's pretty obviously immediately okay we have these patterns right but knowing there's a green group isn't particularly useful what's the green group what's the orange group we have no idea so that is all that's what this report is all about it's only when we combine looking at the report with looking at the map that we start to really understand the results of this analysis so we get a lot of information and we are gonna talk a lot more about this report in our session this afternoon but I'm just gonna focus on the parallel boxplot so what this shows us for instance is let's say the green group the green group the higher this this index the more vulnerable the large numbers are vulnerable so what we see is that this green area is around average so the black box pot represents the global average and then each one of these colored lines represents where that group falls for that variable so we can see that the green group is about average in terms of the vulnerability to climate it is way above it's very vulnerable in terms of household resiliency very vulnerable in terms of governance and low vulnerability in terms of population density or we can see the red areas where there's high vulnerability in terms of climate index but low in terms of household resiliency and governance and about average in terms of population density so we can see how each of these groups are vulnerable in different ways so rather than just saying okay this these are the most vulnerable areas period we can say these are vulnerable areas based on these characteristics and these areas have these vulnerabilities and so we can really use this information to understand those characteristics and you can see of course that these there are distinct kind of regimes here so we see the orange area up here it's pretty spacial but it's similar to an area in other parts of Africa so while we see that there's these spatial patterns we also see that very distant places could have similar vulnerabilities also so we get a lot of information just from exploring the spatial patterns that result from an analysis that didn't actually constrain the analysis spatially so that's a grouping analysis all right next up is hot spot analysis how many of you have run a hot spot analysis before yeah it's usually the most popular tool some guys spend just a little bit more time on it but I know that we are you know still pretty tight on time so the hot spot analysis is going to identify clusters of high and low values so it's a local statistic it's not just going to tell us if we have clustering of high or low values it's going to tell us where those clusters are so I would like to talk a little bit about how this works because I think that some of us might be using the tool without really understanding what it's doing so let's take a look here at forty five of my favorite random numbers and let's say I got kidding either it's pi actually if you what she knows like how many digits of pi do you know that's irrelevant no you have to tell it like five hundred okay so let's say that I take forty five of my favorite random numbers and now is to pick them up and shake them and drop them now you don't really need to understand any probability theory to just kind of intuitively know just really kind of feel that it'd be really unlikely that all of those high numbers which has happen to fall together right here right it's obvious that that's not random but it's not always that obvious so what a hotspot analysis does is it helps us measure the likelihood of that happening what are the chances that this happened randomly so let's take this to a map and first let's get clear on a couple of terms so each of these polygons will call those features okay and around each feature there's a set of features oh I'm sorry each feature has to have a value so again we're measuring clusters and high values and low values so each feature has to have a value and we also need variance between those values right because we can't have clusters of values if all of the values are the same so each feature has a value and we have variance between the values the next term is neighborhood so around each feature and including the feature itself we have a set of features and we'll call that the neighborhood and then all of the features including the neighborhood we will call that the study area okay so what a hotspot analysis is asking is is this neighborhood that was derived based on the original feature is that statistically significantly different from the study area so if that neighborhood is significantly higher than the study area then that feature that the neighborhood was based off of will be marked as a hotspot and we have three degrees of confidence for hotspots you can be 90 95 or 99 percent confident that that neighborhood is a cluster of high values meaning that that feature belongs to a cluster we can also be 90 95 or 99 confident that feature is a cold spot or belongs to a cluster of low values or if that neighborhood is not significantly different from the rest of the study area so in other words it's kind of random then we'll mark that as not significant so the tool goes feature to feature and asks ok what's your neighborhood is your neighborhood different from the study area yes okay you're a hotspot next feature what's your neighborhood is your neighborhood different from the study area yes ok you're a hotspot too and then we end up going through all the features and we have a hotspot map so let's run a hotspot analysis yeah ok so I'm going to show you two examples of hotspot analysis one of them is looking at obesity rates in Los Angeles County this was my the data from my thesis so I hate it with a passion but I'm gonna show it to you anyway so we're gonna use optimized hotspot analysis which is a tool that uses under the hood hotspot analysis and is optimized to make it just really easy to run and we're gonna talk in a ton of depth about we're gonna go into even more depth than Florida stood about hotspot analysis and the decisions that you need need to make and what optimized hotspot analysis is really doing under the hood in the session this afternoon today you're just gonna right now I just have to take my word for it that it's awesome and is doing a really good hotspot analysis so we're gonna run the analysis and it is going to create those that map for us and we can oh I ran it on a totally wrong variable like that I know what the hotspot map this data looks like selfs window yeah the a results would okay percent 5b that is the actual variable that has to do with childhood obesity so this is the percent fifth graders in each of these school zones that were overweight or obese in some of those places it's over 50% and so these are the areas and we can see the based on the colors the places that are where we have more confidence the places where we have less confidence and we actually built optimized hotspot analysis with ArcGIS online in mind because we really wanted to get hotspot analysis into the hands if a lot of people and into the hands of people so how many people have run hotspot the Reg the Geddes or GI star statistic hotspot tool and accepted all the defaults oh please fine you don't have to raise your hand but you know I know it's true so that's why we've built optimized hotspot analysis the thing that the thing about the one we build tools is once we build the tool it has to essentially we have to make sure that you a model you create with one of our tools works forever essentially that's that's our goal and we can't just go willy-nilly changing what the output will be so the defaults are what they are and what we've done is for optimize hotspot analysis we've basically created a new set of defaults so that the defaults are really based on us interrogating your data evaluating the data and making decisions based on that data and so this is an example of using that this is graffiti incidents in New York City and if we go to perform analysis and we go to analyze patterns we can find hotspots if we have no analysis field we have options to either provide our own polygons so here we could use police precincts for instance or beats we could also use a bounding polygon that would say where these incidents could have happened oh I did the other way around but you could provide either of those of course you could provide an analysis field but in this case I just let it create the grid for me and we can see the result of that analysis and if I click on one of these locations I can see like the lighter pink one is a hot spot where we have ninety percent confident so now you can do this kind of analysis right inside of our chess online and really get this analysis also into the hands of more people that won't know the ins and outs of hot spot analysis but can still do a valid a now all right last in this tool set is cluster and outlier analysis so cluster and outlier is similar to hotspot analysis in that it finds clusters of high values and clusters of low values but it also identifies outliers so let's take a look at what that how that works so again we're separating the feature in the neighborhood from the study area but now instead of just looking at whether the neighborhood is different from the study area we're we're asking is the feature different from all other features and is the neighborhood different from all other neighborhoods and so we have four different answers so if the feature is significantly higher than all other features and the neighborhood is significantly higher than all other neighborhoods then we have a high high cluster if the opposite is true the feature is lower and the neighborhood is lower we have a low low cluster but we can also have a situation where the feature is significantly higher than all other features but the neighborhood is significantly lower than all other features in that case the feature is going to be a high low outlier and the inverse if the feature is lower than all other features and the neighborhood is higher than all other features then we're gonna have a low high outlier so a cluster outlier map might look something like this where the the lighter pink are the high highs and lighter blue the low lows the bright reds the high lows and the dark blues the low highs so if i if i graph the results of this cluster outlier analysis so here i'm looking at the values versus the z scores and we can see here that if the z scores are low then we have outliers and when the Z scores are significantly high we have clusters and I can draw this on the map here so again the high highs the low lows high lows and low high so every of these values is falling somewhere in the quadrant but you have to be significantly high or significantly low enough so you'll notice that in the middle here we have the grave features that were not different enough to be marked as statistically significant so let's see that in action alright so we're gonna use that same dreaded Los Angeles data to do a cluster outlier analysis so one of the things that we sometimes use optimize hotspot analysis for one of the things that we create in the message is window we do a bunch of evaluation and we'll talk about that this afternoon but we we evaluate what's the optimal distance to use for our hotspot analysis what's the optimum distance to use for our neighborhood and this is when you enough this is when you don't have one in mind one that makes sense for the question that you're asking and you're just trying to get the tool to help you start to explore what might be an ideal one but we'll talk about that more later but we're gonna use that distance when we run cluster outlier analysis so I'm gonna run the analysis using that same data I'm gonna use the right variable this time we're gonna use a distance bands this is where we define what it means to be neighbors how we can find that neighborhood we run that analysis and it's gonna return to us this report so this this map so what we see here we see a similar pattern to what the hotspot analysis showed us but what we can see is that within that hotspot there are actually these clusters where we have low values that are surrounded by high values and those can be incredibly important especially if you think about something like obesity how did these areas manage to have such low rates of obesity surrounded by places that are in theory quite similar to them they are close by what are the things that they're doing right that we could try to mimic in other places for instance or we can see the high low clusters what are the things that they're doing wrong how are they failing when others around them are doing so well and so those those outliers can really drive a lot of new questions and can help us especially when we move into exploring relationships they can help us get into other fact if we know okay well this one's doing well and these aren't maybe I'm missing a variable maybe I haven't taken something into consideration and those can help us find some of those variables that might be really critical alright so that was the end of the mapping clusters toolset and this afternoon we were really gonna spend the whole 75 minutes talking about just that tool set and those four different tools but so let's wrap this up now with modeling spatial relationships okay so this tool set has some regression analysis tools and a couple other tools to help us define the relationship between different features so we'll start first by talking about the generate spatial weights matrix tool and this tool creates a swm or a swim file and basically it helps us answer that question of how our neighborhood size is determined so if you remember when I talked about hotspot analysis and clustering outlier analysis I said that we were looking at that neighborhood and you might have been wondering well how do I know what the neighborhood is well there are different ways to do this that incremental spatial autocorrelation tool that we talked about earlier is actually how the optimized hotspot analysis tool picks that neighborhood it looks for that peak distance and that becomes the neighborhood but you have more flexibility with the generate spatial weights matrix tool you can define the neighborhood however you'd like so there are lots of different ways to do it I'm just gonna talk about one of the ways which is my one of my favorites which is K nearest neighbors so with K nearest neighbors you're looking at K stands for kind of like the number so let's say I'm looking at four nearest neighbors in this example each feature is going to have a neighborhood composed of its four nearest neighbors so it doesn't matter how far those four nearest neighbors are it's whichever ones are the nearest so each features neighborhood is defined that way the next tool is similar it's generate Network spatial weights and this tool is also going to create a swim swim file but in this case we're looking at distances based on a network so let's say we have we want to look at a Drivetime instead of a Euclidean distance because we want to see how the relationships between these features vary when people are traveling from one place to another so each feature will have the drive time and that will become the neighborhood and that's how we'll decide what the neighborhood is for those other analyses like hotspot analysis and cluster and outlier analysis next is ordinary least-squares and this is a regression tool it's a linear regression tool and what it helps us do is estimate the relationship between different variables so I'm going to go through these last three tools the slides really quickly and then Lauren's gonna talk about them all at once so that we're not jumping back and forth since they're very closely related we also have an exploratory regression tool so this tool it helps us find a properly specified ordinary least-squares model so it actually tries lots of different combinations of all the different variables to help you find a properly specified model it's kind of like an iterative process and it tries different combinations of one variable to five variables and gives you a lot of really useful outputs for looking at how often a variable might be significant in your analysis it helps you look at correlations and it might also help you find a properly specified model so we have a whole other workshop on regression analysis and properly specified models so I'm not gonna really teach you how that works right now but please come back if you would like to learn more it'll be the most fun you'll ever spend learning about regression analysis I think it's really true I promise and then lastly geographically weighted regression so again it's a linear regression model but this time we are looking at neighborhoods so each feature is going to get a separate equation so we're actually going to be able to see how a model performs over space so based on that features neighborhood it gets that one equation but then each feature will get its own equation based on its neighborhood and so then we can see how that local r-squared varies over space we can see where the correlations are strongest and that might help us with you know focusing our efforts when we try to address the issue at hand so let's take a look at some regression results alright this is our last demo and since we have three minutes I think we planned perfectly because I know it's amazing it's a miracle we also don't have a session immediately after so we can totally get questions and we're gonna be around and we have a lot more session so next year we'll leave ten minutes but I think it's a real miracle we got this far so Medicare spending this is actually kind of a sneak peak as a topic that we'll be looking at in both in our regression sessions we're looking at Medicare spending and this is just a quick little kind of story mappy type of thing that helps us walk through that that analysis so what I really want to show is this so this is the Medicare spending data now in a perfect world we would think that we'd be spending about the same amount per capita on every person in on Medicare right now we'd expect to see spending a little more a little more in one person a little less on others but we wouldn't expect to see whole regions of the country where we're spending a lot more per capita and this is taking into account things like cost of living differences age of the population this is adjusted for that we would expect there to be a randomness but of course I'm sure it comes as no surprise to anybody in this room that it's not random at all and that there are clusters where we're spending a lot more per capita and so our question is why and regression analysis helps us answer those why questions that a lot of the mapping clusters tools bring up we see the pattern it's not like my research and childhood obesity was okay well now we know where all the obese children are my job is done you know like we wanted we want to know where those patterns are so we can do something about it so our first question naively was maybe we're just spending more money where people are sicker so we did a hotspot analysis of people where people are sicker and obviously those things are not related we also thought maybe people are just getting picked better care in those places where we're spending more per capita that's obviously not true either so we we looked at that that also didn't really correlate actually we saw that people were getting better care in the places where we were spending less per capita which might come as no surprise to some of you so we used exploratory regression to explore what are some of the factors related to spending in each of these areas an exploratory regression lets us explore those correlations so one example is if we found in this in this part of southern the southern United States we saw that imaging events was a really good predictor of spending so that means the more times people are getting cat scans and MRIs the more spending there is and there's a lot of research on this idea that it's amazing how once a doctor has an MRI everyone went once they've just gotten a new MRI machine everyone seems to need an MRI no offense to doctors in the room I'm sure you none of you do that but these kinds of things and these are things that we can actually do something about right if it's if it was just people are sicker here there's not a whole lot we can do about what we can try but we might end up spending more money but it's the things like the way that that medicine is being practiced that we can really do something about or dehydration admissions is another example educating people about policy is educating people about the importance of hydrating elderly patients because that's a big reason that a lot of elderly patients end up in the hospital so we spend money and it's totally avoidable so we did this analysis and finally we found in this area a properly specified model and we use geographically weighted regression to see where each of these variables was most significant so we can see for instance that the darkest area here is where imaging events are a good predictor of Medicare spending now we're gonna get all into how geographically weighted regression works but the idea here is that we can focus our resources based on where those variables are important so if we know dehydration admissions is a good predictor in this area we can focus those educational policies here if we know MRIs are a good predictor here we can focus those policies there now in a perfect world we could just put all policies everywhere because we have unlimited resources but resources are never unlimited and so being able to focus those resources in the places where we're gonna get the most bang for our buck is what gwr is all about and we're gonna talk a lot more about that in the next sessions so that all right so I'm thank you so much please fill out a session survey we'd really like your feedback we always want to improve our first one yeah really appreciate you letting us know how we did and also more courses if you'd like [Applause]
Info
Channel: Esri Events
Views: 3,820
Rating: 5 out of 5
Keywords: Esri, ArcGIS, GIS, Esri Events, Esri 2014 UC Tech Session, Lauren Bennett, Flora Vale, Spatial Statistics, Simple Ways to Do More with Your Data
Id: 6LbN9cBFVyg
Channel Id: undefined
Length: 73min 53sec (4433 seconds)
Published: Fri Dec 29 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.