Week 1: Spatial Data, Spatial Analysis, Spatial Data Science

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay so as I mentioned the other day I'll start with some jargon some vocabulary just so you know where I'm coming from this is a lecture I give many many times in long-form short form so you may have seen it if you looked at some of the things before but is basically to set the stage as to I we bother with spatial data analysis in sciences that's our point of departure and then what is this spatial analysis / spatial data science that we're going to explore in this quarter words what are its different aspects and then I'll close with some discussion of some pitfalls and spatial analysis you know the kinds of traps that you can easily fall into and some examples okay so why spatial analysis and the social sciences two dimensions of motivation one is substantive the other one is practical so they're substantive one comes from theoretical frameworks the practical ones come from the data so there's been a change in social science in general and in particular in economics and in sociology looking moving away from individual decision making in isolation to taking explicitly into account the interactions among individuals so you do not make decisions necessarily in isolation in economics as a utility Maximizer subject to constraints but there are things like peer effects or copycatting so you do something because your neighbor does it or you do something because your friends do it and maybe if you were just sitting on an island all by yourself you wouldn't do this so these kinds of things are increasingly being taken into account in theoretical frameworks also the interconnection between social networks and spatial networks or social interaction in general and spatial interaction which has a geographic imprint in other words is there a connection between the people you hang out with and the people you live near by and in some cases there are very strong connections for certain types of actions sick example is parents of kindof kids in kindergarten and elementary school because of the ways those public schools are organized in the u.s. you live in a catchment area so all these people can to live in similar neighborhoods and they also interact through the kids through school organizations and and soccer practice and all this kind of stuff so in economics there's also such a thing called spatial externalities externality or in general the phenomenon that your activities or your production or your consumption also effects others production or others consumption so if you're a chemical Procter e polluting and putting your pollutants in the river and downstream there's a brewery that brewery has higher costs because it has to clean the water more than it would if it had clean water to begin with so the cost of that other factory are affected by your production levels again not in isolation but taken into account the effects and of course we'll be interested in particular when these externalities have a spatial or geographic imprint and we call these spatial spill overs in policy context will be particularly interested in something called spatial multipliers in essence that means if you do something in a particular area it has impacts beyond that area so say you have a neighborhood improvement program and that would affect a crime rate in a neighborhood but not only in that neighborhood also in the adjoining neighborhoods and that is called as spatial multiplier and one of the things we won't do in this class but in a spatial regression class which is kind of an built on to this we'll try to quantify this to measure what are the spatial supplier effects okay more theory in sociology there's a paper I'll refer to it later by Andy Abbott who's here it's really about the chicago school and historically how the chicago school has thought about space and time but he has a quote in there that all social facts are located what that means is that you do not look at social phenomena in isolation but you look at them as to how they are connected to the physical reality in which they occur the things they shall mismatch for example if you do all the jobs for your skill level or in the other side of the city that is a mismatch or spatial disparities disparities in income disparities in educational level mismatches for example we're working on a project on the mismatch between the supply of social services and the demand for social services of course you can look at this in the aggregate you just add up the numbers but spatially where are these services provided and where are the people that need them you will hear me see to say the word where a lot it's all about where in fact some big GIS companies uses as a slogan the science of where that's that's what they're about neighborhood effects of course also very prominent in sociology there's another side to this as I said it's a practical side I'm tend to be a practical person a lot of data are geo-located in other words we know where they occur and also sometimes one they occur for example crimes public crime reports they tell you where it happens they're increasingly sensor data I'll show you some examples of that later on social media data a lot of you can set this a lot of tweets have Latin long associated with them and so if you record these tweets and you follow them you can see where somebody goes throughout the day you can figure out broad movement streams of people and this there's some fascinating studies that use this information to actually have real-time information on commuting patterns for example a little more technical is that oftentimes we study phenomena where there is a disconnect a mismatch between the scale of observation and the process and in typically this is changing but typically in social science research were mitad to things like sensors data and sensors data are recorded for administrative units things like census tracts or counties and so oftentimes out of convenience social scientists equate these administrative units with the units at which processes happen for example a law to start but a plea what data a census tract is not a neighborhood I bet you you don't even know what census tract you're in but you do know what neighborhood you live in so these census tracts are just arbitrary things I mean they're not totally arbitrary but they're designed to count people there's not designed to measure social processes so if you study neighborhoods and you have data on census tracts this is not going to be an exact one-to-one match so they'll be spill overs between adjoining tracts and that creates statistical problems another example in economics is very common to treat the as a labor market and in some instances that's reasonable but in many instances it's not for example most larger multi Politan areas including Chicago have several counties around and for example in Chicago Cook County is very large but you have Kane County next to it and a lot of people from Kane County commute into the city so they are part of the same labor market again there's imperfections in the alignment and delineation and that creates statistical problems then more technical in in a lot of models we're actually measuring something but there is a systematic error in it for example if you model a housing market real estate markets and in Applied Economics there's something called hedonic model a donek model tries to explain the sales price of the house by the characteristics of the house square footage you know fireplace garage all that kind of stuff neighborhood effects you know what are the neighborhood effects you know they're there but you can't necessarily in technical speak and regression analysis these are going to end up in the error term but they show systematic patterns across space so again this is a data issue that will affect your statistical methodology and the way you deal with this another issue that's actually very common is that when you have sensors for example you have pollution air quality monitors there's not that many of them so if you want to get a sense of what is the air quality in places where there is no monitor you have to interpolate we'll talk about this later and interpolation think about forecasting in the right if you ask me what is the GDP of the u.s. going to be next year I may be able to give you a reasonable forecast a reasonable prediction if you ask me what is it going to be in the Year 2200 right it's way too far out I have no idea same with spatial interpolation if you're very close to the point you we'll have a fairly good interpolation as you get further and further away it gets worse and worse this is system you get worse with distance so again that's something to have to take into account and then the final data issue but it's called technically the change of supports problem that's a very statistical term but in essence it means what happens if you have data at different scales and they don't match for example you study school performance and you have data on income of the parents and you're interested in finding out is the performance of the students in school somehow associated with the income of the parents well you have a problem because the School District is one area the income for the parents you might get it from census data they don't match they don't even nest nesting is one the smaller units grouped nicely into the larger units that's the case for their census data but that's not the case for the school districts and there's all kinds of issues where this happens where the the areas do not overlap that creates statistical problems and increasingly in social science engine also in the physics by the way these are taken into account this sets the stage as I mentioned the readings for the course are not at all choir they're just kind of background these are some of the readings that I put in for today this is this article by Andy Abbott of time and space the contemporary relevance of the Chicago School if you're sociology this is very interesting background these are some things of a project is so you know as I warned you last time you know this is kind of my my preaching right I've been doing this for 30 years so I was part of a big project to try to integrate operational spatial analysis into mainstream social science there was a long term project there's a book associated with this the the paper which is in essence the proposal for it is one of the reasons readings few years ago there was this article in science about the spatial turn in Health Research and this is paralleling and I and I want to say something about that you know if you in medicine there's like two strands of thinking one strand is everything is biology so you get sick because you have the wrong genes or you were exposed to something funky that's there's an Mahler that is being attention to the social determinants social and environment environmental determinants of health and they're not necessarily diametrically opposed but there is some tension between these two and so the social environment 'el field has been gaining ground in we a lot of this is very spacial so you get sick because you breathe bad air because you live close to polluting facilities or you commute through areas that are very that's very spacial and so increasingly the CDC and the National Institutes of Health have been focusing their research on this spatial aspect into account and then this is a ancient history but it's something that made many of us very excited at the time it was a management and budget circular under the Obama administration on effective place-based policies this is the first time that place and space were specifically and explicitly incorporated in policy documents that were guiding the budget of the US as I said that's ancient history now but there hasn't been a budget for a while so we won't go there so many of you might have seen this picture before this is the I should say classic picture of the cholera epidemic in 1854 London and the so called Broad Street pump if you go to you can see the Broad Street pump it's a little historical plaque in my other class we talk about this in detail but this is kind of seen by many even though there's some debate about this as the poster child for spatial analysis by because Asian it's about where it's about patterns it's about structure in the spatial data so these are all the buzzwords that we'll see over and over magic for fun updated this you go to the Chicago Tribune website and you see this this is updated on a monthly basis these are the locations of homicides in Chicago and this is all public record you can look it up in a number of different ways and we'll talk about that in the lab but this is an again similar thing we'll see later how to do this right and how to not do this but this suggests patterns there suggester structuring this suggests that the location of homicides in Chicago is not a random event if it were a random event you wouldn't see this spot where Austin is you know it would be all kind of spread around so because it's not random there is structure and part of what we'll be doing in this class is live up to detect structured identify patterns and know to some extent how likely it is that these patterns are for real rather than just an accident is you know type 1 error as we'll see so what is spatial analysis and it is over and over even with colleagues of mine first thing they say when you know I mentioned I start at the Center for spatial data science oh that's GIS well yes and no and that's why I made explicit GIS is not a prerequisite for this course because it's not really about GIS is about spatial data and GIS is a tool or toolbox to manipulate and store and deal with spatial data but we deal with the ences of spatial data so it's beyond thing and forming meeting re this is called kdd for knowledge discovery from data and really it's meant to knowledge discovery from databases and that's where a lot of the machine learning impetus comes from and you have these huge databases what you're going to do with them what can you see in them so it's there's this kind of you know sentence that people throw around you go from data to information to knowledge to wisdom eventually maybe right and remember do de the other day it's about a progression it starts with very simple tools with description and as we move forward we get more and more refined and more and more formal until we end up but not in this course without formal process modeling regression analysis simulation so this is the most important slot the whole talk what is spatial analysis about what are the questions that you ask where are things right so we try to identify patterns hot spots clusters disparities okay why are they there it's one thing to find a pattern which means basically things are not random there is some rhyme or reason behind this what is this rhyme or reason that's not that easy to figure out sometimes so these are understanding location decisions understanding moving movement patterns why do people take the bus some places and take the car other places you know understanding this and then then third context it's about interaction how do my actions affect my surrounding my context my hood and how does my neighborhood affect or constrain my action so there's this back and forth and this back and forth is what makes statistical analysis with spatial data is so complicated because it's all about feedback and like to use in the class as I am my neighbor's neighbor so if my actions affect my neighbor and actions affect me then in effect myself through my neighbor it bounces back so think of a system of interacting agents each bouncing back of each other and you end up with a fairly complex system and so teasing that out is a lot of what we're trying to do and then the last bullet is how do you say I forgot the word its prescriptive as given what we know about where things are giving what we know about why they are there where should they be for example this is very real the array of things is about to locate 500 sensors in the city of Chicago to measure pollutants look at activity on the street and all kinds of things where should they be not random we have to understand the underlying patterns in order to make an optimal decision and again also we need to understand what optimum optimal means different things with different people so we have to take the the nature of the the multi objective nature of these processes into account we won't go there in this class unfortunately so one is an alice spacial and I have a little example so the and we'll come back to this later if you change the location of things and it doesn't matter then it's not spatial analysis but if it does matter then it is spatial analysis give you an example I don't know how many of you are familiar with Milwaukee which is just north of here one of these maps is real the other one is fake okay so this is using census data with the percentage African Americans by census tract now I'll give you a hint Milwaukee is as a Chicago one of the most segregated cities in the u.s. so if it's segregated then you know color wouldn't be in one place the other color will be in the other place right the other map takes this very same data and shuffles it around randomly so this is the expression of spatial randomness things could be anywhere equally likely we'll get back to this so this is the real map this is the fake map so a map as you might expect is part of spatial analysis is visualizes the spatial distribution of the data now another way of visualizing the distribution of data which is a histogram so Graham takes these same percentages put some in bins and shows bar charts how many of the tracks are in each of these bins so yeah so you see you know there's extremes there's a lot of tracks that have very few and then some tracks that have very many and then a little bit in between but the key point here to take away these two histograms are exactly the same okay spatially these are totally different cities but in terms of data context without the space they're the same so a histogram is not alysus and it itial analysis one it makes a difference and one of the tools that we'll see to measure spatial autocorrelation that's the degree to which things are similar in neighboring areas we'll get back to this clearly again demonstrates the difference between the two in the random map it's flat which indicates spatial randomness there's no structure there in the real map it's very steep and very high because you know all the high percentage african-american are in the same areas and all the wide areas are in the same areas so that spatial autocorrelation so this is the difference between probably what you used to standard statistical analysis and then making it spatial by making the location explicit and bringing the location into the analysis the simplest thing is a map but there's other things and many of these other things we'll see as we progress through the course so what are we going to do well map will show patterns we'll try to discover patterns it's not always obvious what the patterns are and then we'll do a little bit of modeling but not too much in this class so these are all the kinds of big data is really kind of a buzzword I like to refer to big data as you know it when you see it and one way of defining big data is that it's big when you do that you used to now this is different for different people right if you use with you know I won't man I was almost gonna mention a brand but I went to it if you use a small laptop you have a memory constraint so very quickly if you have terabytes of data forget it so then that is big data for you if you're working with a cluster like the research computing center here in Chicago then terabytes is not big data for you betta bytes might be big data for you so it all depends on the hardware that you have and I'm often in discussions where we have for example we'll have physicists and then so what the social scientists think data our toy data for the astrophysicists so it's all relative that's really what I'm saying but the important thing about big data is when the size of the data or the speed at which it's acquired challenges your current methodological toolbox for example we can all do maps right but can we do maps for streaming data that's a different story that data gets updated by the second or by the minute so can we keep the maps up-to-date as the data get updated maybe but can we do the analysis and can we do it in real time you know I mean nobody has any use to a weather model that takes a week to calculate tomorrow's weather right you have to keep that into account and this is where you hit the constraints of the big data so if your model is so complex that it takes too long to calculate something then you have to rethink the model rethink the methods rethink the algorithms and that's where Big Data its spatial analysis you have men issues of course associated with big data and I won't elaborate too much on this but one big statistical issue what what is this data if you have data on everything and everybody is this still a sample or is this the population you know that is an interesting question that has implications for the kind of statistical analysis and actually it's very interesting that in the statistics discipline tests unless meaningful as you move into a big data that right now you know what p-value of 0.05 if you have 200 million observation what you mean right especially if you have two hundred thousand two hundred million observations every few seconds so another change kind of in the paradigm is that although this is itself being challenged and changing is an emphasis on correlation not an explanation or causation so an emphasis on prediction and in part this is because the marketplace you know where is a lot of this Big Data stuff happening in companies that have big data you know Google Facebook those kinds of companies have oodles of data so they're not necessarily interested in understanding exactly why you buy a particular gimmick but they want to be able to give you a recommendation that you then will be buying so they're trying to predict your next step or predict your behavior rather than necessarily understanding the behavior that's why sometimes some of these things don't really work that well there was a big buzz a few years ago about this algorithm based on searches Rich's for comparison and things like that where the flu epidemic was going to be picking up and it was all the rage and it seemed to be doing a pretty good job the next day did a very terrible job because something system had that's what social science is about is understanding the system but the predictive model had not picked that up and so it kind of you know wasn't well at that stage so in the social sciences as I said it's all a little bit relative you know the theoretical physicists and the astrophysicists they kind of snicker when the heroes talk about big data because it's not really big data but it's big for us and it challenges our traditional methodological tool bar just to give you an example of something you might try one of these days I'm assuming you all know what the scatterplot is I'll go but a scatter plot has two variables and has the points as they are measured on each of the variables so you have a scatter plot cloud and usually it has a nice little shape you can draw a little line through it right try making a scatter plot with a million observations okay it's a blob you don't see anything right there's a million points on top of each other unless your axes are really really long you can stretch it sit a little bit you will not see any jewel observations so our traditional toolbox which is a scatter plot where you can see each individual point and click on it and identify it falls apart and we have to start thinking about binning processes or shading or different tricks to kind of show how many observations are in the different spaces in your scatter plot so what are some of these data sets that challenges and I like to refer to them as new data rather than big data new data for the social sciences sensors so I mentioned the array of things there's other types of sensors that are basically fixed points and pay attention to these things because they matter in terms of the analysis whether the points are fixed or moving or random all these things will have an implication for the kind of analysis we do a lot of cities have come out with open data portals where they provide administrative data we'll use an example of that in the first lab and as I already warned you it's not as easy as it seems this is where the data munging and wrangling comes into place of course social media is is you know has been a goldmine for social science analysis 3 1 1 calls same thing you know people call in about potholes about graffiti about all kinds of things and that data is recorded with a timestamp but also with a location so spatial analysis becomes possible and don't have a good sense of how bad it was when I started doing analysis you know my first sample dataset had 49 observations you know the classic example of spatial autocorrelation in a text book had 26 observations if you have 26 observations and you come to me and say I want to do my project on this I'm gonna tell you go home start over right that is not what we do anymore so a lot of data now have locational stamps you know and lots of it like these three one count one calls have addresses and that has changed the type of analysis that we can do and just to give you some examples this is the Chicago data portal they changed their website so I couldn't reuse my slide from here I have to bring it up to date but it's there there's a lot of stuff there like three one wrong calls abandoned buildings you can have a little map it gives the little dots where people called in and said hey this place looks like it's empty you know can you do something about it this is the array of things that's a project by Argonne Charlie Catlett is leading it it's between Argonne and University of Chicago very ambitious measurement project to put sensors all over the city and manager measure a whole range of environmental attributes but also there's a camera in there which of course makes people very nervous but it's been manipulated so they can't see faces and things like that but it can measure movement so it can track car traffic on an intersection these things tend to be mounted on traffic lights so they can sense the intensity of traffic even the intensity of pedestrian traffic on the sideways walks if there are sidewalks and things like that so these are some of the actual and planned locations of some of the sensors and and you see this is at the very beginning there is all parts of the city where there is nothing yeah this is a type of sensor and I wanted to put this in here because it's not the sensor in the use it may not be what you think of when you think of a sensor these are Divi bike locations and every living by Cloke ation is actually on the Internet and if you understand the API you can measure how many bikes are taken out and in at anytime so this is a different kind of sensor that is not environmental or anything like that but it does give you the pulse of the city where are people getting the bikes where are they dropping them off that's a major operational issue but it's also an interesting behaviorally aware it's not random as if it were of these dots and even though just by looking at the you cannot say that there is patterning and I will keep repeating that so don't ever say that I see a cluster here no you do one of the techniques that we'll see during the course of this class and then you can quantify yes this is a cluster it's not a random thing it's not just me trying to see clusters this is an example of a really interesting study using cell phone calls and a few years ago the European soccer championships had Italy playing in the finals and won Italy scored everything went off so you can actually see that there's an animation of these phone calls and you can follow people as they move throughout the city and make calls this is one of the uses of new data I mean 20 years ago when I did or when I I didn't do my dissertation 20 years ago when I did my dissertation forget it that stuff didn't exist right so what this does it allows you to answer different kinds of questions not just the same questions with more detail which of course you can do but also brand new questions and that's the exciting part so this is some of the work I've done on social media we can skip through this so what are the paradigms and there's two new things that have been added to the usual theory and experimentation of scientific discovery and one is computation so computational social science is using computation as some have referred to a third approach to scientific discovery and that means and explicit use of new techniques like simulation machine learning text mining is a really big deal where all kinds of historic documents can be digitized and analyzed and quantitative ways in and in ways that were never possible before data exploration these are all becoming part of the computational so inés toolbox and then even further is the so-called fourth paradigm and which goes as far as stating that theory is dead there's a very famous little article for DAF in which magazine but wire and wire magazine theory is that we now have all this data we don't need theory anymore you know it's a little bit exaggerated but the new data do change the playing field and they do allow a lot richer exploration then theory allows you and one of the beauties of theory is it's simple right you reduce the complexity of reality to some kind of nice analytical framework where you can draw general conclusions but in the process you lose some of the richness of reality so if you have millions and millions of data points you have the richness of reality so you can find things that theory does not necessarily introduce a lot of this as I mentioned earlier is is driven by industry and that's kind of an interesting new phenomena where big large companies like Google Facebook Microsoft are really interested in mining there's data of course with not necessarily to move social science forward but to make money right so their motivations but it does provide social scientists with incredible rich collect on their own and so it's not a bad thing so this if you have the time article I think there's also in additional social science and then his book put out by Microsoft Research on the fourth paradigm which is the data-driven science so data as a driver of science rather than experimentation or theory as the driver of science it's a different way of looking at the world and then what is data science itself and you know there's probably as many definitions as data scientists this Venn diagram is and the famous is I don't think it's published anywhere it's on the web it's by Duke way and it's basically data science is everything right so you have substantive expertise you know math statistics Oh computer science and you can program and you can do all those things you're a data scientist right so what does this mean and reality in realities means that the data scientist does not exist what does exist data science teams and typically a data science team has people each of these sir are able to communicate and be in the middle in the of the three circles so it's not just good enough to be a social scientist you have to be a social scientist that knows enough about math and statistics and algorithms and data to be able to talk to computer scientists and talk to statisticians and vice versa so that's really what data signs out some people have referred to it a data scientist is a statistician who knows more computer science than other statisticians or is a computer scientist who knows more statistics than other computer scientists I mean you get the gist right it combines ideas from computer science primarily database data mining machine learning and ideas from statistics I mean there's this thing called statistical learning and statistical modeling a lot of these techniques are actually the same but they have different labels you know for example logistic regression is a standard technique in statistics and it's also one of the mainstays and machine learning you know you might think these are two different things they're not they're the same so that's what I'm trying to do with spatial data science here's are some of my favorite books if you're an hour person you have to work through this book all about how to manipulate data and do it efficiently with the packages and that's kind of a must know if you want to be a serious data scientist there's counterparts in Python as well I should mention but so doing data science was put together it's a couple of years old but it was put out by somebody who came from Google and taught a data science class in a business school at Columbia so it's very much geared to a business narrative but it talks a lot about data wrangling visualization finding patterns those kinds of things which is what we will be doing they don't talk about space they make some maps but they don't talk about space so that's what I do I talk about space so the main thing about spatial science spatial statistics spatial econometrics is that we have an explicit treatment of the spatial aspects that means we take the wear into account we take distances into account we take spatial arrangement into account and it all becomes part of our analysis so this is not just doing standard analysis and then making a map of the results a lot of standard machine learning will have Maps but they are not spatial analysis they're just presenting the results well we will be doing and we'll be doing this fairly quickly is bringing these spatial data and their representations in maps or other types of graphs in as an explicit part of the analysis so for example you have exploratory data analysis and statistics which is a lot of graphs and tables and different ways of looking at the data we will have exploratory spatial where we'll do very similar things but we'll connect them to where they happen for example rather than just having a histogram as such we will know where these bars for the histogram are in geographical space and that's called a choropleth map it's the maps that you probably used to they're basically the counterpart of a histogram but making the location explicit so as I mentioned before a lot of it is data preparation you know that means that maybe 80% of the quarter I should spend on this grunt work of how you deal with dirty data I won't do this one lab you'll get a taste for it and then we move on with clean data sets and everything will will be ready for analysis but the important thing is that it's all about data structures it's all about workflow one of the big challenges in data science is how to make this data wrangling as automatic as possible you know Google just came out with a new tool and there's a couple of other competitors who are trying to make strides to make this process more and more automatic but if you work with data with missing variables with miss in anybody who's the kind of data analysis or for that matter data entry knows mistakes are be are made it just happens you know Murphy is your buddy it happens so how do you pick this up how do you detect this in an automatic way well one way is highly labor-intensive and that's the traditional way but you might be able to do that for a table with 20 entries maybe 80 entries you cannot do that for a table with 200 million entries you just simply can't do it so how can you make these processes more structured more organized and I like this little graph which comes from the book and it's the the workflow right you start with the data you clean them up and this is called tidy because the package is called tidy and then you you're in this circle where you start by manipulating the data transforming them visualizing them modeling them changing maybe the data transformations redoing this process until you the light bulb goes off you infer something you figure something out and then you're not done and unfortunately we won't have a lot of time to spend on that but a very important part of the analysis is how you present and train your typical audience will not be me it will be people have no idea what you're talking about so a very important part of data analysis any data analysis including spatial data analysis is being able to translate the results to what an expert audience okay spatial data science I already told you will do all this stuff we'll focus on the exploration the visualization we won't be doing modeling in this course that's for a separate course the key even though I will be using geo de for most of the analysis the key thing take-home message is there is no one-size-fits-all software package and the strength of a data analyst is to know tools and to know which tools are good for which types of analyses and typically you have to combine two different things and so you will never hear from me oh just do R or just to Python the bad news is you have to do both and you have to discover that for certain types of analyses the Python tools are better faster more scalable for other types of analyses the our tools are more intuitive easier to implement and so that's part of the skill set of being a data scientist and just to give you an example of some of the work that I did with my colleague Sarah Williams at MIT so we got what is called the fire hose and the fire hose is everything from Twitter and Foursquare Foursquare is defunct now but Foursquare used to let you like places so you you go to restaurants and you become a citizen of the infrastructure built around it so basically the essence is that we knew where people you know entering this information so we have one week and basically this is the kind of analysis so our raw material was Jason files Jason is a structured format we had close to 6000 files with more than five million messages so is this big data not really but it's the biggest well it was the biggest data set that I had handled up to that point now I'm working with a data set with 58 million observations and that's getting bigger so first of all you have to get this data and put it in a format that's usable there's a lot of garbage in Twitter so you have to get rid of that stuff and then since we're doing spatial analysis we have to move it into a spatial database so by now we've used Python we've used our we've used post GIS then we do spatial aggregation which we'll do in the lab on Monday that's our or post GIS now you can do it in Giotto as well and then we do some statistics and visualize them in NGO de and the end result is something like this a map with hot spots and cold spots for the New York City area so that's an example of what is involved in spatial data science so the data and the data questions and it's important to think about this so I'm bringing this up at the beginning but we'll get back this as we move through the different techniques and the different kinds of data so to do this in a computer you have to abstract everything so you think of locations as points you think of Street networks as lines you think of neighborhoods or counties or census tracts as polygons and then you formalize this whole structures that's called a data structure in computer speak so there's a standards organization it's called the open geospatial consortium and they have specifications for these kind of data structure and an important one is called the simple features specification and if you've ever done some kind of computer modeling you will recognize the the symbols if you haven't they're just boxes and arrows and they show how everything is constructed from other pieces so you have the top concept abstract concept is a geometry and so a geometry can be a point to curve a surface or collection of these things and then you go back to the primitives and this defines how you store data in a computer that's a data structure so then for analysis we can think of broadly speaking for different types of data points now points are very tricky points can be a number of different things the most common occurrence of a point is an event so you think of the location where across you know traffic accidents where do they occur or they can be the locations of it's like if you're interested in food deserts and food access you have to know where the stores are and the stores become represented as ports right then we have surfaces continuous surfaces you know think of a 3d image so things like air quality or noise in a city or if you're an economist the price surface if the price think of an abstraction if you had know the price for every house in the city and every house becomes a little point is there a surface that you can drape over that so that you can then interpolate prices where you don't have information this is a classic example of spatial interpolation so we have surfaces then we have this is a misnomer it's called lattice data in the literature but it's really about aerial units counties census tracts you know provinces states these will be represented as polygons in in the spatial data structure and then finally we have networks nodes and links and so these can be Street networks if your model traffic flows or if you want to model accidents on a street network or crimes on a street network it could be river networks if you want to model pollution or social networks which is an abstraction so the nodes are the people and the edges are the people you connect with so these are the kinds of formal mathematical structures that we will be working with they're all it's all about geometry but there's geometry translated if you wish into algebra and then into computer code to be able to manipulate this give this example because if you're not if you haven't taken GIS and you haven't been exposed to geography think of this problem if you have a map a County map and you have a city in the county you look at the map you know immediately that city is in that county right you see you see the point inside the polygon how would a computer know that how would the computer know that that point is inside that polygon because it's fine for you to look at a map and it what you hunt point and you have to figure out which points are in which polygons you can't do this anymore you have to have the computer do it for you so you have to think about this in abstraction and develop algorithms and do in order to find the pop will do something clever to figure out whether it's inside or outside or touching and all these kinds of things that's what a lot of the computer work in a GIS is about now we won't actually be doing that but we will be building upon it and it's important that you know that those are the building blocks that we build our analysis on on top of okay space-time data unfortunately I don't have the time to cover that at this class but there's two very important classes of analyses one is called time and space where basically you follow the same cookie cutter the same spatial locations over time so these would be our sensors we have our array of things 500 sensors every minute we get information so the sensors don't change the data changes that's time and space the opposite is space and time thing object oh I'll show if I have time I'll show you an example of taxi cabs in New York so the taxi cabs change their location over time so that's not a fixed cookie-cutter but it's a moving cookie cutter so to speak so that's another set of problems a lot of applications and in biology and animal tracking so you know you know the grizzly bears the North West they all have a little thing in them so we know exactly where they are at all times so how do we make sense of this how can we use this information to figure out how much space they they need I don't know if you been in the West I just drove back from the west coast and it's very interesting in some areas there these overpasses and the overpasses are not for people or for cars they're for animals and they're designed to make sure that the elk and whatever else it is you know don't get hit by cars I mean it's also so the cars don't hit because you don't want to hit a deer or an elk with your car but it it's streamlines these migration streams and the only way that can be done scientifically is by taking advantage of this animal tracking information building models for the movement patterns of these animals and then putting overpasses where they are likely to cross so this stuff is not in isolation it has real applications so the types of data often that determine what won't we just a example took questions so our traffic accidents located random in space or are they somehow organized clusters this would be called point pattern analysis you know where where I live downtown it's a there's a lot of accidents and the accidents are because a two-way street becomes a one-way street and a lot of people have issues with their vision and can't see that it is a one-way street so they keep going and they hit the other car now you don't have to do statistical analysis to figure out that this is not random that this is something that you might want to do something about but there are example a lot of studies of pedestrian accidents but there's hit in tort so how do you find out where this is the case analysis incident of disease if you have a very famous study on breast cancer in Long Island there was a suggestion that there were clusters of breast cancer in Long Island several studies some identifying the clusters some dismissed into clusters very tricky thing to do but again these are point locations and you have to determine some are these structured somehow or are they random I mentioned spatial interpolation before that's case when you have sensors hot spot detection cluster detection for example if you have more mortgage foreclosures again I'll show you an example in a few minutes this is not random it tends to happen in certain neighborhoods more than in other neighborhoods how do you identify the busters and then we can model this in spatial regression so some other gotchas is the data sampled or is it everything there's a big was a big controversy in spatial data analysis at some point if you analyze the data for a map so you study all the census tracts in Chicago is that the population because there are no other census tracts so how can this be a sample you have at all or is it somehow sample there is that it's sort of a sample because you change the notion of population to a super population so the particular layout that you observe for all the census tracts is one of any number that could have happened so that's how statistics come back into play we mentioned discrete versus contiguous locations given of random let me spend a few minutes on some pitfalls just so you know the terms the ecological fallacy is a classic term in social science research is basically confusing individual behavior with aggregate behavior you have this in sociology you have there's a cosine so you have this in economics it's an aggregation problem so let's say you have data on crime rates by county and you explain crime rates by county that doesn't explain anything about criminal behavior right criminal behavior is a entity crime rates by county or aggregate units now you may think this is obvious but you wouldn't believe how many papers out there make this mistake and the mistake happens at the end when you write the conclusions where the conclusions are written as if they are about individual behavior but really they're based on aggregate magnitudes and aggregates mix and everything so just to make this simple an aggregate is a really good representation of individual behavior if all the individual behavior is identical so if all the people in the county are exactly the same that whether you take all the individuals or the county as a whole doesn't make a difference but the moment there's differences and especially when these differences are struck by k-- bimodal low income people high income people on average their medium income people but there are no medium income people in that county it's one over the other right so these kinds of things are the traps that you run into with the ecological fallacy another trap which is more geographical is the modifiable aerial unit problem and that means that you can have different results depending on the scale of analysis and a classic article is called a million spatial autocorrelation coefficients and by Stan Openshaw basically what he did is he looked at election results in Iowa started at the ward and computed a spatial autocorrelation coefficient then went up to bigger and bigger units and the results changed that signs even changed and so we really have two problems one is the prize of the unit with Gatien problem the other one is specific geographical and it's called the zoning problem as the spatial arrangements of the unit so this is something to be aware of when you do this and then the change of support problem I mentioned before let me get to the fun stuff so these are a couple of examples and I want to just elaborate a little bit on the kinds of things you do in the kinds of questions you address and you know the last few minutes or so so these are all real okay these are not fake these are real these are real locations of car thefts and San Francisco this is one of the sample data sets of the Center for spatial data science I'll mention this again in the lab so we have a whole bunch of sample data sets this is one of them and so how are we going to approach this first of all what are these points these points are random events so they're not fixed right datum have randomness so we want to understand the spatial distribution of these points and just by looking at it you see this is not evenly spread throughout the city so we have techniques to bring out the spatial distribution and many of you familiar with this this is now all over the place these heat maps you know and lots of pieces of software give you heat maps and and they show you that you know the intensity of car thefts is much higher in you know parts of the city than in other parts of the city okay fine but does this really make sense wouldn't there be more car thefts where there are more cars everything else being the same so really to do this right and we'll get back to this later but I just want to kind of tease you a little bit and make about this to do this right you wouldn't just look at the car thefts as such but you need some kind of denominator right if there are a lot of car is same with accidents right you're looking at car accidents there'll be more accidents when there are more cars that doesn't mean that it's unsafe er or there's something all else wrong with it right if you have look at diseases there'll be more diseases when there are more people everything else being the same so you have to correct for the denominator and that's one of the things we'll revisit later okay this is a similar exercise again this is a real one this is a call the flu mapper and it's done at NCSA at urbana-champaign it's based on Twitter messages so they look at Twitter messages and it goes through a whole machine learning thing this is millions and millions of data points to find out the hot spots of mention of flu now whether this is accurate or not that's a different story very debatable but you cannot know because I flew areas areas and you get a sense of the network that could be behind the spread of the flu again it gives you a sense of the kind of thing you can do these are sensor data this is LA it's a study I did several years ago we had a lot of house prices about 200,000 of those and we wanted to look at the effect of air quality on house prices you know any economist would say that everything else being the same a house with a better air quality would have a higher price the same house should have a higher price than with a poor air quality so is there a way we can extract this information from the data so you know houses typically don't have air quality monitors on top of them so in LA there's about 30 34 of these so we need to interpolate so we use the statistical technique to take the measures and then spread them out and assign a value to each of the houses it's another example this is the cluster and hotspot detection so this is data for Columbus Ohio Franklin County on foreclosures during the crisis 2008 crisis again you know you can't see this from this map but it seems to suggest some patterning and in fact if you apply what we call a local test for spatial autocorrelation which we'll get to later in the quarter you can identify the hot spots and the cold spots and there is clear structure in the data then this is some fun uber stuff you know uber used to have its analytics in the open they don't anymore they still do a lot of analytics really interesting one but this is an example of a flow map and again this isn't random but we have to understand what's behind us that drives the flows and then one of the analysis you can do is you know where are the points where people get picked up and where are the points where people get dropped off and one of the tools we'll see when we talked about visualization is a so called cartogram if you're new york you might not recognize the shape of Manhattan but this is a cartogram where the size of the area's gets proportional to the data magnitude so it's not area but it's a number of uber drop-offs in that particular area so for the city I mean for the New York metropolitan area Manhattan is way bigger than it is physically because that's where all the action of the uber drop-offs is and I want to stop with a little animation I found this the other day and it's super cool let me just tell you what it is okay so there's this data set and it's kind of a challenge Grand Challenge data set is the New York taxi Authority released all its data for 2013 or something like that so it's every taxi pickup and every taxi drop-off geocoded and time-stamped so taxi wet license numbers so-and-so picks up Latin long here at 4 a.m. in the morning drops off lat/long they're at 4:20 right and then because these cabs all have GPS is in them you actually know where they are and I think so I've sped it up so every second is 15 minutes and this is one particular cab as it moves and picks up sits around and then has a really good ride but then kind of has a big drought nobody nobody wants to go back to the city and so they just kind of drive back and see the time it's 4:00 a.m. there's nothing you know it's just sitting around having some coffee then picked up somebody and then just kind of things pick up as we get closer to rush hour and I all stop this at noon you know but you get the gist of it right so this is the kind of data that the new data now make it possible for us to analyze and we can analyze this in a number of different ways we can think of the spots where people get picked up as a point pattern we can think of the spots where people get dropped off as a point pattern we can aggregate them by neighborhood and find you know like the cartogram you saw the neighborhoods with a lot of pickups and with few pickups and we can go further and look at the actual flows you know where are these flows channeled how are they channeled from origin to destination so unfortunately it's only nine weeks there's only so much I can do we won't actually get to do this but we'll do a lot of the stuff a lot of the bits and pieces that get us there so that's where I wanted to end it today so it's kind of a setting the stage the slides are on the site the lab notes will be on the side ahead of time so I'll see you on Monday
Info
Channel: GeoDa Software
Views: 24,108
Rating: undefined out of 5
Keywords: geoda
Id: MmCYeJ27DsA
Channel Id: undefined
Length: 75min 56sec (4556 seconds)
Published: Sat Oct 07 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.