Introduction to Geospatial Data Analysis with Python | SciPy 2018 Tutorial | Serge Rey

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome everybody to the geospatial tutorial I'm surgery for nearest California Riverside and it's my privilege to be one of the co-chairs for the Syfy meeting so welcome how many of you this is your first side pie excellent so that means it's your first workshop here as well good this is going to be a joint effort between myself you want to introduce yourself yours yes I'm Yodas and from Belgium working in France currently and and one of the two pandas maintained so and then we also have Levi wolf the University of Bristol works on the pycelle team and Danny Olivas Bell who's at the University of Liverpool also on PI sauce so there's a joint up between geo pandas and pycelle first time we're doing this so it should be exciting so it's gonna be a four four part session yours will start then Danny will take over then we'll have the break probably and then the second half it'll be myself and Levi so because there's four of us we have chances to help if people are stuck with the installation or have questions while we're going through this what we're gonna talk about the binder yeah so if you don't get everything working and installed but you want to follow now because you are going to start you can try there is a button on the repo lounge binder and normally so that should create a working environment with all the materials online it may not be the the fastest or but it's a should be a backup if in case it is not working locally so that's something that you can try maybe to start to get a little bit an ID who has already used two panels before some who has already used pycelle or related like with a few bitless for those that didn't use geo panacea did you already use panis library itself yes most of you did yeah so that's good because we will assume a little bit that you're familiar with tons we won't explain everything off panda so that's that's will be helpful that you are familiar with that so we are going to start with so the first part will be an introduction to G open design related libraries to start working with your spatial data reading in data doing some queries on it to some manipulations and then after the break we'll go to the pycelle part which is more than the spatial data analysis data science parts so to get started so in case is yeah that's patient is who is not familiar with Jupiter notebook everybody is familiar you may be used it for the previous tutorial as well so that's good so we start notebook in the directory that you downloaded or cloned with kids then you should see something like this this is if it's and so we are going to start with the first notebook one an introduction to do spatial data is this big enough to be writable in the back I can increase it a bit more yeah probably yes that's better okay and okay so um maybe another question who have you followed the Carter pipe tutorial yesterday a few so you can explain what the difference is between raster and vector data somebody wants to try or give an example but are the examples that were given in the in the tutorial yeah indeed so the vector data it's here in the title as well two spatial vector data the third tutorial will mostly focus on that kind of spatial data where we have polygons like neighborhoods you know City points line strings so that's the type of geospatial data that the people mostly see in this tutorial on the other hand we also if raster data think typically satellite images array of images with pixels so we'll we will mostly focus on vector data and because vector data if you have points you have a set of points with let's say points of interest within a city you often have additional information about those points or polygons and you have attributes about those points and that fits it you typically want to put this in a table where you have different attributes for each point so each row is our the metadata link to a certain location in space so therefore that very fits very natural in you know in a table and then in Python you in a panda's data frame so that's the kind of data that we will start with and the first thing if you want to start with that of course is we have to get some data and yeah in J's in special data you have many different file formats are some very specific file formats like shape files to JSON files to package files and do pandas provide some try to easy way to reading all those different data formats using the GDL library under the hood so digital libraries they are used by in it's very like based library or the full open-source GOP kosis team used in many libraries applications for reading and writing data so here's a small example in the material that you download we provided the zip file that we got from natural earth data with all the countries of the world so we can use the G openness with file function to read in those data and if you look at the first few rows and you see it's a data set all the computer world there is geometry so that's important and to to see I don't know because we already fit you panelist there is geometric column and in this case it are polygons because our countries are polygons we can also quickly plot it and now so you will see we have other the countries of the roads yeah so those things I explains so yeah for sure it's the the 12-minute itself is naive to that it will just put the coordinates that it gets on the map but there are we will see later on Metis to come forward and make sure that that's okay but that's something independent that you have to do yourself to make sure that both are in the same coordinate system and anything that so what we did get but we got the country's data set its geo data frame and as I already explained so we have our geometric column and then the other comes our at the attributes of those geometries some special things about this compared to a normal data frame is that you can always access this geometry column with a dot geometry attribute and this gives us a Geo series so it's a like do data frame is a subclass of a data frame to your series is a subclass of a series so it has our functionality of a normal series but it has some additional functionality some additional methods and attributes specific for the type of data for the geospatial data so as a small example we can get the area of each polygon with the area attribute on our geo series and we will see later on draw many other methods that are available the two data frame and into series it's of course still a data frame so we can as though everything that you know from using pandas will still work like we can access the population column and calculate the mean Alice or what we can do is we can filter our data set we can take a subset our data set in this case so I get a continent column I compared to it Africa so this will give you true files it will give you a boolean series and because it's a boolean series we can use it to filter our original data frame like this so to only get other countries that are that have in the continent column Africa and in that way you can easily use a familiar Panda syntax to take a subset of your data in this case all the countries of Africa yeah so what I said so the due date frame will allow those specific special operations so up to now we have seen a small example with polygons but as we already said in beginning you know safe points can have lines or line strings in in the terminology we will use here or you can have combinations of multiple points multiple lines multiple polygons in a single geometry so in this case for our countries it's a bit alone representation but so this has a polygon so now we are going to import a few other datasets also from natural earth websites so everyone with some cities and cities are represented by points and some rivers which are represented by a line string so it's a line if I would get here I change this print with a type you will see that this is a shapely line stream so shapely is the library that you pandas use under the hood - yeah to represent those points lines and polygons those geometry objects and it's also the library that that provides all the specific geospatial methods it actually does it by itself interfacing to the GOC library which is for example also used by post chairs so it's actually the same c library under the hood as purchase that provides all the functionality for geo pandas up to now we read data in from a file if needed you can also construct some points or polygons yourself for example here with the point you provide X&Y or longitude and latitude or with the polygons you provide a list of coordinates so very easy further on in the tutorial you'll also see quite a lot of methods that are available on geo data frame important thing to remember is that often those will also work on on individual objects for example for the polygons I saw here a bit above the area that you can with the area attribute gets the area of each Sora's it here with the area attributes get the area of each polygon in my series if I have a single polygon and I want to know so this single polygon I can also get the area of that so it's a very mostly a one-to-one mapping of methods for the Geo series or the individual objects so then related to any question we had before and so so people who follow the Caterpie tutorial will now be quite familiar with the importance of projection system for having something on a map and Duplantis is for the rest as I said naive to your coordinate reference system it will not do automatically projection or something like that so you need to ensure yourself that for example if you are plotting multiple things together that they both have the same current reference system well it provides some functionality to do that so that your data frame as a juice use as well as a CRS so for coordinate reference system attributes and in this case you will see here for the countries that we have epsg four two four three two six which indicates that is the most widely used just limited flight latitudes based on week yes we can then plot this just without any specific map projection but and we can also change I will here for example and we can use it to see RS methods to our these an easy way is to provide the epsg code so every our most current references and have such a code here I put the link of one on that website you can also search for the correct code that you need but in this case I'm going going to convert our coordinates to web Mercator with our world Mercator in this case so with two service you will see that if you look at the data so before now you see here and numbers in within the 0 280 raised because this was in latitude latitude if you now look at the new one and you will see now the this this range is much higher and if we also plot it you will see that it has now a different projection Greenland is a much bigger than it was before now quantify inspector without why are those projection systems or not predictions and in general there was a current reference system is important for example if you want to plot things you get all different options also the MS conversion functionality is also important if you get different from data from different sources and often they are provided in different reference systems so you want to convert them to a common one or specifically for us is also important so you can calculate distances with two pandas but it will just assume that your coordinates are in the Cartesian plane and for that if you want to have a distance metric that you want that you want to understand like how many kilometers is a are my two points from each other you typically want to the in meters and not in degrees so if you have a data set in latitude and longitude latitude and you want to calculate distances in meters which you pandas you need to convert it to a current reference system that has a meter unit another in decrease so that's another good reason in Japan does that you might want to do this okay so to end this initial introduction and then we will get you to some exercises here is an example how you can plot different things together because I already showed how to plot the country's data sets we will assign the result to an ax which is a map of the x-axis objects so we can use that to add the other plots as well by a providing that I want to plot my rivers and put my cities on the same map and for the rest a it's also based on on matter tip so the typical ways to style the plots are available here as well in this case I say I don't want any fill no face color I don't want the Filner polygons I only want to have an edge around it and the cities have to be in red so that looks then and something like this so rather the points lines and polygons okay are there any questions up to now about this part if not we will go to the notebook with a bigger exercise so that you can start practicing a bit is the this one so case conflict mapping I will give you a short introduction about dataset and then you can start with the exercises so the data set that I provided here is a data set about artists artisanal mining sites in eastern Congo so it's an organization the IPS from which I got the data did like visits of many sites and inspected the different magnesite so they have data on how many workers are the active if it's controlled by an armed group or not what minerals are the mining so it's a survey on mining sites in confirm so you have some some background there they provide a yeah public so I put here an example how you can do this with their the interfacing with their API but let's start with the exercise to read the do Jason data because I you can also download is that Judaism so this directory in the in the repo so the first exercise is to read the data from the Judaism file and instead first inspect it a bit a final note before you can start so I will would advise you to really try it yourself but if you are stuck and we are not available I would first try to ask us to give you an explanation of if you are hint but if you want to see the solution or you are ready and you want to compare it you can always comment this out and run the cell and you will see a possible solution yeah it will not be exactly the same as your solution maybe that doesn't mean that your solution is bad because there are often several ways to do something but in that way you can check it and you can maybe I'll get it already go a bit further if you are so much faster than as we are going or so you can try exercise if you're already you can try run the code here and do the exercises below as well so you can certainly already go until here point to I will exercise then and highlight some questions that we had so to start with reading a file we use G open as read file and then we pass the directory or the the file name the path like this and we call it in data visits oops so we know the panel's functionality to check the first few rows and just to give you a bit so an ID I will show that later so because there are quite some columns and we don't need other comes for the tutorial so I provide here some code to select a subset of the columns and if you now look at it so what is the data that we have here so we have own data and so etc in a date when it was visited I provided here a little bit of code to do great assess the data so we are only going to take those that was visited by by eyepiece so by again doing a boolean filtering operation here where for our rows where this column contains the IP string and also where there at least more than zero work so the active mines and I still sort them by date some mines were I did deprecated by grouping by the code and taking the last one so the last either mine was visit several tanks which ate last one and finally yeah converted back to the data for new data frame because by doing the group by we lost the fact that it was a new data frame so now what we have here the data here has still and so we have now a bit more than two thousands mining sites the other data set that we are going to use throughout this exercise is the asset protected area so natural national parks in the in the in Congo so yeah we again provide the link where we got the data I will just show you the results so you can if you do it similarly sorry oops I have here apparently not the latest version so if we did it similarly as how it was sorry this zip file yeah thank you so this should work I think No so apparently so yeah the same thing apparently some people already had that problem okay sorry for that so let's go to quickly check and so if you look in the data they work directory hmm yeah so normally as if files also work like I already understand okay so the problem is indeed shape files are annoying that's actually the reason because shape files say you for those that are not familiar with shape files it's a file format but it's not a file it are actually multiple files that you always need to have together that's the reason that people often make it a zip file or an archive another archive but then yeah depending on how you zip it and the support for reading it directly from a zip archive is not I mean it works because we we used it in the previous notebook but it doesn't work always depending on still in a soup folder in the in the in the zip file or things like that so in case it doesn't you don't get it working with a zip with the archive you can simply unpack it and then point to the directory or the or one of the ship the ship file inside it doesn't matter both work yeah and that will work so I suppose that this will not work because I didn't unzip it if you want it from the zip file directory I put here because it's a bit complex in this case I actually put it here you can do it with this line but I think that quite complex the easier thing in this case is to unpack the zip the archive and then you can read it like this so you can either a copy paste this one and comment that one that will work with the zip file we provided or unpack it and do it like this just so so to quickly have a idea of what is inside these data sets so we do a plots so you recognize a little bit and each polygon is one National Park so for example if we whoops if you look at the protected areas hit you will see that the quite some columns but the last one our geometric column of polygons and somewhere in it there is a name of the know name of the National Park okay we'll give you again a bit of time to get here and you can further continue with with this part of the exercise so if you didn't succeed with reading in the past the working line was inside the solution put written okay yeah you have quite some material so it's not no problem if you're not ready I'll go quickly through some of the solutions and then we will switch to doing some actual special operations so you might already have noticed so the weight the data if we check the coordinate reference system of both so you can see so for example yeah it's actually in the second so I will first show this so if you would like to make me now have areas in the points of the same area so to see their relationship you want to put them together how can we do this we can plot our data this returns and not positive axis and then on I also want to plot the protected areas on this axis and let's say that I want to use a green color because they're national parks so you all see if you do this because first convert our two data sets to a common reference system another reason that we want would like to to do this is if you want to do distance based calculations for example here we are going to import from shapely a point to create a single point like this so in this case I need to use those coordinates but be careful so saving multiply it's always longitude latitude so X 1 point 66 is South so north/south is our Y and we need to put this lost I think if I'm not mistaken so like this and then we come at the question was calculate the distances of all mines this point and then find the five smallest once the so the five mines closes to koma koma is near the border with Ronda so how do you calculate this we have our mining sites we can use the distance method for this and if we give it a single point the distance for each of our points in the data set to Tacoma so and if you want now want the five closest vendors or the want smallest distances we can for example Soros so they will MVS ending so the five first values are here but of course 0.38 it doesn't say that much in degrees so if you want to use you panels with distances we again need to convert your current reference system to something in units that has a meter as unit and not in longitude/latitude so now so both are reasons to to convert our current reference systems so let's convert them so the first one was already I showed before so the longitude/latitude here apparently so the was provided by the data as we download it was in roads Mercator and we will both both so typically if you have data that are not spanning the full world but a certain country or certain city you can use the local UTM zone and your special code so that you won't have big distortions in the shapes or in the distances that you calculate as long as your data are relatively fitting for proper for that with gender so again you can look here at a link that will indicate for which area this UTM zone is is appropriate how do we convert our data so remember we can use the two CRS method and then it's provided here we want to oops sorry we want to use this epsg code so we can provide that here a dot and that will convert so our data was in lot longer the platitude and now see and this is no longer the case so we can call this UTM I will then quickly show the other one as well oops so we can do this for both so for both the data the mining sites are the protected areas and if we now plot them together you will see now they are in the same area I will give this green color like that and make the points maybe a bit smaller so we can use here at the typical keywords to style our is that smaller yes United's we can for example give it some alpha to have it overlay there are too many points to really see the effect of this but so you can use typical matter tip functionality in this case we want to maybe also set our axis off if we don't want so what this function will do is just remove the full box around it with the because in this case those those x and y labels are not that interesting so we can this set is off and we can also make it for example a little bit bigger let's say 15 15 15 so ladies so the next exercises quickly customize the plot a little bit so you can always go through if you want those exercises but for the tutorial now we will switch to some actual spatial operations and for that you can open the second notebook so two special operations relationships and operations how's that can people read these how about at the very very back can people read these more or less yeah no binoculars required cool all right well so welcome everyone to the second part of the works of the tutorial I'm Danny rebus Bell from the University of Liverpool and in this second session what we're gonna do is looking at what's called special relationships or spatial predicates so how can we put in relation different observations that have a location and pinpoint right before anything else let's go ahead and import pandas India pandas and let's read all all of the data to make our life easier later and this is almost like going back to the middle school when you had different shapes and different objects and you you wanted to combine them in in different ways so there's there's a bunch of different operations we can do in the dip and the support and the ones we're going to look at today are of this kind so if we have two objects and these two objects could be a polygon or a point or line so all vector objects we can check whether one is within another one we can check whether to touch each other we can and that you know that could be for the case of polygons which will build up later on in terms of the contiguous concepts that certain Levi will we'll look at later or whether a line is touching another polygon or the toolings crosses so crossing of objects and again that could happen between two lines if you're looking at roads or streets that could be very useful or if you have a polygon and aligners will play in a little bit or whether there's an overlap and so on so this is a few examples there's a whole list of space of relations and like for everything that matters in life there's a page on Wikipedia for that so I won't bother you a lot more with that for now so let's start the way we've designed this is first we're going to look at these relationships at an individual level certain at an atomic level so say you have a polygon or you have a point or a line and you want to check whether that point is within a polygon or whether the line is crossing the polygon and then that's useful and conceptually I think for me at least this is the most useful way to get my head around what gia Ponder's at the end of the day is doing but then in most cases you won't have one polygon or you won't have one line you'll have a whole street network which is made up of segments and you want to check for each of them and you don't want to do that manually so gia pandas makes it easy to broadcast those operations across a lot of observations so before anything else let's get started with atomic relationship so you can tell that it wasn't me it was some Belgian on the team that wrote these exercises and let's just pick a random example here and pick the country of Belgium and then we can actually well actually I'll first go over what this does so what are we doing we're querying the country's data data frame and we're keeping all of the observations where the name column equals Belgium which there's only one last time I checked in the world and then for the entire row we keep only the geometry column right and then that will still give us a data frame and because what we want to query exactly is the geometry we use these helper here called squeeze which if there's only one of service and energy data frame will squeeze it into geo series and because that's only a single column it'll really squeeze it into the geographic object so if we do that the notebook has a really nice rendering printing method so when you just run Belgium you'll get the shape of Belgium so object number one let's do the second one Paris there's another member of the team that happens to work in Paris even though he's not French so let's pick Paris and Brussels as two points and I'll give us two point objects right and the approach we take to do that is exactly the same we query the data frame we keep the geometry only and then we squeeze it into the geographic object yeah yeah yeah so the question is very good point I just had Belgium and all of a sudden it ports Belgium I didn't have to say Belgium plot right and the reason is because so actually let me very quickly show you the difference yeah so if I do these it's gonna print the polygon and then it's gonna plot the polygon but because Belgium is already the polygon object there's a built-in rendering quick print on the notebook that will print the polygon and this is only just for a quick check-in because you can't really use it to combine different shapes or anything and it's not it's just a quick lookup so to speak well that's a good question yeah yeah but it only works on Jupiter on notebooks or in the rich console display yeah so it's equivalent to I don't know if you've noticed pandas dataframes if you print them on an interpreter they'll print as a string if you print them on the notebook they'll render as a nice HTML table this is equivalent to that yeah yeah yeah so then we do the same for two points Paris and Brussels and we can see that you know that gives us also a similar representation but because it's a point is not terribly useful unless you have multiple points or more complicated point objects right but is there and sometimes I actually use it a lot when you have to when you don't know your dataset and you want to make sure that it's what you think it is and if you think it's a it'll print the point at least you're in your own business and then finally let's create so we have two points we have a polygon let's create a line for that what we do is we import the the class line string from Shapley so this is another joke Geographic or geometric object and its really the API is really neatly designed so all you have to pass is well you can create it in different ways but the simplest one is you just pass a list of separate points and I'll give you a line that goes from point one to point two and so it's our in this case we're just creating a line that goes from Paris to Brussels and same as with the other ones again this line is not terribly complex so this doesn't help us a lot but if you had a you know more intricate trajectory it would still render it and it can be useful to to explore and then we can also combine so we have one we have object like everything else in Python and pandas for for data we can combine them very very easily right so India pandas we can create a Geo series that actually has different geometries so usually you wouldn't well I'm not going to say in most of my use case at least you wouldn't want to have in one data set different types of geometry so you wouldn't want to combine points with polygons with lines but you can if you want and it's really easy again you can create a Geo series that you're is presented before with the list that just includes all of the all of the objects and then we can pipe it directly into the plot in function and I'll give us plot and then this is to go back to your question this is not the the light representation this is a full-fledged map all the plot right that it's combined in all of the objects in the Geo series okay so I'm gonna switch gears a little bit the way I'm organizing it is I'm gonna give you two minutes for a super quick example so all of you and come back from sleep mode while you were pretending to listen to me and I'm gonna ask you to create a similar figure to this one but that connects friends pain Paris and Madrid and then I'll give you a price you can get where I'm from and it's not friends so I can give you two minutes very quickly to try to recreate a similar plot to this one but with friends Spain Paris and Madrid and if you have any questions just please raise your hand and we'll be moving around ok I'll move in just to keep it more dynamic but I'll post the exercises to the github repo later but they're here just one sec so you should have got something similar to this how have I done it first I pull out Spain the same way which is query the data frame for everything within Spain keep the geometry and squeeze it into a geometric object same for Madrid on the cities data frame then friends and Paris we already had it and then we throw everything into a list that contains pains friends Madrid Paris and then on the fly we can build a line that goes from Paris to Madrid all of that gets squashed into geo series and then directly we pipe that into the plot pollen engine yeah any questions what what oh that's because it it's a complete map so I'm gonna I called it friend fr so France has a colony well the old colonies what is it what why I'm France has the French crayon I'm gonna say your test in my geography yeah yeah I don't want to go in the record but I'll tell you the French crayon and not that I work in a department of geography or anything but alright so that's a quick way for pulling out geographies and then playing with them very easily right but then what what I said we were gonna do in this notebook is well now that we have a couple of points a polygon is a line we want to learn how to put those in relationship right so how can we check whether the point of Brussels is within Belgium or the point of Paris is within or not Belgium and so on and of course in this toy example this is all very obvious but the logic is the same when you want to blow it up to much larger data set right so let's start with the simple stuff so we can query whether Brussels is within Belgium and this almost reads like English and then we get a return true right so under the hood what is doing is taking a point object and doing what's called a pointing polygon operation with the polygon that we passed for Belgium right and we could do the same for Paris is it with in Belgium da is not right and you'll tell us follows this is using the point as the reference and then calling the within operator we can also say flip it around and say well if I have a polygon the polygon can contain a point or not right and then we can say in this context as Belgium contain brussels true we could do friends contains Brussels Falls yeah so that for it contains and within its the the two sides of the pointing polygon coin right and in some cases if you have a lot of polygons you might want to use contains or with and you have a lot of points you might want to use with in it depends on what your ultimate goal is you want to use one or the other it'll be more more useful so that was points and polygons now what if we have a align so let's go back to the line where that was going from Brussels to Paris which is yours is commute does it does Belgium contain it well not right because it goes part of the line goes within Belgium the other part within France so again the contains with checks is whether the the object on the left in this case the polygon fully encapsulates the object on the on the right and again you can flip it around and well not not flip it around you can use the intersect to see it just as it crossed is there any sort of a touching between this polygon and this line even if it's not complete so the line might not be fully within the polygon but is it part of the line inside the polygon you can use the intersect and the line does intersect yeah [Music] it's there's it so you're quite your problem is that this line is a line like that hmm okay so much yeah well okay any questions so far on this basic relationships all right so then I'll move on and then I said well this is very useful conceptually right if you want understand under the hood hood well how do you do it at an atomic level so if you have a point then you have a polygon or you want to put in relation a line and a polygon but usually in real world you we don't have one line and two points will have a lot of points and a lot of polygons and and we want to use those we want to use these sort of relationships comfortably with larger data sets right so happily for us to pandas does that very easily it basically broadcast everything that you or almost everything you can do at an atomic level it'll do it efficiently at a data frame geo data frame level yeah so remember how we said whoops L Jim contains Paris Falls that was one polygon checking whether one polygon contains a point but we actually happen to have a lot of polygons in our country's data set so we can say directly countries contains Paris and what it returns is a column of bullion so for every row in the country's data set so for every country in this case it goes and check whether that country contains the polygons apparently the point of Paris and then it returns the the output so these will only occur once right even this will be true only once how fair enough yes okay yes yes that's true I don't think Paris Texas is in this data set but then again this is the heavy lifting in line 25 the heavy lifting happens all of the checking the computation required to check happens but then actually for a human like me who don't like to read false and true but actual names you can then use this in combination with the standard pandas logic for querying data sets or data frames and to subset the countries data set data frame to only those lines where the countries contains Perry's query is true and we can just do this and in this case yeah maybe we should expand it to have a Texas country that has Paris in there and makes it true but again there's nothing spatial that happens in this line but this is also really useful to show that a lot of the pandas logic for manipulating tabular data sets applies entirely here it's just that we're expanding that with geographic capability or geospatial geometric capability really yeah yep this well this is Aquarius I would imagine is a view I mean it's whatever pandas were doing in the standard so it's exactly that's a bullying slice right No is the same way of querying a data frame here right so I'm not I'm not sure exactly whether that that would be what would that be a copy or a view it's a view I would say yeah I take up say it's a copy yeah okay all right so but in any case that would there's nothing specific to do pandas when you do that query that was my point right and then here's another quick example you that I mean that word that will work with any sort of geometry that you under support right so in this case we have we can pull out the Amazon River in the same way we query the Amazonas River in the rivers data frame keep the geometry squeeze it in and then we can look at which countries cross the Amazon or intersect and it will give it well Brazil Colombia and Peru yeah and again we're using at the core it's a geo query the cross is one that returns a boolean and then we pipe that into a simple filter in for a data frame for a panda's data frame and that will return the the countries we want and so far we've seen the crosses that contains the intersects there's a full reference here and again these all built on the the so called de 9 a.m. model for predicates and Geographic objects there is a Wikipedia page for that if you wanna read and if not the Shapley tutorial deceptively manual and on this's is very fairly comprehensive so if you're really interested in the details please go go there so with that another to three-minute quick exercise for you to shake off my boring voice can you find all of the countries that are contiguous to Brazil in the data set and here's a hint by contiguous I'm in countries that cut Brazil yeah so I'll give you a two-minute time like that it's French Guiana right I'm Frank okay Anna does touch Brazil so France is a neighbor of Priscilla T here's one lesson you can walk out of this workshop that you probably didn't know on the Python side there's a couple of things you could learn from these first thing which is pick pick out Brazil same way we query for the name Brazil we keep the geometry with quiz it into the geometric object and then all of the heavy lifting happens here countries that touches Brazil and then we just filter on the country status and to keep the the roads yeah any questions hearing on this method this Way's no cool so then the final part of this notebook is about is covering spatial operation so so far we've talked about spatial relations right so how can we put in relationship to objects and check whether they meet the requirements of different types of relations whether they touch whether they cross whether they intersect and so on now another big set of GIS functionality really or Geographic geometric functionality independence is about spatial operation so things that you can do you can create based on geographic objects not really just check-in for existing objects but creating new objects that follow some sort of rules that you're interested in and again I'm not going to cover all of them because I mean we could spend a whole day and unshapely really but if you do learn debate the the basic grammar it all pretty much works the same and the patterns are the same so you would the kind of objects you get back would always be geo series or geo data frames and really it's just getting into what the space operation you're in Trent interested is will do okay so a very common one very common operation is buffers so if we have points or if you have any object but these is one too to understand is if we have a point we can draw a buffer of X mean of a radius of X meters around it and then that really is a polygon a circular polygon and that's very useful to combine later on with all of the relation operations that we've just seen right so if you want to see whether any point is within 500 meters of another point one way to do it is check in distance another way to do it is create a buffer for that point and then check for point in polygon with the other ones so the real power of all of these comes not at the atomic level but when you start combining them all of the different ones and putting them together and because at the end of the day everything happens in the same Python objects it's all shapely geometries stuffed into your data frames with Geo series it's all very very easy to combine them put together chain one or the other and then build fairly complicated operation at the end of the day so in this case what we have is just a simple plot of the polygon of Belgium that we had before the Brussels location and then around Brussels we've drawn a buffer of 0.5 can anyone tell me what point 5 means here because usually you want to say well I want about 4 or 500 meters around my point of interest or what is 0.5 does anybody remember from its degrees right because our data right now they're expressed in longitude and latitude so if you wanted to do distances as Yuri said before you need to convert the CRS to something that's expressed in meters or miles and then run create a buffer around it and then in that case you could say say if you had converted it into something that's expressed in meters you could just pass 500 and that would be a 500 meter buffer yeah yes sorry what say there say the question area because yes so well so the question is if you say you have a point then you create a buffer around your point in long lad and then you come convert that buffer into the different CRS right that's so the answer is if you convert the buffer yes because at the end of the day the buffer is nothing but another polygon object which you can convert and reprint it has a CRS attached to it you can project it you can manipulate it in any way in the same way that you've manipulated any other object before so if you had say this and this is going to be live programming so I'm not things can go wrong what was the code for does anybody remember what the code was is what three four somebody has the yours do you remember the CRS 4 meters in in Europe which one yeah Adam tango nope okay anybody has anything else oh sorry actually that can anybody tell me what this hair is so the and this is really good I run into this problem all the time so I'll highlight here to hopefully save you some time so if you look at the error it says polygon object has no attribute to CRS this is a to CRS is a Geo pandas method right so it will only work on Geo series or do data frames yeah shapely objects I'm gonna say these don't have a CRS attached to it right they they have coordinates but they don't have a CRS attribute that attribute comes in the Geo data on the Japan this object weather is juice here is a geo data frame and because of that then the 2cr as the sierras converse and all of that happens with india pandas yeah so you can do i told you life programming is never a good idea there you go sorry here is what was the Belgium one [Music] there you go okay so what have I done here this is the same buffer we had before in degrees right so the buffer we created at the in degrees and then we stuffed that into the Geo series and we make sure we explicitly enter the CRS for the deer series and because it was is I'm projected this is the same as the city's CRS so you could specify in anyway then once you have the geo series then you can use two CRS to convert we projected into the epsg which is expressed in meters and then if we plot it then we have that this original buffer which when you create it is a perfect circle when you reproject it it not necessarily the case a perfect circle yeah it's more of an ellipse any questions no all right and what buffer is one once we have the buffer as I said the real power of all of these comes when you these are all building blocks right this is like a Lego any Lego piece doesn't do anything is when you put it all together that makes about or a car or whatever you're trying to do so the real power of these comes when you start combining say buffers with intersections with unions and and so on so the intersection just a couple of examples to show you it'll take all of the object Geographic objects you you pass and return the intersection of everything right so everything that the area that is shared by all of the objects in the lit in the series you're passing the Union will pass all of the area that you're passing will kind of put it into one the difference will return the area that is not commonly shared in this case between the buffer and the belgian polygon and as I said this is just a couple of examples there's a lot of other operations I'll just tell you one final one that is particularly useful this Union in many cases in many cases that's what you want you have a bunch of polygons you want to create them create one out of them that contains all of them there is a difference between Union and unary Union and it can remember it now but you want to use unary Union that's the lesson so in this case if you want to create a polygon so say you have all of the African countries and what you want at the end is a polygon of Africa are not a composite of different smaller polygons so you start with African countries with just a simple Quarian and then you call the unary Union and I'll give you a polygon that combines all of the geometries that you've passed in yeah any questions so I have another quick example by then I'm gonna in the sake of time I'm going to skip it because it's it's too easy for you guys so I'm gonna move over to the last bit I wanted to cover and then we'll continue any questions before these yeah so dear pandas use so the question is does the intersection union difference and so on is in shapely and Jia pandas the answer is the heavy lifting happens in shapely but you can access it through dear pandas so when you you do the same as we've done above with the intersects contains crosses and so on all of that really is shapely operations which is to say that religiosa operations so dear pandas exposes them to you in an easy way to use with your data frames and geocirrus is yeah yeah more questions nope alright so then let's move on to the third notebook the spatial joins and hopefully I can get to then 20 minutes with Oh okay so before anything let's import everything we'll need data and libraries just a quick recap so we're gonna see here space of joins there really I mean computationally they're very different but conceptually they're very similar to non space or join so if you're familiar with pandas and with joining and merging data frames based on keywords conceptually is the same idea is the idea of a spatial join is if you have to do data frames or two tables that have some sort of geographic pinpoint in that you can put on it on a map in some way we're gonna join them based on the location of their observations rather than the traditional keyword join so here's a non spatial example would be let's pick up at just a few cities burn in Switzerland Brussels London Paris and then manually just for the sake of the example let's add the countries that they're in and then this is just a subset of our cities data frame yeah nothing special there's the name there's the geometry and then there's a country and then let's also subset the country too so we have two small manageable kind of data frames that we can at least I can directly relate to rather than 400 observations that doesn't really mean much to me so the idea of non spatial joins is you have a table that has a column for country which is for example in the cities you have a column that is the country where the cities and then you have another table with information on countries and as part of that table in countries you have a key that is the country code so if you have that common column in the two tables you can join them together with the join or the merits commands and there's nothing special I mean in this case you're doing a special in the sense that you're doing in cities two countries but computationally you're not joining anything space or you just say I know this table has a column that is shared with this other table and then we're gonna use that common keywords to join them yeah and then what you end up that here is with a merged table that has the of the data you had for the cities of the name the geometry and the country code and then on the right of that is been appended the information for the countries yeah so again nothing in implicitly space all in this join now that is great when you do have the column in both tables right and but that's not always the case in many cases and that's one of the to me one of the key powers of space and and geography is that you can use it as a kind of sort of common index that you can use to combine data without having that extra column so how can we do that do pandas has an S join which is placed on the join in in pandas that does a space or join so in this case where we have cities and countries again it's going back to the previous notebook of pointing polygons is the point within the polygon yes if so then I'm gonna pen all of the information in the polygon data to the point yeah so again we want to do that join without having to explicitly say if we don't have the data which country every city is which is gonna I mean that in from we have that information encoded in their location so let's go ahead and run it the way you do it is the S join tries to use as much of the join API in pandas so if you've played with that a lot of it is is similar or tries to mimic that translating it into the into geographic grammar so to speak so we're going to join cities which is a bunch of points with countries which is a bunch of polygons then there's an up attribute for operation and here all of the operations that we've looked at before all of these pace of operation so and they within the crosses the intersect touches and so on applies and then how this is just traditional database darkened right phase and in this case were saying left so we're saying that the city's data frame is your preference table and then to that you're gonna pen the information in the in the one that's on the right and the in the countries one yeah this is make sense cool oops and then what do we end up with we end up with a table that has all of the information that cities had and all of the information that countries has and is being joined by location because we used within we use every point we'll get the information of the polygons for which the point is within it right if that make sense any questions yeah so the question is what if there's a point that is not within any polygon so before I tell you a lie yes so that really depends on how you specify the house right so if you say left is gonna keep every observation in the left right and if something in the left doesn't find a match in the right it'll just throw a missing value for those rows so it'll it'll give you a row with all of the the columns but the columns that come from the bride will be missing missing data if you said inner which is the default actually what he would do is he would give you okay it would give you just the subset that where everything is tense so to speak and if he said how right it would keep everything on the right yeah does that make sense anymore yeah if emily has one geometry and what happened to the that's for the city yep what happened in what so the question is on the joint the joint output has only one geometry which in this case is the one for cities what happened to the one of the countries that gets discarded and that is because by by agreement geodata frame will only have one geometry column I don't know if there is technically but you shouldn't in the sense that the idea of a Geo data frame is having a one observation for a single geometric object so if you wanted to have two I think and I guess this URIs will probably be better but my view is if you want to have two observations just make it two rows for every for every observation say if you have imagined that you you want to keep countries and countries you want to keep the polygon and the point of the capital then instead of having a row with two geometries one for the polygons of the country and one for the point of the capital you would do either two tables as we've been doing here or potentially you could consider having two rows one for the country with the polygon and one for the country with the point of the capital but by by agreement do pandas will only have one geometry per row and this is kind of in in agreement with how post GIS works and so on yeah before you could actually but so you have seen the analogy with you have a job geometry sorry what's the questions yeah [Music] yeah that would work the same here though yeah so you could its another twist of the previous question G because you do have the how attribute that the how argument you can say how outer and then say in this case imagine there were some cities without a country and some countries without a city he would still put him there it's just that the the row of cities without a country would have missing values on the country columns and the countries without a city will have missing values on the city columns no no yes yes it would it would only keep one I believe so yeah actually sorry I'm wrong and I just like to you there is no outer for that very reason so yeah see so you can do left it'll keep your geometry on the left you can do right you will keep your geometry on the right or you can do inner and it'll use the intersection but then will return will retain only the left use that make sense there's no outer sorry I should have said any more questions okay so then let's move on to the final one there's a couple of exercises I'm gonna skip time so right now spatial joins are really about combining data but you don't modify geometries right with the space of joins you just have say points in polygons you put them in relationship but the output of that join is still the original geometry teacher person in sometimes that's what I mean in in a lot of cases that what you want in other cases what you want actually is through that operation generate a new a new set of geometries and this is a equivalent so the space of join relates to the space of relationships the way that special operations relate to the overlay right so if you want to do a combination that results in new geometries you would want to use an overlay and then you specify the space of operations that you want to do with the how again with the how argument so let's say for instance that we want to get we have the African countries and then we draw a buffer around the cities so let's say we're the limit in urban areas in Africa and then what we really want to get is a map of the rural area so all of the parts of Africa that are not within the area within an urban area so then we can call the overlay we pass Africa as the polygons cities and then what do we want as the output we want the difference of those two yeah and then that's what the overlay returns so again just to reiterate the space of join combines information using location as they join in column but it doesn't modify the geometries the space I'll overlay does modified the geometries and you can apply you can use all of the space of operations to modify those geometries and create combinations yeah any questions so the idea what I mean why was required here so it was just the example I had was let's say we let's just keep Africa this is our polygons and then we have our cities our cities our point right and by definition points have zero area an area of zero so because it's there's no they don't have dimension it's a single point but let's say in the example I gave was let's say that we want to keep the non urban area so let's call the rural areas the rural Africa so how can we define that let's say we pick every city and then we define the urban area as a buffer around that point so let's say in this case it was two degrees around the centroid and then you're turning cities from points into polygons circles really and that's what we were calling in this simple toy example urban areas then let's say how do we get the rural areas by getting all of the country minors the part of the country that is urban yeah which is this and in this case it's two because cities is expressed in latitude and longitude and then it's degrees but say if you had if you projected it to something in meters you could express that that number would reflect meters and you could say a buffer of 500 meters a buffer of 1 kilometer if you did say 1 kilometer you will have to enter a thousand but because it's always in the unit that the CRS is expressed yeah what do you mean that's not so the question is how can you create tailored buffers to every city in that case the way I would do it is I mean this this is again a good example of just combining different operations right so then I would probably end up doing so well I'll explain you and then if I can life program it or maybe I'll do it in the in the break is were coming up to the temple the idea would be you can use an apply function and then apply that row by row and as part of that apply function define the value you want so you could for instance have a column another column in every row say imagine we had information and population for the cities so then as part of that apply function as part of that function you defined to pass through apply you could say then I know that in this world is going to be a puff column use that and transform in this way to come up with a radius and then that rate is taken into the buffer generate the buffer so at that point again in it it's just more general Python programming rather than specific functionality that is and I think that dots of feature is not about really because what do pandas is doing is making all of the GOS operations pythonic in a way that you can plug them in with the rest of your workflow yeah doesn't make sense so maybe on the break I'll try to life coat that but I'm not gonna embarrass myself I'm not gonna finish off with an embarrassment so any more questions on this yeah okay does not touch geodatabases yeah [Music] there is a driver there is a driver yeah it's Han so basically we can accept KML oh okay fara never try to comment on juries he always knows more but yeah I mean basically I don't think it in you eating operates in any way with heart pies it basically does as far as I can tell pretty much the same or it's comparable except this is based on on a fully open source tag they have their own customs and I know it's sacrilege in this room but if anyone uses are the SF package is pretty much the same it relies on gos for all of the geospatial operation so at the end of the day when you draw a buffer whether you do it in R in QGIS when do pandas the heavy lifting is being done by the same code which is tears all right so there have a break or we switched yeah one more I was gonna before that there is a set of exercises for these two notebooks on the same file that okay no the case conflict file has a set of exercises yeah so you can always because we had a bit too much material for the two hours never say you're not an arts on an embarrassment so you can always continue there the exercises yeah applies everything that Danny showed on yeah so everything from special operations on and same as the previous session the solve the solutions are there so if you if you get stuck there there yeah so this concludes the first part of it was really focused on getting to know G openers after the break we will still use cue pandas but combined with many of the other data science special data science tools that he will show you so let's say yeah yeah ten fifteen minutes break right would be twelve so we should start all right welcome back in the second half I'm surgery I think introduce myself before we're gonna shift gears a little bit so the first part was about geoprocessing dealing with all the rich varieties of spatial data formats and coordinate systems and geoprocessing the spatial operations we could call that deterministic spatial analysis that is you're interested in spatial relationships between your observations in the afternoon we're gonna switch to looking at methods of visualization of spatial data there was some notebooks in the first part that we didn't get to so we're visualizing spatial data with an eye towards exploring spatial data to identify interesting patterns and then ultimately to modeling spatial data so Levi will be covering some of the modeling capabilities we're primarily going to be focusing on the library PI South Python spatial analysis library which started a long time ago but or we had things like Shapley before we had G of pandas before we had card o pie and we had to write we we the people work on pycelle working area spatial econometrics spatial statistics geo computation and we want to develop analytics that would sit on top of these very elegant and strong data processing geoprocessing but there were we didn't have them back in the day so now we're very excited because we can focus more on the Geo analytics and then partner with geo pandas and related so we're at that part of that the the workflow if you want to think about it in terms of a scientific workflow I'm going to be focusing on notebooks five and six and then turn it over to Levi so visualization is a massive area in scientific computing but in geography in particular yeah I am NOT a cartographer I'm more of a statistician spatial statistician so we're going to just delve into a little bit of cartographic theory here in terms of choropleth maps now Koro means region all right so we have aerial unit data polygons that we want to understand how some spatial attribute that is some substantive variable you're interested in studying most of our work is in the social sciences so we could be looking at income levels or housing prices or market segmentation and we have observations could be polygons could be points and we have an attribute that we're measuring for each location we want to understand the spatial distribution of those attribute values in space we are going to look at two sets of methods that are coupled in Terry we're going to look at visualization methods so mappings very powerful but we're also going to couple that with quantitative statistical measures to get at the same question to try and understand the spatial distribution of these patterns so choropleth mapping the effectiveness of a choropleth map is largely a function of two things okay the first is given the distribution the statistical distribution of the attribute that you're interested in what is the appropriate class of and scheme this is essentially a one-dimensional clustering problem or you're trying to group observations based on their value similarity in two mutually exclusive classes and maximize the difference between the classes so you're trying to summarize the statistical distribution of the data and there's an embarrassment of riches of classification methods some of which we're going to look at shortly so that's one half of a scissor the other half of the scissor is the symbolization you're going to use once you've selected that classification scheme and it's those two together that are going to drive how effective your visualization is and by effectiveness here that depends on the context so it could be you you're a scientist trying to explore your data understand the spatial distribution in other cases it could be you're interested in identifying outliers in the data from a spatial perspective or from a value perspective or some combination of these approaches so this notebook we're going to look at how we carry out map classification ultimately then to build a choropleth map so classification first you have in we're going to work with a data set from Berlin so we're using geo pandas to read in the data and the data this is from a workshop we gave in Basel a couple months ago where we talked about how you download the data construct the day to do aggregation from points to polygons so these are Airbnb listings that we're going to aggregate to polygons ie neighborhoods in Berlin so that's the data set where we're working with we can point you to the older notebooks that build the data set and and used some of the functionality Danny and yours to cover this morning so the variable we're interested in here is the median price was the font big enough for people in the back make it bigger okay okay so we have medium price and this is the median price of all the listings in a given district right we're looking at the top five to the first five districts in the data frame here okay so that's the variable of interest I could be an economist interested in how the housing market functions in in Berlin so we're first going to understand the statistical distribution of those values abstracted from geographical space so this is the median price and I think it's in euros per evening the variable of interest so the observations are taken for each of the districts and here we're summarizing the value distribution there's no geography here the the space if you will is the value of the attribute the median price okay it's a little bit of a skewed distribution not atypical for income data or dollar denominated data you have a couple very high price listings and you see where the center the distribution is off to the left here all right so we're using Seaborn there just to visualize the data we'll throw in a rug plot that shows you the locations of the individual values at the bottom there on the price scale okay so that gives you a sense for the statistical distribution of the values ultimately here we're going to be interested in how those values are distributed in geographical space in the city of Berlin across these polygons which are the districts or the the neighborhoods and to get at that we're gonna pick up back up with the plot method for the Geo data frames using the defaults so to do a choropleth Mont and youa pandas we have the data frame which we've created we know how to do that we've done this morning we passed in a column and a column argument for which attribute do we want to plot and it's the median price and we get default color scheme and default classification which we don't know what they are yet we're going to get into this okay so that's the view of the spatial distribution of the median prices anyone from Berlin okay so I won't get called out here for not knowing anything about Berlin okay so maps are very powerful because they tap into the way our brains work in terms of we are pattern recognizing animals we wouldn't be here if our ancestors weren't tried and recognizing animals so affective map design is about tapping into those the way our brains work but you can make an argument that we're perhaps too good at pattern recognition in that we can pick up on false positives things that we think are there but they're actually not there because of the way that the implementation of the visualization was done and there's a couple of dimensions that complicate things for real world maps these polygons are not all the same size nor are they the same shape right and larger shapes tend to dominate our cognitive load so to speak so we're drawn to them and then makes it difficult because from a statistical perspective each one of these observations should have equal weight but when we put that on a map things get much more complicated okay so let's begin under the hood a little bit and try and understand what's going on we're gonna number one make the map a little bit bigger so the default I believe for that the first map that we saw was saw was to use quintiles for the classification so we take the housing values we sort them from lowest to highest and we break them into five equal sized groups and then we assign a color hue to each of the groups using a sequential hue so that we get the impression of where the high values are and where the low values are all right so we're doing that again here we're being explicit so the scheme is the classification scheme and that comes from pycelle there's that pandas has a wrapper around to use several the the classification schemes in Python not all of them for having to do with the complexity of the the function signatures but this is the default that we saw earlier just making it bigger here all right so we bumped up the size of the map we made it explicit for our classification scheme and we can get increasing granularity so now we're going to add a legend so we see what the breaks are for the different bins according to this quantization of the attribute okay so the darker colors are for the lower values of the rents the lighter hues are for the higher income or sorry higher valued rents questions so far all right so the defaults quantiles we could do quartiles by just changing the argument k2 for and instead of splitting into fifths we're splitting into quarters based on the value distribution all right and you can change that to whatever you want och tiles deciles by just manipulating K so the classification scheme is important but also the color ramp that you use and there's a lot of interest in color ramps and visualization color Bruer which comes from Cindy brewer she was a colleague of mine a long time ago at San Diego State to the very effective website for helping you decide what color scheme to use for a choropleth map depending on if your attribute is quantitative or categorical is that a diverging value distribution and so on and what what is your intended medium are you gonna print a hard copy map are you gonna put it on a Beemer so it's an excellent excellent tool and many of the libraries and in the Python world have already pulled into color Brewer and then extended it with other things like palatable and so on so we just changed the color scheme here keep the number of classes constant so we can first look at okay what's the impact of the choice of the color scheme for the ramp alright the default versus orange to red or green to blue remember the same attribute just that one dimension of this symbolization changes alternatively we could change the classification scheme and we're going to get into these in a second so we could use quartiles or equal interval or an optimization approach Fisher Jenks again same collection of values for the neighborhoods but the classification scheme differs and then obviously you could change both the color scheme and the classification scheme okay so GOP no supports the classifiers and pycelle that have a common signature that is they typically take K the number of classes that you want to define right but there are other classifiers in pycelle so there's a package in Python called map classify that just does the classification it doesn't do do the visualization that's not the the goal the goal is to make it quick and easy to get different classifiers and then you can couple them with whatever you want card apply geo pendous folium and so on so the classifiers that we have in map classify boxplot equal interval fisher Jencks sampling which is for large problems head tail breaks and we're gonna we're going to look at a couple of these have people encounter map classifiers before is this new to most of you cuz okay so we'll do a little bit of a slower pace than I was planning on okay so equal intervals the first one we're looking at here this takes the minimum value in the data set the maximum value of the data set and ask the user how many classes do you want in this case I set five and then what it's going to do is subdivide that range into five equally spaced equally with intervals okay and so we call the equal interval class we passed the attribute Y which is our income housing values and K and what we get back is an instance of our classifier object and it's right repper is just the classification scheme okay so you see that we have five classes you have the upper and lower bound for each of the classes and you have the count of the number of observations that fall in each of the classes so this is purely based on the value distribution there's no geography here there's other attributes for the classifier object so YB means the bin wise Ben so it's similar to a label and scikit-learn for doing clustering tells you which of those bins is each of the observations falling into second classifier which was the default in the plot for a choropleth and geo pandas is the is the quantiles so same type of signature you pass in the attribute Y and the number of classes same repper so quantiles this is quintiles we have five if you squint at the screen the idea behind quintiles is you unlike equal intervals where the widths are the same across the classes here the number of observations in each of the classes should be the same equal count all right but what do we see here do we have equal count which class has the most observations the first one has 39 observations right which is more than one-fifth of the observations and this is a common problem when you're doing corporate mapping with values where you have a lot of duplicates letta ties right so for quintiles in this case you sort the data and you find the 20th percentile right now if you have a lot of low values or a lot of values that are duplicates you could have that be say I don't know 32 euros but if you have a hundred neighborhoods that have 32 then you have a problem because if you rank the data you have ties right and this isn't consistent across all mapping packages how they deal with the ties what you do so what we do is we just bump it up to get the the ties and then that if the tie was say at 32 and we had 132 we move the bottom of the first quinta to the hundredth ranked observation okay at least we tell you what we do other packages don't tell you have to guess all right so Y bins and there you see that there's only 57 unique values in the attribute distribution and that causes problems for quantile classifiers all right so how do you decide which classifier to use well it depends on what your objective is and there's different objective functions so one that's built into the classifier is a measure of fit and this is like trying to come up with a good clustering solution this is absolute deviation around the class medians and we're going to compare five for different classifiers for the same number of classes of five classes and I want to maximize the internal homogeneity of the classes and maximize the external heterogeneity and what the difference is between things belonging to different groups would be as large as possible and the difference is between things that belong to the same class small as possible so lower values is better for this and we see that the Fisher Jenks class has the lowest fit it's guaranteed to okay so but fit is not necessarily always the ultimate objective here why not well there is classifiers that are used to detect outliers in the data so extreme values and those extreme values could be errant observations data errors or they could actually be interesting true values that you want to flag there's two classifiers you could use for this purpose standard deviation and mean where the for the standard deviation you take you define the classes based on distance from the mean and multiples of the standard deviation so you're getting out in the tails explicitly or a boxplot and we'll see images of box plots later but the idea here is you have the 50th percentile the data from the 25th percentile the 75th and the core of the data set and then you have things called whiskers and fences that are going to help us bound outliers it'll be clear when we get back to the visualizing the maps okay so using these classifiers in geo pianos for the classifiers that are not built into G of pandas you can do that so if we wanted to do the boxplot classifier to identify outliers with Geo pandas this one's not by default so you have to sort of build it in so first we're going to simple list with the legends for our label so a label is 0 means you're a lower outlier you're low extreme value 1 means you're low whisker so you're on the whisker of the box plot Q 2 and Q 3 are the 25th to 50th and 50th to 75th percentile so there's sort of central observations they're not extreme and then we have the high whisker in the box plot and high outliers so those are our legends for our label then we loop over our box plot classifier and we extract the bin label which label is each observation in and then we're going to use that list and attach it to the data frame as a new column in the data frame right and then plot that because do pandas can support can support that we just treated as a categorical data type in geo pandas and then we can exploit the plotting function okay so then we don't have any low whiskers so there's no low extreme values but we do have a couple of high outliers all right and the high outliers here they are more than one point five times the interquartile range above the 75th percentile so these are the really extreme high value neighborhoods or high rent neighborhoods right and so now we're seeing that we're pairing the statistical distribution with its articulation in space in geographical space okay questions on the classifiers is that fairly straightforward tells us gives us different ways to group the data and then we pair that with the plotting functions you have pan is to do a choropleth map no questions on that fairly straightforward we don't we do not have a low outlier in this case Yeah right right we teach I teach a 15 week course on geovisualization so we spend a lot of time on classifiers so we're getting a thumbnail sketch here yes as human eyeball and you might say oh these all looks simpler and here's yep yeah I didn't go we did fish at Chang's which is an optimal classifier that'll that'll do better than we can do by eye we also support user-defined classifier so well not classifiers but bins so if you wanted to pre specify where you wanted the breaks to be you pass that into the classifier and it'll return it in an object that then has all the functionality other questions okay so now we're visualizing the data and as I said our brains are really powerful pattern detectors but sometimes we could fool ourselves and find you know apparitions so people seeing Jesus and the toast kind of thing this is a demonstrative phenomena there's a lot of literature in psychology about our pattern recognition because we have to survive so we're trained to identify things that might eat us or kill us or write so now we're gonna a couple the visualizations with statistical tests so we don't fool ourselves all right and this Eastin a notion of spatial autocorrelation so it's Auto correlation we know what correlation is right you have two variables and you want to see if they're positively correlated that is when one is high is the other one high that when one is low is the other one low or are they negatively correlated is there an inverse relationship so that's correlation Auto here is that we're looking at one variable in this case the housing prices and we're going to be interested in the question of whether the value in one neighborhood is related to the values in the nearby neighborhoods okay so it's auto correlation because it's a single variable but is still going to be a form of auto correlation of correlation so we're using the same data same plot from before and what we do with auto correlation first we're gonna start with what's known as global auto correlation we're gonna pair two types of similarity together the first is value similarity so we want to understand how far apart things are an attribute space and the value of attribute space so high rents are far away from low rents ok and close rents are close but we want to couple that with a notion of geographical similarity right so how we express that gets into a large literature we're gonna use what are known as spatial weights that define neighbors so they're similar to like an adjacency matrix for network analysis if those of you took any of the network tutorials alright spatial weights are very similar they have a similar mathematical representation but our spatial weight here you can define it many different ways neighbors we're going to use simple contiguity which is going to leverage some of the geoprocessing we saw this morning and we're gonna define neighboring districts as though that share a vertex or edge on their polygon borders and from that we're going to construct from pycelle a queen contiguity structure that is if any two polygons share at least one vertex their neighbors okay now we're reading a data frame here so we're peeking into that geometry column and running an algorithm that detects the topology okay so our Queen W is telling us who is a neighbor of who and that's our geographical similarity you can think of this as a matrix although it's not a matrix each row tells us four polygons 0 who its neighbors are there's nonzero values in that row in this case ones where we have a neighbor between column of row I and neighborhood J if they're not neighbors at the 0 we're gonna transform those ones to be row standardized so that now if we summed across the row the value would be 1 so the weights get Rho normalized that's what we're doing here we use the weights the WI J's to develop a weighted average of the attribute values in the local context that is for a given district we now know what the neighboring districts are so we need an average of the housing values in that neighborhood and that's the operation that's where the spatial weights come in so it's simply a weighted average of the values for the neighbors and it's this variable called the spatial lag which we're going to then correlate with the original housing price so it is in fact a correlation we've just constructed a spatially explicit variable that captures local context and so we correlate them ultimately so that's our spatial lag and we can plot it it's like a smoother for those of you who do remote sensing in a sense but this is not a pixel image right these are you regular polygons so this tells us about what's going on around a focal polygon right but if the value in the polygon so it's looking at information in my neighborhood and bringing it back into my neighborhood and what we're interested in now is these two maps the original price is on the left that's the quintile map and the spatial lag is on the right so now we want to explore the correlation between those two maps so it's a bivariate correlation in a sense this is also another course 15 weeks of spatial statistics so we're gonna do a quick dive here so are these correlated or not this is not a scatter plot this is very complex visually to get at ok so we're going to use statistics to get out this and we're gonna back up and make it simpler to develop our notion of why these statistics work instead of having five classes we're going to have a black-and-white map high values and low values so we're going to convert our binary we're going to convert our housing values to high and low here and then I'm gonna plot that map to reduce the cognitive load so the dark polygons black are above the mean housing price or rent white are low and now we're interested in the question of if those black polygons are randomly distributed in space and conversely are the white polygons randomly distributed space or is there some structure there right and we're going to come up with statistical tests to answer that question the oldest tests are called join counts so is that Matt random or not that's our question our null hypothesis is that if we had say 68 black polygons and you randomly assign them on the map what would the map look like if there's no structure in the process generating the data so we need a test against that hypothesis so join counts work with joins and all the join is is when you have a pair of neighboring polygons and we know how many neighbors there are because that comes from our W so for each join we then ask well what type of join is it is it a poor neighborhood next to a poor neighborhood or a low low or white white or is it a rich neighborhood next to a rich neighborhood dark dark black black or is it heterogeneous white next to black or black next away so there's only these three types of joints possible black black white white or black light and join counts we're going to count up how many of those different types of joints did we get in our actual data versus some reference distribution under the null and all that whatever governs housing markets and Berlin is random spatially okay questions on the those concepts okay so we need to count the joints and in pycelle we have an Asda module for exploratory spatial data analysis it uses our W object together with our binary attributes so join counts you pass in the YB which is the binary valued attribute and our spatial weights and then it creates an instance of our JC object found out that there was 121 black black joints on the actual map under 14 black white and 150 black white and that's it that they're going to exhaust the number of joints the number of joins is a attribute of the spatial weights because it's a binary weights matrix so it tells us how many neighboring pairs are there and there's 385 so now we're back to the question if you had six to eight black cells and that spatial structure for the joins what we would expect for where the 68 black cells would be co-located or not co-located how many did we get we actually got 121 but we need a reference distribution all right to decide if that count is higher or lower than what we would expect under the null and in PI cell there's two ways to get a reference distribution there's analytically statisticians have derived what the distribution should be so it's like looking up in the back of a stats book and comparing your value to a table but they don't work so well because every map is different they're not regular lattices we have different geometries so instead we rely on computationally based inference we're gonna play creator the universe and simulate the null hypothesis by taking up the actual housing values and shuffling them around the map randomly right because that's the null the null is that there's no spatial structure we create one synthetic map we recalculate our statistic so now we have two values for a statistic or observe value and this one from this isn't that synthetic map this is called a permutation of the values we repeat this say a thousand times or 10,000 times depending on how much you want to do and for each one of those we have a draw from the null distribution and from that we collect all these and we compare our observed value against the null to decide if it's significant or not and it's a pseudo significant value so we do that when we create the the instance of the joint count on average for the fake Maps you should have gotten 99 black-black joints that's the average but we have all those synthetic joints so if we do a kernel of the null values that's the blue kernel that's what to expect under the universe where housing markets are random spatially there's no housing markets that are random spatially in the world that will that we know of so this is not a surprise but how extreme is our observed value well that's the red that's the value indicated by the red line so way out in the tail in fact none of the values from the fake maps where as extreme is what we observed all right so it's highly Auto correlated data here all right there's strong spatial structure there and hopefully you get a sense that this is a very nice complementarity to the visualization of the map pattern because map patterns are complex to process all right so that's our pseudo p-value that's one of the attributes that's calculated and it's simply the number that whereas extreme or plus the one that we got because we pretend like ours came from the null divided by the number of permutations I think it was at 999 plus the one we had that's all that value is the p-value well we threw away a lot of information when we did the binary conversion but that was just to try and develop our intuition of how the autocorrelation tests the logic of the autocorrelation test so did that sink in I know it's all it's compressed there art there's an industry of autocorrelation test there's new ones coming out all the time for spatial autocorrelation we're gonna look at well what if you kept it in continuous space so now we'll look at a test for the the actual values not the binary values it's called a Moran's I for the guy who invented it Moran was a statistician same idea you pass in the value array Y and your spatial weights and you get back an instance of Moran's I had a observe value of 0.09 is that large or not we don't know we need a reference distribution to compare it against and again there's two ways there's analytical theoretically and then computationally based so we're gonna show the computationally based one here and for the continuous value it's slightly different so the synthetic p-value is still below 5% which is your conventional cutoff for a significant statistic so this is almost 2% it's not as extreme as the binary case but remember for the binary case we threw away a lot of information we reduce a lot of the variation in the data because all the high values are treated equally even though we know there's some very extreme high values this we're retaining the full range of the values okay so it's still significant and then lastly to finish off this section on autocorrelation test the these are these two tests were global they answer the question about the map as a whole is that map random or is it not random yes or no but we have a quantification of that there's a newer set of literature on what are called local spatial statistics and here these local statistics are designed to identify the locations that are perhaps driving the overall pattern or that deviate from the overall pattern or spatial outliers so we get into notions of identifying hot spots and cold spots and we're gonna do the same thing here this is a scatter plot which shows us on the x axis this is the housing price for a given neighborhood on X on Y it's the housing price if for the spatial lag the weighted average in the neighborhood of that neighborhood right so weighted averages in the neighborhood surrounding a district and so the scatter plot lets us think about different types of spatial Association local spatial Association so if you think about the vertical bar to the right of that those are above average rent districts to the left those are the low rent districts but what about the y axis the above are high neighborhood values so high values in the neighborhood below neighborhoods that are lower value surrounding me so we put the two together we have four quadrants if we're up here what's this neighborhood what can we say about that neighborhood is it high or low high and what about the markets surrounding it they're also high so it's a high high location in quadrant one quadrant two mathematicians go counterclockwise don't ask me why quadrant two what do we know about this neighborhood is it above or below the average it's below rights to the left everybody what about its neighbors those are high right so this is low surrounded by high sort of the donut hole in the middle good stuff on the on the the outside of donut down here in Quadrant three these are low low right so these are areas we have low housing values surrounded by neighborhoods with low housing values and then finally quadrant four high value is surrounded by low values right so these are like diamonds in the rough so we have different types of spacial Association here in quadrant one in quadrant three we have positive spatial association or value similarity in space you have a high value next to a high value quadrant one or low value next to the low value in Quadrant three that's positive association negative association is quadrants two and four low value surrounded by high values or high value surrounded by low values all of those are different types of spatial Association if there was randomness on the map these points would be randomly distributed across these four quadrants but they're not right they're concentrated in one in three and the slope of that line is Moran's eye that global statistic so our local statistics now the first one instead of having one statistic for the whole map now we have a statistic for each location and it tells us multiple things the first thing the Q attribute tells us which quadrant is that statistic in is it one two three or four okay in the scatter plot and then we're going to test if each one is significant or not using permutations but now it's more complex for technical reasons we have to hold that location fixed and permeate the complement set because you can't be your own neighbor unfortunately otherwise you would inflate the spatial dependence which means if we do ten thousand permutations we have to do ten thousand different permutations for each location one for one set for each location and then we can map these so now we're going to map those locations that were in quadrant one and the scatter plot that we're statistically significant that is there's something going on with that location and its neighbors they're both coincidentally hi not coincidentally but they're coincident in space right and where are they they're in the center of the city in this case so those are our hotspots the other ones are not significant okay so we're we're treating them as we don't care we just want to find out where the hotspots are for the market in this case the other types of spatial Association are the cold spots so the cold spots are those again that were in Quadrant three but resist statistically significant and we plot those their locations and then finally we do it with the outliers so I'm going to jump to the shape the final map here so the outliers were in quadrants two which are the doughnuts and quadrant three the diamonds in the rough so this gives you a view of the spatial distribution of the housing values from a spatially explicit perspective cuz now we've coupled these quantitative summaries of local spatial Association together with the actual geography of Berlin okay so this this is exploratory set of methods this leads to questions about why this map is like it is the job of these methods is to raise those questions the job of these methods is not to answer these questions there are exploratory methods right so they're fitting in that part of the spatial analytical workflow okay questions on on these so so the method what you're testing for statistical significance of local hot spots is basically per tuning okay so for this guy right here this polygon we know how many neighbors it has so it might have six neighbors so actually what we do is we take this set of values except that one we randomly select six from that set and we recalculate its local statistic yeah it essentially yeah you find out how many more how many of the synthetic ones are more extreme than what we observed okay alright and then Levi's gonna turn to the well let him talk about what he's gonna - yeah so I figure we've got about an hour left and we've got by the way my name is Levi wolf I'm an assistant professor at the University of Bristol and so we've got essentially three notebooks left one of them is sort of a problem set based on the local autocorrelation stuff that we just did that would give you more of a sort of hands-on sense for how these things work what it means when you detect a cluster and sort of subjectively how to connect that to what that means in your own maps would you like to do that or would you have them move on to regression and clustering I guess problem set raise your hand well a couple let's do the problem set because the only people have voted okay no problem set raise your hand okay quite a few okay well then let's move past that then if you want to check that out it's the case Ginny in a bottle the Trump vote it's on the relationship between economic and equality and the change in vote from a bona Obama to a Trump kind of some interesting stuff there that you can find in the local structure of vote by county so the first thing that we're gonna do is we're gonna move on to the progression notebook so this is notebook number seven when we're talking about how to use geography in data science applications you can sort of think of it most commonly in most useful applications as a data borrowing technique so when Serge was talking about the spatial lag right what we're doing is we're constructing a new feature that gives us the average of a particular feature in the area surrounding every single observation right so if I have spatial data and I come up with some way to build a weight's matrix right awaits matrix for what we were talking about before is like an adjacency matrix kind of for the graph of connectivity between the polygons you can have this four point data which we'll be doing here in a second but if you come up with a way to express all of these spatial relationships at once and you come up with a way to compute these neighborhood averages you can use them quite often to do data science as synthetic data right but in reality what we're doing is we're using geography as an instrument to give us additional features we would call this in other context feature engineering so here I'm going to import a couple packages that we're going to be using and we're going to read in the same Airbnb data that we saw from the other notebook but this time we're going to also get information about the Airbnb prices at point level so this is a collection of I think about 14,000 prices of of Airbnb listings in Berlin that so these are nightly listing prices so every night some of these may have different minimums so implicitly they're priced slightly differently so if this were sort of a full analysis you'd have to take into account a couple of different things but in general this is the spatial distribution of our prices in the neighborhood and then let's or across these residential districts and then let's construct actually let's leave it like that so the first off first thing that we'll do is conduct a kernel regression all right so some of you have may may not be aware of what a kernel regression does in oh do you have a question no okay classically speaking how kernel regression works is in feature space you sort of usually it's used in classification you may hear about voting classification algorithms or things like that essentially what you're doing is given a new observation you try and figure out which class that observation is a member of by looking at an adjacent number or a distance decay radius of nearby observations in feature space so in the multi-dimensional space features that you have and you're sort of taking the neighbor that all of the people agree based on their class so if I'm a new observation and most of my neighbors are in class two then you would predict that I am also in class two that's kind of in general how these kernel regression ideas work so that's in a classification context in a prediction context you'd sort of take the neighborhood average so we would take if I got a new Airbnb listing I might predict its price to be the average of its ten nearest listings and so psyche it does this stuff inside of the kernel regression framework and we'll talk about that in a second but here I'm gonna pull out some data that's relevant to what we might think the price of an Airbnb should be we have the number of people that the Airbnb says that it accommodates so this is where the the host says you know this can sleep about six people across three beds or whatever you've got the full rating of the the listing itself so we call this the review scores rating we've got the number of bedrooms in the listing the number of bathrooms in the listing the number of beds in the listing the actual price and then the geometry so I'm just extracting this from the original what the listing data frame which has a ton more information than what we need and then here I'm in a structure sort of like a regressive framework where I've got some predictor variables and then I'm trying to predict price of the Airbnb listing and here I'm just again extracting this all from the data frame so we can work with it in shorthand the last thing I'm going to do is I'm going to pull out the coordinates of all of the Airbnb listings from our data frame and I'm gonna do this by going through every single row of the data frame and pulling out its XY coordinates stacking it into an array and then stacking that whole thing together itself as an array so coordinates now is just an umpire array full of projected data this happens to be in web Mercator which is accurate enough for the scale of computations that we're doing here so here now I've got this numpy array of XY coordinates which we'll use to conduct these regressions so the easiest sort of fundamental neighbor regressions that you can use are in the psychic learn neighbors module and this has a bunch of functions that some of them operates similar to the PI Sal weighting functions for spatial weights ours tend to be a little bit more geographically focused whereas the KNN stuff and well the near neighbor stuff in Psych it tends to be a lot more general for future space stuff but then we're just gonna split our data into some training and some testing stuff for trying to predict these air B&B neighborhoods and then I'm gonna do three different things down here if you look at this next cell I'm gonna fit three different neighbor regressions the first one is what I'm calling the spatial neighbor regression so what this does is it only looks at the prices of Airbnb s that we observe and it tries to predict what the or it will eventually it will try and predict the price of any testing data based on the closest elements inside of our training data does that make sense so we're just using basic spatial distance inverse distance weighted as a way to predict our holdout set so you could do this for any kind of feature data that you want to train a model on you could see whether or not holds you could you know do all sorts of hyper parameter optimization on this some cities may have different urban structures so you have to use what are called iceberg and isotropic distance measures which depend on the angle at which observations are related to one another then we're gonna check this against an attribute version and the attribute version is going to pick prices from our our training set again but it's going to try and use these exogenous regressors so these additional informations that we have about our Airbnb s the accommodation the the the number of beds that kind of stuff it's sort of a more classic framework we're gonna try there and then we're gonna try and mix these two things together we're gonna try and give an information about we're gonna try and make predictions about Airbnb ease just naively using the spatial coordinates of the longitude and latitude as features themselves okay so there's three models here that we're about to fit it's a spatial one which is just doing basic surface interpolation there's sort of what you would consider a more classical regression which is using nearby features to predict nearby output and then we're using a hybrid version of that then we're going to put make our predictions what's up oh yeah you sure you don't have to declare that three times I just did it for assuredness I mean the initialization of a cyclist emitters pretty cheap so in this case so it's you can you can change that as you want so then we're going to predict all of these values based on what we've got so we're trying to predict the spatial prediction is only going to use the coordinates of our observations to try and predict prices the attribute based version is going to try and predict prices based on the nearest exhaustion ax sin firm ation that we have so using our X matrix using the information on you know accommodates and things like that we're going to pick the other ones that are nearest to the holdout set in the feature space there's no information about geography at all it's only nearest in terms of most similar in observed attribute and then finally we're gonna fit what I'm calling this sort of hybrid model which uses latitude and longitude as themselves part of the attribute set now when you look at the accuracy of these things you see that the spatial model has about 10% of variance explained so just by using the outcome itself not including any of our exhaustion distress errs we get an explained variant score of about 10% our exhaustion is progressor model the one that uses the X information which could potentially be a huge amount of X information attains about a 33% and finally when you mix both of them together naively you absolutely tank your model score so what this is supposed to show you is that if you do want to try and use spatial data you have to be careful about how you include it oftentimes it's not going to be sufficient just to include a naive longitude and latitude into your data you should try and use these concepts that we're teaching you such as the spatial lag using neighborhood averages as engineered features rather than just naively including information about spatial position does that make sense cool are there any questions just on this little example here does anyone want to see any alternative reconfigurations or refit yeah I believe that I'm splitting in two yeah I think I'm only holding out two thousand observations or so so that can change the other thing is we don't necessarily know I haven't ensured that the holdout set is spatially random so it could be that there's spatial structure in the holdout set that affects our prediction accuracy we're not going to talk about that but all that kind of stuff can be assessed using local statistical techniques and things we discussed previously all of that exploratory data analysis stuff applies to any type of spatial data you just have to figure out how to express the topology so yeah I think we're only holding about like 2,000 M are there any other questions on this just basic example here none okay so what I would suggest then if you're interested in looking at these kinds of kernel based regressions for local data there's a package in the library analytic library that we work on called GW r and GW r stands for geographically weighted regression geographically weighted regression is kind of like a mixture of these two concepts of feature space regression and spatial regression in essence what it's doing is its fitting models at each site if you've ever heard of creaking or you've ever heard of spatial differential process models these are all sort of a family of local smoothing models that all work in a sort of similar way this one is a generalized additive model that uses spatial position to localize predictions and estimates so the second thing we're gonna do is I'm going to show you how to use this spatial lag /data borrowing technique in a more classical regression setup so we're not going to use this KN regressing kind of stuff so here I need to do some conditioning on our weights matrix because by default if we construct a spatial kernel weight it fills the diagonal with ones and we want to make sure that we fill the diagonal with zeros so that our focal observation does not contribute anything to its estimate of the surrounding averages so here I'm gonna fit that kernel weight which is just going to apply a Gaussian kernel to every observation it's going to adapt the bandwidth to ensure that a particular K nearest neighbor criterion is satisfied so I believe we include a whole notebook that goes way in depth on these weights matrices and how to model sort of spatial relationships specifically but here I'm just using this to to provide it takes a while Oh bummer K is a hundred okay well so so the point of doing this here in the same way that in the previous notebook we constructed an adjacency graph using the Queen's weight criteria this is going to give us a sort of weighted way to construct a spatially weighted average of the nearest hundred observations that we have in our data set so using this kernel weights it's kind of like if you've used post GIS or something like that when you when you're doing spatial indexing queries what these weights objects represents sort of conceptually is the output of an all pairs spatial relationship query and we do this one time ahead of computation so that we don't have to do it later when we're running the techniques it allows us to express everything in terms of matrix in linear algebra rather than needing to express it in sort of query language like post GIS usually uses so now I've run this waiting function we've got a kernel weight which gives us a Gaussian kernel centered on every single airbnb that gives us a distance decay based on the Gaussian radial basis function based on the geographic distance away from that observation so this is just an arbitrary way that I can say this observation is related to this observation by this spatial kernel y'all have seen heat maps before this probably is somewhat familiar now our data matrix remember is all of this data accommodates review scores bedrooms bathrooms and beds so that Colonel weights matrix is in general just an N by n matrix that has been sort of specially constructed to give us certain attributes so when I compute this WX I'm pre multiplying that matrix by my data matrix and that will give me a new data matrix which gives us the neighborhood average of every column in our original neighborhood matrix so this stuff scales usually pretty well if I can get that weighting matrix I can use it as a filter to construct spatial averages so now if I look at WX the matrix it's a numpy array but it's a numpy array which reflects a weighted summary of all of the neighboring observations to that specific observation so WX and X well let's grab WX 0 just to show so these two things are not related at all right so the the there the WX summarizes the surroundings of the first observation and X summarizes the observation itself right if we look at the actual neighbors for kW zero we get that it has all of these hundred neighbors with this Gaussian kernel weight now if we wanted it to be on the same scale as our input data we have to do one additional thing we have to row standardize it and then construct the lag again we'll do that here I didn't do that in the notebook originally so now our focal observation for our first observation accommodates two people but the hundreds hundred nearest heir is Airbnb listings tend to accommodate around three people does that make sense the focal observation the first observation has a review square rating of 100 but the hundred nearby Airbnb listings only have a review square ninety-two since we've constructed this Colonel weights all of these things become sort of de facto queries that we've done the computation ahead of time okay so using stats models but you could use your favorite regression package and achieve the same thing we're going to first check out our model which is a standard OLS trying to predict prices using the attribute information that we had before and we're going to get sort of not a very good model it gives us an r-squared about point two nine we get most of our effects are significant except for the the the number of bathrooms effects so I mean obviously we're not using this to do any like inferential science on the theory of why Airbnb prices are what they are but it would be interesting to note that bathrooms don't significantly affect everybody prices but then in addition to this will construct a hybrid table together this X W X table which has let's just grab the head so this table has both our focal observations now and then these lags right so for every single attribute that we're using we're going to grab the lag attribute as well and then we're going to use these together in a regression and when we do that we notice that the r-squared bumps a little bit so we've done this feature engineering now we've looked at constructing a hundred the the neighborhood average of a hundred nearest observations and we've gotten out this new regression where we find indeed some of these contexts are actually statistically significant useful predictors and we've improved our model in general so for instance when you're considering sort of the the review scores locally the the review score of the air B&B matters but the quality of surrounding air B&B is also matters and in fact that lag coefficient estimate is actually larger than the original one so that sort of gives you an end of how desirable the area is sort of in this intrinsic unmeasurable quantity we've sort of instrumented it with the neighborhood average of reviews does that make sense if you're staying in an area with highly reviewed Airbnb s it sort of gives us some information about that context in which that Airbnb lives so there's an additional collection of functions and pycelle these live under the pycelle the PI style that spatial regression module and these all involve various different types of spatially relevant specifications so let's just fit one here on y and x WX table values oh no it's not going to work yeah okay so there's some incongruous Mis between some of the stuff that we're working on a pile right now but there are other methods such as these lag object or the sort of what they call lag regressions so here in the previous section we just fit a linear regression that looks like this and where this WX here reflects the neighborhood averaging that we've done right there's also another class of models that can be fit in pycelle and other places you can fit them and Stan PI MC is a little difficult and this class of models fits instead a version kind of like our kernel regressor and you can throw in a W X if you want but you don't have to where you now have your outcome on both sides and this this sort of mixes the strength of the kernel regression which borrows the similarity in the outcome to predict new outcomes in this regression you're using the outcome as an endogenous factor you're saying that the air the price nightly price of an air B&B cannot be determined by itself you know I am competing if I'm an Airbnb tenant or air B&B lord whatever they call them I don't like everybody Lord but I guess I'll use that if I'm if I'm running an Airbnb shop and I'm pricing Airbnb ease I'm cognizant of the the price of surrounding areas right so I can't I don't have full price setting ability in order for me to set my price I need to be competing with the people that are spatially next to me right because they have a similar set of amenities they have access to similar urban areas similar transit outcomes right so because of this we can't necessarily say that we would expect all of these things to be independent of one another and one way that we can model that is by this hybrid outcome those get a little bit more complicated because they become non linear regressions so mathematically speaking there's probably not a way that we can efficiently solve this one in this particular kernel and weights configuration but there are ways as these gets parser to fit very well so doing them on tens of thousands of observations as long as you've got really sparse connectivity works just fine so this is what we have on the regression section there are there any questions as far as what we've talked about on data borrowing as a conceptual method to bring in geography into your modeling using these kinds of neighborhood Africa's yeah okay yeah so I have written some stuff to sample spatial models in a Gibbs sampler for multi-level stuff but and I've also written some code to do some of this in Stan we don't do most of this stuff comes from maximum likelihood estimation perspectives so in general like we don't we don't support Bayesian inference this the only the stuff that I've written as sort of packages under PI Sal fit those kinds of models yes true as well there's a looking at local smoothing of attribute weights sort of more on either outlier detection or local area modeling that that can get in them but if you do want to if you are interested in fitting these kinds of endogenous outcome models I would suggest that usually the prior like there's some tools in Python but I think it's probably best if you fit them in our there's like a well-established Monte Carlo sampler there well it's a Gibbs sampler metropolis with and gives but do the gradient stuff gets difficult because this reduces to a difficult expression to differentiate so the conditional posteriors of the auto regressive component don't sample very well so I just had a paper out last year I think that looked at that kind of stuff and it's tough yeah yes so it refers to constructing new attributes either from WX or WY yeah yeah I mean how high would you like I thought that data borrowing was a good descriptive name for what's going on formally speaking yes it is a form of weighting your escrow various but semantically speaking your regression gains strength because you're borrowing additional information based on its spatial configuration of observations so yes yeah right so you're not I mean it yeah yeah so that you're well I mean you do get more parameters that have to be fit there so technically you do lose to Greece of freedom to fit those parameters but in general you're not making a more complex model yeah so you can you can stuff it in any place where you use a linear model which I presume is all places for most people any place you use the linear model you can bring in geography by using these data borrowing techniques this stuff the endogenous regression is more difficult in general there's not a closed form solution depending on what you do so like you have to use a different sampler if you want to do this in a logit situation you can't fit it in particular ways in that way does that make sense any other questions you all have a good idea about how you might apply this particular technique in your future linear models okay so then we can move on to clustering and we can do the problem set I'm gonna ask another another poll maybe not we'll just do the clustering so we'll move 208 is everyone following fine does anyone need a break yeah okay I guess you've been here for three we've been here for three hours so it's that's a long time is everyone on clustering yeah huh let's change yeah so when it comes to clustering Serge talks a lot about the stuff that we have in pycelle that will do map classification and in traditional geographic perspectives map classification is sort of a thing that we talked about for univariate data when we're visualizing the structure this spatial structure of our distribution of data and if any of you this is only supposed to be an introductory workshop so we're not getting too heavy into the theory of clustering in your data but if any of you have used stuff like multi-dimensional k-means or other things like that you know that it gives you this sort of dimension reduction technique it allows you to take large amounts of data and express them in Leighton categories it's kind of what map classification is supposed to do but then we can do it with much more data than what we had before so what I'm going to show you here is how to search for and use these clustering methods on larger data does that make sense cool so the first thing we're gonna look at again is this Berlin this thing's data set and I'm just showing you again how this Berlin listings data set looks and then we're going to pull up the psychic clustering module which you should all have installed if you followed the directions and again I'm going to pull out a coordinate array so most of the stuff that geo pandas is useful for is to be able to sort of work with your spatial data but again psych it doesn't really care about where it comes from once these numpy arrays so we kind of have to play nice with that it would be nice if we could just send it a geo series and it understood how to get the coordinate array but it's um definitely not there yet so the first kind of clustering technique that you may have heard of is DB skin right this is a density based scanning method that it looks at sort of local reach ability tries to find areas of high density and interconnection between observations and discover where sort of regions exist so we're going to apply this DB scan just on the spatial information so we're not going to consider anything about the attributes themselves we're gonna do what's called regionalization regionalization is slightly different from generic clustering in a spatial context regionalization is a graph partition we're trying to split the data into contiguous sort of geographically well-defined zones and what DB scan applied to spatial coordinates will do is give us that right just like we two usually use DB scan and feature space to give us well-defined zones with attribute similarity internally this will give us well-defined zones where observations tend to be closer to one another in this zone then they tend to be outside but that's only in the Geographic dimension not in the attribute dimension so here we've applied DB scan to the spatial coordinates and I've tuned the epsilon here you have to do a hyper parameter search depending on the projection it's not insensitive to that so when you use these DB scan clusters you get information about what the cluster discovers is the poorest of every one of these clusters so these would be in other contexts like if any affinity propagate of sampling you'd get these things called exemplars so when you hear core sample indices these are things that are sort of like the most cognizant elements of that cluster and then we get this label array and as Serge was talking about in the map classifiers this dot labels attribute is mainly what we're interested in right so we do these clustering applications we get these labels and then we use these labels in order to visualize complex patterns in the data so here what I'm recommending is that when you want to do this kind of study you go through you use your psychic classifier you do what everyone maybe you have a special model in carrots that you like and then you can again pop it back down into geo pandas and plot so mainly what we're doing is we're going most of the data analysis workflow is pull in your data do some Explorer story statistics to try and get your head around the spatial structure of the data build some spatial weights matrices and then use those to do all your data science bring it back into geo pandas plot work with it export it make dashboards things like that so when we've done these clustering in in DB scan what we've done here is we've shown that every every single one of these are latent density clusters so these represent areas where Airbnb listings are closer to one another than they are to any other one and we see that mainly what we find is Berlin is a cluster right well tuning the epsilon value will give you different sort of resolutions on those clusters it'll tell you how far apart observations need to be in order to be considered disconnected so many of you may have heard of h DB scan which is a hierarchical density based clustering technique based on DB scan does some of this tuning for you but also introduces more implicit parameters the point of this was not to show you the best clustering the point of this was to show you that some of these methods when you apply them just on the lat/longs will give you these sort of zone ations and a lot of people want to solve zonation problems so when you do these kinds of donations keep that keep that in mind yeah or it would error because you don't get connected graphs so like if we drop an order of magnitude from the cluster and we pop it into the plot we'll see that most of the clusters become either unaffiliated which is now the purple or so yes yeah DB scan itself is highly sensitive to the the structure that so one thing that I find to be helpful when I'm working with these kinds of classification labels is to sort on the labels because when you get when you get the classification vector back from psych it you get negative one is like no classification found so when you sort based on the labels you can pop those below the rest of the map gives you a nice sort of impression of where those clusters are found so here now instead of saying at a thousand feet observations are going to be considered disconnected now I said at a hundred feet observations are going to be discounted and then use that to sort of give us a sense of where these latent clusters of density exist and if you look closely you can actually tell that some of these neighborhoods correspond or some of these clusters that are found by DB CN correspond to clusters of neighborhoods in the cool shape I love the neighborhoods but we're not necessarily going to get into that because again this is contingent on a lot of hyper parameters yeah yeah it should be hmm not sure originally that should be a P I'm not sure why that's the case right now I'll have to check on that and get back to you sorry about that regardless geo pandas is smart enough that it knows how many clusters there are in a given discrete data set so it'll figure it out and we don't have to tell it any other questions on just basic zonation of data sets no okay let's move on so when you get actual aerial polygonal data how many of you tend to work on like point data sets many okay how many tend to work on like aerial data sets where you've got districts okay a couple I think there's a lot of people who do like econometrics stuff or program analysis that work on counties or zip file or zip zip codes or things like that so a lot of this stuff is applicable to lattices what we call them these collections of polygons that touch one another but in general it's also applicable in theory to anything that's connected by an adjacency graph so all of these lattice statistics are also themselves sort of graph methods weighted graph methods or what they call labelled graph methods sometimes if a node has particular properties slightly different representations of the same thing that we're going to be talking about here so again we've got this neighborhood data set where we've got the median price of observations and first what we're going to do is we're going to look for just like we were before do a map classification style approach we're going to look for three clusters inside of the price data and the price data alone if you recall that's the median price distribution right so we should probably find maybe two clusters maybe three clusters here I'm forcing it to be three you can do some assessment we'll talk in a second about how that works and now we're mapping what this distribution of median prices looks like when we're using this word agglomerative clustering on our input data again still univariate we're just visualizing the patterns in the price clustering have you all heard of the like many different kinds of cluster certainty metrics like silhouette scores have you used them before no okay so a silhouette score is kind of like a regular eyes distance although it's includes negative values as well it's it's a measure of how certain we are about a given classification it's a function of how close the observation is to the center of its cluster versus to the center of the closest cluster we could pick that is not that one so it's kind of like we put it in cluster a and our second best choice would be cluster B how far apart is the observation from its own cluster to our second choice standardized if there's some stuff it's usually varies between negative one and one negative one indicates lack of fit one indicates fit and Psyche it has this silhouette samples function which allows us to compute this measure of fit for every single observation in our sample so here we get a measure of how strongly these prices that we fit before in this map are classified into their groups so just like before you can do sort of spatial statistics on these kinds of things you can map classify has some stuff relating to this as well but we're not going to deal directly in terms of these silhouettes score has for the moment when you visualize based on the Fisher Jenks algorithm which is a k-means so that's a different solver than the and the order glamoured of clustering we get a different spatial distribution of our attributes right so depending on what cluster you think is relevant here this is using Ward here this is using a Fisher Jenks solver for class homogeneity you'll see that you get different structures in addition our structure of certainty changes as well right so this silhouette map is quite different visually from this silhouette map and again we can do sort of map comparison techniques statistical methods to compare these two things and then of course you can plot these decision boundaries and see how these relationships vary across different customers when you're trying to summarize the quality of an overall fit you use a weighted average of the silhouette scores so if you don't want to get the silhouette scores by observation and rather just get a single map average estimate of how good your clustering is you can use the cite kit metrics dot silhouette score function and this will compute a weighted average based on the number of observations in each cluster of those silhouette scores propagating upwards so as you see here the Fisher Jenks estimate this k-means solution is slightly better on homogeneity and separation then is the agglomerative clustering method okay so this is just again we can do multi-dimensional versions of this because they're all just standard psychic clusters right so if we wanted to then say fit let's just do an agglomerate 'iv because that's kind of it's kind of simple if we fit this agglomerative clustering and instead of using median price oh wait these are the districts so we may not have as much information yeah we don't have all of the information so in this particular data set I've only propagated up one attribute I haven't propagated on more than one attribute so here we're just dealing but again since these are psychic classifiers you can take any a number of features condense them into a labeling set and then use them to do this kind of certainty analysis the second thing that's kind of helpful is so we've seen zonation we've seen when you use DB scan you can just construct these sort of where our observations and how are they close to one another okay we've seen that we can do this sort of map classification style approach where you for high dimensional clusters in your data and you map the output of those clusterings the other thing that you can do is actually look for zones that are homogeneous on a given attribute and that's what I'm about to show you now to do that we usually need to again build a topological relationship so I'm again going to use a particular adjacency graph here I'm using a rook adjacency and rook adjacency is like Queen adjacency but it just means that you can only share an edge so if two polygons share a point they're not considered neighbors under rook but they are considered neighbors under Queen does that make sense it's just an analogy to chess we use it as sort of shorthand here so I'm saying that I want to eventually find these clusters that are homogeneous based on their attributes but then also I want them to be internally connected I want them to form connected partitions of my global graph and so to do this we can fit our price clusters into side kit using this structure here so in the agglomerative clustering function you can actually pass a connectivity matrix and pycelle constructs connectivity matrices from our weights objects on the fly so when you look at loops when you look at rook F dot sparse this is a sparse matrix that represents the adjacency matrix in that weights object so any of the various types of kernel weights constructions can n weights constructions that we have will also express themselves in terms of these adjacency graphs which you can then use you know you can ship this off in a network X if you want to use sort of alternative types of graph partitioning but this way you can go really quickly from fitting the adjacency structure in a spatial format like geo pandas straight into psych it alright so using these kinds of these methods so what we're gonna do is we're gonna look for three clusters again but we're going to make it so that it obeys this connectivity graph this this adjacency matrix and we're again only going to find it in the median price and we find three clusters with relatively homogenous prices based on so of the connectivity structure we fit now you probably don't need to know a lot about Berlin to be able to tell me what this divide is it's not actually precisely the correct divide but there there are some strong east-west divisions in Berlin right yeah there there's there's a there's a lot that's gone on sense man too so so in general we can fit these kinds of clusterings and again you can vary you can vary the amount of clusterings that you find here say we wanted to find six you get six clusters that are adjacent to one another internally but have relatively homogenous prices so in general if you want to do this kind of weighted zonation you can do it really quickly by moving straight from these connectivity objects into psych it now one other thing that we can often use with these kinds of probabilistic clusterings and scikit-learn has a really nice Gaussian mixture model framework anyone who uses Gaussian mixture models in other contexts you can do the same exact kind of analyses so here I'm gonna fit a Gaussian mixture classification just by I'm forcing random state to be a particular value here so that we get reproducible results passing along the price and then we're going to give sort of a predicted price based on our mixture model and the most helpful thing is that instead of using these silhouette scores which are what they call a post hoc measure of fit given the classification we compute the silhouette score we get these probabilistic measures of class membership so here this is actually built directly into the model of clustering most of the other clustering methods like agglomerative clustering at k-means don't actually think about the silhouette score while they're fitting they only really think about these measures of variance or these measures of deviation locally instead this is actually built into the likelihood function of the Gaussian mixture model so when we recover predict prabha which is how certain we are that this observation falls one of these three or one of these six or one of these K classes we get a full matrix of prediction strength so here I'm going to use just a very basic Arg max here I'm gonna grab the maximum class and that gives us a sort of like a lit of strength of label assignment here and then again we can make this visualization of how strong our map predictions are as well right but one thing you'll notice is that in general when we do these a spatial clustering methods we get these unconnected graphs and we get these structures in the silhouette scores which tend to not be very spatially patterned right what happens if we compute the silhouette samples of things like the agglomerative method did I save it yeah ah that's a bummer I thought that this cell was there I guess not well go back to the button so the point is anyway when you visualize these things for the spatial versions of the zonation when you get these homogeneous donations you'll find out that the spatial structure of the certainty is also very patterned right so when you enforce spatial structure you actually usually lose something in terms of the fit of the cluster and here I'm showing you the relationship between prediction strength on the spatially constrained version and the gushin mixture version but the point is when you use the spatially constrained stuff when you say I want contiguous zones those contiguous zones are usually never going to be as homogeneous as the zones that you don't force to be contiguous so you have to be mindful how much do you really want that contiguity how much does that really matter to what the analysis are you doing is is it just for pretty maps if that's the case it may not actually be worth it because the difference in the two can be rather stark however for the zonation stuff you can move rather quickly from doing these topological computations in pycelle straight out to a matrix form into psych it using the tools that we've just talked about in the clustering are there any questions on the clustering notebook we've seen tier spatial clustering attribute clustering and visualization and then sort of what we would call regionalization or contiguous donation homogenous donation no questions this is a semester-long course rather than a 14-day of course all right well we've got about five four minutes so I guess we'll break just a slight bit early if you have any questions for me or my co-presenters come up talk to us we'd love to love to chat but otherwise there's a couple more resources in the in the repository like I said there's this additional case study we've also got some additional teaching materials based on spatial data science and analytics in our various geographic data science repositories and a book in the works so yeah please keep in touch and keep following us and more updates on this kind of stuff Thanks [Applause]
Info
Channel: Enthought
Views: 55,415
Rating: 4.9598799 out of 5
Keywords: python, geospatial, Data Analysis, SciPy 2018
Id: kJXUUO5M4ok
Channel Id: undefined
Length: 188min 3sec (11283 seconds)
Published: Tue Jul 17 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.