if you're attempting to use the environment dot yml file to set up an environment a build issue on some platforms with some versions what I would encourage you to do is to download the environment dot yml file from the from the tutorial resources and replace the one that exists in your current copy of the workshop it also just redone the entire we had to change it so that it explicitly specifies where we're getting carta PI from so you could also just edit your file locally and add car to PI like it's the rest of the packages here under the dependencies that should work hopefully we're very sorry that it's this sort of inconsistent mismatch we checked that it worked on Windows and on Mac OSX and Linux but in the time that that happened it seems like there's some failures sporadically so for those of you who do not have the environments configured yet continue to just raise your hands as issues arise there are also a binder environment for this workshop which allows you to execute the code in the cloud if you'd rather sort of do that then get concerned about setting up the environments but I think it's always best for you to have your own local copy running so you can interact with the with the data so I guess what we'll do first off is introduce hello this is the tutorial on intermediate methods for geospatial data analysis welcome this is the second in a series of workshops that we gave last year on an introduction to geographic data analysis and that dealt with sort of the primitives of geographical information so how you represent things like points lines and polygons in geographic information systems in Python and then how to do some sort of basic querying and manipulation of that data all the way up to some more advanced topics on statistical explorations and the analysis of spatial structure and data my name is Levi wolf I'm a lecturer at the University of Bristol in the United Kingdom and this is I'm Serge ray um professor at UC Riverside I run the Center for geospatial sciences there and I'm a co-chair of the meeting this year so thank you all for coming we're gonna tag-team today Levi's gonna deal with the first half up to the break and then we'll swap and I'll take over so that means I can range around and help people and the first half if you have a question or get stuck on something raise your hand cool the format of the workshop is broadly organized according to the schedule that's listed in the readme in the tutorial materials there will be problem sets for each collection of notebook so we're gonna start off working on a notebook on spatial relationships and this is about sort of classic concerns in gos about whose nearest to me what spatial joins relates different geographical objects together and then there's going to be a problem set where you'll take the stuff that we've walked through and then apply it on some new data to try and answer some questions about the structure of 3-1-1 cases reported in Austin so broadly speaking it'll be sort of a presentation on a notebook which you all should walk through as we're doing it and then a problem set where you get about 20 minutes or so to try and solve the problems using techniques that we talked about throughout the tutorial before we proceed raise your hand if you're still having environment issues but quite a few okay does anyone else have any issues generically about the content of the tutorial or about what we're gonna do before we get started awesome okay if you are still having environment issues can you keep your hand raised so that Serge can come around and check on some of the statuses and then I will proceed to the first set of content so du are all of you aware of how to open Jupiter notebooks in in either Jupiter lab or a Jupiter notebook viewer so that you can walk through the tutorial normally that would involve opening up and anaconda navigator getting to where the files are located on your hard drive and then opening the file we're gonna start off with the first note book called GDS one - relations that IV ynv okay so geographic data comes in a lot of different representations and conceptual structures in this notebook we're mainly going to be talking about the structures and relationships between what's called vector data and vector data usually comes in three different kinds of flavors points which conceptually rent represent a specific single location in geography lines which represent some collection of continuous space one dimensionally along a particular path usually those are composed of individual straight line segments in the GIS and then finally we're going to be talking a little bit about polygons which are patches which usually have lines defining their boundaries or internal holes we're going to be talking a little bit about how to do some pretty plotting of this data and a little bit more advanced of a way than you learn typically with standard geo pandas tools and then we're also going to talk a little bit about geocoding which is the process of constructing Geographic data from textual data and then also constructing textual data from Geographic data and then we're also going to talk a little bit about zonation which is the it which is one way that you can conduct sort of a spatial group by of points and then we're going to talk about classical sort of ideas behind spatial joins spatial queries and in particular spatial nearest neighbor queries so the first thing that we'll do is we'll be introducing in this notebook about four or five libraries but the central ones for most of this workshop is the geo pandas library and this is incredibly small cool sitting single document mode cool so we'll be using these three libraries geo pandas matplotlib and pandas pretty extensively throughout the entire workshop in this notebook specifically we'll also cover a little bit of the context de lis library which is the library responsible for generating map base tiles the geo PI library which deals with geocoding and reverse geocoding data and then we'll also talk a little bit about the the SyFy libraries spatial sub module and in particular some of the things that you can do using sigh pi spatial so first off geographic data comes in a large variety of formats and oftentimes when you're working with people who don't do GIS work on a day to day basis you're going to have to learn how to parse geographic data by hand one of the more common ways that people export geographic data especially for points data is just in terms of a flat text file so here we've read in two data files that we're using today they're compressed comma separated value files this is actually usually a little bit nicer than what people will give you they'll usually just give you raw CSVs which tend to be wasteful and somewhat dirty here this is a set of data on neighborhoods in Austin Texas that's in a CSV and then a set of Airbnb listings that have been scraped from Airbnb containing information about their price or their amenities or sort of different descriptions that the host has written about the property and when we look at this listings data we can see that most of it is textual there's five rows and 107 columns and at some point we see a geometry column but in general we see that there's a ton of information about spatial context we get some columns refer to the city we get some columns that refer to the state so let's see here city we get state we get some information about the zip code we actually get this internal variable called market which is the area that Airbnb thinks that this this particular listing competes in you get some information about latitude and longitude which is the actual point location of the listing and then you also get some other information at the end here called geometry and this is because it's a database export it comes out in well-known texts in the neighborhoods data we have polygon information so these are going to be a collection of polygons that represent the area of every single neighborhood in Austin Texas and these polygons only have three bits of information with them three columns you've got a hood ID which is kind of like the neighborhood identification number you have an information about the group that the neighborhood falls in if any and then we have this column called wkb wkb is a very common format for exporting Geographic or geometric information from spatial databases and it stands for well known binary texts well-known binary text comes usually in two different flavors this looks like a large string let's see this looks like a large string where the very first part of it is providing some information about what type of shape that shape is and then the rest of it basically everything that comes after this AC is encoding the coordinates of that polygon in a textual representation okay so this well-known binary is a common format used inside of that is like post GIS if you use sequel light the the spatial light extension will also use this kind of export by default so this is a very common mode of geographic data but it's not really that clean right you're probably used to hearing more about things like geo JSON or shapefiles or geo packages and those are much cleaner data you can usually just use geographic information systems to read that in like we did in pandas using the geo pandas dot read file we'll do that later instead to create geometries you'll have to learn about the different constructors that take textual data and process it into shape data so one of the ones that was added to geo pandas quite recently is this points from XY functionality and points from XY parses latitude and longitude information in this case it's actually stored as X&Y so it's longitude then latitude and turns it into a geometric representation in Python so when you run points from XY on XY coordinates here we're sending it the listings longitude and the listings latitude it turns that information into a list of points now this is an intermediate workshop in geographic data science so we won't cover geographic analysis so we won't necessarily cover a lot of Shapley but Shapley in python is a library that is that provides the geographic primitives for geographic data so when we mentioned before that geographic data usually comes in three flavors geographic vector data points lines and polygons Shapley is the package that defines all of those classes and their structures and their relationships if you're interested in learning more about these kinds of very basic Geographic primitives for how to present geographic data last workshop on beginning intro to geographic data analysis provides a really thorough overview of those geographic primitives now we're going to sort of skip that thorough intro on Geographic primitives and move to sort of using them to make geographic information science happen okay so we're not going to spend a lot of time dealing with the actual underlying mechanics of the Shapley library but that's what happens when you use points from X Y points from X Y takes textual representations of data or float representations of data just in coordinates and then turns it into this proper representation in shapely so to build a Geo data frame there's a lot of different ways you can do it and the Geo pandas geo data frame constructor takes the most common way to work with this constructor is to send some kind of data frame like object and then tell geo pandas where you expect the geometric information that corresponds with each row to live so here we're sending this listings data frame not had sorry head not dead we're sending this listings data frame which has a bunch of columns and a bunch of ways that the geographic information is represented but we've just built up this collection of geometries inside of geometry down here right so we recalled geo pandas dot points from XY to take the raw longitude and latitude coordinates and turn them into shapely representations and then we're telling geo pandas this list of shapely objects corresponds with every single row of that original listings data frame so we build the geometries and then we send in something that looks like a data frame and some information about the geometries down to the constructor now there's a lot of different ways that this constructor can work if you have a shapely object stored in every single row of your geo pandas dataframe you can give this a string and then the Geo pandas constructor says okay I know where to look for the geometry and I'll construct a Geo data frame just fine so now I've got a Geo data frame where at the end here I have this column full of geometries that are the correctly processed versions of our original latitude and longitude but if I look at the very first one it's still a shapely point it's getting transformed here because of because of the representation in the ipython notebook but this is no longer a float or a string it is a shapely point so geo pandas needs to know about your data frame like object and that it needs to know about where the geometry lives the geometry could live in something else like a list of shapely objects it could live inside of the Geo pandas dataframe and sit in a column they just need to be aligned so the geo pandas constructor sort of zips them up together and makes a distinct object called a Geo data frame okay this is kind of special because like in pandas you usually got all these different from whatever accessors that allow you to take a bunch of different data structures and turned them into a panda's data frame this strategy for geo pandas means that you can take basically anything that you can turn into a panda's data frame and then just shove in that geometry column later and turn it into a geo data frame so all of the Constructors from pandas still work and give you pandas data frames but once you need to do spatial analysis you can then convert them into a geo panda's data frame and use that so for x and y we can use this handy helper function geo pandas dot from XY or geometry from XY but for parsing well known binary or an equivalent representation known as well known text which is that whenever you see something like whenever you see this kind of point and then this text here this is what well-known text looks like the Shapley library has parsers for these various kinds of geographic representations so you don't need to worry about writing a representation or writing a parser for well-known binary or for well-known text if you import the module from Shapley then you can use the well-known binary dot string shape and in this case that the the well-known binary is in it in a hexadecimal encoding if the well-known binary wasn't in this encoding I'll show you what that looks like in a second but its standard typically in geographic databases to export well-known binary and hexadecimal encoding rather than in a raw encoding so here when we're using this well-known binary column remember this is neighborhoods this is the data frame that looks like this with three columns one of them with the hood ID and at the end here we have this well-known binary string which is really long we're going to take each row and we're going to parse that information into a usable usable data so that's the first well-known binary string and if we use the Shapley well-known binary dot loads hex equals true we parse that into a shape and if we want to do that for every single row of a G of a panda's series we can use the apply method of that series so here we're using a lambda which is like an anonymous function but that may be too so if I wanted to say parse geometries and you get a well-known binary string you would return wkb loads well-known binary string hex is true so if we wanted to parse this you would do the same thing we did above on each string by applying parse geometries now every single row is a polygon it is a shapely polygon it's been turned from this large string representation into the correct object representation so like that's what the first neighborhood and the data frame is that's what the second neighborhood in the data frame is and so on but even though we've converted all of these things into shapely objects we haven't actually solved the problem of making a geo data frame like before I was talking about the Geo pandas geo data frame constructor and the geo pandas geo data frame constructor needs to know where your data frame like object is and then a collection of geometries which you can use to create a geo data frame so here are neighborhoods data frame oh shoot eh sorry I ran the cell ahead too quickly let me go back and read populate it cool sorry about that so the the neighborhoods data frame again which looks like this where you have that well-known binary and then we parse it after we parse it we get another column called geometry that has that textual representation right it's still not actually a Geo pandas dataframe it's still a standard panda's data frame so in order for us to use it for spatial computations we have to use the geo pandas geo data frame constructor now this no that won't that so this constructor implicitly assumes that your geometry is stored in a column called geometry so that's the default is if you give a geo data for a data frame like object to the geo pandas dataframe constructor it's gonna look in a column called geometry to try and see if you've stored a geometric object in your data frame already so just by default this looks in that geometry column and grabs the correct type it grabs the correct geometry and then now we've converted that into a Geo data frame yeah yep fat it set geometry only has only exists if listings is a Geo data frame if so that's very new what I don't think that's the case is it whoa that must be in the oh that means that so recently there was a there was an update to geo pandas that takes account of this refactor that I think was released in the last week so if that's the case it must be because importing geo pandas monkey patches the original class so if this is true then this is really new good call I didn't think that this had been released yet oh five oh yeah yeah so that is possible if you do have this set geometry column set geometry now will monkey patch into the original data frame and set geometry we'll talk about a little bit later also allows you to switch between geometric representations inside of a single data frame so if you want to talk about if you have like say I don't know a polygon with two different representations or a point that has a buffer around it that you computed which is like an area which you're going to search inside of you can use set geometry to switch between those representations and recently set geometry I guess also has been patched to process geographic information inside of columns and cast it directly into shapely objects at the start so that's interesting that that's there I wonder if that's actually processing the text the plain text on the geometry column anyway yeah I think that that's I think that that's another it set geometries works by looking at any column or any other constructor just like the original geo data frame constructor so but that's interesting I think that that's just merged in five-o because that was a request that I made that's cool I'm glad that they put it in so anyway when we're building the Geo data frame stuff for the neighborhood's the by default it looks inside of the column called geometry but you could call your geometry column anything you want in geo pandas this can be kind of confusing and if you look at some issues for geo pandas you'll see lots of people ask this question the difference between so once you have a geo data frame the difference between the geometry column which is a specific column in the data frame called geometry and the actual geometry attribute which is what the data frame has as considered as its active geometry column those two things are different so if you would have called your parser up here when we parse this wkb if you would have called it something else like parsed geometry you can construct a data frame still and have that column still be called parsed geometry some database installations I've seen have geometry columns called vijayam and if that's the case then you just send the geometry column that you expect and geo pandas will parse that into a geometry but by Nature if you use the dot geometry attribute that will always look for a series inside of your data frame which may not necessarily be named geometry so to show you what I'm talking about here you can use dot rename let's say geometry equals new gym so now this geometry column is actually has a different name but dot geometry can't find it because it's looking for the column that happened previously so set geometry will make the geometry column refer back to the name that we used earlier if we have this let's say different name then the geo pandas geo data frame constructor on different name would have to say geometry equals new geom in order to get a data frame back so this kind of division between the column called geometry and what geo pen has actually records as the geometry column you have to keep those two concepts distinct for most cases if your column is called geometry that you want to work with you're fine but sometimes people will encounter these problems where they're using set geometry or reset geometry or renaming columns where this happens and you will get into trouble if you you don't see the difference between those two methods frequently so just as the side since we were talking about this the the set geometry stuff you want to be sort of careful about what you're calling these geometry columns the default name is geometry most operations will work correctly if you use the default name should be fine riding out to file is done in in a using a two file method on each geo panda's data frame setting coordinate systems though is rather important when you're working with geographic data coordinate reference systems are sort of geodetic conventions that allow you to agree on accuracy or representations of geographic data so when you're working with coordinate reference systems and geo pandas this is conducted through the dot CRS argument or the the attribute and this is another thing that is standing to change here on a new release of geo pandas but right now the proper convention for working with coordinate reference systems tends to be specifying a correct proj for string in this manner since we've parsed the data from latitude and longitude in both cases we're going to set its default or its initial coordinate reference system in this manner when you have data that comes from plain text and has no coordinate reference system information you set its coordinate reference using a dictionary where the key is in it and the value is something that kind of looks like itself a key value pair of what referring Authority you're using and then the code for that projection this could also be a progerin that should be fully supported as well so here we're going to be setting coordinate reference systems on our data frames and then I'm going to be writing these out to geo packages for later use so that's kind of the higher level working with geo pandas dataframes constructing them from plaintext and working with coordinate referencing systems initializing them when you don't have okay the second part with the lighting the listings will take a while because the format geo packages takes a bit are there any questions just on this basic construction of the data frames and working with geometric columns the difference between the column named geometry and the actual geometry attribute anything yeah yep yeah so you can always you can always add an arbitrary number of columns and you can switch between them using that set geometry operation so if you wanted to have both a projected and a none projected thing inside of a data frame you could do that the standard way to do that would actually be and we'll do this in a second listings dot to CRS and then you cast it when you need it but if you have large data frames of every projection is difficult computationally then you might want to store the pre computed points but yeah you can method chain on a lot of these different things so like the two CRS attribute will reproject your data and then you can use a method on top of that to plot it will do that in a second I can show you but yeah you can always store arbitrary numbers of columns and then you set geometry to recover are there any other questions on geometries and construction move on yeah I'm pretty sure it assumes that by default but I'm not sure when you'd use it is more often than not if you're coming from a context where you're parsing data from plain text you'll need to get the exact information about what's going on in the coordinate reference system here I'm just using this epsg code because it's simpler and concise for the example but if you know more about the context of whatever data you're reading it in you need to use that information on that yeah sure got it yeah so the question was when it when should you use no def sequel true or additional sort of default options on additional options on the proj yeah it's whenever possible if yeah okay so I I tend to just use the epsg codes and get away fine I think for geodetic operations with high precision you've got to be more careful about parsing stuff from plain text anyway if it's uninitialized but yeah I think in general gio pandas will recognize coordinate reference systems when read to file so if you have a projection that you want to use then it should be fine any other questions yeah the Geo package file is a spatial data format which stores all of the sort of attribute data the columns that aren't geometric in addition to the geometries together my understanding is it's implemented inside of like a sequel database you don't need to really know a lot about how that operates but um it's just like another geographic data format there are quite a few drivers that are available in geo pandas these are all coming from another library called Fiona which is responsible for managing geographic information writing and reading a vector data and pipeline but geo pandas just sort of sends that all the Fiona any other questions cool so this next parts relatively straightforward we're gonna be working I'm gonna show you how to do geocoding raise your hand if you've heard of geocoding before okay quite a few of you have you ever used geo pi before some okay so geo pi is sort of the standard library to do this in Python I'm showing you how to use the nominative API which is a OpenStreetMap geocoding API but geo Pi has access to many many many different types of geocoding endpoints so if you have one or your organization has one that you find to be easier to use you could always follow that and use that here so gyeo pie works by sort of instantiating a coder object which then processes all of the queries that go through the API the nominative API so here we're actually initializing that geo coder object and that geo coder object has a bunch of different ways that we can work with the API to geocode information the first one is the actual URL which the API is exposed at so if you have that you could always go to it and this is the geo coder endpoint that we're actually going to be using okay so today I'm just showing you how to use nominate 'm from the GOP package but there's a bunch of other ones you could use the RT is one I think there's one for the is it yeah so there are quite a few different endpoints that are available so when you're working with geocoding geocoding can go sort of in two directions I'm going to show you how to do both here but geocoding when you say I'm going to geocode some information it tends to be operating from an address and then giving you a location so geocoding when you classically think of it is going to be an address that gives you a particular coordinate in space reverse geocoding looks up addresses given coordinate values so in our Airbnb data we have a bunch of coordinates right we have a bunch of longitude and latitude values and we might want to know the address or the closest address at which each of those listings is made so if we do that we would be reverse geocoding we'll reverse geocoding because we're going from the actual coordinates into textual data so here there's a bunch of options on how to conduct that query but the main point here is the coordinates you wish to obtain for the closest human readable addresses is the very first option so that needs to be a pair in terms of latitude and longitude that gets coded note that this is YX order not XY order and then we code that point for that Airbnb listing into the closest address and this is done by taking those coordinates sending them to the OpenStreetMap nominative service and then pulling back down some representation in text so this very first listing and our data frame says that it's 3-1 for West 37th street right now this also gives you information about the raw content of that request that raw content here can give you additional information things like the bounding box on where the service thinks the address is it can give you some information on sort of the neighborhood OpenStreetMap neighborhood where the point lies so if you ever want to access like the raw contents of any of the api's you use the raw attribute but Nominating here provides this amount of information and since what we want to do is we want to do this for every single row maybe to get a textual representation of where our Airbnb czar we can do the same trick that we saw before and define that operation using an apply method now in pandas when you apply a function to a data frame it ends up operating column wise and what we want to do is we want to take the latitude and the longitude of each row and send that to our query so kind of like how numpy works we have to use the axis argument argument to make sure that we operate over the pairs inside of each row rather than operating down each column so here what's happening is we're getting that reverse geocoding of each coordinate grabbing the address and making sure that that's operating on each row the XY coordinates so when you run that it makes these 20 rep requests using the default settings of the geo coder and then we'll return back a list of addresses that correspond with those coordinates and we get a time out because probably the internet in this space isn't too nice service timed out but in general if we were on better internet this would run fine and return all of the textual representations just like just like this original one here we would get the full address string of that request now when we're working backwards you use the G or when we ran we've been working backwards when you want to turn an address string into a location you just use the geocode method so if I wanted to geocode a coordinate or geo code the actual location the the actual textual representation of the location then using geo code you'll get a the same kind of object back from geo pi so if you use reverse you get a location where the textual representation corresponds to the point and if you use geo code you send the textual representation and at the end of that you get back your latitude and longitude here 30 and 97 okay so working with the geo PI will let you do this kind of conversion from latitude and longitude into addresses or from addresses back into latitude or longitude using these geo code and reverse geocode methods of course you want to be careful because there's a ton of conditions on how to use this that you have to be careful about in order to ensure proper usage so what I would encourage you to do maybe is to play around with some of the coordinates in the data frame and maybe try a couple of requests yourself so see if you can't generate a couple of correct geocoding z' from the coordinates in the listings data set so you would want to use reverse to get back the address and then try geocoding that address again to get back to the latitude and longitude so let's do this for about five six minutes or so see everyone gets sort of like a functional use of some of these coordinates and we'll walk around asking any questions but this is just like five minutes or so to get your head around using these to your coding options and if the internet doesn't work let me know and we'll move on because everyone's getting diamonds so go ahead here we're going to be doing some cleaning of the data this price data again comes as plain textual information so the prices are listed in terms of how they would look on a website with a currency indicator and then commas separating the thousands so what we need to do is we need to parse that into a different representation of the string which looks with underscores has no additional methods or no additional information other than things that Python can understand so here there's a lot of different ways you can do this kind of string substitution what I'm showing you here is one that separates them out into individual operations so you can understand at each step what is being substituted inside of the string so what we want to do is we want to take each price and we want to remove that dollar sign and so here I'm just suggesting that we replace any instance of a dollar sign with an empty string which says drop any dollar sign in the price you could do this by using an hour strip command or an hour strip command which would pick that only parts that start with or end with there's a lot of different ways to do this the dot stir operator should be sure to mention make sure that pandas applies string methods to every cell that you're talking about so here I'm able to use this dot replace and this dot replace is referring to a replace on the string it's not referring to a replace operation in pandas which would take a cell value and replace it with another cell value so here we're going to be using the string methods to swap out dollar signs commas and things like that for stuff we can parse so first we're going to be swapping out our dollars then we're gonna be swapping our commas into the separator the thousand separator the Python can understand this is some people haven't heard or seen this representation before but in Python you can use underscores in the middle of numbers as thousand separators to give them sort of visual chunking any listing that doesn't have a price that has no price information on it it's just an empty string should get replaced with the correct missing data indicator so here we're going to replace that with an and value and then we're going to convert that price into a float so each of these things is operating in sequence we're gonna swap out those dollar signs we're gonna swap out those commas we're going to replace the missing values and then we're going to convert them into a float so now we've turned our data which looks like string representations of prices into numeric data that we can work with so we're kind of moving all the way from flat text information down into data that we could do some analytics on so now that we've done that we can make sort of these histograms of nightly prices in Austin and if you look from this scraping there's one listing that has a nightly price around $10,000 an Airbnb $10,000 a night so one of the nice things about this data is that you can actually use the listing URL to look back up what that scrape corresponds to so here in our data we see that it's the arrive Safire on like Austin is the most expensive listing and so when you actually look at that listing if the web will be this is what you get so 9,000 bucks a night 12 beds very nice space right so our data includes everything from things that are very cheap all the way up to this so for most analytical purposes we kind of don't want to deal with these extreme outliers we want to deal with sort of the majority of the data you can use outlier robust methods here and for some of the later analytics we're just going to say we're kind of goroh Klee not interested in the arrived sapphire and to filter in this manner you can use the quantile functions in pandas to split up your data so here we're just taking the middle 98% of the data we're dropping off the top 1% and the lower 1% of prices and when you do this you still get a pretty unequal distribution of prices oh it's unintentional zooming there you still get a pretty variegated distribution of prices but none of them are quite like the arrived sapphire okay so we're just filtering it in to sort of get focused on to that you could use outlier robust analytics and don't worry about this I'm just doing this here for the purposes of the workshop now we get to using base maps base mapping is useful because we want to be able to make pretty maps and in Python there's a really sort of newcomer library that's effective for making basic matplotlib maps that have geographically correct base map information with them that library is called context tally context Tilly looks at web services that provide tiles these are sometimes called tile servers they grab the tiles from the tile server and allow you to stick them underneath your map like an image so what we're showing you here is how to do this kind of underlying or underlying of images on on vector data from geo pandas for any kind of image you could have your own images that you want to over underlay but here we're just going to be talking mainly about base mapping through contextually the first thing that I'm going to recommend that we do because we're using web tiles is to convert our data into web Mercator projection and doing this will let us sort of talk directly to the web tiles learning how to do the reprojection of the actual web tiles and all the rest of that would be sort of an additional collection of topics where we're talking about reprojection raster data and we're not going to hit that just yet so first off we're going to be reprojected this into web Mercator and what this does is provides us with an alternative representation so before we had our latitude and longitude and when we use the two CRS it takes the data so let's say it takes the data and converts every listing into a new listing where we get new coordinates so you can use this in a way that allows you to method chain so if you wanted to just convert the data into a coordinate reference system to do one operation you can always access that on that that underlying data frame but here we're just going to keep around the web Mercator projection for some doing some mapping so when you make this array the Geo pandas attribute has this total bounds attribute and the total bounds is basically the bounding box of what's in that geo data frame and the nice thing about this is that it's pre compare it's computed on the fly so if i subset my data or change what observations i'm looking at then this total bounds will look only at the data that I'm talking about and get the maximum extent from that data so this means that you can use total bounds to sort of get a rough indicator of the bounding box of all the data that's inside of your your geo data frame at the time context Tilley is a library takes that bounding box and grabs all of the web tiles that intersect with it clip them together into a single image and returns that image and it's a parent geographic extent to do that you have to specify those coordinates those west south north and east corners and then you have to specify a zoom level so for those of you who worked with web mapping before the zoom level is kind of like how much detail on the base map there is and how sort of tightly it should be displayed so here what we're doing is we're using something called splatting or unpacking to take that total bounds information and convert it into the format that context Allah expects which is each individual coordinate rather than a list of coordinates and we're telling it to grab a zoom level of 10 anytime you have an iterable you can use that star to is to say okay what I want you to do is I want you to look inside of that iterable for each element right so this is kind of like a nice way that you can work with ordered data and Python is using these kinds of unpacking operations so using contextually making this web request we get two objects always two objects that are returns you get the base map and you get that base maps extent so here you see the space map and that isn't a very pretty base map indeed it's actually an umpire a the numpy array is multi-dimensional it contains a standard RGB kind of band and then has the extent showing you where the each each pixel is and then the extent is just another iterable but that extent contains the corners of the image now this is kind of confusing sometimes because this extent is in a different order than the extent that we sent context Allah and this is because in Geographic analysis it's common to give coordinates in terms of two corners and in that pot Lib it's common to give coordinates in terms of left-right bottom top or top bottom I can't quite remember Amanda thinks top bottom top but the point is basically you're you've got one format that's interleaved X values and Y values and another format that's xmin xmax ymin ymax so just so you know there's going to be a lot of that kind of conversion between the two when you're trying to work with between Geographic data in raster format and map pot lid but using plotting in geo pandas there's a lot of different tutorials on this but one of the nice things about geo pandas is that this plot method is built-in and allows you to do some really simple choropleth mapping right so if I have spatial data and I want to make a map I can just say give me the geographic representation of that data now that works for points and it works for polygons as it works for anything right so it'll take that data it'll give you a plot of that data just fine you can make very simple choropleth s-- because by default it'll allow you to do that color in styling so here what we're doing is me actually making a composite plot and stepping through it line by line we're making a figure that's big enough to contain the axis and then in that figure we're plotting the boundary of neighborhoods so what is neighborhoods dot boundary well every geo panda's data frame has a bunch of different kinds of methods of accessing things about its geometry and neighborhoods that boundary allows you to get the line that encloses the polygon so using geo pandas that boundary you can start talking about spatial relationships that involve just the boundary points of those polygons so here I'm using neighborhoods boundary dot plot and want to color them orange red and then I'm putting the listings inside of those in small points and calling them green so when you run that plotting call you get this kind of figure just as an aside you can also focus on the boundary of shapes by doing face color equals none and I think it's edge color equals orange red but I prefer thinking about the boundary solution because it gives you more of a sense of exactly what's happening right you want to take the boundaries off of the polygons and then you want to give me those boundaries instead of sort of thinking about how matplotlib thinks about parameters of plotting patches if it's easier for you to think of face color equals none do that so you can make these composite maps using multiple different geo data frames in the same coordinate reference system to add images you use the imshow command which sticks that image always below other stuff that you're going to plot on top of it so here we've gotten the base map from context Lee we've gotten its extent I'm using interpolation here so it looks a little prettier than normal and then we're doing the same call from above we want orange boundaries and green listings but at the very first part we're using imshow to stick the base map below right we got the base map from from context to Lee and now you've got that spatial information here in a map with a pretty with a pretty base map so yeah that's kind of base mapping on the basics you're using I am show to provide the base map you're using the boundary plot there to send the to send the boundary information and you can do this for any arbitrary number of things you want to plot using further stuff and contextually you can grab a bunch of different kinds of maps base maps so contextually by default has in context to lay tile providers it has a bunch of open street map providers it has a bunch of stamen providers that's just what's built in all of these are URLs that you can use in the context to lis bounce to IMG function to grab different base Maps so here I'm grabbing the statement owner base map and making a map in the same way as before but using a different image for my base map right so there's a bunch of different ones that are made available you can also use an arbitrary URL so if you know of a tile server that you like I used to use Cardo tiles the the positron tile is really good for some digital photography if you want to use a tile for yourself it has to be formatted in this format it's slightly different than some of the standard ways like if you go to the OpenStreetMap tile server listing slightly different format but then if you pass that to the URL you just it queries that server for tiles and then you can make a map with those base maps so that allows you to do it I have a stub for using geo PI and contextually to make a map of your hometown but we're going to be running out of time if we continue at this pace so what I'd rather do is move to the last sort of well actually no we'll take this so in the next sort of five to ten minutes try geocoding an area that you know using the bounding box from that location to get a web tile and then try and use that web tile to make a map just show that web tile okay so that might look like something like code or dot say geo code I don't know you know mark it and then with that latitude/longitude context alee bounds to image and you'd want to say give me the area around that so let's say give a little bit to the left and right and then use peel to that imshow to show the map okay so the next five or ten minutes try and piece together those three you might have to use the lat/long equals true argument and context li but just try and use that to make a basic image of some location that you know about okay for the next five ten minutes like before raise your hand if you have questions on how to string these three to move on it's alright if you didn't get this immediately there will be part in the passat and the exercise where you'll be building some things from data as well so there's a part on this notebook talking about building up areas for regionalization or for visualization and in this case this is relatively where we're building up a boundary over a bunch of different spatial operations I think it actually might make sense to take a break in about five minutes or just say take a break now our group here at three and then we'll go through the last bit of this notebook and then do the problem set rather than what I said before so let's I think it might make sense to take the break now because this is before we pivot into sort of more complicated geographic operations on queries group buys geographical unions things like that so it might make sense to do that to take a break now to about three o'clock and then back at three o'clock we'll work back through these group bys joins and other things all right so back at three and we'll move to the next part of the notebook in the interest of time we're going to move quickly through the building areas come and move directly to the spatial joins part where we're talking about how to relate spatial datasets together and we'll also talk about querying for which objects are nearer to which other objects but to do that we still need to run the building up areas for visualization component so if you're comfortable go ahead and run all of the cells using shift enter between build up areas building up areas for visualization down to the down to the part on spatial joints or you can follow along with me instead and go ahead and do that run on your own time it's just because we're sort of we want to make sure that all the material gets covered in a way there's all there's exercises for all the stuff that we've posted online for the workshop and we're very responsive when things are posted to the repo so if you work through stuff that we don't cover and you have questions about it feel free to ask it in the slack or preferably on the github repository and we'll be able to answer questions in that mode what we're doing in that collection of code is we're taking the listings that we've got and we're building polygons out of those listings based on where individuals claim they live so we're taking the coordinates of every single Airbnb and then we're saying okay if this person says they live in this neighborhood and this person says they also live in that neighborhood group them together and then try and build a polygon out of all the places where people say are the same neighborhood at the end of that process we get a collection of neighborhoods in Austin from the Airbnb data which are built in blue so that whole part is walking you through how points can aggregate into polygons together either using convex hulls or better but slightly more complicated hulls and walking you through how to take a point data frame and out of that construct polygons from a group by okay so that's kind of generic operation in G is known as moving or changing supports and in this case we're moving supports through a group by operation okay at this point we're working with spatial joints so now that we've got two polygonal datasets we've got the Airbnb neighborhoods in blue and then we've got a set of official boundaries in red we might want to quantify how those two sets of polygons or how the points underlying them relate to one another and this is done oftentimes using spatial joints spatial joins are useful when you're asking spatial joins and spatial queries but here we're going to be talking about mainly spatial joints spatial joins are useful when you're asking questions about relationships between geographical or geometric objects and so here we're gonna be talking about how many Airbnb is fall within this neighborhood or fall within this neighborhood or how many Airbnb is that claim they're within this neighborhood are also within southern neighborhood we'll be building some contingency tables we'll be building some simple counts but in general all of this operates using these formal spatial queries and spatial predicates the full implementation of this is known as a dimensionally extended 9 intersection model which if you've done geographical queries before you may have seen or heard about the set of relationships and you may have seen it expressed in terms of some predicates about shapes intersecting or being contained within or crossing one another here will be mainly working with the three predicates that are supported by geo pandas spatial join so when we're working with spatial joins in geo pandas we're going to be talking about one geometry intersecting another geometry which means do they share any points in common we're going to be talking about one geometry containing an the geometry where all of geometry B has to fit inside of geometry a or the flip of that relationship within where all of geometry a has to be inside of or contained by geometry B so these three operations are accessible through the S join function in geo pandas and geo pandas spatial joins relates to vector data sets together in this manner so earlier we had our listings and our neighborhoods and these two sorry dot plot I'll do this in the other order so we had two pieces of data we had sorry about this we had the neighborhood listings which I'm sorry I'm doing this we had the neighborhood's and then we had the individual listings that sit inside of those neighborhoods right and we made a couple of different maps about that on the fly to count the number of listings that fall within a neighborhood we would use a spatial join and here that spatial join is saying give me all of the listings all of the thing on the left that fall within all of the polygons and the features on the right and spatial joins and geo pandas allow us to do this kind of operations so when you run listings and hoods when you do this operation of listings within hoods you get a geo data frame where it looks pretty much like listings it's got all the information from our original Airbnb listings but then over at the end sorry we have information about what neighborhood each listing fellin right so if we did listings in neighborhoods through the spatial join each listing is going to have the index and then any other column that's relevant from that other data frame so every single bit of information about the neighborhood that the listing falls within okay so this spatial join is denominated in terms of the left data frame so all of the listings and then it takes all the elements from the right data frame that apply to each row on the left so if one Airbnb was within multiple neighborhoods depending on how you configure it you'll get multiple rows for each listing here every Airbnb is in a neighborhood and the neighborhood's data is propagated into index right which is the index in neighborhoods which the listing falls in and then the ID of that neighborhood okay does that make sense when you make a call as a spatial join which listings are within which neighborhoods that's how that works all right so what this does presented nice and cleanly here is allow us to speak about what neighborhood and what in our original listings data and what official neighborhood relate to one another right we had neighborhoods in the original listings data which had names like University of Texas or West Campus and then we have information coming from an official neighborhood boundary which with names like seven eight seven oh five and this should sort of reflects kind of the different Providence because one of these things is designed for marketing purposes trying to give people an idea about where the neighborhood rather listing is that they're trying to rent and the other one is sort of an administrative function so if we wanted to we could try and figure out now that we've related our points data and our neighborhood data what the relationship is between those two polygon sets right we can get a contingency table of if you're in Allendale you might be in these other two neighborhoods in the official data likewise if we just wanted to know how many data is how many how many pieces of how many listings fell within a single hood ID you could do something like this a group by which tells you what polygons that point was in and then gives you the count of all the things that fall in there I'm calling out price but you could do ID it doesn't matter if you wanted to get the average price within a neighborhood its then becomes sort of standard gyeo panda or a standard pandas group by operation right so using the S joint appear to go from two data frames into a single data frame that expresses their spatial relationships then lets you treat it as before just like a standard geo data frame I don't think that that's right so in that case you can use intersects right as as your your method but you don't there's no by default way to do that if you wanted to filter it you could access the dot area attribute of every row and then pick the one with the biggest area but that would have to be a secondary group by to figure out okay so from my original index in the left data frame which one in the right the result has the largest area so that would be but there are multiple ways to sort of go through that so here we're building a contingency table between the neighborhood's that the listing say that they're in and then the neighborhoods that this other this other data set would say that they're in now that we've done the spatial join we can just work with them as a standard pandas table and this contingency table is built using the same sorts of group buys and unstack s-- that you may have learned in a panda's course okay so this allows us to move from points through to polygon representations and talk about structure relationships between those things further we could talk about how many unique names how many different place names are there in any single government neighborhood and doing that involves another kind of group by from our listing in hoods we can aggregate up all of the things that we see in each government neighborhood again using sort of Panda is fundamental constructs of a group by grab me the neighborhood that people say they're in and give me its unique name so these are just default pandas operations the sort of standard way to work with these spacial joins in in Python more generally is to do the geospatial join get the object that gives you all of the structured relationships and then later on do your aggregations or your your operations so like if you had to filter based on you know which one has the most area or which one has the highest median price or something like that you're always going to run the spatial join to get the relationships and then you're going to process it afterwards using pandas okay so most of the time the actual important spatial operation is only done in the actual spatial joint command okay so moving on from this part lots of this is pandas manipulation so here we're grabbing the distinct names we're finding out from that how many names fall within each one of these other polygons and this is again all pandas this is not spatial manipulation at all so we're doing this group by we're trying to find the unique names in them and we're doing this Len we're moving it back and then here we're making a choropleth of how many how many neighbourhoods do people say that they live in based on the government hood so we're constructing this polygonal relationship between the listings and then the government relationships here this is all sort of pandas manipulation and we've at the break we're talking with some people who wanted to sort of get into more explicit spatial statistics so we're gonna move on a little bit more quickly to make sure we get to the rest of the material here is an example where we're talking about what are the properties of my nearest neighbor so you might have heard of nearest neighbor queries spatial joins provide you some set of operations between two different spatial supports so it says does intersect be well then they're considered related is a within B is my listing within this neighbourhood well then we'll we'll consider them related nearest neighbor queries are a bit different nearest neighbor queries or distance range queries are structural relationships between observations that try and communicate you know who what's the average price of the nearest five listings or what's the average speed limit of the next five roads on this path and these kinds of operations tend to use search areas called buffers if you've done GIS before you've probably heard about buffers buffers take a geometry and expand them by a particular radius so a buffer around a point is shown here in each red location were increasing our search area by two hundred two thousand five hundred meters and that that increased search area is shown in the blue you can buffer things in geo pandas using the dot buffer method so a lot of these kinds of geographic and geometric primitives are already built into geo panda's data frame so using just as an example listings dot buffer contains the way to take a geo geographic data frame and then increase all of those geometries by a particular pre specified amount so buffers are usually integrated quite a bit in geographic analyses and let's say we had a query like what's the average price per head in ten and a thousand meters around each Airbnb so just to do a small example we're gonna be focusing on neighborhoods in the downtown so we're going to focus on only the downtown listings and this is for your computational reasons it gets pretty difficult to to do this in large scale so we focused in on only the downtown neighborhoods okay and then if we buffer each one of those then you can see from here that there will be once once we do this together you'll see kind of like the thing we saw before where each observation has this large search radius around it and then the center of that search radius is shown in red if we made this smaller let's say we called it a hundred these buffers would be smaller right so that buffer attribute is that buffer method is doing that so all of these new things around the point all of these big circles are polygons and stacking them together we can use some pandas magic to grab the two too or what we're doing here is we're building a new data frame using the search buffer and we're giving the price per head in that buffer to that data frame so we're computing the downtown listings dot price and dividing it by the downtown listings that accommodates which contains the amount of people that the Airbnb can accommodate and you get this indicator of price per head oh right because it's already been converted and then now if we have these search buffers and we have our downtown listings again we can do this spatial join and we can say what is the average price what is the average price of all these other listings using all of these search buffers and when we do that we're going to be taking all of those search buffers and looking back through all of our listings falling every single one that falls inside of that listing and then creating another data frame that gives us a spatial join and this looks just like the one we constructed before but now uses this intersects operation and again at the end we get the index of the other Airbnb that it matches to and the price per head at that Airbnb so by stacking these buffers and spatial joints together you can answer these kinds of questions buffering gives us a polygon that says anything within this area should be considered to my focal point in that buffer and then we can use that in a spatial join to collect everything back up and aggregate okay so when we do end up buffering and getting our average prices on that group by again you get the you get a data frame that looks after you do your group by the original listing ID you can get an average of all of the other buffers that intersect my original listing and then aggregate by price so like if I did listings buffer join group by ID listing but for join and then said Chuck just to show you what one looks like each of these chunks is one listing right so that ID is unique so that's one air B&B but the index right from our spatial join is the index of every other listing within a thousand meters when we aggregate the price per head out of all those listings we get the average price per head and a thousand meters so again once we do the actual spatial operation all the rest of this is pandas group byes and descriptions of that original spatial structure okay now some of you may be interested in performance and I'm not going to walk through this but there was a reason why we filtered down to the downtown listings the Airbnb data set overall is 11,000 observations and waiting for that to happen on a laptop for 11,000 observations is quite difficult so using side PI spatial there are special high-performance methods to compute spatial relationships and this the KT tree and it's psyphon implementation are really efficient ways to do that I detail that at the end of the notebook if you want to go through and do that but I think for time purposes we'll skip the high performance part and leave that to another day so if you'd like to read that on your own time or ask questions in the repository we do detail how to scale this kind of stuff up but that's the reason why we sort of shrank the problem down when we were talking about what's the average price or price per head in the downtown area so at the end of this notebook that's a lot about how to how to do that that price per head stuff so if if you'd like we've come to a bit of a juncture here there's a problem set that we can work on together we can sort of do you can solve the problem set and we could go through it together but this basically checks the understanding of most of the main concepts from our previous lecture so we could do that or we could take a short break to retool and then move on to the statistics if you'd like to do the problem set raise your hands okay let's move on to the statistics cool so search just to reiterate so the problem sets also have solutions in the repository so you might want to look at the solutions after you bang your head against off for a little bit but don't look at them initially because if you like me I don't learn much when I just look at the solution I have to bang my head a little bit okay what we're going to do shift gears a little bit here once I get this going so I'm gonna move to GDS three the notebook jog entitled geography is feature much of what Levi covered earlier in the first sections are is basically called deterministic spatial analysis you're looking at the spatial relationships be the intersections joins containment between two geometric primitives okay and the Shapley library under the hood is the key part that geo pandas is using to implement that in that's not the extent of geospatial data analysis very often we have attributes for each of these geometric units be they points lines or polygons and we want to understand the statistical distribution of those attributes and how it relates to the spatial distribution of those features and spatial analysis is about really putting those two things together so we're armed with basic concepts of deterministic spatial analysis now we want to sneak up on probabilistic spatial analysis that is geospatial statistical analysis that takes the geometric properties of the data into account okay so that that's a mouthful so we're gonna start to think about for in say machine learning parlance treating geography to create new features if you will so new variables that Express spatial relationships that we then incorporate into this statistical analysis we're importing a bunch of libraries many of which Levi contact talked about already recovered the new ones that we're going to deal with in the second part of the talk has to do with library called pycelle which is a library that we maintain that's ident that's its goal is to do spatial statistical analysis a variety of flavors right so we assume that we have things like geo pandas and Shapley to deal with geoprocessing and getting the data in the format that we need when PI sauce started a long long time ago things like geo pandas Shapley didn't exist so we spent a lot of time writing shapefile readers just so we could bootstrap this statistical part but the world is a much richer place now and we have excellent geoprocessing libraries so we're going to be using some of the same data that we've built up already we've cleaned up the data we've changed the support from points to polygons we're going to read some of it back in so we're starting with the listings data that we've seen earlier this morning we're gonna pull context aliy in and do some plotting so we know what the context is so we have two layers here we have the point layer for the listings and we're using color to denote the price of the listing and we also have the neighborhood boundaries the polygons so if you're an urban economist interested in housing markets and what explains the spatial distribution of those colors you want to leverage the methods that we're going to be talking about this afternoon one of the key constructs in spatial analysis for this type of data are so called spatial weights that are going to allow us to express potential relationships between the spatial observations whether they are polygons or points or edges on a network segment segments on a network and you can think about a spatial this notion of a spatial weights as an adjacency matrix if you're familiar with social networks so it's a big matrix let's say we have n polygons 100 polygons where our matrix is going to be an N by n matrix where the values in row I column J of that spatial weights matrix tell us about the potential spatial relationship between polygon I and polygon J now two things to note about this what those values are is yet to be determined are they going to be binary are they going to be continuous values and we'll cover that and then secondly we actually never store the full and by n matrix because many of the values would be 0 and therefore we don't we don't need to store that so we're going to leverage sparse reference representations of these weights matrices right but conceptual you can think about them as a sincere matrices this is a neighbor of that based on some criterion all right and in pi syuh we spent quite a bit of time making these efficient and scalable but under the hood the attributes of the spatial weights there's two dictionaries that we start with a neighbour's dictionary that tells us for each unit say each polygon who are my neighbors so the value for that dictionary for the key value pair in the dictionary is the ID of the focal unit the focal poly some polygon and then the list is simply the IDS of the neighbors to that polygon so in this contrived example we creating a neighbors dictionary and we have three polygons labeled a B and C and there's their neighbors dictionary while a is a neighbor of B B is a neighbor of a and C and C has only one neighbor B so you can think about three polygons in a linear row okay so we have three polygons this expresses who's a neighbor of who that's the neighbors dictionary we also need the value of the weights so these are the values that go in the IJ row of the weights matrix if we indeed had the whole weight matrix and it's up to us right now we're we're saying well the weight for the relationship between a and B is 1 for that neighbor pair but B its neighbors are a and C and we're going to assign a different value for the what's called a join between a and B it's 0.2 from the perspective of bees as the focal unit and point 8 4 bees relationship with C and then conversely for C's relationship with B its weight is going to be 0.3 okay so we have these two dictionaries the neighbor dictionary which tells us who's a neighbor of who and then the weights dictionary that tells us something about the value if you will of that particular neighbor pair relationship in this case they are in terms of the notion whether they're neighbors but the weights in this case for that pair are not equal valued so they're not symmetric and we'll see examples where we do have symmetric neighbors and other cases where we don't okay so that's a toy example just to get our heads around what the underlying data structure for the weights class in pycelle is and that wouldn't be much use if you had to do this by hand for a map with say eleven thousand points for the listings it would be a royal pain to do that right by hand write out what the listings are so pycelle leverages now geo pandas and related things where we can read we can read these different spatial formats that Levi mentioned this morning or earlier this afternoon and build the weights matrix from from the original data and we'll do a couple examples of this okay so here I'm just finishing up with our example LP is Lib PI Sal we call the weights module the class is W uppercase W and if we pass in the neighbors dictionary we get back a binary weights object an instance of our spatial weights its weights attribute is our dictionary right which has the weights we could also pass in both the neighbors dictionary and the weights dictionary and then the weights are no longer binary in the first case so if you don't pass the weights in you only have our required keyword sorry required argument of the neighbors dick you'll get back binary weights if you don't want binary weights and you have information you pass that in as a key word for the weights okay so there's different ways to get the values for those weights and this is an industry people are coming up new ways to do this all the time we're gonna start with this simplest which are contiguity weights so whether I and J are neighbors is going to be based on a continuity criterion and that seems simple at first glance actually more complex once you peek under the hood in here we're going to leverage shapelin G of pandas so backing up for a second contiguity you do things touch touch again conceptually seems simple but programmatically it's not so simple what do you mean by touch right and in deterministic spatial analysis if we're dealing with shapefiles where we have a polygon represented as a bunch of XY coordinates for one polygon and a second polygon it has XY coordinates our contiguity criterion is going to use those coordinates in different ways the simplest way to think about this is if you add a checkerboard or a chessboard right the different notions that continuity have to do it the way certain pieces can move legally on the chessboard so Queen if you think about the center cell in a checkerboard which way can a Queen move legally how many moves could you make from the Queen position what direction so you go north you go northeast east southeast so the Queen can move eight ways right so Queen contiguity for that center cell would be any of the polygons that share one of the vertices of the Queen's one of the vertices are an edge of the Queens cell the center Center focal cell okay so that would be Queen contiguity rook contiguity is if the way a rook could move is north south east west so the rook would not have the off diagonal neighbors working that back to the geometries themselves in real-world Maps we typically unless you're working with rosters but if you're working with vector data your maps are not regular-sized or shape polygons think about the counties of the u.s. they're very different animals okay so now we want to map this notion of Queen and rook contiguity to an e regular lattice a map that has e regular polygons for Queen what we look for are simply two polygons that share at least one vertex the XY pair or that lawn pair they would be Queen neighbors for a ruck it's more complex they have to share two consecutive points that define an edge a common edge alright so pycelle lets us build a queen neighbor's spacial weights object by passing in a data frame our neighborhoods data frame that we've been playing with earlier today we just read that in and that's our argument to the queen from data frame and that gives us a queen neighbors got a spilitt right okay so it does the hard work for us it's reading in each row of the data frame looking at that geometry column and building a spatial index and then checking what polygons I should test for our common shared vertex one or more and when I find them I build the the weights and the neighbor dictionaries and populate the object okay we can plot these so see what this looks like we have our polygons the neighborhoods we're using the centroid of the polygon as a representative point and then we're drawing an edge between two centroids of queen' neighbors okay so that's our adjacency graph based on Queen contiguity for the neighborhoods and all in Austin alright questions on on that so far yes these are undirected so we're making no difference between A to B B to a other handy things that the W object does is check if you have disconnected observations and these are called islands you could be a true Island so if you were looking at the say some of the County geometries for the US have the Channel Islands in California has separate polygons and they don't share vertices with the mainland so they are both physical islands but geometrically they're also islands and that means they have no neighbors in this wait structure and that can cause problems for certain types of spatial statistics that we'll get to so you want to know if you have any neighbors if you do there's ways to attach them if you if you're then gonna move on to spatial analysis where you need know where you cannot have islands all right but fortunately for the Austin neighborhoods we don't have any islands because that's an empty list if there were Islands that would tell you which rows and the data frame are the ones that don't connect by this neighborhood criterion all right yep yes so for this constructor when we're calling Queen from data frame it's binary contiguity right so right now the weights are all ones or zeros yep all right so that's very sparse we could visualize the full matrix if we wanted to to get a sense for what what this looks like all right so we're using an image plot here basically to visualize the matrix if you will so these are the IDS of the polygons our neighborhoods and then where there's a yellow cell that that represents a joint a neighbor relationship between I and J okay so this is pretty sparse that purple I guess it's purple those are zeroes and that's why we don't store the zeros so we typically work with this course representation because it's efficient memory wise and compute wise how sparse is our object the the W object the Queen neighbors is 12% nonzero so only 12% of that matrix would have values nonzero values in it and this is a pretty dense matrix actually in this case alright they're all the same for the nonzero values they're all ones they're binary and we can check other properties of our graph for those of you are interesting graph theory we don't have any islands so every polygon has at least one neighbor one queen neighbor but you could still have multiple components in the graph right so you could have a bunch of connected components at the bottom and at the top and we want to know that as well because certain types of spatial analysis require a single connected component meaning you'd be able to travel to any of the polygons by passing through neighbors right if you have more than a single connected component that won't be possible you won't be able to travel from one group to the other through the contiguity graph so we can check that down here we only have one connected component and they're all belong 0 means you're in that first connected component Python 0 offset if there were multiple components you would see which component each polygon belong to all right skipping over this we've given this workshop versions of this workshop with other datasets we typically change it to where we're giving the datasets giving the talk and last time Levi I gave this was in Berlin and Berlin's a JC structure is very different so there were can multiple components there's different ways to represent the weights information in the weights object so we can convert from our PI Sal W object to get an adjacency list that could be useful for other downstream applications so here we create a data frame adjacency list and look at the head so we'll have three columns in that data frame focal which is the idea of the the focal neighborhood focal polygon and then the idea of one of its neighbors so we're exploding the weights object into a data frame and then the value of the weight which are all going to be ones in this case okay so this is our adjacency list at least the top part of our adjacency list and this is going to be handy to construct other geographical features for the spatial analysis that we're going to be doing okay this is the I read the listings data in Levi's cleaned it up before I'm going to clean up again because I didn't read it in the clean data so just skip over that and assign that cleaned up price to the listings data frame all right now we're going to make use of some of the information let me close this part over here make this bigger got a lot more room there alright so we cleaned up our price data we have this new column in our listings data set well we haven't put it in yet price there we go we add it in ok now we have the cleaned up price for the listings using some of the techniques we covered earlier we want to get the median price in each neighborhood so we're going from the points to the polygons and creating a new attribute for the polygons the median price we do that with a group buy okay we're gonna attach the index right on the index right and we call price dot median we're calling the median method on that price column and we're grouping by the neighborhood's right so for all the listings in a given neighborhood get might get me the prices and then take the median and return that and put it in median prices okay that gives us a 44 comma so there's 44 neighborhoods and those 44 values are the median of the listings in each of the 44 neighborhoods okay let's put that into our neighborhoods data frame so we're doing a merge where we're taking the median prices creating it to a data frame or converting it to data frame and then we're going to do a merge with the neighborhood's data frame and plotting it so visually what we see now is instead of the points nested in each of the neighborhood polygons we've done a change of support where we've taken the median of the point listings in each of the polygons and used that to develop a hue or choropleth classification for the map okay so the yellow is the highest median price for that name that neighbor has the highest median price and the darker hues lower values okay questions right this gives us a different visualization of the same market Austin then we saw on the point data the point data is very rich but there are eleven thousand points which is a lot of information to process cognitively so one reason you want to aggregate like this is to simplify the cognitive load and hopefully you do so in such a way that you're capturing the overall structure not throwing away too much information we're throwing away a lot of information by going to the median but what we gain hopefully is sort of a just thought if you will the spatial structure of the market once we have that visualization we then want to start to ask questions about well what's the relationship between the prices and this more in this neighborhood and in the neighboring neighborhoods is there a relationship and is it constant over space right is there so called clustering in the data set or are these maps random right much of what we're going to do the rest of today is look at the methods that allow us to answer those questions from different different perspectives okay so does this look random to you why not mm-hmm okay so this stands right out that yellow right what does random mean to you in this context okay you might expect more in the different classes more colors more polygons with more variety in the colors and their locations being random alright so it's a hard thing to unpack and we're going to hopefully unpack at the rest of the today yeah we're gonna get to that for sure yeah right the other things that make this difficult one we're gonna have to formalize what do we mean by random okay but visually we're it's complicated because these polygons these neighborhoods are not all the same size or shape and our eyes get drawn to the large polygons even though statistically speaking each polygon has a valley it's one observation right so it conceptually again it's hard just to rely on visualization to answer that question so we're going to couple it with analytics what we're going to do here is using that notion of the adjacency list where we have the focal neighborhood and one of its neighbors the focal neighborhood its other neighbors and so on and so on we're gonna start to look at the relationship between the price the median price in the neighborhood and the median price not for all its neighbors but for each of its neighbors okay so for example our Jason C list we've we chained together a couple merge operations here so that we're building more columns in the data frame so they have the focal neighborhood neighbor at zero one of its neighbors is 25 we know the weight from the Queen we know the ID of the focal neighborhood from the original neighborhoods data set we know the price for the focal meeting the center neighborhood in this pair relationship and then for one of its neighbors we know the price for that neighborhood okay but each polygon may have more than one neighbor so we want to look at all those pairwise relationships so that same focal unit has three neighbors 25 3 and 21 and they all do non the same price okay this focal neighborhood is going to have the same price for each of those joints so we're interested in the relationship between the medium price and the focal unit and the meeting price in the neighboring units so we can take the difference between each focal unit and the neighborhood units the neighboring units for example zero focal unit neighborhood 25 the price difference is negative 60 meaning that the price is higher in the core as you move to the neighbor it drops by $60 a night right we do this we're building this new variable up when the previous steps all right so we're constructing a new feature if you will the price difference between the focal unit and the neighboring unit for each of the neighboring pairs and we can query that and we find that there are some neighbor pairs where there's no price difference so you have neighboring neighborhoods where the meeting price is the same we can find them leveraging simple pandas queries but at the other end we can find the contiguous neighborhoods where the price difference is large right again leveraging pandas as Levi mentioned this morning we do the spatial operations to build new features and embed the geographical relationships and then pandas queries to to query those okay all right so that's contiguity for polygons do you share a border what do you mean by share while do you share at least one vertex queen neighbors we could do that could have done the same thing for ruck changing that the function that we call and it would have been slightly different because Queen and rook have different sparsity properties you're gonna have typically more neighbors if you use Queen because it's a less stringent movement criterion than rook but as we've seen earlier one of our data sets was the listings the points themselves and we might not just be interested in them that say the neighborhood level what's a relationship between the median price in this neighborhood and meeting price and neighbor neighboring neighborhoods but how about the individual listings themselves what's a relationship between the price for this Airbnb and the ones that are quote neighbors to that all right so that's what we're gonna talk about now so we'd like to be able to use the same notion of contiguity but four points build a spatial object a spatial W object four points and there's multiple ways to do this we're gonna use what are known as Voronoi polygons so those are the 11,000 points plotting the price alright and we want to consider how the price for one Airbnb may be associated or related to the price and the neighboring Airbnb s so we're down to the micro level here not at the macro neighborhood level pycelle has a function called Verona frame so Veroni polygons are people familiar with these right for a point set they're going to partition the plane in two mutually exclusive polygons where each point is the so-called generator point for the polygon and that means each point has a polygon defined around it such that any other point that would fall in that polygon that generator point is the closest point in the rest of the data set we look at pictures it should be clear so here's our neighborhoods and we've drawn them on top of the Veroni diagram this is the roni diagram the edges you go out to the edges intersect and some of them takes a long time to go so this little black dot is basically this area because it's a very different scale so we have to clip this diagram by our extent of our study area and this may take a little bit of time we're gonna clip by using what's known as clipping so we take the extent of all the neighborhood's we're blending that we're dissolving them together and all we care about is the exterior of Austin boundary we're gonna use that to clip the Veroni so that we don't have to go out to infinity so that's our clipping polygon then using we set the coordinate reference system to be the same because this is a spatial operation and the thing that's running is we're clipping the Thiessen polygons which go out to infinity with our clipper polygon the one we just made which was the boundary if you will of Austin and then it creates a new a new geo series that's 11,000 polygons so there's one polygon for each of the listings now and they've been clipped so some of their edges don't go out to infinity what do these look like okay so that's the Voronoi polygon for the first listing we plot those and you squint those are the 11000 polygons the Veroni polygons so each listing has its own dominant area if you will if you dropped a pin in the Thiessen polygon the closest listing is the one that generated that polygon so this is nice Baroni polygons are very useful for our purpose now we're back in business if we want to develop spatial contiguity matrices because we now have polygons we're no longer dealing with points so we could say well give me the Queen neighbors for this or give me just Verona neighbors those that share an edge saving it to disk as a Geo package and there are 11,000 polygons so when we try to view it it's there's a we get a blob of the edges and towards Austin so we're gonna zoom in on a particular neighborhood Hyde Park people from anyone from Austin okay so Hyde Park is north of University I think pretty nice neighborhood that adds 158 of the neighborhoods and if we were just sub setting that so that's what the Veroni polygons look like for the listings in Hyde Park sorry right right so the the size and the shapes are those polygons are a function of the spatial distribution of the points right not the values of the listings at the points so these are just purely geometric partitioning of the that plane based on where the points are so the the larger polygons mean what fewer points in the neighborhood right so the distance as Levi's talk about the nearest neighbor distances those are going to be larger if you will for the generator points that are such with the larger polygons yeah that's news to me where did it create the directory without the driver to generate them or to read them okay ought to dig into it well to top my head okay so there's high Park and then we're gonna plot here we're zooming in on the bounding box of Hyde Park on the left that's all the listings in the data set just clipped to the bounding box of high park though those are the row knees for the listings in Hyde Park the colors are the value represent the value of the listing and that's the adjacency graph for those Barone's to give you an idea just like we did the adjacency graph for the neighborhood's here we're doing it for the points okay questions on contiguity type wait so we did them for polygons and we did them for points all right I'm gonna skip over this just say a couple words about it there's other ways to define weights so you can use distance relationships say nearest neighbor relationships so you could have binary distance relationships I'm gonna define as a neighbor any two points that are within a thousand meters of each other right so then we do a buffer around each point that's a has a radius of a thousand meters and for any other point that falls in that buffer their neighbors okay so we get a binary matrix but it's based on some distance cutoff we could then say well we don't want binary but we'd like to do is wait the further away properties or listings less than the closer buy listings so even though two of my neighbors may be within 1000 feet of my listing the one that's next door 20 meters should carry a larger weight than the one that's 999 meters away and we wouldn't use a binary weight we would have some distance to k2 the weights okay and there's ways to do that with pycelle using different types of kernels for the distance decay that allow you to be more flexible with the weights all right so let's see how we do in time wise for visualization real quickly just a couple of points we did a couple of coroplast during the first half we're going to just quickly go over them here because we're going to need them when we want to start to test for this notion of spatial dependence spatial autocorrelation in the data so this is GDS for so G upon des if you call a data frame that's polygon D a data frame for the geometries and it has a continuous attribute and you call plot and say column equal to prices it'll do a choropleth map of that and we'll look at an example in a second and it's under the hood it's using a library called map classify which provides classifiers for an attribute that determines how many classes for the choropleth and how do you determine the breaks for the classes so I'm reading in our listings and we're first going to look at the statistical distribution of the median prices of our data and we saw the histograms earlier today so this is a distribution plot the median price the statistical distribution right there's no geography here this is just looking at a column of numbers and summarizing it the way a statistician would summarize it since we have that in a Geo data frame when we call data frame dot plot we tell plot the column medium-price we get a different distribution it's a spatial distribution so this is a using a quintile classification for the median prices so quintiles we take the data we sort it we're going to break it into fifths all right those become the bins for the classification so that's the default we can make it a little bigger so that it's a little clearer what's going on there should be roughly the same number of polygons in each color fruit right so if it's quintiles and we had 44 neighborhoods divided 44 by 5 it doesn't work evenly but you get roughly that that amount now we have more visual information to start to think about this notion randomness right these are a bunch of polygons that are yellow there's the top end of the distribution these are not randomly distributed even though the geometries are weird they're not regular pixels we can still intuitively see while they're close together they're not peppered around the map and similarly for the dark blue and the green okay we can add some legends which can be helpful make this a little clearer okay so pycelle uses well pycelle map classified comes from pycelle we use close intervals for the upper bounds of each of the classes so to be in the first class you have to be less than or equal to 80 dollars a night to be in the second class you have to be above 80 to 100 and in our trim data set the maximum neighborhood median price is 335 dollars a night okay so that's how the bidding works for the default we can change instead of quintiles quartiles so you change keg you get a different number of classes so now it's only four instead of five and you can monkey around with the color schemes if you choose there's a rich literature on choice of color schemes for corp left mapping having to do with what is the measurement scale for your variable is an interval ratio scale is it categorical is it diverging and there's guidance on on how to do that alright so we could pick different color schemes to see the impact or we could change the classification scheme quintiles is the default but there's many classifiers you could use here we're using equal interval the second one where the width of each interval class is equal okay not the count of how many observations are each class is equal that's quantiles this is the width or equal and then Fisher Jenks is an optimal classifier that tries to minimize the amount of heterogeneity for the listings that are grouped together and maximize the difference between the classes map classify which two appendages on a hood doesn't really plot it wasn't intended to plot but it was to give you if you call equal interval pass in a variable and tell it you want five classes it returns an object that gives you what basically then goes into the legend of your classifier in this case sorry the legend of the map alright so you have plenty of options for exploring the data sets what classifiers should use is related to your the prop the statistical properties of the data okay so we have classifiers we're going to run through this quickly down at the bottom there's a boxplot classifier which is useful to identify outliers in your data set either higher low outliers which which is a good exploratory technique there's a heads tail break if you had distribution that was like a power-law you don't want to use quintiles so there's there's distributions that could be more useful given the statistical properties of your data set all right questions on map classify all right so then we're going to start the statistical analysis so we looked at the maps and we asked ourselves do those maps look random either the point maps or the neighborhood polygon maps with regard to where the high values are and if the maps are not random visually we would like to have statistical tests that confirm our visual inspection all right we want that not just because it's a hip thing to do but because visualizations really really powerful our brains our pattern recognizing machineries we we all wouldn't be here unless our ancestors on the plains of Savannah had really good pattern recognizing abilities right so your great-great great-great great-great-great-great grandmother is out harvesting on the Savannah she looks over at the brush and she sees the wind's blowing and she sees something solid behind the wind blown wheat or whatever it was back on the Savannah and she thinks well that might be a lion or it could be a boulder what did your great-great great-great-great grandmother decide if she was faced with that decision is it a lion and should I run away or should I just think it's a boulder and stay here and go about my day what would you do if it was me and there was a 1% chance that thing was a lion I would run right and if our ancestors didn't run our species might not be around and that process has wrecked over generations and generations so we're really good at detecting patterns really good at detecting patterns that's why maps are so powerful but they could be too powerful because there's not many lions left on the savanna number one number two we don't do that much anymore right but we still we have this hardware that we've inherited so these quantitative measures we're going to use in conjunction with the visualizations so that we don't fool ourselves we don't let these powerful things run us astray and and identify a cluster when it's in fact not a cluster all right that's the idea here and this concept is known as spatial order correlation alright so spatial we get we've been rattling on about it all day all afternoon but the autocorrelation is something we'll want to unpack so correlation is about the linear association between two variables my height and weight are they correlated positively yeah typically all right but autocorrelation here means you're looking at the correlation of the same variable auto correlation and in our case when we looking at housing prices as the variable of interest we want to understand its spatial autocorrelation so what is the second variable in the correlation sense of the word we're going to build that up okay it's not spatial correlation spatial correlations when you have a map of say crime rates and police expenditures those are two different variables you can correlate them that's spatial correlation like the correlation is traditional here we're talking about something different spatial autocorrelation so we're gonna use the median price data build it back up and visit our data and a bigger map and I've changed the color scheme so that now the darker hues are the higher values these are quintiles alright and I'm asking myself our housing price is randomly distributed across these neighborhoods right that's the question well to me it doesn't look to be the case like there's the highest quintiles the dark blue tend to be co-located right there over here and the light green dominantly up here and these blues are clustering down here all right but again remember the lion and our ancestors so spatial autocorrelation is going to do two things for us it's going to compare well it's going to compare two types of similarity first one is attribute similar are to housing values similar or not if it's $100 a night versus $300 a night that's a difference of two hundred is that similar or not well depends on this distribution in our data set 100 versus 150 is closer together than 100 versus 300 so there's this notion an attribute similarity how different are the values irrespective of where they are and then spatial similarity okay where are you measuring those two prices and are those locations spatially similar okay and that's where this notion of the spatial weights come back into play the spatial weights are going to allow us to develop a new spatial feature called the spatial lag which is the average of the housing prices or the listing prices in the neighboring polygons okay so that is nothing more than a weighted average where for each neighbor two polygons I we're going to take the median price in that neighborhood and multiply it times a standardized weight so that when we add up that weighted average when we add up that some of the products we get a weighted average and that becomes what's known as a spatial lag for the focal polygon it encapsulates all the information in the surrounding polygons into one value and that becomes our second variable for the correlation so now we're going to correlate the value for the listings meeting listings in a neighborhood with the average median and the surrounding neighborhoods we do that for each of the 44 polygons so that is basically a correlation but we've created this spatial feature called a spatial lag and that's where the auto comes from because it's still a housing price so it's the same variable but this new variable is spatially constructed so to get the lag we take the original variable from our data frame medium price and we call the PI cell or LP wait lag spatial and you pass in two things your weights object we're using the Queen that we saw earlier and the median prices so it basically takes the weights matrix and multiplies it times a vector of prices and gives us a new vector which is the average in the neighborhoods the neighbor neighboring neighborhoods okay so that's our new feature why lag so that first neighborhood whoever it is the average of the median prices for the neighbors of that neighborhood is $115 a night the last neighborhood whoever that is the average in the surrounding neighborhoods is 154 dollars a night that is just another attribute so we can make a map of the lag the spatial lag of the median prices in the quintiles and put them together so on the left you have the original choropleth quintiles for the price and on the right you have the spatial lag and special autocorrelation is going to test whether those two maps are correlated loosely speaking so we created this new one using the spatial operations questions on the lag so it's similar to a notion of a time series lag anyone go to the time series tutorial it was this morning I think it's a popular one so in time series the sales price in February is related typically to the sales price in the previous month January so January is the first order lag to February February is the first door lag to March if we had a monthly model and so on but time is much easier than space because in time the recursive area goes one way one direction the future doesn't influence the past the past influence future and everybody has one first order neighbor one and only one first order neighbor in space it's much more complex if you think about the center polygon how many neighbors does that Center polygon have well it's hard to squint does it have one neighbor has more than one neighbor right number one number two it is a neighbor of its neighbors so what goes on in that polygon influences what goes on in the neighboring polygon potentially but what goes on in the neighboring polygon can feed back so the the neighboring relationships for the spatial relationship is much more complex it's multi-dimensional and it could be simultaneous depending on our time frame whereas in time one neighbor and one direction and that I bring that up because when people started doing spatial or correlation developing methods to estimate regression models that incorporated space the economy trisha n--'s the time series a kind of nutrition said well this is silly we have this massive literature on time-series econometrics let's just use that that makes sense because you have if stuff exists why reinvent the wheel and the reason it doesn't work is you'll see papers even today people studying say the relationship between school expenditures and school performance in the states in the US and they'll the data comes alphabetically so you have alabama to Wyoming and they'll use a Durbin Watson test for temporal order correlation because they'll pretend like Alabama came before whatever's next Arkansas and literally this happens so clearly that's not how geography works it's more complex so you can't just use the straight off-the-cuff time series methods on spatial data which shouldn't be a surprise but it still happens when you do the time series methods on spatial data not the same but this map is a little bit smoother the one on the lag so because it's taking an average it's borrowing information from that the surrounding ones yeah so it's not purely a simple correlation because by construction these values are going to be Auto correlated even if the original variable was not what are correlated so these tests that we're going to talk about now take that into account so now we're going to develop the tests that it's a statistical test whether this map is random or not and there's two families of these things there's global test that we use to answer the question is this map random or not okay then there are so-called local tests that allow us to assess whether there's different types of clustering going on in subsets of the data identify hot spots and cold spots things like that but we'll start with the global ones so by global one statistic for the entire map in answering the question is the map random or not and we're going to develop this starting with basics and hopefully clarify is what's going on here so we're gonna take our data set the median neighborhood median price that's our variable of interest that's what the maps showing us and we're gonna convert it back to a binary variable right now it's a continuous variable but just to get the main concepts down we're gonna create a binary variable whether the neighborhood's meeting prices above the average or below I'm sorry whether the neighborhood median is above the median of the medians is it high or low okay so the median across the 44 neighbourhoods was 100 the night we're gonna create a basically a dummy variable YB if the neighborhood price is above 109 25 we're going to give it a 1 if it's below its zero so we're creating a binary variable so we have 22 neighborhoods above the mean median 22 neighborhoods below the median then we're gonna plot that okay so the black polygons are the ones where their price is above our threshold the median the white polygons are where the price is below the listing price is below so far so good and they did this on purpose so black was picked on purpose and white because these the statistics were going to use they're called join counts and they conventionally call black black and white as we'll see in a second so the join counts count up the different types of joins and compare them to what we would expect if the process that generated our data was a random one okay so we're going to have this conceptualization that our null hypothesis if you will is that whatever generates the maps that we see is a random process completely spatially random so we're going to develop a distribution for that process a random process and use it to compare what we actually see the map we actually got and say does this map look like it came from a random process or not okay so the joint counts are the way we do this what is a joint a joint comes out of our W if W IJ is one it means that's a joint you have one polygon that's a neighbor of another polygon okay so the joins how many joins are that's a function of what you use to define the W did you use Queen contiguity did you use nearest neighbor whatever you use we have this W matrix that's ones and zeroes so wherever there's a one that's a joint but there are three types of joints we're with respect to the attributes so if we look about look at this polygon and that polygon whoops this polygon and that polygon that's a joint because they share this edge but one of them is below the median one of them is above the median so this is a b/w joint black/white joint or white black join you have value differences for that joint right one is high one is low so that's black white or white black we also have white white joins so this join right here is a white white joint low next to low so value similarity next to spatial similarity these are spatially similar because they're contiguous and their value is similar based on it's a white white joint and then the third type are black black joints you have a neighbor pair that's a joint and they both are classified as high okay so we know the number of joints that's on our weights matrix and we decompose those joins into how many are black black how many are black white how many are white white given that we know the probability of being black is 22 or 44 or 50% all right so if you had this process where you're sampling from 44 do bernoulli you have 44 slots you flip a coin if it's heads this is a black cell with the next one flip a coin in its head so that one's a black cell you do that for each polygon that's a random process that's gonna be our null hypothesis because we know half the cells are going to be black how are going to be white but where they are is the central question in our hypothesis say so these 22 black polygons sure don't look like they're randomly distributed to me so let's count these joints pi saw has a module called ESDA for exploratory spatial data analysis that has these autocorrelation tests that we're using so we import that we create our binary variable YB I'm doing it again just to be explicit I create a queen weights object from the data frame so that reads the polygons in the geometries builds the W objects so now we know who's a neighbor of who we then transform the weights to be binary meaning I don't want to roast standardize them I'm not doing a weighted average like we were for the lag and we'll come back to that what I want now is I actually want to keep the binary weight binary because I want to count how many black-black how many white white how many white black so the transform property the weights allows you to change the weights either row standardize them column standardize them or set them to binary I want to make sure they're binary then we're using numpy we imported numpy as NP up above and we set the random seed generator why well we're going to use computationally based inference to decide whether the counts that we got for these three types of joins are different than what we would expect from a random process and the way we do that is with what's known as spatial permutations we're gonna create fake maps where we pick up the 22 black cells and randomly reassign them okay and that gives us a new map a fake map then we calculate the joint counts for that fake map now we have two realizations our original map that one and we know this split we'll get to the split in a second and then we have a fake map that we've randomly permuted the values about we're gonna generate a lot of fake Maps ten thousand hundred thousand depends how much we want to do and for each one of those we have this split of the counts the joint counts into those three groups and we're going to develop we're going to develop our reference distribution to assess whether our map is random believable that it came from a random process or not and that's why we need to set the random seed number so you all get the same result as I'm gonna get meaning we can reproduce this and we would get the same results if you don't set the seed number you rerun it you'll get slightly different results okay then we call the joint counts module from pycelle and the joint counts class and we pass in our binary variable which is black or white that's a vector of 44 values some of them are won some of them are 0 and our weights object which we've standardized to be binary and that does the joint counts for us and it creates an object that has the attributes with the counts so there were 44 joins that were black black joints on our map 44 cases where a shared edge was on the right and the left was both black polygons there were 43 white white joins and 32 black white or white black joints those exhaust all the joints how many joins are there are weights object WQ has an attribute called dot s0 the number 0 that's counting up all the ones in the matrix and divided by 2 because it's a binary symmetric matrix so if a and B are neighbors B and a are neighbors but that's one joint so we cut it in half and this tells us we've counted the joints correctly we know they're split we know that we had 44 black black joins and the question becomes okay great is that higher or lower than what we would expect from a random process meaning for each polygon we flip a coin if it's heads it's the color of black move on to the next polygon flip a coin each time we do that we don't the process doesn't care where that polygon is geometrically speaking or spatially state speaking for our fake maps the one that used the spatial permutations this coin flipping analog right we did I think a thousand 999 fake maps when you called when we call this method here the default is to do 999 fake maps that simulate that coin flipping process the mean of the black black count was 28 under the null because you're simulating the null that the map came from a random process okay so we know what we got is higher the question is how much higher so we use C born to extract the distribution of the black black counts from all the fake maps all the synthetic maps all the maps based from the random permutations all right so we collect the BB counts this is their density that's the average this is what we actually got on our map the observed map we had 44 black black joints and from this we can develop a pseudo p-value it's the area of the curve under that density to the right of the red line and it's like a p-value even those these are falling out of fashion so the p-value it's an attribute that PI cell does for us but conceptually it's integrating that curve loosely speaking and getting the area to the right of the red line and it's under 1% right so it's point o2 so if you as conventional 5% this would be significant okay so what we got is very extreme if we're to believe it came from a random process okay so we would reject the null that the process that generated our data is a random one we could be wrong because we don't know what the true processes we wouldn't be here if we knew that right but we do know that if it was a random process getting such a result as the one we got is very unlikely how unlikely to in a thousand okay so that's the logic of the joint counts right we count up the joins and for each join we say is it black black white white or black light okay how many different maps could be generate this way thought experiment so the way the permutation works is we have a vector of 44 values the first say is the first 22 are happen to be ones a permutation is sorting the indices randomly so what became one moves the position 23 what became what was 44 moves to 2 and the number of permutations you can do is n factorial ends 44 n factorial is a really big number right so we don't do all n factorial permutations we sample from them but that's intuitively what's happening I think and 44 factorial I know 48 factorial we could not enumerate all those there's not enough time in the heat death of the universe to do that so we sample instead okay so that's for the binary case we threw away a lot of information just to get this notion of what do I mean by high and low there's other measures that deal with the continuous data so we go back to our original variable median price we're not going to truncate it to high or low now we're using the actual listing prices and here we're going to transform our weights to be R which means rose standardized that takes the row sum of the binary matrix which tells us how many neighbors each unit has and then it divides the original values by that sum so if you had four neighbors each of their weights would now be 0.25 they they get equal weight if you had five neighbors their weights would be 0.2 if you had one neighbor it would get one that's what the R means for Rho standardized and that lets us get the spatial lag as the weighted average if we kept the weights binary it would be the sum of the prices in the neighborhood in the neighbors and we want the average the weighted average that's what the transform does the test for the global Kate I sorry the test for the continuous case where you don't have a binary attribute but now we have a continuous attribute is called Moran's I and it has a very similar call structure you pass in your attribute Y and your weights object WQ we set the random seed again because it's going to rely on the same way to generate the null distribution permute the values over the map so pick up the 44 floats and randomly assign them to polygons randomly because you're playing God in the housing market but you're creating a random housing market and that's our benchmark do that a lot calculate your statistic for each one of those Maps take your original statistic from the real map and compare it to this distribution just like we did for the joint count so from Moran's I the statistic is called I so it's capital I we get a value of 0.5 9 5 7 what is that well now it's not counting joins it's loosely speaking thinking about the correlation between the value of the median price in a neighborhood and the average median price in the surrounding neighborhoods the correlation of those two variables but we can interpret it using the same logic that we did with the joint count so this is the distribution of Moran's I under the null based on the permutations right so you get something a little bit under 0 and that's true it's not the expected value of this statistic under the null is not 0 even though if you think about as a correlation it you might want to think of it as zero but it's not it's negative 1 over n minus 1 or ends the number of polygons so it gets close to zero but it's always negative so that's what we get on average this is what we got for our statistic way way out in the tail right in fact none of the 999 fake maps generated a value as high as we saw with the original data and because we kept more information in the continuous variable we see a more extreme result its p-value is smaller so again we would reject the null that whatever generated the housing map price map is characterized by random process right why well the probability of getting such a result is less than one in a thousand right so those are global tests they tell us something about these maps as a whole either that one for the joint count or this one for the continuous case where we use Moran's I so in both cases we reject the null and conclude there's something going on in these housing markets prices are not random in space questions on that all right so why does it matter you might be asking that we rejected then though why do you care about spatial autocorrelation why do you think you might care about spatial correlation okay so these are exactly exploratory methods that Asda module is exploratory they're not developing causal frameworks but there are uncovering patterns that perhaps we weren't aware of that lead to those kinds of questions well why is this not random or why does this appear not to be random all right that's what they're designed to do the other purpose for doing autocorrelation analysis is when you get to the confirmatory modeling part which we're not covering today let's say you were going to estimate hedonic models for housing prices if the data is spatially Auto correlated you just cannot use traditional methods your inferences are likely to be wrong so you want to know if the data is spatially dependent so that you can use the right methods so for two reasons pattern detection and then Diagnostics to whether your assumptions for say classical regression modeling are going to be met or not yeah do you want to say a little bit about the measurement like a spatially weighted Bahaman of assistance there's a whole bunch of combination and spatial feature kernel stuff that's kind of the frontier of literature right now but yeah some of those implemented in Python some of its not available yet anyway if we don't do the work Manami polygons so he's inviting us initially but instead we just do continuity and then apply for example and attribute to look at features like what is the eigenvalue centrality for connected points you know like basically you do network analysis yeah yeah they're doing the and the result to be different then different measures of centrality graph structure yeah will disagree with different measures of a spatial statistical relationships right so it can't be the case that if you choose a Verona Pali of our presentation I gave that kiss ontology I know is not the same as its local statistics where are there questions okay so that the Miranda and the joint count were for the global test is the map random or not these local measures local statistics are spatially explicit in that they're going to allow us to detect if there are so-called hot spots or cold spots or outliers is the clustering continuous or is the cluster clustering stationary in the process if we found for the global map that we reject or the global test we reject the null is it the same process everywhere whereas the process unstable or if it is stable are there certain locations on the map that are driving the overall result right these these are some of the applications of what are known as the local methods so we're going to use the same data same weight structure Queen we make our lag price and we're going to motivate this idea of local stats leave this a little smaller alright so what we're doing here is plotting two variables are on the x-axis it's the median price for each of the neighborhoods okay the vertical line is the the median of the mean oh sorry it's the mean so that's the mean of the medians the y-axis is our friend the spatial lag right so for each neighborhood we know who its neighbors are we take the weighted average of the medians in those neighbors and that gives us the coordinate for this axis this horizontal line is the average of the spatial lag so with the cross between the two averages we have four quadrants okay its quadrant one up here what can we say about that neighborhood is it high or low its price this guy right here so its price is about $250 a night is that high or low relative to this data set its high because it's to the right of this guy right so that's a high neighborhood how about its neighbor its neighbors the value for the spatial lag the average housing price and the neighborhoods that neighbor that neighborhood is is that value high or low it's high it's above the horizontal line so if you're in quadrant one you are a market that has high median price and your neighbors are also high valued neighborhoods right that's what's going on in this quadrant quadrant two what do we have say this guy that neighborhood is it a high or low the median price it's low but what about its neighbors they're all high right so this is a relatively low priced neighborhood surrounded by higher priced neighborhoods not higher than it but higher than average okay so this is low next to high where the first one is the focal unit so here we had high high high focal high lag here we have low focal high lag quadrant three if you're in quadrant three what does that tell us no focal load in the neighborhood of the neighborhood okay and then finally quadrant four what are these guys high price surrounded by low low price neighborhoods okay so this is called a Moran scatter plot Moran say Moran as Moran's eye but this division between the four quadrants gives us the ability to define the type of local spatial Association across these four cases so in quadrant one you have positive spatial Association of a hi-hi type this value is high and so is the value in the neighborhood of that neighborhood so is this high high positive spatial Association this is also positive spatial Association you have value similarity low values but also spatial similarity because these are neighbors right that's positive Association but it's of a different type so you might think of this as but a poverty cluster so to speak a wealth cluster so on the diagonal you have positive spatial Association if you know something about the price in this market and the data is positively spatial associated that tells you about the prices in the context around that neighborhood or if you know the context around a neighborhood it tells you something about what you should expect to find in the focal neighborhood okay these got he's got and this up here this guy yeah the so they this is a high the price for the focal units three hundred and whatever and the average in the neighborhood is under two hundred so if you're below the diagonal it means the focal units and you're up here is higher than what's going on in the neighborhood so one quadrants one and three are positive spatial autocorrelation but from a local perspective value similarity and spatial similarity quadrants two and four are known as out spatial outliers because you have value dissimilarity with spatial similarity right the neighbors are spatially similar there that's our definition of spatial similarity whether your neighbor or not but here you have high value surrounded by low values here you have low value surrounded by high values okay so we have these four cases these are high high or so called hot spots hot meaning high for the attribute you're measuring we're looking at housing prices but it could be crime rates it could be disease rates could be car accidents high for the attribute high high these are low low or cold spots low values next to low values these two other cases are spatial outliers and we have two of them I like to call this one where we have low surrounded by high like the donut you have donut with a hole in the middle not a jelly donut but so empty in the core and the periphery is where all the action is with regard to the attribute and down here you have a diamond in the rough you have high value in the core not much going on in the periphery okay so this direction are outliers this direction is associate positive association spatial association this red line the slope of this red line is the value of the global Marans i statistic that we saw or right so if there was no spatial Association on our map our original map we would expect to have the dots spherically distributed in this to the four quadrants because there would be no association between the housing value in the focal unit and what goes on in the neighborhood that's the that's intuitively what randomness is about but this is clearly not a zero slope line all right so to calculate these local statistics there's actually a lot more that has to happen number one statistics plural we no longer have one statistic for the entire map we're gonna calculate a measure for each location and that's why they're called local the signature is the same though we call Moran Moran underscore local instead of Moran pass in your Y vector and your W object and that creates a local I object that has a bunch of attributes number one is Q which quadrant are you in so these are numbers between one and four right one is this quadrant two is that one three and four so observation one was in Quadrant four observation forty-three Python zero offset was in quadrant one okay also what goes under the hood here are the permutations just like we did for the global test we need a reference distribution to decide if our statistics are extreme relatives are maintained null however here instead of 99 999 permutations for the one global statistic we have to do that for each polygon right we have to do that for each polygon so this is a lot more computationally intensive and the reason you have to do that is so that you don't include yourself as a your own neighbor loosely speaking we get a p-value P sim for each observation each observation gets a value for the localize statistic and then we need to decide whether it's significant or not and therefore we use those distributions just like we did for the global stat but instead of one reference distribution for all values we have a different rest rebuking for each location but the interpretation is the same where is my observed statistic in this reference distribution so we could plot 44 densities but we're not we're not going to we're asking how many of them are how many of our local statistics had p-values below 0.05 right so there's 18 of our 404 statistics that are significant by this criterion okay so we want to find out okay great where are they what are they first the hot spots so those are the ones that are in quadrant one and are significant not just being quadrant one but also being statistically significant being more extreme than you're comfortable with under the null so those are the high-value neighborhoods that are co-located with other high-value neighborhoods it's not enough just to be a high-value neighborhood you have to be continues to other high-value neighborhoods right so that's our hotspot map cold spot map these are the ones in Quadrant three that are significant so blue for cold in different part of the city the donuts which is quadrant two whole in the middle right so this is this is a relatively low housing price housing market surrounded by high housing price markets right so that's our doughnut one of the outliers and then the diamonds these are high valued neighborhoods that had high median price surrounded by neighborhoods a tad low median price and significant so putting them all together we're only plotting the significant ones and we have those four quadrants right red is the hot spot high high quadrant one the gray is not significant so we put those on so we see them anyway the blue is the cold spots quadrant three the donut up here and the diamond down here okay so if you're an investor where do you want to buy property yeah why if there's diffusion if the growth is going that way that market might trip sooner then yeah you could use it for those purposes yeah okay so those are the local statistics they identify the interesting locations whether they're strongly emphasizing the global pattern which certainly this one and this one are those are in quadrants one and three similarity in space similarity in value hi next to high or low next to low but then less prevalent but still there are the outliers okay questions on the autocorrelation all right before we leave the autocorrelation these are all static this is a cross sectional analysis we have the data for one time period right so this notion about well if there's diffusion and the the growth cluster expands then we probably want to invest here but that's making a guess about the dynamics of the process one away one of the ways these statistics are used is if you have panel data if we had there being feed data for multiple time steps then one can have a time series of these scatter plots one for last year or the year before and you're interested in the movements across these quadrants so we have packages inside PI cell that deal with those kinds of questions as well all right last notebook and then maybe we'll have time to do the exercise has to do with notions of clustering and regions so region building and clustering clustering in the terms of multivariate clustering you have a multivariate data set say US census data you have or for our case our data we have the Airbnb data for each listing there are multiple attributes ya the sales price how many people and other attributes well we'd like to do is cluster together the listings that are similar to other listings in attribute space okay that's a multivariate clustering problem but we also since it's a our focuses on spatial data science we'd like to incorporate geography into that clustering process as well and introduce this notion of marrying if you will spatial similarity and attribute similarity and defined clusters from that perspective okay so that's what we're gonna do with this notebook we read in our neighborhoods and the listings plot them we constructed some new information Levi calculated median price per head so in the attribute data you had the listing price as well as how many does it accommodate so on the right that's the ratio of those two price divided into how many people can live there I'm not staying at that 12,000 place I would invite people if I was but it wasn't so we're going to start with traditional clustering with scikit-learn it has many clustering methods that you could use we're going to apply that to our our neighborhoods data as well as the point data we're gonna look at two clustering methods k-means people are familiar with k-means clustering or know how many never heard of it i'll spend some time on it if okay so k-means clustering is about we have a set of objects and we want to assign them to groups ie clusters such that everybody is assigned to one and only one cluster and we've clustered everybody so the mutually exclusive and exhaustive how do we do that well we could take the IDS and sort them and say the first five were in cluster one next five were in cluster two and so on yeah that would make them mutually exclusive and exhaustive but in terms of similarity it's probably random noise so these clustering algorithms use the distance in similarity in attribute space so how similar are the prices how similar are the sizes whatever attributes we can get in the data and we develop multi-dimensional similarity indices and then we group together more similar listings into those mutually exclusive and exhaustive partitioning x' okay so k-means works by you have in are doing it for the neighborhoods first we have 44 neighborhoods and I want to form five clusters of the neighborhoods so I create a model called k-means that's from scikit-learn and I tell it I want five clusters what this is gonna do is take those 44 neighborhoods and put them into five buckets right those are our clusters how does it do it it depends on the options but you start with a random you randomly pick one of the neighborhoods where you pick five of the random seeds and then you look for each of the unassigned neighborhoods which which of those seeds is the closest to alright so I first picked five random neighborhoods those are the seeds for the five neighborhoods then I have 39 unassigned neighborhoods I loop over those I picked the first unassigned neighborhood and I see which of the five seeds is the closest to an attribute space and I attach it to that seed and I do that for the other unassigned ones and I currently have a feasible solution I have five clusters everybody's been assigned but then I read he established the center of the clusters by taking a multi-dimensional average of the ones that have been assigned to that and then I stepped through the list again I say should I move this guy to a new cluster and I keep doing that reassignment process until it converges okay that's what k-means does it gives us labels so we create a model then we're going to fit the model on our neighborhoods data with our clustering variables and are clustering variables we're going to use what we were using I skipped over it well let's see what they are well we're just using meeting price so we're we're faking multivariate clustering we're just using the median price the result of fitting the k-means model is it gives us a k-means object one of its attributes are the labels underscore these are the clusters that each neighborhood was assigned to all right so the first neighborhoods isn't in cluster ID two last neighbor is in cluster ID one we're going to attach those labels as a new attribute and then plot it so that's K means from scikit-learn on the median prices so for me five clusters based on price similarity no other attributes and the color denotes which cluster you belong to not necessarily if it's a higher lower value but membership in a group okay so anything interesting about this map okay you see let's say for the green cluster most of its appear of polygons units that were assigned that cluster one Arcola located similarly for cluster zero so there's some spatial structure there and the this is a spatial clustering routine it doesn't explicitly take geography into account but the spatial distribution of the attributes when we saw the autocorrelation analysis it's not random so it's picking that up okay we can get the sizes of the clusters meaning how many neighborhoods were in each of the five clusters so that cluster number four only has one polygon that's Center one and then we get the values of the mean of the medians for the housing prices in the five clusters so that cluster that only had one neighborhood has the highest value 335 right and that's this guy here okay so that's k-means next we do AG long enough clustering kilometer clustering works by we start with forty four clusters everybody's in their own cluster but that's not useful because that's there's no summarization of the data then we search for the closest pair of listings closest pair neighborhoods based on multi-dimensional space but here we're just using price so we find that two neighborhoods that have the smallest difference and they form the first cluster the first real non singleton cluster so after step one we have forty three clusters one that has 42 Singleton's in this one cluster that has two okay then we keep going we find the next that we treat this as a single thing we collapse it we have two neighborhoods that form that cluster but we take their average and treat it as it's one unit and then we search for the closest pair again and it could join one with that one or it could join two other ones and we keep going until we have one cluster that contains everything and this gives us something that looks like this if the wireless God's will this is a danger gram right so step one would have clustered neighborhood two and ten step zero we have all the neighborhoods not clustered and then you keep going until you join Singleton's two clusters or clusters two clusters until you're at the top of the tree where you have one cluster everybody's in the same cluster that's what a glommer if clustering does glom rate move up and how you measure similarity has to do with the linkage option words is the distance measure once we're done we have the same properties of the object we have the labels and we can introspect that and get how how big are the clusters meaning how many neighborhoods are each of the clusters and we see some similarity that fourth cluster falls out again the singleton by itself and we can plot that as well okay we said we wanted five clusters for a Glahn roof and what that does if we go back to our dendrogram it tells us where we're going to cut the string horizontally so that when we cut it we have five clusters of bundled neighborhoods together right if we cut it here that would give us two clusters split here right if we cut it here we'd have well I'm not going to try and count so when you say K is five it finds out which horizontal cut would you get five clusters that intuitively that's what is happening all right again see spatial structure non-randomness and the collocation of the high well the membership and the clusters is not random in space oops do that and then we can compare the two clusters so k-means on the left they could be we mislabel okay okay so the geography is a result of these clustering processes but it's not explicitly built into the clustering processes there could be reasons where you want to respect some spatial considerations in forming the clusters and not rely on the data to allow it to weave in the solution one of the reasons you want that is if you're thinking about neighborhoods for example like socioeconomic neighborhoods you are either inside your neighborhood or you're not you don't leave your neighborhood to get to your neighborhood so if these were neighborhoods these clusters or neighborhoods cluster one is here and here it's spatially fragmented and that's not a socioeconomic neighborhood right you wouldn't say I live in this neighborhood but my neighborhood also is over there you're there inside the boundaries for your neighborhood or you're not alright so we could say well we want to form clusters that are not spatially fragmented for those kinds of problems those kinds of applications you may be running a business where you need to assign people in the field to service regions and you don't want these regions to be fragmented it's inefficient from like an optimization type approach so we want to build those constraints into the clustering solution and fortunately we can do that leveraging the really nice architecture inside pie and scikit-learn rather by passing in a pile sparse the queen weight that we've been using we use it's sparse attribute here and we pass it into the agglomerative model as a connectivity the value for the connectivity prompt option and what that does is it repeats the clustering but it only forms clusters such that we're going to have members being spatial neighbors right so we didn't have to reinvent the wheel here although we have on separate product packages this is really really nice and it speaks to the strength of the site by ecosystem this is one instance where this kind of thing happens a lot you discover this other library that does exactly what you want to do and we have a data structure that fits really nicely in there so let's do this with Ward go a kilometer of clustering we want five clusters but now you see what we mean by there's no fragmented clusters the blue one is not split such that if you had the connectivity graph it would have two components it's a single connected component right as are the brown ones and the green ones all right so if you need to respect this this type of constraint you can do it fairly sure interest rate very fairly straightforward option does that make sense okay then we can do the same thing on the points we've been going between changes of support working with points then switching to polygons we're going to go back to points here and we're gonna make use of things we've already seen we did the Thiessen polygons a couple of notebooks ago we stored it to disk hopefully not in 3000 files on someone's hard drive or 11,000 sorry we're gonna read that clip Thiessen's geo package which are a bunch of polygons and pycelle we call queen from data frame so it takes that data frame and says give me the Queen neighbor relationships from that Veroni polygon decent polygons saying the names are different but it's the same thing so now we have our Queen contiguity structure for our eleven thousand points we take the log of the prices because the gamma no clustering what was the rationale does poorly on price date is not normally distributed so we do a log transformed on it and then we do AG line of clustering with the spatial constraint just like we did before on the polygons but now we're doing the clustering on the points with the Veroni neighbors as our weights object and it runs pretty quickly we'll plot it so now the color is denoting what region was that point assigned to we don't see the Veroni polygons but they were under the hood expressing the contiguity constraint okay and just like the polygons we can query the clusters the regions and ask how many points are in each of the point clusters or point regions now there's quite a bit way at eleven thousand points to cluster and we can do it with a agglomerative as well and then plot the different clusters individually where they are in lat/long space okay questions on the clustering and region building how you doing for time okay so do you want to do the last problem set or ask questions about anything else they are being be data and scraped it [Music] [Music] yeah yeah that would be a great use case they get clustering get choose for many many domains for sure okay do you want to work on exercise or learn other stuff about pycelle or just give you overviews of what's in the library stuff we didn't cover and point you to resources it will be yeah it is so we the project itself has it's a Federation of packages so if you go to paisa org that's the public face to it there are packages we covered today lib pie sell some Asda map classify I think those are the ones we hit but there's many others there's a new one called segregation that has a bunch of indices along the lines you're talking about that just came out it's used for segregation analysis but you could use it for other purposes as well so the way the library is installed if you want all these things you install the meta package contents pycelle or pip install pycelle and it'll get everything we have these pycelle lib as well as Asda giddy is for space-time analysis inequality is for inequality analysis point patterns which we didn't cover today really there's a module to do that segregation spaghetti is for point patterns on networks so if you had phenomena that are constrained to a network and you wanna do spatial analysis on it that's what spaghetti does and region is in limbo it's probably gonna go away and get reinvented but that's the regionalization the stuff we ended with forming compact regions then there's a modeling layer which we didn't cover today this would be for the confirmatory type modeling spatial econometrics spatial general linear models spatial interaction models where you're trying to predict migration flows or different types of interactions between origins and destinations and it's a model that's why I put it in the model layer and varying component models that Levi's heads up we all work together on on these different packages there's a bunch of us on the team the visualization we saw a little bit of Matt classify today there's a package called s plot that Stephanie eliminates who's here giving a talk tomorrow developed legend Graham is a cool visualization for the legends and then there's a notebook project that we're working on so you could run all this stuff locally before you use it so so the rasters when PI saw started it was in the vector space because that's most of us at the time worked in social sciences and the dominant spatial data structures are vector data so polygons points and lines the raster stuff was more in the geophysical realm so hydrology ecology that kind of stuff and so our research in just weren't there so we focused most of pycelle up until very recently so this stuff we didn't cover today although you have the notebooks there was a this set on rasters Levi's been working more with rappers lately than I how so I don't don't want to misspeak about it but we've never add roster support inside paisa because there's other really good libraries that that have it is that fair thing to say yeah so the problem sets are here at the level and then if you the solutions are also notebooks that are fully worked out so you have them I think well context Lee's not part of paisa that's the one where you have to hit the tiling service I don't think any of them do yeah yeah yeah one of the motivations for we just refactor the library so now if you don't want everything you can install the specific packages you want so be lightweight and one of the reasons we're doing that is we have partners who were consensus on-site and they have really strict requirements as to what they can install so they couldn't install everything they wanted ala carte kind of stuff and I think they can they it cannot go out off site so it has to be local other any other questions well we could stop there yep no that's fine what kind yeah there's people well this morning at the breakfast i sat this nexus a guy who works for a large multinational petroleum company that I won't name that uses my cell you wouldn't tell me for what much but you know now we both work with companies before that use a lot of PI Sal and related tools for all kinds of research and we it was borne out of our own research needs for sure but a lot of people are using it yeah so if there's stuff that we didn't cover that's not library you want let us know it's a friendly it's a really friendly group people can get involved and maybe we'll stop there [Applause]