Unpacking the geospatial engineering toolbox – an overview of data science techniques for spatial da

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
foreign welcome everyone to the all things data track I'm super stoked for the lineup we have today my name is Ned I'm one of the hosts of The Specialist at the specialist track we also have Ray and Pedro who you'll see later on uh throughout the day they're down the front um the only minor housekeeping thing I'm gonna mention before we jump straight in is that we have a Discord Channel look for the data track uh Discord Channel pop questions in there for when folks are going to be taking questions if we have time and just say hello uh chat there's a big shout out to everyone in the online audience um it's great that you're tuning in as well so with that said let's jump into our proceedings and uh first up we have Christine seliga and long dang uh Dan they're going to be talking about uh unpacking the geospatial engineering toolbox an overview of data science techniques for spatial data just some quick intros Christine is the technical lead of wsp digital's data science and analytics team a scientist by training she has worked across a wide variety of different fields from computer science and software engineering to biomedical research fintech and now civil engineering she is continuously looking for new and interesting challenges to expand her Knowledge and Skills long is a junior data scientist WSB who likes all things math deep learning and computer science long is it most at home crunching through some algorithm coding problems with too much scratch paper and a pen when long is not at work you can find Long watching anime playing gacha games going on random walks if the weather is nice meeting friends and new people or trying to get a neural network loss to go down so please Round of Applause for our first talk yep um hi everyone um thank you for spending some of your time at the first half for this track um so I think when you think about it science you probably think about statistical techniques like the things that you can work on for tables like the things you find in Excel or you may be thinking about unstructured data like images and texts the things that you work with when using deep learning models like strategypt or Sable diffusion they're all cool and stuff but at my workplace wsv we have a lot of contact with civil engineering disciplines so one of the data type that we work a lot with is actually geospatial data these are everywhere they're very useful and in this presentation we would like to give you a brief overview of what geospatial data is what you can use it for we'll give you some overviews of patent libraries and techniques that we use and we'll give you some examples of the actual use cases or projects that we deliver to clients and give them insights from their geospatial data we'd like to begin by acknowledging that we're meeting on the traditional country of the Ghana people of the Adelaide pains and pay respects to the elders past presented emerging my name is long and second Sunday with me is Christine and we are data scientists at the team so the first question is probably the most important one is what is your spatial data and why is it important to us so the long and short story is that geospatial data represents all the physical things that surround us so it can represent the buildings this building this room so the chairs the roads around us and can represent the electrical lines that bring power into our homes the water pipes that bring water and waste away from our homes and Google Maps as you can see to the right it's probably one of the best example of how we use geospatial data in our everyday life so we use all this geospatial information about the layout of the buildings of the cities and the public transport networks to help me get to this convention center today we also use it to find places to go to on weekends for example and this is what geospatial data is important we want to use all of this important information to make better decisions not only how to move from place to place but also how to build better cities better environments and better communities so for example we may want to use your spatial data to find the best place to build new social housing new parks for just underprivileged communities we may want to understand how our infrastructure work how they are all connected where faults may happen and if they do happen where can we best solve them before we can delve into like solving all these important problems we first need to understand how we actually represent geospatial data for computers to understand there are many ways that you can represent your spatial data depending on whether you want to include verticality or not but for this presentation we'll mostly focus on the 2D case so you know like a map where things live in two dimensions in two Dimensions everything left on the coordinate system so they can be represented as Vector data and these are the intuitive things that you find your everyday life like shapes lines points you can abstract over this representation using grads where the shape of things no longer really mattered only the connectivity of the objects you care about matter 2D data also include things like images and raster where things are recorded in pixels so for each pixel it represents some actual real-life geographic area and for that area you measure some measurements maybe the amount of rainfall the amount of solar potential and so on so this is probably the simplest Vector data that you can think of it's just a triangle and a swear but you may be able to do more interesting to them when you start moving them around rotating them finding the intersection between them which is highlighted in purple here you're going to merge them you'll use one shape to match the other shapes solve these are common operations that you may think of when we're dealing with Vector data and you can think of this representation as not unique to geospatial data alone because they are shapes they are practically everywhere they reduce a lot in computer Graphics as well so there are a lot of literature and algorithms that have been developed over the years that we can leverage file geospatial data analysis speaking of tools and algorithms at my team we use um very common Suite tools to extract insights from your spatial data the favorites are qgis jio pandas Chef Lee and network X which we will go over shortly we also use raster IO and lastify which are libraries for raster and lighter data but we will not talk about in this presentation because we don't have enough time so the first stop is qgis this where when we receive data from a client or we download data from the internet we will usually just plug them into this tool and it will plot everything onto the map and then we can scroll around just learn Google Maps we can also click on the actual shapes itself to fight attributes associated with that so here we are having a lot of polygons and orange shapes that are the footprints of houses and we also have dots that represent um points of interest and qgis also have a lot of tools associated with it it's a full power Analytics tool so it can also do all that geometric operations that I talk about it can merge things find geometries of Interest as an open source tool qgis also has a lot of third-party plugins that you can use to improve your analytics workflow and you can actually write Python scripts that can run in qgis as well if we have determined that qgis is not sufficient for our work and we really want to dig deep into things and start automating things with code the second stop is usually geopanders it's kind of like pandas which is a library for manipulating tabular data except we have an additional geometry column and it can read and work with the most common geospatial data formats like SHP LG or Json or GDP so in this code snippet on the right I'm importing your pandas and using the read5 API to read all the geometries for suburbs in South Australia and this read file API is very similar to the pandas read CSV API and you can pass arguments to it to specify which format you want to read it and the next line of code to CIS is where the geometry column really shines here I'm using the 2cis API to perform an operation that transforms one coordinate system to another so when you think of coordinate system you probably think of like latitude and longitude that can locate anywhere on the globe but it's not the only coordinate system and in fact it's not even useful for some use cases where we want to measure precise distances in meters you can't really have a distance in degrees it doesn't really make a lot of sense so what we do is we'll convert that corner system into one that is locally precise for South Australia where we can measure distances in meters and so that's what we're doing with this API call and then we can perform common pandas operations so things that can work on tables look similar here here I'm just indexing some columns and showing it in my tributor notebook if we want to go a bit further and directly manipulate the geometry itself we will turn to Shipley Shipley actually underlies of the geopanders geometries objects so when I print the typed of the first geometry in the geometry column it actually comes out as shapley the geometry the polygon and I can also use a lot of the common geometry operations that Chevrolet provides and here I'm trying to use the unary union operation and what this does is it just merges all the geometries together and because submerging the geometries of suburbs in South Australia I'd expect to see what the state of South Australia to emerge and it kind of does except there's this big hole to the left which I don't really know why maybe there's just no one living there but um as an old disciplines of data science data quality is probably one of the biggest problem that we have and geospatial data is now stranger to that problem as well to really show how powerful these libraries are though we need to really dig into the use cases and the projects that we work with our clients to bring value to them so the first project I would like to share with you is search pipe maintenance so the problem that we're trying to solve for our clients here which is a water utility company in New South Wales is to optimize the sewage means maintenance program to achieve the best performance for their budget so the idea is sewage pipe sdh they get more prone to breakage or leakages and it releases a lot of harmful things environment and also cause a ton to fix up so ideally we would like to replace or maintain the pipes before it gets too old um obviously you cannot really replace everything everywhere all at once so you have to make decisions on which pipe to replace now within your budget so for the scope of this work we want to automate this automation optimization program and we also want to add some new additional features we want to tell the client how accessible a pipe is to maintenance equipment and this kind of breaks down into two sub-measures one is how much of a pipe is built over Say by property so you can see the pipe highlighted in red is kind of built over in certain places the second dimension is how accessible a pipe is from the road behind like all these properties so let's say I want to access the red pipe from the road then I may need to go over like the building in Gray and the building May does not have a large enough right way for my equipment to drive through so for this problem we have data that are the pipes and lines we also have the outlines of all the buildings and these orange polygons and these are actually generated from satellite images using the Deep learning model so that's an interesting application there and finally we have some outlines of the lots for each property and this data is publicly available from the new software government so to stop the first measure which is how much of a pipe is spills over the algorithm is relatively simple um intuitively you can think of well if I want to find how much of a pipe is built over I just find the exact section of pipe that is underneath any other geometry and this is a geometry intersection problem and then Shipley it can do that with a single API call the intersection API so energy geometry have an intersection method that can be called on knowledge geometry and then Shipley will handle the specific algorithms to intersect a line with the line or a live with the polygon Etc and it's all well and good however the problem of determining how much of the pipe is acceptable accessible for an equipment behind properties a bit harder there's no really single unique operation that you can just do and get the answer so our first naive approach is to try to measure the gap between the building and the large fence line so the idea is well if the pipe is in front of the property then surely is accessible so that's the trivial case if it's behind the property then the issue is is there like a big enough gap between the building and a lot of fence line for the equipment to drive straight in like a driveway that's not blocked by your garage so visually we're trying to measure the distance of that red line between the building and the lot and in shortly distance is very easily performed consuming you are in the projected Corner system that I mentioned earlier you can use a distant API between any two geometries again and Shipley will handle all the specifics to measure the shortest distance between any two point on those two geometries this approach however kind of just breaks down when you add more complexity to the problem so here I'm trying to add some extra like sheds and garage and maybe a poles so there's extra polygons now in the lot and so this minimum distance between the building and a lot of fence light doesn't really tell the whole story because maybe there's sufficient distance but a shared filter data lines blocking as you can see or maybe so now you have to include the distance between the building and the shed as well but if that distance is too small maybe you also need to consider the first distance that we measured because there may still be a path straight through so I guess you can come up with some algorithms to iterate over all these combinations of distances but it's not very naive algorithm anymore and it's starting to be very complex and error prone fortunately there is a better approach which takes which uses buffering and buffering is just expanding or shrinking a geometry by a certain set distance so in this illustration we are buffering expanding the original building geometries by one meter so you can see the original geometries in Orange with the X cross on it and a buffered outlines in a more transparent orange color I'm also shrinking the original large geometry by one meter and the remaining space is color in green so the inside of this approach is that if we make some simplifying assumption on the shape of the equipment that we care about and we're just gonna see there is a sphere of some radius and we can also tweak this radius to actually account for the actual shape of the equipment than the center of the sphere must lie in the green area and you can kind of see this for yourself if you try to like imagine placing the center of the sphere anywhere that is not in the green area then when you draw the sphere it must intersect the large fence line on the original building geometry somewhere because the radius would exceed the buffer distance so if we now check if the green area contains both the pipe and access to the road then surely some equipment that is a sphere would be able to go from the road to the pipe we cannot do this in shiftly using the difference and buffer API which does exactly what they say it does we compute the difference between the negatively buffer a lot so that's the strength lot and the positively buffer buildings envelope to give the green area we then iterate through each polygons within that green area just in case that the buildings went above for drops the area in half and for each of that polygon you can check if it intersects the road and the pipe geometries if it does then the pipe is accessible the outcomes of all of this algorithm is some condensed summarization measurements for H Pi let's say how much of it's built over and how much of it is accessible and all of these measurements can then be fed to Downstream applications for example like in the dashboard or in some reporting program so that the client can make decisions or it can even be fed into another machine learning program to actually perform that optimization program in the first place sometimes though we don't really want to have this summarization measurements from a geospatial data we instead want some geospatial data as outputs from our initial geospatial data because we may want to for example visualize them on the map and the second project I want to share with you is one where this is the case in this project heavy vehicle load access sorry the the problem is we want to evaluate Road networks in New South Wales against electric trucks Dimensions so the idea is the state of New South Wales wants to adopt electric trucks but these are really large compared to original trucks they are larger and longer to account for their batteries and you have ever driven behind trucks you know that you don't want to be in size turning curve because as it turns it create this really large web path which is this total area that it Trace out as it turns and you don't want to be inside it because then it's going to hit you and that's not going to be a fun experience so the client wants to understand if the road networks in New South Wales is sufficient for all these new vehicles so for that purpose we yep sorry I think my laptop is not liking me today yep yep so we have developed a simple algorithm that can simulate the swept path given the path that the Mover of the vehicle takes but this is actually the start of a new problem because this simulation algorithm needs the path that the Mover takes but the data that we were given actually do not have any information on what actual pass it can take to turn so the information we're given is a single line for each Road geometry that starts and end where the road starts and end so in this intersection we have many separate roads but we have no information that tells the algorithm that it can turn that you can take Row one two and three to turn right from the top so none of that information is available to the algorithm itself all it says is this separate collections of Roads so we need a way to given this collection of Roads it numerates the possible hats that the truck can take and then output some actual Road geometry that the truck can the simulation algorithm that can then take yeah sorry for that all right um so the way that we solved It Is by using uh Network X Library which constructs graph okay so let's see if I can get this to work now so what we did is we take each individual Road and turn it just go to the last yeah and don't click we take each individual Road and compute and treat it as an abstract Edge so the start and end of the road will be the starting n vertex in the graph and we ignore all the actual coordinates that link those two vertex together then if we Define some start and entrance vertices to the graph to the intersection so maybe you can use if there's only one Edge to the node then that is an entrance on exit vertex then we can use a graph algorithm to kind of find the shortest path between those two vertex and that will give us a list of edges in this graph and then we can extract the coordinates of the road that is associated with the edge and then concatenate them and that will form the final Road the truck can take so yep when in networks X this is done using the code in the bottom right so we first initialize this graph component this graph class then we iterate through each geometries and then we can index into the core in this array and find the first and final coordinate entry let us start an N vertex Network X actually allows you to input any hashable immutable python data structure to be the vertex so you can just insert the coordinates pair right into it and we can also attach the coordinates of the original live stream geometry as well and just make it easy makes it easier to just extract the coordinates and then concatenate all the the the paths together we can then smooth out this path using some additional numpy geometries numpy vectors and the math to create this red geometry that informs the simulation algorithm that the truck can now take this path to turn right and we can do the same for every other pairs or starting inverses vertex and that helps create all this paths that navigate this intersection now if I click next it will jumps all the way again so I will just present without any visual aid before passing it onto Christine so the output of this algorithm would be this swap pad geometry that we have simulated for the truck and then we can kind of look into the widths along the square path and then we look at the width of the road and then we say oh at this location the width of this web hat is too large and so there will be a problem because this drug will now go into the opposite lane and cause problems for other vehicles oh it may intersect some roadside geometric assets like maybe fence or traffic lights I don't know the problem is we don't really have very good labels on how Wire Road is at any point it would be good if we have the actual geometries for the road like a polygon that covers the area of the road but that is actually quite hard to get so as in all other data science problems QT label is also a problem in our case where we don't have a truthy geometry labels that we can compare this webpad simulation with so with that said I will now pass it on to Christine who will take it over from here so yeah and it's to move to the next slide yes so with this last use case um we are pushing the libraries and tools that long was talking about a little bit to the site they're not the center stage of it but they are still very important to produce inputs and outputs for our analysis our team frequently supports traffic modeling simulation tasks with data analytics and Automation and occasionally the tools and Frameworks that are in use by those teams are necessarily suitable for for the problem at hand and in this case one of the issues they encountered was modeling reversal of vehicles so what was the problem and we were asked to look at a Precinct in the mountainous region that gets frequented particularly on weekends by a lot of tourists and these Precinct is to access roads um that are very narrow and in a lot of places um they're just single files so you can imagine if you're driving down the road with a lot of traffic you're encountering other vehicles going the other direction quite frequently which causes you to either try to be brave and scramble past them or what is probably a little bit safer is to reverse and find a spot where you can safely let them pass The Operators of this road were particularly worried about like the latter people reversing up and down the road and creating too much traffic on it so they asked us to provide some foundations to intuitive traffic operations that they could use to alleviate that problem for example using shuttle buses to get the vast majority of people down the road with different schedules and different sizes or to implement boom gate Solutions on either end of the road um so how did we approach this problem we needed something that could iterate over a lot of different situations fairly quickly and um we decided to break the road up into cells and each of these cells would have properties such as whether they are single or double file and for example if the speed associated with traveling on this particular location and it will have a state which represents a vehicle being on the cell whether this vehicle is traveling east or west or whether it is currently reversing or in a holding State waiting for a vehicle to pass um for each time step in the simulation the cells would update their state in a random order based on the surrounding cells and we introduced some more variability for example the arrival times of cars at either end of that road based on distributions we got from traffic counts that were actually measured up front what allows us what this allows us to do is very quickly test different operating scenarios introduce the boom Gates and buses and other things and simulate thousands of typical days how traffic could evolve on this road how did we implement this at the core of this is a custom simulation that we wrote in Python and one of the key inputs to this is a shapefile that contains the geometry of the road and you can imagine that you can easily load it and then use shapely that Don was talking about before to break this road down into evenly sized segments in addition we put the vehicle counts in and the config file that then relates the change along that road and therefore it's relation to the cells with regards to what the speed limits are on or the average speeds on those sections of Road and which ones are the single file locations now once you've done that your simulation can essentially forget about the spatial context of this however it's really valuable to keep this information in the back of your head because what we did is we implemented the visualization in pi game for this which is basically translating the coordinates of the cells into screen coordinates and we could use this to take the client on our journey of developing developing the simulation and show them along the way our vehicles are actually moving along the Bendy Mountain Road then the simulation has a batch mode where you can run thousands of simulation and it's just chucking out files with the parameters that we are interested in so the last bit to this analysis this is missing is what do we do with our startup once simulation is produced it and in good old data science style we developed a python notebook that reads all of this in for single scenario and then creates aggregated outputs and again the spatial representation of these cells becomes really quite important because it allows us to produce a spatial representation of our results that then can be loaded for example in qgis and put on maps to show people the actual content of what's Happening and these are some example outputs I need to say that the y-axis went missing from these plots but it's a percentage of vehicles and it for example shows the travel time overall on the road the intergate travel time the number of interactions that vehicles have and on the left and the top right in particular on the left you can see the hot spots of these reversal interactions that we were mostly interested in and on the right you can see General vehicle interactions which also includes Vehicles holding or queuing up behind others now this approach really worked quite well and after this we conducted a second study um with a very different topic but the same methodology which basically looked at a mine site in wa and the road that is required to access it and this is a dirt road and the biggest problem there is dust development so we basically use this methodology to simulate different operating scenarios water trucks Convoy scenarios to get workers in and out and heavy Vehicles traveling up and down this road in machinery um yeah so a very useful little methodology and as you can see the special context is really important and useful to show your results um that's it from us if there's still time I think happy to take questions otherwise thanks for listening thank you sir thank you so much for that talk um we are running a bit late and so um you did a wonderful job with a few minutes extra but we might give that back to folks too because we're going to try and stick to the 11 o'clock schedule so uh we'll jump into saying thank you so much Christine and and long um and we have some some gifts for you foreign thank you um
Info
Channel: PyCon AU
Views: 268
Rating: undefined out of 5
Keywords: ChristineSeeliger, LongDang, pyconau, pyconau_2023
Id: QwVdbCiayBo
Channel Id: undefined
Length: 30min 45sec (1845 seconds)
Published: Thu Aug 24 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.