Getting Data Science with R and ArcGIS

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right welcome everyone it's just four so I think we'll get started so we have enough time to get through this we're here today to talk about primarily what about using our in the context of our KS but also some about data science and how it relates to to using art so my name is Sean Waldron I'm a developer in the geoprocessing team I primarily work in Python but I also am doing a lot of work now with our and this is Margene I am Margene poo Buddha and I am a product engineer for the spatial statistics team and I also serve part-time as product engineer for the our ArcGIS bridge great so we're going to go through a lot of stuff here and if you've been to a talk that I've given before you've probably seen this already but basically that very top URL there is the slides that I'm working those besides we're looking at right now or that URL so feel free to follow along also just you can write down that one URL there's lots of links in this talk I find it much easier than trying to have everyone furiously write down links and spell them out just give someone the slides the second link here this github repository is the slides themselves and also any demos that we're going to give there'll be there'll be resources in there for that but not all there right now but we'll get them up there in the next couple days so yeah those are two good starting points that let you take a picture of that there's also a resource resources section in this talk we we used to have 75 ministry sessions and now we we only have an hour so instead of talking like the Micro Machines man I'm going to probably just not talk through the resources section but leave that to you as an exercise after we're done today but if we do have extra extra time then I will I will look over that a little bit and just give you a little bit of flavor of some good next steps okay so all right so what what is what is data science right that's a good good starting point so yeah it's it's it's pretty high there's this idea of like a hype cycle you might have heard of and it's pretty much at the top of that hype cycle right now it's somewhere around peak peak type of data science but what is it really about it's it's taking things we've had for a while and combining them in a new interesting way right it's about taking statistics which has a long history as a discipline and machine learning which is newer machine learning has gone through a number of different evolutions some people would classify it as a part of AI there's some some labels in the scientific community that that they come and they go because people have to write new and compelling grant proposals and certain labels are not you know they're not acceptable in certain communities so sometimes labels go away AI is actually one of those ones that there's something called the AI winter in the 80s and you could not get any money for AI so I think machine learning kind of took the mantle is that name and now it's kind of making a comeback now that the data science is getting bigger again but the basic idea is that you're applying those two domains statistics and machine learning to real world data you're not just doing it in the abstract you're doing it with real world data and you're coming up with formalized methods that take that data in some kind of pipeline that you can use and reuse right so you're not just doing this once getting one published result you're doing it for some system where you can use that result over time so it kind of Maps well into how we think about doing things in the business process context right we want something where we can change the process but then we can continue using that many many different times now doing that transforming this where you're going from doing the one-off analysis into this kind of formalized tool takes a lot of different fields to get us there so there's a lot of people with different they're bringing different aspects of the story to data science so it oh this is what is the data scientist Josh wills he was he's a cloud era he was here a few years ago at dev summit his definition of a data scientist as a data scientist was someone who is better at statistics and a software engineer and better at software engineering than that in any statistician and so it's a pretty good is a pretty good working definition the the the joke version of this is a data scientist is a statistician who lives in San Francisco those are you know what we can go with either of those but you know just to give you an idea it's someone who is kind of combining these two what historically we're pretty disparate disparate environment creating statistics and actually doing software engineering right so data science has an interesting context in geography there is kind of this nascent field of spatial data science we already as geographers have to rely on knowledge from many different domains right geography geography is often an enabling discipline it's not working in and of itself right it typically is combining with other domain knowledge to solve problems we know that spatial data is more than just taking the X column and the Y column and saying all right we've done we've got spatial right we we know that we can do certain methods even though it's just if you have a just an X and a Y is just two more floating point values in a table we know that that actually just having those two floating point values have a very special context and special meaning that we can get a lot more out of so part of what I think you know geography brings to this is we can take that same data and contextualize it maybe it's been historically just done in an a spatial context we can texture-wise it and we can do new and interesting things that combine both Geographic science and some of the data science stuff that's going on so in terms of languages that people use for data science there's a lot of them I'm just going to kind of give you a quick background on it there's there's a lot of different things going on Python has a big community C++ Java some of the more heavyweight languages do are will talk about a lot today you know one thing that's interesting is that I mentioned how its kind of converting these these work flows into productized you know systems that are that are reusable that often means you have to take pieces from different stacks you can't just say like we are a java stack we're a java shop and we're only going to do java and if there's not java code to do it and we're not going to do it that's really it's impractical in a lot of situations it's very common that a data science problem requires spanning multiple languages and it's part of the reason why we're looking at things like ours that we want to meet people where they're at they will they need to access things in different environments I just did a quick there's actually this recent report that came out from this blog our four stats and it this is a report on the data science job world just produce maybe a month ago now and it basically looks at the jobs that are labeled as data science what are the programming languages that they use SQL I mean it's kind of arguable whether it's programming language or not but it's definitely the tie it's the king right everyone has SQL somewhere in the back of their systems I put some check marks here for the things that we as Ria's supported for some time right those are the things that are part of the core product so you know SQL you probably use that Python Java I do see C++ we have a bridge for integrating the sass but there's kind of one missing from that top right which is the witches are so that's what we're here to talk about today of how our it's clearly used in a lot of diet to data science context you know two years ago we didn't have any any story for that one thing I also wanted to mention is in this kind of framework where you are trying to combine things from many different languages one of the problems you have is how do you actually get all that software into the environment you're interacting with well we've partnered with this company continuum analytics and they're really I would say they're kind of the preeminent provider of open data science solutions they've built a open-source standard industry standard package management system called Conda it initially came out of the scientific Python community a lot of the folks who work there are some of the founders of that community but they as Margene showed and a really great demo in the plenary we can use it for doing things packaging up our as well and you can use it actually for lots of other languages and it kind of gets us closer to this this idea that we could create some environment where we have a workflow that spans multiple different languages at once if you're interested in learning more about what Continuum's doing and some of the offerings they bring to the data science table they have a session tomorrow at 10:30 in the morning and mesquite gh and Brendan Collins who's a great guy and he knows a lot about their offerings and about the spatial world he's going to be talking about some of the stuff that they're doing today ok so let's talk a little about our we've kind of identify why R is interesting to us today but what is our what is it for so ESRI has you know it's been doing a number of things to get into the our community we built this our bridge we're going to primarily about that today we're also doing things kind of at the logistical level rayar is open source open source language and it is contributed to by many companies and many individuals so we've joined the our consortium which is kind of a collective interest group of companies that have some some interest in making art and do new things they do some cool stuff like they take you know money from various companies like us and and many others and they then give it back out to a community project so they identify interesting new projects and and hand that money back out to extend our in different directions we also are doing a lot of other things historically GIS has been more purely Python but we're kind of meet meet people where they're doing their analysis okay so what what are the things that make our really great one thing is that it it has data structures and operations that are very useful for statistical analyses it really came out of people needing to do statistical analysis as a programming language and environment and that means that the things that are in the core of the language are things that people who need to do some kind of data analysis they want them and it's built in at the such a fundamental level that some things that in other languages are just possible but very painful or instead very simple in our and there's a few things we'll talk about some of those in detail but these are things like data frames the functional programming model that it uses but there's also some other stuff that does it have something called DSL that it implements and we'll talk about we'll do that as well it also has it's really kind of snowball into being the de-facto language of statisticians and that's across a lot of different domains of Statistics my backgrounds in ecology and most of the you know kind of big well-known ecologist actually produce packages first they produce our packages that is the output of their scientific researches they publish papers based on packages they've published so that's pretty powerful if you need cutting-edge statistical routines then you can't do better than getting a you know a stats package from someone who just created it and it's very on the edge of what's possible so there's about 6,400 packages on cran which is the package repository now those vary greatly in quality right there's no there's just requirements Lee you have to make this thing fit certain things but there's no like formal rejection process right so you have to do one of the challenges is you have to do a little bit of work finding the right package for solving your particular problem also our has really great and very versatile plotting so you can make really great publication quality plots and it's actually pretty simple to do and I think it's plotting remains even though there's lots of new things happening though in the visualization space it remains probably the best plotting solution out there particularly if you care about the meaning of the visualization right it's different if sorry I'm getting a little feedback up it's different if you're just trying to make like a chart to put in you know I don't know some some meeting somewhere and how it gets interpreted isn't exactly it's like is that bar bigger than that bar is the answer that's fine but you know sometimes you actually really care about that every like pixel and the output is going to be rendered in a special way so that you can interpret it directly right there used to be this game of academics taking papers where they wouldn't publish their data and they would then like digitize extract the PDF and then like get the relative heights of the different charts and you know the bars and then they convert that back into numbers so you could get data out of the paper right or would be a good place if you want to do that because you actually would have you know very accurate plots okay so we're going to we're going to talk a little bit about programming in R so we have to assume a very basic proficiency in programming for this talk we also have some resources for really getting into the guts of our at the bot the end of this presentation right so we're going to just talk about the basics get into this our Burrage and how you can interface it with with arcgis but if you want to learn more about our about the programming language check out the resources at the end so you know what are the what are the data types that are at the bottom of our well most of them are things that if you've done any programming then you've probably seen them before right there are things that are common across all the different major programming languages things like we have numbers right we have we have numbers that are floating point numbers we also have integer numbers we have strings with your characters we have boolean values logicals and we have x right those are things that pretty much any programming language is going to provide you because they're so commonly useful across different environments but the differentiating factor in our is that you get a bunch of things in the language at the very core that aren't available in your typical programming language at some languages have these maybe as extensions but they're not necessarily well integrated into the language and you can't rely on them existing wherever you're doing your code so it can be very tricky to implement them these are things like matrices right and the simpler days of vector data frames we'll talk about a little bit more and there's also this idea of factors so if you had like some ordinal data you might want to store it in a way where you have like a lookup table where you say well there's only like four levels in this factor and and then you can reference it in the numeric context or through the textual value so it has all that stuff right built-in bacon of the base of the language data frames are a particularly interesting thing so I'd like to highlight them and it's something that I really really like basically the idea is that you store your data it could be just tabular data but it also could be multi-dimensional data you store it in a way in which all of your all the variables you're capturing are labeled specifically and each observation you have is indexed so you can think of you got you've got some kind of you know you got some kind of table or our matrix and you've got your different labels for the different variables you're capturing and then each row is indexed and that tuple that one row represents a collection of you know they're some related observations are all part of one observation and you can then get to them either through the index or through the labeling now it sounds pretty simple but it actually is a much much better data model for doing analysis than what existed before right Excel I mean I like to bag on cell sorry it was anyone who uses lots of excel for interesting things but it's really hard if you have someone else's excel data and you want to work with it right the the guy who designed visicalc which is the the intellectual predecessor to excel he said well it's really just a 2d layout program it's not actually doing any real it doesn't know about typing information for example right so you could have easily like a column where you think you get nothing but 64-bit integers or floating-point values but actually you know someone puts a string in there and they have to do all this work to kind of parse it out you can't really work with that at the base because your data structure is conflating things just it just lays out things in 2d and the fact that you can sometimes some of them correctly it doesn't mean that that's a good data model for storing all your data on the narrowing so the thing that's great about I mean that's that's why I love data frames I'll now walk you through what they look like in R so in this case I'm just going to take a very simple CSV file something you might have sitting around already and I can just tell are to read that file I'm pointing it at the name of the file and I'm telling it this particular CSV has a header row right we want we've already labeled our columns with a header row and now at that point I have this data frame that's coming from ICS that's come from a CSV file and I can do things with it the other way of making data frames is that you can concatenate together existing objects so here I'm defining these vectors so vectors are you know some stored collection of the same data type and in this case I'm going to put together some information no move the quarter of the year some people names I've got some famous geographers here good child Tobler and Krrish and then I've got this vet this and for some reason in the scenario they're salespeople I don't know why that doesn't really make any sense but they're salespeople so I'm going to store whether or not they met their quota in a particular quarter that we're looking at and it looks like that Tobler failed he probably was making some other really cool observations that but okay so now instead of instead of making a data frame by pulling in data I'm going to concatenate together these different labeled variables into a data frame so all I do is I pointed is a three variables that I've created and now I'm going to get out of that again a data frame where the labels are going to be shown as the variable to the input so if I then just type D out at the our prompt we get this representation right we've got our labels here and we've got our observations our index observations here so that's the index here this new Circle Index is getting assigned automatically right maybe come from like a database world you can think of that as being like a unique primary key right it's automatically generated for you and it always keeps track to make sure that the data is never ambiguous okay we also have an R that's kind of the very basics of some data types in R we also have some spatial data models there's a package called SP and it knows how to deal with some pretty complicated representations of data you know the first three here we're probably all familiar we've probably work with pretty commonly which is the zero one and two dimensional case our points lines and polygons right we probably all work with that a lot SP actually even those how to deal with more complicated models that can deal with solids and with space-time objects you know now whether you can do in analysis of space-time objects can you know it's more complicated there's not as many tools and know how to take that and do analysis on it but you at least can represent it very simply in R now the general story here is the entity attribute model right so if you've dealt if you've dealt with spatial data before this is what you're used to right you have like some tabular data structure that represents the attributes that you're storing and then on the side of that you have the entities right the geometries themselves and there's some linkage between those you don't want to store in the simple case of a point you actually could sort right you can kind of cheat you can just put next way the next column at Y column and then you're done but in any of the other cases you need to kind of link those two views together because the lines and polygons they're not going to fit nicely into that tabular structure you have they might have totally different sizes and scales and and this is a common model we use all across gif it's also implemented in R ok so now I'm going to get you into the actual r r KS bridge and just as a little bit of a background on that so who are we targeting with this well one thing that we found and talking with a lot of our customers is that they often have they have a lot of different people in their organizations right they have some people that they're often responsible ultimately to talking to stakeholders and the stakeholder group is often very large it could be the public say you're working in the local government but it could just be internal stakeholders too right there's often a lot of those stakeholders but you know they don't necessarily they're they're good at maybe taking results and interpreting and making alternative decisions they're maybe not in the business of generating new results right and then they have analysts who you know that often might be in the spatial Department it might be in other areas within their their company or group but you know they have very solid problem-solving skills often very solid spatial reasoning skills there that doesn't automatically mean that they're programmers and there might be a few people in the organization who are really you know very very fluent in programming there might be some people in that organization who only do programming now you know the ratio between these different classes varies between different organizations but we have we have customers where this ratio might be as large as a thousand to one for every highly fluent our programmer they may have a thousand analysts in their staff so that creates a mismatch because if if there's a limited number of these people who are really really fluent in our it might mean that any time anything comes up that an analyst can't handle they have to throw it over the wall to the programmer and that's kind of a bottleneck right there are some problems there we also want to enable the analyst in the organization to do more and to take advantage of some of the things in are without having to necessarily learn soup-to-nuts the entire language so that's one of the motivating factors behind us creating this bridge so just to kind of reiterate that and also think about it in another context we've got some ArcGIS developers and they already know how to do things like create tools make toolboxes they want to do that in a way they can integrate our into the existing workflows they know and understand and for users of our KS they might want to have the Navy their organization or maybe they got some somewhere in Internet there's some R code out there that does exactly what they want they want to implement it in a way that just can be used in the conflict they know and understand which is something like a geoprocessing script and finally there might be our users who maybe they don't use R to F at all they're not really even maybe even interested in it but they might be in a situation where their organization's data the system of record in the or company or group is already rjs and so if it's the end result for them as they have to spit out data into you know some other format and has to go through some kind of munging process to get it back in that's problematic right because it means that that that ETL transformation is kind of broken so they might be able to want to get into the organization's data do analysis and are and then put it back into the system of record imagine the traditional GIS phase so that's kind of the motivating story behind the bridge so what it basically does is it lets you do some of those problems and that's you take your data in rks store it using those ways you always have access it quickly in R and then return it back at the end as are objects that get converted into the native arcgis data types so this could be things like future classes it could be other kinds of things as well tables and it also knows how to mentioned that SP package it knows how to take the spatial data you have and convert it into a native representation that are understands so just link you to the package documentation and with that I'll let Margene take over and get you into showing you actually how this works first awesome thanks Sean so really the point of this demo is just to help you get your feet wet here we're going to start by showing you exactly how you get started how do you get this infamous bridge and just a basic example of how you can use it to read and write data from arcgis to r so to begin with we start here if you are to just google for the bridge type in our r KS bridge the very first result that appears is for its github page so this is where the bridge lives so when you want to get the bridge all you need to do is simply download it and believe it or not you're almost there if you're curious about what the bridge actually is we can take a peek at what we have here and the bridge is just a toolbox so when you have a project that you say hey I think I could really use some help from our you can go to that project you can be at any point in your analysis and in every project you have a project pane where you can view toolboxes that you have associated with that project because the bridge is simply a toolbox you can select to add it and then you have the ability to access the tools that it contains within it specifically it has four tools the first of which is the only one you actually need to setup the bridge the install our bindings tool is incredibly simple as you can see if you've never installed the bridge before you can click run and it will build the bridge if you have installed it at some prior point in time you're not sure you can just check over right that'll take care of any existing bridges you've built click run and you're good to go it's done it's installed that's it I'm not going to run it because I actually have our studio open but this process is incredibly quick I think it takes like seconds which is awesome and once you have it installed its installed so any project you open up that bridge is still there you can open our studio you can open our and you can access it which is great so let's talk about why I might want to make use of this tool so here I have crimes that have occurred in San Francisco for 2014 I have performed an emerging hotspot analysis results on them so what I'm looking at here is I'm trying to get a sense for the Bay Co temporal trends in my crime so what does that mean that means that I have crimes that are occurring across my map and essentially I'm looking at locations which are little hexagons bins on my map and I'm comparing the crime count for that bin to neighbouring bins and I'm saying how do they compare how does this one bin compared to neighboring bins and how does that compare over time so this is allowing me to get a sense for where I have locations that have statistically significant increases in crime over time so for example these deep red hexagons here these are locations that I know for a fact that crime over time is increasing and that trend is statistically significant now why is that relevant well if you're a police department and you want to know what areas you might need more resources to this is very valuable information but as anybody knows who works with data can't just do one analysis and call ourselves done and say great we know the whole story that's not the case we also need to dive deep and to figure out other questions in to probe our data to see all that we can learn from it and one thing we know about crime is that population can play a large role in impacting it so this result while great and a good starting point it's missing some information I have not factored in population one thing I might want to do is identify a crime rate so while I could use the field calculator and do a very simple division get my crime rate per 100,000 population maybe I want to do more than that maybe I want to make sure that the rates I create are robust and so this is a great time for me to hop into our and to take advantage of some of the rate smoothing functions that already exist in our that currently don't exist in ArcGIS so if I want to hop into our it's quite simple I'm going to load ESRI ArcGIS binding library this is just going to confirm yep you got the bridge you got Pro I know your license let's go ESRI has functions like arc open which I'm going to use to simply communicate what am i working with where is my project stored what is the data I'm trying to bring in and about the data here it can be in a geo database it can be in a shape file it can be in a table you just need to specific you just need to specify exactly what that data is where it's located and you can bring it in so now we've started to bring in the data but currently it's in an arc GIS data type so arts select is going to do the process of converting it for me into an our data frame object what Shawn was talking about and once it's in an our data frame object it's good to go I'm good to roll with pretty much whatever I want or need to do in R now there's a couple of nifty things about arts collects I just want to highlight you do have some options here you can specify specifically what columns you want to be bringing in I'm personally choosing to bring them all in you can also craft SQL queries which is kind of nice so if you want just a subset of your data you can do that now once I have them in I can just do a little bit of data renaming and it might actually help if I ran that first command and once I have that data frame object like I said it's an our data frame object so I'm pretty much good to go but ArcGIS does me one more solid we have a function which allows me to convert from that data frame object into a spatial point data frame or spatial polygon data frame which is great because I have spatial data and so I don't have to worry about a lot of messy hassle to get it into the format I need so I can now take advantage of any functionality and are that I want in particular what I'm doing is I'm using the e best function that's going to be performing empirical base smoothing so what that's doing is using population as a measure of confidence for me areas with high population are given a higher confidence areas with lower population are shifted towards the mean this is going to help me create rates that are a little more stable and just like that it's done but hopefully you can envision here it could be any function you could be doing any type of data aggregation whatever you need to do I can simply add it back to my spatial points data frame and if I'm ready to go back into pro I write it back now this is the part that I really get excited about because I don't know if you've ever half old with having to transfer data back and forth between different software it spins but but here is where I decided to write that data too if I simply click refresh here is crime rate it's all of that data with those newly calculated empirical Bayes smooth and rates I can pull this into my map I can run it in a tool I can get back to work so it's incredibly simple it communicates very well and it really opens up your options in terms of functionality awesome great thanks for team so you know one thing to ask is if you're going to set up on these tools how does it how to talk right what are you going to get if you make a new tool that's an hour based geoprocessing tool what's going to come in on the are set and this is kind of a quick table cook two-part table here that gives you an answer to that question so for most of the things we're just going to take them in as character arrays just raise the strings right if we send over a coordinate system we're just going to get back a string that represents that but for our other data types we try to do the right thing we try to if we have a boolean it's going to be a boolean on the other side numeric data it's going to come across just fine and that pretty much exists for all the things you're getting across right so we've got an extent you're going to get back this you're going to get back as vectors it's going to have the values of the extent so you know it's pretty pretty logical pretty it's pretty easy to understand what you're going to get on the our side you know a couple ones you want might want to know what that is right so if you're if you're coming a holder is coming in then you're going to get back the full path of the location of that particular folder but the the basic approach is is very simple it basically lets you take things that you would do just normally using using maybe Python or using model builder you get to get that directly in our and you're getting the representation when you make a tool that uses are under the hood so again I'm going to kind of review a little bit of what margin just shows you just to kind of bring it home basically all it is from the are side you just say I want to use this rks binding package just initialize the connection between our KS and the binding package and we're good to go from that point on to open things you set a data source and you pointed that data source by using arc tout open and then you know you often want to do some filtering operation right you have that input data but you might want to you want to trim it down part of the reason you sometimes want to do this is in our a data frame is a representative memory right so my first time using our I was working on some global data set I had 720 million cells I imported it into our and then I tried doing a mean and then our crack and it was you know it didn't have enough memory down the machine right so this is really great because out of the gate you can say well actually in this case I'm only interested in these fields and I'm even only interested in these fields where a certain you know conditions are met so that's at that point you know it's nice just as a data model kind of answer because it means that you're only looking at the things that are relevant to you it's also nice just from a technical level because it means that data frames not going to take up very much memory it's going to make things simpler and that's that's really handy at times so now when we've done that when we've got back to so we've selected into this data frame like Margene's that we still got this ArcGIS data frame still in a representation that ArcGIS understands but is not yet actually an AR data frame and it looks like a data frame but it always references back to the geometry right because the geometry needs to be encoded in a special way so to do that final step like Martine just showed you you can use this data to SP call which is going to take that that filter data frame that we just created and convert it into a fully native or representation that's both a data frame and it includes the geometry in a way that our understands now any downstream package that depends upon SP can just take that you can do your work with it at the end you can just run this inverse function that goes SP to data and get back another ArcGIS data frame then you can write out to any rjf data source so that's the basic workflow you also mentioned this writing already you can you can write you can also use this to overwrite so you might have circumstances where you want to take that data and you just say I don't actually I'm just going to be adding a new column right don't make a new data set for each of these instances so you can just pass an additional parameter however it's okay to overwrite and it'll it'll do that for you as well there's also a few just kind of convenient functions in the package now these deal with some things that if you've ever had to deal with projected coordinate systems you know that it's not a trivial undertaking and the internal representation that many are packages use is something called the prodigious library called project four so one of the things we have have this built-in set of functions to convert between the the well-known text representation rqs uses and those projects for Strings that you might use elsewhere so this is very handy it's it makes certain kinds of operations simple that can be otherwise hard with are you also can interact directly with the geometry sometimes you don't want to make that data frame you just want to poke at the geometry so a couple of helper functions for doing that and there's a couple of things here that if you're using this in the context of a geoprocessing script you might want to know something about the environment or you might want to display something and that's what these tools are for so there's these two tools Progress position and progress label so that would be like you're running some script and you want to modify the progressor that shows up on the users interface right so you can do that you can set the label just using these calls you also can get access to the environment if you've used Python before you probably familiar with the environment currently in the art context that's only a read-only set of information right you can get back the workspace the extent you can't modify the workspace and extent like you can in Python today but it's something that we definitely have on the roadmap for a future part of this package okay so I'm going to segue for a second here and just talk a little bit about we talked about data science we talked about our what does it look like putting those two things together one of the things that's happened in our in the last about eight years or so is that this gentleman had lyric in them who's both a developer at our studio also a professor at Rice University he's been prolific he has made a bunch of really amazing packages it's actually really grown the art community pretty dramatically he's done a lot of things that make workflows that already existed much simpler and better here I've just shown a plot of some some data that I worked on that uses this ggplot2 library and it's kind of we got this we got this very nice plot out of that and you know so we get some we get some confidence interval and we get the outliers all in a simple plot that is actual code to make this plot is it is very simple it's probably about three lines of code I think so that's that's much easier than in most most languages and he's also done a lot of things to make the process of making data manipulation happen and making packages much simpler in are things like deep wire and dev tools he also is working right now if you if you do anything in python the gentleman West McKinney created the pandas package they're working together on a package called feather that lets you interoperate I mentioned data frames there's also the equivalent thing in the Python world called pandas dataframes they're working on a project to allow you to interoperate with that same data structure so you could have some model that worked in Python it worked in R and actually in other languages like Julia and you could use the same data frame across all those environments so you're backing store in that case would be this data frame instead of the first step always be converting into a data frame I mentioned earlier that one of the strengths of of using R for statistical analyses is that it has these geocells or domain-specific languages this is just a quick example of that typically if you wanted to make something like specify some formula you have to come up with this kind of clunky way in R because they they know something about the domain they know that you're going to be doing some statistical operations they can specify that in this very this very concise way so in this case we're creating a linear model we're going to output it to fit results and so we've labeled it as a linear model here it's a function and then we're going to say that we're making predictions on pollution and then we have a number of Cove areas here that are going into that right so we're just doing a simple linear model here in this case these are purely additive you also could have terms you could have interaction terms and it has a way of specifying that you can do all kinds of really complicated things so suddenly you're going into you know making it to way ANOVA with one line of code you just learn what the symbols mean which isn't too hard and now you have a very very concise way of specifying a model a way that might take many lines of code or many clicks and you know if you're using jump or something there's similar properties like this in other parts of the language that make doing statistical analysis much simpler than it would be in an average programming language there's also a package named carrot that I think is worth mentioning it really is trying to deal with this model specification consistency across many many different packages I mentioned there's over 6,400 packages in cran they all often take slightly different approaches so here it's kind of coming back and saying like well can we reformulate these models in a consistent way so basically you say ok I'm going to be using carrot and then care under the hood knows how to talk to all those underlying packages it's really handy if you are doing work across multiple different packages another thing that's kind of emergent is if you saw some of the plenary session you probably saw the you of Jupiter notebooks some of this really goes back to donald knuth is a very famous computer scientist he really you know he had this idea that you really want to at both for teaching and understanding you really want an environment where you can have the documentation of the thing that you're working on and the code side-by-side right we often live in a world now where the ultimate realization of our knowledge and understanding is going to be in code but to get there it's really helpful if you could have that context right in the same place right you can have the formulas you can have some textual representation and and he's been looking at that for a long time in some ways you know the jupiter notebook those things are part of that vision there's some tools directly in our to really make that easy there's a way of embedding markdown along with your Python with earlier our code so you can kind of jump back and forth you have a document and it could jump into our jump back documenting it and then oxygen to lets you do some really cool things of documentation making up la tech versions of things and then jupiter notebooks yeah so jupiter is a word it used to be called ipython notebooks a long time ago jupiter the name actually comes from the three initial languages supports which is Julia Python and R so that's it's always been there right it's always been one of the fundamental languages that they wanted to support was R and you can use those languages actually together in that context so it kind of fits nicely into that view of using data science to stitch together multiple languages finally there's also some really good development environments that you can use with our margene was just showing you our studio it's great it's it's it's free and it does a lot of really cool stuff you can use Jupiter notebook like I mentioned already that also Microsoft last year released this our studios for our studio sorry our tools for visual studio and that basically means that is a plugin that's directly in visual studio if you've used Isreal studio before it knows how to do everything it provides you a rebel it knows how to do all the kinds of things official studio can do so really it is kind of we got some best of class tools for interacting with data that's available to us for working with R which is really great and just as an example of one of these kinds of package is you know here's something where you could do this kind of operation with data frame but something like deep liar makes this much simpler so what am I doing here I've got I've got some I've got some data baseball data and I'm going to take the batting average information I'm going to be grouping by a player ID doing summarization across that and then sorting it by the total value and then looking at the top five values right so I mean if you you saw on that on that job chart right that SQL will still king in every environment has SQL one thing is kind of happening generally is that people are like well I'm fine with this deal but it takes a lot of time and I have to talk to my DBA and it's hard to optimize so people are looking at how do we make these little kind of domain-specific languages that give us something akin to SQL but without so much work right so this we just wrote in you know it's kind of spread out here unnecessarily but it basically five lines and not really much many characters on those lines if to write out this equivalent thing in people will probably require special queries and you know maybe some some extra joins to go on to get it it's not trivial to get this in sequel so this is really nice if we're doing some data analysis and artists available to mix the data science process much simpler there are some challenges and tricks but edge upon dimension I mentioned that I had this performance issues so we're careful here I mentioned that my first experiences are was one in which it crashed but I still came back to it such a great language but you know it so it does have some potential performance pitfalls you need to deal with at times but there's been a lot of work to really make that better there's only I'd say pretty limited areas where you run that on a regular basis now also the amount of memory people typically have gone up a lot in the last few years it's not it's not designed to be a general-purpose language so if I were going to be stranded on a desert island I could only write in one programming language I would probably not pick it as my as my even though I loved it it it can't if you want to do something that had nothing to do with the data analysis um nothing to do with you know with an analytical process or statistics it might not be the right fit right because if you're trying to write a blogging engine there's not tooling in the ecosystem for making a blocking engine using our it's not that it's impossible it's just that you're going to have to you know push a rock up a hill to make that happen it also doesn't have a pure UI only mode of interaction if you if you want to make a plot you can't like drag things around and label so it's a typical complaint of environments which are programming first and it's kind of it in areas it is kind of a program or first mentality right it's taught typically in grad school when you have many years to learn some you know codified set of knowledge but if you're not in grad school it can be challenging to take it on right as a new as a new thing there are some things happening like shiny which lets you kind of interact with like these interactive widgets that are backed by our but primarily it's models that you need to be fluent in are in order to use are now that's part of the reason we think this bridge is really great is it enables those folks in an organization who have some our experience to share their work with a broader group of people who maybe aren't as familiar with that world ok so now we're going to get a little deeper into the bridge and so you know what does this look like it looks like any other geoprocessing tool you've seen right so it's just a tool we've got parameters we talked there already about how these parameters are going to come you know move between the are between the arcgis world in the our world but when we click that Run button instead of that triggering a Python script or a built-in tool to rkf it's going to be running an R script so with that I'll hand it off back to mr. margene perfect thank you so Sean was kind of hitting it perfectly so why exactly would we be wanting to route our functionality if you're a developer your time is incredibly precious if your organization is anything like ours everybody wants a piece of your time and you can't always do that so if you have functionality in R that's regularly used in your organization and you want to share it with somebody who's capable of interpreting the results but maybe not of the scripting end you can easily craft one of these geoprocessing tools additionaly really the whole point of this our bridge is that we're streamlining your work into whatever platform you prefer I mean Shawn really nicely laid out kind of like the three categories of users that we're envisioning here I mean if you need your work in pro you can do that if you want to get your work in are you can do that so let's take a look at one example and that's the case of you have functionality in R that you regularly use and you want to make it available to other people who use ArcGIS but are not as comfortable with the scripting language so a little bit of background I'm going to be showing you today the scan statistic it is the most widely used test for detecting clusters essentially what it is is it's going beyond the question of is my data clustered to where specifically is my data clustered now why would you want to do this so the data I have for you here is from Utah's open data source they're looking at noxious weeds locations in Utah's national parks and at a national park they might have limited services limited Park Rangers and if they want to tackle the issues of noxious weeds and trying to reduce them they want to know where is the area that I should send people to to try and make an impact on this problem where is the most potent area for me to start dealing with this and reducing it so the scan statistic is a great tool for that and if this is an analysis that they commonly need to run it can wrap it so let's dive a little bit deeper into what exactly I mean when I say wrap it so this is a tool I've created it entirely calls our functionality specifically scan dot test from stat stat if we were to just take a quick peek at exactly what this tool is it's just like any other script tool except for this one little piece right here it's not calling a Python file it's calling R and that's it you don't need to have Python to necessarily like a Python package to call your R it's all being done here through ArcGIS and if you've ever created a script tool before this is just the same you here can customize the UI experience of your script tool so whoever is using that our functionality you can kind of control what it is that they see and they do you have the ability to give them different parameter options that might have specific selections if you want them to choose particular parts of the data or perhaps if you want to give them model options click on or binomial you can specify default if there are certain values that you think are important for them to consider first you can also specialize or specify symbology so you can really customize how these results appear the tool itself looks like just any other normal geoprocessing tool and so when you share with someone this is what they see and when they want it to work they click run so let's look at the are side of this this is the script that is actually being called right now by this tool this very thing right here and what is exactly wrapping our functionality what is that it starts out pretty simple it's really just the function with some in cramps and some out this is kind of what you can think of as the template for any tool you're going to wrap you're going to follow this same format the same function within params and out cramps we can take advantage of some functionality Shawn mentioned in the bridge we have a lot of different functions that kind of enable you to craft these tools one of them is arc dot and our env so you can read those environments get a sense for what your user has if there's something that you see as a potential issue you can print out a message to warn them about certain things here you can do some simple work of making sure that whoever is running your tool has all the are packages that they need so this is just simply doing a very simple check to say hey do you actually have this package if not let's install it into your library so that way when I do actually call them and load them I make sure you have them anything that the user is putting in that UI as diameters you can index and once they're indexed you're good to use a menu script however you want to do that so it is really like these are just kind of the finer points of setting it up but once it's set up you're free to start roaming with your coding and get as creative as you want to get so these are the various things that a user's inputting for me so that data set they're giving me that's just what I'm calling to my occurrence data set I can specify that collection by agency because I'm picking that national park they picked and I can use that throughout my coding now the last kind of set up part if you will is taking advantage of the bridge to actually get that data in so how does this tool work it's entirely utilizing the bridge those same functions I showed you are open our select our data to SP that's where we're taking that occurrence data set that they gave us in that tool we're opening it we're taking the selection they gave us they specified an agency for us when they ran that tool I'm using our select to bring in that specific national park they wanted I'm then converting it to in this case an SP object and then I'm converting it to a point process because that's what I'm going to need for my scan test function and here just as Shawn mentioned I am customizing that experience so you're not seeing the tool run because that's not quite as exciting but here I am customizing what's printing out for them when they're watching the tool run so it's giving them notifications to let them know what progress is asked it's at and how it's doing so now here this is this is you this is all you you can get as complicated or as simple as you want this example is quite simple just because all I'm doing is I'm calm scan test function and that's all I'm interested to do but you're open here to be creative and to customize and to take advantage of our functionality however you need to for your organization which is incredibly powerful with so many different libraries and functions there's really in my mind almost limitless possibility for how you can create these tools and what you get them to do now I am doing something here that I personally like I am crafting some messages for the user of my tool and I'll show you what that looks like in the UI but all I'm really doing is just printing results for that way they have it they're not just getting something in their map they're also getting information that's printing out for them when they run it and lastly is something a little bit sneaky we are still working under aster support we know we've heard you you want it and that's something that we are actively working to get developed for you but this particular function is actually going to output a raster for me of the likelihood ratio test statistic and since the bridge doesn't quite work with raster just yet I can take advantage of other our functions to actually do that for me so when I'm writing my result I'm actually using write raster as well to put that raster input into my project okay so let's take a look here it is Hey so here is the result so really just to kind of drive at home here was the original obviously we can see there's clusters but if I wanted to try and pinpoint it a little bit further this is giving me a map that's helping me to do that additionally I mentioned my messages here and so this is all the stuff that I've crafted as I was running that tool so I'm printing out for them exactly what that scan dot test function found they get their p-values they learn some information about exactly how it went how the data looked and they have a result beautifully symbolized in their map that they can take advantage of just by clicking run they have no idea that as if that's going on but it is and you can craft it to do what you needed to do which is incredibly powerful great so yeah just to wrap up here we've talked already about how to install it there's just some details or some instructions online to get you in to get started using it tomorrow it's really easy where can you run this is a question that we get sometimes so if you have a really old version of Archaea desktop it's not supported 10:31 was the first version which we implemented this bridge and we also committed it pro 1.1 the only thing it's a little tricky here is if you're in desktop still if you install our you get two versions you get a 32-bit version and a 64-bit version arcmap is a is a 32-bit version application so you have to use the 32-bit version of our to talk to it so something to be aware of and our studio has a 32-bit version as well that comes with it you can just switch it and it works great yeah exactly if you're using pro then you can just use 64-bit versions of both you also can use the background geoprocessing mechanism in desktop to get you into the 64-bit version if you have something where you had memory limitations you really needed a 64-bit version of are you can do it in desktop as well so you can you can run it now you need to get our get either pro or arcmap and you can use it in a variety of context this also works on the server side so on 10:31 and later it works it works on the server you can use this for running geoprocessing services it's really really not too hard to get it going what's next we are looking at now implementing a Conda for actually managing these environments I kind of talked about this idea of you might have a problem where you need to chain together different languages so we're looking at trying to make that even easier so if pro 2.0 you can install are just like any other package you can install the our bridge just like any other package using using Conda we really want how can we make that workflow even better than it will be into oh in the future and as Margene mentioned we are not we do not support rafts or now now the good thing is that like Margene just shows you it might not always be necessary right so one of the things is different between feature data and raster data is that if you're you're backing store is really just a geo TIFF then our knows how to work with you it's just we know how to work with geo Tiff's that's fine right where we definitely need to address or support is in more complicated cases like if you were working with a mosaic data set or you know you have raster data stored in an FTE and since then there's things that we need to do to make those kinds of workflows enabled because that's not possible today in our but yeah something that we're working on actively and most deliver sometime this year okay I mentioned I mentioned that we have a resources section here I want to leave least the one or two minutes for questions but just quickly I'll jump so we've got a few do you want to drive here is that going to be easier you thank you oh yeah over to you okay so real quick one thing we've done asked for a lot is this was cool but where the heck can I have some resources to actually get up and running with this the feds just watching you lovely people so I've been working really hard over the past couple months to create some of these for you hopefully you find them useful but please feel free to give feedback on what you need because that is entirely what I've been dedicating part of my job to is creating these so we have two brand-new web courses they just came out we have one on using the our ArcGIS bridge that's the basics of getting set up of reading and writing your data and getting comfortable with that workflow we have another one entirely about wrapping script tools so you want to wrap awesome hopefully you're talented but if not we have a web course for you where you can learn the basics of doing that it shows you the template script and it gets you all the steps for getting up with that we also have a learn lesson that we put out and it's kind of to help you get an idea of how powerful the bridge is so it's a workflow using some tools and ArcGIS and some tools that are and just how well they seamlessly are integrated to expand your capabilities we also have the our vignette so if you want to know more about the functions in our package we have written it up so that way you can get an idea of all the functionality that is included with them even some gems that we've left out due to time so great all right so in the interest of time I'm not going to go through these resources but there's a lot of stuff in here you might want to look over there's other sessions there's some details about using our the task views are amazing and you have a specific problem check out the tasks use they really they really get you pretty far fast some background on meeting it with data science a few other things yeah there's there's a lot going on just to close out you know one of the things is that this is really a community oriented project is open-source it's a little different than you know it's just a strict vendor relationship right you can take this or use it in other contexts oh really take a look at it and kind of think of it in that community context and with that I'll just say thanks to all the folks who worked on this and thank you to Margene and we'll we'll take any questions I think up at the front we're out of time so I don't want to make people wait if you have iOS or Android please give us a review if you don't have that then I will accept an Akkadian tablet and that's it for today thank you [Applause]
Info
Channel: Esri Events
Views: 7,692
Rating: undefined out of 5
Keywords: Esri, ArcGIS, GIS, Esri Events, analysis, statistical model
Id: KXCupqtb0-4
Channel Id: undefined
Length: 59min 39sec (3579 seconds)
Published: Wed Mar 29 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.