OpenGeoHub Summer School 2021 Day 3 - September 3 - Block 2

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
i didn't i didn't know that you connected already but uh no problem okay so shall we start uh just a second i will give you a sign so yes please we are now connected we are live hana please introduce the next two speakers uh next speaker and uh let's uh let's try to be on time again please start with the next one so um yeah welcome back everyone after the break and we continue the summer school with the talk of martijn tennacus and he is a data visualization specialist at statistics netherlands and he's especially working on spatial data visualization and he is a developer of the tmap package that i think many of you already know and visualization of spatial data will also be the topic of this lecture today so martin i hand over to you thank you uh first of all i cannot uh start my video said i yeah okay great okay now let me share my screen sure okay um so first i'll talk a little bit about visualization of spatial data in general it's very introductory and after that i'll go a little bit deeper into visualization itself and then i will say something about team map together with jacob here have quite some exciting things to show but first of all visualization of spatial data so as you of course already saw like we have like vector data which is pretty much covered in the sf package well basic types points lines polygons [Music] of course spatial register data i'll not go into depth just yet but of course but which which what's already covered here are the is the coordinates reference system especially by etzer this morning uh the point is how if you want to visualize spatial data especially on a global scale you need to think about how to build orange the world is 3d it's a globe how to put it on a flat two-dimensional surface or device so that's what projected crs are for and we need them if we want to visualize data i mean we can wait a couple of decades until holograms will be of common use but until that we have to do it with flat surfaces of course we have 3d visualization but even like this orange even though in 3d if you plot it on a flat device it's still a 2d projection of course you can use projected crss for visualization but be careful because it's still a compromise especially when you work with global data you need to be very careful about the poles about the 180 degrees longitude and um and so on um so this is just an overview of the type of projections that we have and some examples that's just for record so just skip this slide for now uh for visualization purposes doesn't matter yes it does matter so this is a quite simple example of population density and the left hand side map is the web marketer which is typically used by google maps etc and on the right hand side a projector crs that is was used in team map by default and that's a good one for world maps there are others like robinson winkle triple etc it doesn't really matter which one but you see that the that australia is three times as large as greenland which is the truth and not the other way around like here uh for spatial density it really matters um because otherwise you get like a biased image of where people live here is you would expect that many people live in alaska where it's not the case so um yeah so for for statistics that's that's what i mentioned here are the properties which a projection satisfies or doesn't satisfy um equal area so this map projection satisfies equal area properties so it means that the the area of the polygons are basically the true areas however so that therefore is still compromised the shapes um are can be distorted especially new zealand this this is more realistic than this one so depending on the application so statistics we say shape doesn't really matter because we it supports population density but for other purposes like geology it does matter and navigation of course so there we can recommend our projections um yes i see a question please go ahead brian or otherwise we can we can wait until after the presentation it doesn't matter for me for a spatial data then r now there are four yeah i listed here four core packages which are basically covered tara will be covered after this presentation uh there's also quite handy thing in stars package which are data cubes which you probably are familiar with or i've seen before it's very useful both affected data and for raster data because you can have different layers which you can use as timestamps for instance so you can have a time series analysis which you can also use to visualize so so far the basic spatial data summary in r as for the visualization part so now comes the core of the presentation this is what i perceive as a good workflow for to create a nice visualization of course you have to start with pre-processing of data and data as you know it had like has like a content data part so for instance employment air quality etc basically answers the question what do we want to visualize whereas a spatial data component answers the question where are those phenomena that you want to visualize so you need to pre-process like taking into account missing values duplicated values all those things after that you join those data and then you can start the visualization part and that's also these two blocks where tmap comes into play and other packages well of course [Music] but yeah i'll just cover things that are in the map after this the the slides and this basically so in this block it's about data driven spatial transformation it's an optional block not many visualizations have those transformations however it's it will become optional in dmap in the next update and i'll explain what is this so when we have spatial data you can use the content data to change the the features you can for instance inflate particles or deflate polygons as i will show in the next example and for that you need like a variable of interest some scaling function which scales to the size of polygons and then apply a certain data driven spatial transformation after that you have like this this transformed data that still contains a content part and spatial data part and then you need a visual mapping so this means that you need to map the data variables to the visual variables now the visual variables are these are the most important ones so colors well kind of multi different type of color palettes size shape others are alpha transparency line type fill pattern etc those are a little bit less common they're not included into them yet but they will be i'll talk about that in a second and then yeah here's our spatial object types so symbols lines polygons and rosters so basically you can map a variable to either color palette or symbol size symbol shape line width so line width basically size basically this one and when you have done that yeah you can just plot it on a map and it can be static map interactive map etc that's up to the device but at least you have the information to create the visualization so an example of this data driven transformation function is the cartogram so we've probably seen it as a little bit playful way to visualize in this case happy planet index corrected by population density so what we do here is we distort the polygons such that the areas correspond to population size and such as neighboring polygons still touch each other there's also a non-contiguous variant of the cartogram which polygons just shrink but don't touch each other in this example so i used population as a data variable which i mapped to polygon size as the weights so population is between 0 and 1 billion and i use a continuous skill to scale it to 0 1 unit units weights in this case which then are passed on to the cartogram function and then i used happy planet indicator index so these numbers are between 10 and 45 but i use intervals because well you can use continuous color palette as well but in this case i choose intervals which has advantage that that you can really read the value better than continuous color palette but these are choices these are choices to depend on the task on the user background of the user all those things so in this case i use diverging palette red yellow green this looks pretty nice however i've been told that this color palette does not work very well for colored blind people so therefore in the next team major team map update we switch to another default we haven't decided which one but at least we want to take into account those people because there are many of them so this goes a little deeper into visualization itself so visualization is all about losing information we want we are we often want to visualize all information because you don't want to lose any information maybe it's important so you're really afraid of losing information however especially yeah big data spatial data in general is often very detailed and for many user tasks this high level of detail is not needed so if you want to visualize some for instance like climate change on global scale you don't need to know that some kind of neighborhood in some city had like a very high temperature at some point of the day just to to give an example then it's probably better to use some kind of kernel density small thing to to visualize global patterns so information loss is therefore key concept and good visual design you can think about how to lose some information but on the same side of the other side of the same coin is uh the goal of visual design is to find the best way to depict some information so that's basically the question is what information do you want to visualize the same question as what information could you get rid of so for instance the underground map so the question is realistic always better to learn an underground map is not realistic so this is this these are realistic locations however often people find these maps to be very useful why is that because because of the population density the population density in london city center is much higher therefore the metro stops are closer to each other and as a result you can print the labels better it's very small font size also the directions here okay here we also use like eight cardinal directions in reality it's much more um organic i would say but due to this this schematic way of visualization it's easier um yeah to digest uh this load of information so you need to think about so this is also maybe it's also a case of data driven spatial information where you use population density to move the points the underground stops in this case a little bit farther from each other in order to to be useful for public transport planning um so we used this so these these concepts of information loss uh borrowed from information theory um and i've been in oxford for half a year and then i worked together with professor min chen and we focused on the design space for origin destination data visualization so here's the reference and in this case we we composed the four dimensions so this design space consists of four dimensions and the first two dimensions cover like the data structure of the nodes and edges and third and fourth the visual variables so basically how to depict them this is just an example of applying this these concepts of information loss to a specific type of of spatial data namely origin destination data visualization and when we worked on that we kind of saw a new new design emerging because i was uh i had this data set of dutch commuting uh which basically described how many people commuted from one municipality to the other and eventually uh we designed this map so it's called the donut map let me show you it's here um so can put the url in the chat to everyone there you go uh it may be a little bit slow because of the this doom i think but oh well it's okay so what we basically did is um instead of whole lines so we draw half lines this means that the flows between utrecht and amsterdam can be depicted by two half lines which together form a whole line i've been told that's also starter uh theory that's the underlying core principles behind it um in this case the pink clients are the commuting flows of people who work in amsterdam and the green ones who work in utrecht and the donuts summarize where the people who live in amsterdam work and the majority work also in amsterdam now you can see the commuting towns where they work so a lot of commuter towns the have people who work in amsterdam because of this these pink parts many people here work in the hague so it's a little bit trivial but uh and the interesting thing is for um policy makers it can be really interesting because they want they can find out whether it's about the the number of jobs that's available because so many people commute or because it has a good public transport system or on the other hand not so good so they can like create infrastructure based on based on this data so back to the presentation again uh so visualization visualization of spatial data in r and this is kind of an overview it's probably not complete as far as i know these are all great packages to visualize spatial data well the plot function is faster straightforward so it's implemented in every like class package sf starts thera etc usually the fastest way to plot data ggplot2 it's really great in the past it was a little bit lagging behind because it's not suited for spatial data but today it is so team map yeah we started i think a couple of years ago six years ago and it's maturing it's mature right now and currently we are i'll talk in the next slide uh plans for a major update a map view package is also really great so both team map and map view are built on top of leaflet so leaflet is used as a javascript library for interactive maps and so team map has two kind of modes two devices so the the the static device which you use you see in our studio and the plot device the the view device view pane in which you can see interactive maps these interactive maps we are using leaflet as an fundamental package for that map sf also looks really nice haven't used it before but very promising but now about tmap this this is a news i want to share so so as i said the current version is pretty stable premature because there can be bugs anytime as with any software but it's kind of grown to its full potential however there were i we had trouble uh in extending it so therefore we had to uh develop a new framework so for the user a little bit some things will change but the core concepts will of course be the same but it can be much easier as will be much easier to extend it um and the next news i want to share is that we are going to publish a book so me and jacob knows that the draft version is already online it's over here and it's basically cover so we completed about half of it probably a little bit more but at least the fundamental things that you need for for d map a spatial data in r uh just more general team up in a nutshell it's really if you want a quick overview of what the map does you can read this chapter basically you can go through it very briefly just now overview qtm so if you have a spatial data object no matter what class you can plot it with quick thematic map in this case volcanoes [Music] let me close my email for a second okay and so if you plot it in the tmap way which is if you're familiar with plot kind of similar but it has some things that's done a little bit differently so what's instead of gg plot data set and then the aesthetics we use steam tm shape so tm shape is used to like specify what i call a shape object it's a spatial object it can be an nsf data frame star subject era auto things and then we have like the map layers so these are tm symbols the um polygons lines and dm roster these are the basic ones dm text and then you can specify aesthetics which means the visual variables this tmc symbol has three color shape size in this case we keep the shape constant and we use a data variable to be mapped to the size visual variable okay so elevation elevation comes from volcanoes and we have a title see if this map stare this is this map so apparently 24 is triangles and elevation is used so here's legend um it's killed to elevation and so on and so forth so yeah as you can see here you can design the map any way you want so like background colors uh you use a continuous scale for elevation so it's uh it's supposed to be a really flexible package um so here's some map attributes like uh scale bar compass etc see how much amount of time okay um yeah so i said we have the plot and view mode so i can show little quick demo uh so i have the world data set so data world and if i go dm tm-shape world tmpoligans hpi i have the happy planet index and if i use the suwetier theme map mode view and then i can run re-run this code and then i'm in view mode that means interactive mode this is all thanks to to leaflet the people who contributed to leaflets made that possible again for statistical mapping it's not really great because what i said earlier greenland is still three times as large as australia but other of that is very neat that you can zoom in and zoom out so back to the tv book again uh yeah not all paragraphs have been written yet but i think the course there um so in this book we'll cover like um how to specify spatial data uh like with the tm shape function the the the pipe about layers so these are the base layers and the derived layers uh these are the uh the visible variables you can use and the geometry that it expects which means the the that it can yeah it can take more geometry so for instance the impulse can also take point no tm near around tm symbols can also take polygons then just the centroid is taken and if you for instance use multiple shapes in multiple projections then it's automatically so this all describes value in the previous chapter then it uses the crs of one shape which you can later on change so chapter five yeah yeah tips five hours about layers in chapter six about official variables you've seen this picture before uh to dive a little bit deeper um uh yes so uh other types so these are not written yet so layout is written it's already finished um so i can use different font size etc so please let me know if there are any questions or suggestions we're open to that as i mentioned in a slide [Music] so this book is written for team 3 and will be updated to team 4 because there are quite some changes in demon4 which i want to present now so um team map4 the major thing is that it will be extendable and in extendable in many ways we can add map layers or we by we i mean developers other than me who wrote basically the framework of the map but anyone who is familiar into r can add map layers uh you can think about dm donuts like the donut map that i showed dm hexagons dm network so for instance if you have like an sf networks object and you want to visualize it straight away why not use the network we have to of course think about how to use it and which what kind of aesthetics you have etc but sky's the limit in this case uh team hill shade um there could be many more like tm kernel density estimation aesthetics so there will be many more visual variables available so we have 5 out of 5 new aesthetics for tm polygons and again it will be easy for the relatively easy for developers to add new visual variables once you have like a method to plot them it's also easy to add them so the graphics engine so tm3 had two plot modes plot which basically is built upon the grid graphic system and view dot upon leaflet but if there is another plotting mechanism graphics engine that could be used why not use it [Music] and finally spatial class we use is built upon sf and stars but it will be also more easy to incorporate other classes as well so for instance there are classes so a small example of the aesthetics so just to show the pictures it's easier um so this was basically i bought this about syntax the syntax will change too because there are many many arguments if you look at tm symbols for instance like a huge list because we have three aesthetics and every aesthetic as you has a bunch of arguments in tm-4 team f4 it will be organized in this way that an aesthetic is still an aesthetic fill the scaling arguments are bundled into a scale function in this case tm scale intervals and the things that only are related to legends is configured in dm legend so for instance the title this is an example of three different aesthetics so we use as you can see i've improved the continuous legend here still have to implement the landscape legends but so border width and line type so in this case border type is life expectancy so dashed means a little bit earlier in life unfortunately the thicker on the solid lines means a little bit later well-being thickness of the lines and happy planet index is a color so in this case best to live in middle and south america map layers so the cartogram that which you already saw and this by the way is an example of a diverging color palette which takes into account color blindness so it's used as a transformation aesthetic so it's basically distorts the polygons such as the areas correspond to the number of people who live in the countries the code is here i'm not going to detail you can see again the the aesthetics are remain same and the skill arguments are but not in the skill functions and skill functions are summarized over here so i have currently have these ones uh doesn't mean it changes but can mean we probably we add a couple of more so for instance date time are not included yet but i will include them these are the main data types that are expected so and you see the differences in this map so it's a happy planet index for life expectancy let's check life expectancy in africa using four six different skills categorical every unique value has its own color and the colors are repeated because we don't have enough colors not very useful in this case intervals very useful once you have a sequential pallet um and you can really read the categories so this orange is this one 60 to 65. a log scale for this type state is not very useful could be useful for population density or for income economic data but not for uh for life expectancy a tmsk ordinal is also a categorical variant so this is a class vector this is class ordered also uh here every unique value has own value but it's also not very useful because this this skills not equidistant so for instance 50 3 is missing and so the continuous skill is also very useful it gives a better image than this one but the values are a little bit more difficult to to read from the legend so these are pros and cons but these are both very useful tm skill discrete this is new in team map it's it basically has to discrete numeric skill in this case not sure if how valuable it is because there are many fills but if you have like five or six values and there is one missing then you still have an equidistant skill so this will be interest very useful for that uh multivariate skills so we can for an aesthetic you can assign multiple variables so in this case i have a bifarid core path i still have to design the legend but this is like a concept in this case we have like three classes well-being and footprint i just see that for instance north america has a high footprint and a high well-being whereas europe medium footprint high well-being so it's pretty easy to to read the values and you have like a bifid core path in team map4 the way it's implemented is by using this mv stands for multivariate if you have suggestions for other function name for another function name please let me know but i want to keep it short and tidy so for now i've called it mv so the first variable which defines the rows is well-being second footprint i have skill tm skill by variate and again for both rows and columns assign a scale function in this case intervals and the values are by ferret uh color palette i'll talk about color palettes a little bit later uh for now so the steam donut the donut map that i saw that i showed um it's not so this one is not implemented yet but it probably will implement it like this this one is also not implemented and i'm not sure exactly how to to create this map and also not know what kind of extension libraries we're going to make for that but i think using sf would be great uh and like this for instance dm shape edges or maybe tm-shape network and then tm half-lines [Music] team donuts and here again so size that corresponds to the number of employees who live in the specific municipality and the parts the donuts is also a multivariate thing it's basically symbol and the content of the symbol is a donut a donut chart and we need uh to have the number of employees who work in the highlighted municipalities in this case so it's again the mv function a legend so the legend positioning will be different will be more options to get them both inside outside the map you can specify it per aesthetic so each aesthetic will have its own legend um arguments which you can specify in different ways in old team map all legends need to go into one one position also what's also new is let's see this is not sure what i was going to show here i have also have like shortcuts like tm play selections left so i plan to so the options will be a little bit more complex because if the framework is more complex but i plan we plan to to have like a couple of like like magic functions that does it all in one function call and the nice thing about tmap four is that let me show that is that you can use autocompletion for that so if you do tm plays legends then you can just place them right and then you have just to specify the width 0.2 let me see if this works it works um so etc so yeah those things are really help uh useful again if it if so this is really development version a lot of things don't work yet if it will be more mature i will ask you all of you to give me feedback what kind of functions you need are the argument names intuitive etc so it's also new i've asked i had this question a couple of times whether people could make this kind of legend so we used two aesthetics symbol size and simple color and one one legend so basically the same data variable is mapped to two visual variables and now we can say dm legend combine so the size legend is will be combined with fill legend and one of the last things i want to show this a little bit more so there is like a free argument which decides whether facets the skills of the legends are free so perfected or shared among facets what's new is that can also be done for a fasted grid per row row column so this is kind of a theoretical example i've let me see what do we have here income group over here the rows and the columns is economy [Music] life expectancy for the fill it's it's determined per column so therefore columns is true and i have the legends for the the red symbols and red symbols are gdp and it's determined per row so yeah that's that's a new feature i'm not sure how much it gets used but at least it's there uh so for the backwards compatibility yeah it will be backwards compatible um the the options will be different because the for instance we want to take into account colorblind people etc however there will be in style called as i see it now team d-map v3 which refers refers everything to uh to the current set of options so color palettes so let me show that here so if a function is basically the um comparable to the pellet explorer which was in team map tools fortunately i'm on windows so installation gets a little bit quicker okay it just renders a large png and opens it in viewer so you can browse through the pallets what i plan to use is have it as an html format so that you can copy paste the palette names because they could be very long but as you see so i've not we've not decided how to uh organize these so far it's been organized as categorical palettes sequential palettes uh diverging pellets and cycling cyclic pellets by various pellets [Music] and you see here the the the family so if like the cofaci carto brewer pellets ocean hcl so both hcl and start at the beginning yeah asian pellets are our base r is packages palettes uh we still haven't decided on the defaults but yeah i think that's that's all i wanted to show let me see yeah so suggestions are welcome so if the get a page um there's this uh this yellow tmap v4 sign attack that you can use yeah we have a lot of questions uh yeah okay and it's just uh yeah it's a good time to stop um so hannah please yeah all right uh thanks a lot my time um i collected the questions on meta most and um there's a lot of questions so my time please just try to go fast exactly and the questions that i will not be able to answer i will answer online yeah great so the first question is can you please reflect on gemap.org that's a question from tom and how can our community catch up with cloud-based solutions to visualize data second enjoy or maybe tom wants to ask it's uh they made a package to work with google earth oh yeah yeah yeah it's called ge map so how do we catch up with that because you know you can visualize also really large data sets and yeah i think it has both pros and cons i mean um the great thing is about the our spatial community is that everything is open source everything is connected everything is possible um how do we catch up i don't know maybe maybe we can kind of have an interface to that um so i've seen it but not used it myself so okay that's a good question yeah okay the second question um would you say tm shape is suitable for plotting high resolution data as well or is it recommended always to aggregate data before using tmap that's a very good point there's in uh team team map shape uh let's see how i can probably there's something in there so there are a couple of options uh so data simplification uh we have a simplify argument which you can set to simplify polygons for rosters we can sure if it's in this chapter but we have like a max raster option which is set to say i think 1 million which automatically downsamples spatial rusters because you cannot really see more pixels on that so it's basically done automatically enough to say stars it's really fast uh so yes okay then probably a related question does t-map also handle the use of satellite images as base maps um [Music] yes so a good one it should be it should be so we can of course use that as if if there's a tile server of satellite images we can use it of course um yeah then we can use this base map so i still have to implement base maps for for static or for the plot mode because it's not there yet you still have to use for instance map tiles to download the tiles and or satellite images and then you can use dm roster tm rgb but it's better yeah to have like have as a base layer right next question can t-map be used to export to kml to be opened in google earth or similar [Music] it's a good question i've also asked uh the question uh one time whereas the map the maps can be exported to uh to uh arcgis arcgis for instance it's a little bit harder i guess for kml i can imagine that it's possible and it's not too hard but it depends on the map i mean if you have like a really um complex map then it's a little bit harder i think to create an exporter you can always make a png as a ground overlay you know and just put it that's always possible exactly that would be repeated to lose the vector structures you know so um you can probably also export to svg which even more even better okay next question is from do you want to ask it yourself or should i read it you can do that hannah okay so the question is um wait no i have to spend it yeah um do you have any experience with visualization of uncertainty and spatial data um i've thought about it [Music] but not it's a really interesting one you can map uncertainty to a visual variable so say alpha transparency is you have to be careful about the color palette but if the color palette is for instance one from from viridis so let's take one [Music] which for instance this one or the hcl palettes which are have same same brightness more or less and then you can use alpha transparency uh to decode uncertainty so the brighter the color the uncertain more uncertainty there is that could be an id but i think it's also depending on the use case and yeah thank you might i think tom did something similar in the past with a whiteness blurring or something like that yeah that's kind of similar indeed yeah maybe you can also use the the the line uh line type so what i showed in the vignette here something like this that you kind of yeah i'm not sure yet there's also a nice one at the so the city university of london they create sketchy graphics if you have like a polygon you can use you use the exact borders but you can also like sketch it a little bit also indicates so they use it for info bar charts if like play preliminary results then you can just sketch a bar chart as if it's written by pencil maybe that could be used as well all right um currently the last question on overleaf is if you can add something like histograms of classes or values into the legend of t map um there's an example of how that should look like included in metamorph yes there's a team map let me use google for that it was possible on team map3 so there was like yeah for instance this one so yeah at the time a couple of years ago i just made made like this histogram for team f4 we have to think about it because i wanted to be general as possible uh but for sure we plan to do that and not just histograms is any chart which can aid the analysis and also i've also asked a question whereas could be possible if you have like an interactive map if you zoom on it you can show it you ideally won't show it in the pop-up in yeah many many cases it will be useful okay good thanks a lot my time lecture nothing but just in time to move on to the last speaker of today before we end the summer school was a panel discussion robert are you there i'm here you can you can stop sharing my time yes i'm looking for the button yeah uh let's just stop stop scared yeah okay okay good and we continue with the last session um and i quickly like to introduce robert hymans and robert heimans is a professor in the department of environmental science and policy at the university of california in davis and he is working on spatial data analysis in biodiversity agriculture and health science and he is the author of many r packages for example rasta and terra so robert please go ahead i'm looking much forward to your presentation i hope it's not too early in california um i would have to think about that because that's not where i am [Music] okay so it's a quarter past four i am and that's the same as where you are i think i would i would show my face uh but i'm uh but the host doesn't allow me to so you can look at my uh uh starbucks drink um not that i know you miss much better you think no you can but let's see um i cannot okay then i think tom or valentina you have to train uh but can you see my screen though yeah we see a screen well then i'll just see the we see the first slide yes we'll just go with that then so yeah hello thank you anna for introducing me um i'm robert heymons i'm gonna speak mostly about the new terror package um which is a very general spatial data analysis package and so i thinking about you know how to coach what i could say about it uh i thought well let's let's let's talk about it in the context of reproducible workflows and and the reason that i i thought about that is well two-fold one i i was thinking like well what what made me start working on r and and developing our software and you know the most important well the two really important considerations one i wanted the freedom uh to do the data analysis that that i want to do without any you know software vendors or telling me what i could do or could not do but also without having to reinvent the wheel um you know i used this right software from scratch and within r of course it would build on you know the on the shoulders of giants um but the other big thing was reproducibility especially in collaborations when people with something called with results that i look at like what um very difficult to to to really have have good trust in results if you cannot look at code and see how it was derived and then thirdly later on um as i move move to the teaching job also you know for teaching and thinking about um you know how to make relatively complex workflows as simple as possible so that users of this kind of software can focus on you know what they're really interested in um with with a minimal uh basic knowledge of of coding that you will need but without you know being drawn into very esoteric code writing so i'll first give a very brief introduction of the terror package and then i will show it's used through a case study that i also gave out as a as a challenge question which was really more a challenge project it was you know there's clearly a bit much um to ask for um but i wanted to challenge people to think not only about reproducing something but also about um you know validity reproducibility and validity uh going perhaps a bit beyond sort of what was the topic yesterday about you know cross-validation and just a number which is which obviously is very important and dear to my heart um but there's much more to thinking about you know valid research all right so so my context is spatial data science um and i thought let's let's define it so we know what i'm talking about um and i highlight these these red words so so so especially as scientists which many of us now are i think and and many i i assume many of you in the audience um are or maybe maybe aspire to become you know uh we write code so that that's crucial to create reproducible workflows not one of things but we can show where results came from and we can build on it and reuse it to often analyze complex data and to me complexity is really the key here um often people talk about big data you know that's a that's a part of it but but really where it gets interesting and difficult is the complexity putting different types of data together making sense out of it using appropriate analytical methods um guided by domain knowledge these methods is also you know really something i i i hammer on a lot in my teaching in a sense i feel you know people tend to find hammers and see nails everywhere um people doing spatial data lss tends to have to come through sort of gis a pulse and have pretty limited statistical background knowledge but then they want to jump into this other kind of spatial modeling um it doesn't always work well um you which you really want to have is some domain knowledge some understanding of process before you necessarily jump into analyzing uh data and so and so being a good spatial data scientist there is a really hard task you know you need you need to have a whole set of different skills to do it well and and so that i think that is to me a challenge and that i think that many of you will be in charge as well um and i'm going to talk a bit about representability in part also because i feel i need to start talking more about it uh to educate you know my students but also myself because still today every time time i write a paper and the paper is done and i you know want to write in the in the appendix or or somewhere like well you know the you can reproduce this with this you know code on github or so i feel like oh well actually it's not so easy you know there's maybe partly um so so let's let's so i think it's really important that we talk more about that we often talk about data sharing with very little about you know sharing code and um um thinking about what are appropriate workflows for complex data analytical uh problems so again yeah i won't go into a little huge depth of any of this throughout you know this one hour i have and given the different things i want to touch upon but i think that's sort of this the general broad stage you know or topic on a layout that we maybe can discuss it now maybe also during the forum later um and i think it's just important to get this conversation uh going and maybe maybe some of this has happened during these couple days uh i i personally haven't been able to to be in much of it okay so first let's talk about tera so terra is in some way similar to sf in the sense that it's very places of old-school um spatial data um in our terra especially the roster data but really both roster and terra you know have um both raster and vector data uh uh support if in fact you know roster did also the raster package did a lot of vector manipulation building on our geos so the old-school russia data analysis was you know roster rgdl sp rgos that's the f4 those packages um lots of that has now been put into this new terror package why while you know rasta roster had many good things i believe i thought it was relatively easy to use it was you know had a lot of functions you could use one of the innovations at the time was there were no file size restrictions so you didn't have to load all your data into memory for to be able to work with a large amounts of satellite data um it was also unnecessarily complex and i was kind of interested in hearing martin speak about tmap and i i mentioned something similar may have happened you know you start writing a package and there's a lot of path dependency first you know i started first with this roster layer idea and i said oh we should also have a stack but what about this case this should be something else and before you know it you have all these different uh functions and and objects or classes that really um could have been designed much in a much more simpler way um with hindsight and so you know with that hindsight um i i started working on rust on terra a couple years ago and there was some ugly things as well like you know sort of file format is not supported and certain um functions much too slow so terra came out a year and a half ago now it's very much like roster and it's simpler it's generally faster actually not always not everywhere but most cases and it's much better in many small ways you know complex to to point out only small ways but i think it really uh uh works much better except that it's still a new package so you know there's still some box that people are finding and thank you all very much for for doing that but i'm quite confident to say that this is you know uh stop using roster start using terra um and and if you and if you uh get stuck let me know um and i'll i'll make sure that sarah can do what you need to do just one um more technical slide sort of like okay how is terra also different sort of under the hood from roster so roster like most our packages are or many our packages anyway are mostly written in r uh tera however is almost the opposite is almost entirely written in c c plus or c plus plus really with some c libraries and so essentially there's this what i call the spat library that does all the work that uses gdl and gdav and netcf libraries and then there is this rcpp module there's a thin wrapper you could call it there's this interface between r and c plus plus and so that allows you to use it through r but the nice thing about this design is i designed this way for speed in part but also i sort of didn't want to pollute it if i made with with sort of rcpp code such that i can also essentially compile the standalone program and on the long run i want to make a very similar package using boost module in python where you have would have exactly the same interface in r as in python for most of you this doesn't matter as end users but but this is this is um it has one big implication is that um you know i try to um not build too much on or or at all if i can on other r packages because i don't want the dependency i want to have the standalone c-plus plus software that they can use in in different contexts if you want to learn about using terra this is where i would send you our spatial.org slash terra there's a similar version for the roster package as some of you may have seen in the past it's nearly complete now in terms of a copy of the what was available for roster um and when it when it becomes complete they're actually i got plenty of plans for for much more but that's where i would start so a final slide on ontario um i think it's currently getting quite stable um you know there was a peak in development still early this year and then you know a peak in bugs over the past months now it seems to have slowed down a bit um but surely there are there are still issues uh to be found so please um if you find them um you make a huge contribution by reporting them i've i've very frequently come you know i've met people say oh yeah i found this thing that didn't work but they never told me you know it's like well if you'd only told me you would only only have helped yourself uh but you would also have helped so many other people so so um yeah if you wanna you know if you wanna make a small contribution providing block reports uh or even simple feature requests uh is tremendously helpful all right so now i i want to jump into the case study um and through that case study this will show terra in action and i took this this case study because you know it's one of a few papers that i've been working on to reproduce and to write a bit about and to have some thoughts about that um the the highlighted red area summarizes uh essentially the finding that they find that you know they say there's two groups of of religious beliefs ones that have high moralizing gods that basically you know our moral authority that tell you what what's right and what's wrong and other religions do not have that um and they and they they find that that those beliefs tend to be more common in more poor environments so just to be sure here uh you know i'm not not making any statement about religious beliefs or about your relationship beliefs or about anything like that um i'm just sitting here um imagining myself being some you know intelligent um uh being from another planet looking at planet earth and just seeing well what what what what kind of patterns do i see here so so please if you know if you have relationship beliefs yourself uh keep in mind that it's with that distance that we look at this so so um yeah i hope that some of you have looked at the paper have tried to reproduce some things and yeah the first thing that um i did was making a map but before that i have another another general slide here which again is introduced a bit of my motivation here you know this this is a paper i use a lot in teaching you know why most published research findings are false and partly i use it because of course it's a beautiful paradox um but also i think it's really important to think about it now this paper goes uh it's much more about sort of type 2 statistical uh inference errors and things like that you know beyond sort of what i'll be talking about um but it's very clear a lot of things a lot of findings are are incorrect and it's really interesting to think about why that may be and what and what what we can do about that and how that can help your your writing area reviewing um and and this could be one of these cases so let's have a look well first i started out with as i wanted to say earlier let's just get the data in and reproduce um what we see uh yeah there's a couple of figures in in the paper and i'll just only show this map of course they'll focus a bit on the spatial data so they have a map of um where and where not do we see these these particular religious religions so to believe in moralizing high gods and underlying map of net primary productivity so let's go to our studio to my first script and i started out with with making this map of net primary productivity so i use the terra package and i use the geodata package that just allows easy access to spatial data i'll have i have a function here which is npp based on this pdf here which actually is interesting you know it's quite a challenge if you read the pdf to even the oh to derive um what this function should be the the the the differentiation between mathematical notation and code is you know it's entire or the step to go from mathematical notation to code and vice versa it can be really challenging um i think science still works on the notion that mathematical notation is king right that is you know if you can write this then you really are a scientist um the problem however is that you know this is how we implement things or tend to implement things and in many ways you know this is what we really need to understand what's going on or something like this now of course there's the issue with yeah this is r what if you want to use python or c plus plus but even uh even then i would say particularly a simple function like this it hardly matters um but but anyway that's suffice to say at this point it was a challenge to translate but i made this function i downloaded precipitation data so if i do press crack i do i can look at that oh that's actually not great here and a big font let's so prac is a spot roster it has you know rows columns a number of layers 12 layers one for each month um 12 sources because each each month is a different separate file um minimum x values for each of the layers if you ever use roster package you're familiar with this we to i want the annual precipitation so um i have the overloaded sum function i say well sum this up and you know you can always wonder of course what does this do you know what if you now we have these 12 rasters and i could um plot them i have to go back to the original to do that plot crack block correct so we have it takes a little while we have these 12 layers of of monthly precipitation uh so one one design question in terra of course is well what what what would some represent you know some could be just the sum of all values um but it could also be the sum across cells of each layer and in that is the general design principle there are of course ways to sum all values as well uh using the global function but generally um roster is you know we want to keep you know maintain sort of the roster structure so by summing i can type we now have annual precipitation for the world on the raster cell so a similar thing for average temperature if there i take the mean now i want to have them together in a single object so i use the c function you know combine it's like a normal r uh in in roster it used to be stack and so now i have the klim object it has two layers one is called mean the other one is called sum you know that's not great but that's because you know one was the sum what was the mean so let's do maybe names prac is yeah we call alcohol you got called precipitation i'm too lazy for writing out long names like that so i'll just do that assign those names do this again okay now we have names that are are more easily to understand we have antarctica in here um that was not part of their map there's no people living there so there's no religions that we know of unless the penguins do yeah that's something we don't know about so let's crop out by minus 180 to 180 so east to west and minus 60 is sort of you know around here let's let's cut out um on arthica okay and so now i want to now i have the input variables to this is um function to compute the net primary productivity another question how do i apply that um there would be two ways you could actually um just you could write this function in a way that the rosters would be an argument but more typically and and generally a better approach is to write first a function that works on a vector and then apply that to the rasters in one way using one of the several applied family functions well one thing i see go wrong a lot is that you know somebody will write a function and they use it in in this case you want to use lap lap without checking the function first so what you want to do always is something like well let's check the function so let's say 10 degrees and 100 millimeters and we see okay 192. um i forgot actually what what the gram per square meter what i don't know what what if what the function what the unit is but let's think well if it gets warmer and wetter we should get at least more per oop that was not what i wanted to do i don't know what i did there it should get uh you should have higher productivity um oh wetter high productivity clearly your rainfall was limiting at 100 millimeters that makes sense yeah you can't you can't have too much uh plant growth 100 millimeters the next thing to check is whether um your function is vectorized so let's make a set of temperatures and let's take put that function in there and let's maybe plot that temperature against the npp of temperature at one thousand millimeters of precipitation oh okay well that's at least it goes up it's very simple linear and his um and maximum so apparently rainfall becomes limiting let's see oh very high rainfall okay so now lane fall doesn't limit anymore but if you do see at some point it it's um you know there's a maxim you can reach all right so we could we could go on and on but but you know most of the questions on stack overflow i see about this this kind of use of a function that's going to apply to a rasters is is people that haven't looked at their function try your function first and of course the other big thing that you've never forgotten is uh put some nas in there 10 to 15. very often functions fail for missing values well here not you've got basic values but but it all works okay so that's good i like i like my function let's now apply to the rosters so so um that's sort of new in in in terra we have the app function which was similar to calc in in uh in roster and terra has app terra as lap which was similar to overlay in roster and there are some others there's wrap for for um well i wanted to go in there there's wrap there's sap there's all kinds of you know applied type functions that that come in very handy so let's run it um it already exists because i tried this out before so i say over right is true so know that i provide a file name argument here you don't have to do that all almost all functions you can give a file name argument i often see people not do that and later want to write a file if you want to write a file anyway then you might as well just do the one step instead of in two this message here is a nuisance um that we can ignore so we have npp computed there it is now this is not necessarily the same way as they compute npp um but i thought it'd be illustrate i want to illustrate you know the use of lap where we have two layers in this one object called climb and they are used as the one layers the first argument t the other layer second argument so the only thing you have to be sure about is um that the order is is right or you can also actually use the names if you want you can you can match names to make sure that you don't have to worry about the order all right so now we have um the npp let's have a look at now at their data they used which is um available um from pas so that's nice uh there's an html i can download it and then read it i simplify the data a little bit i won't go into that but well it was important here so now i i have a data frame with their data that they used for in their analysis so they have the societies they talked about the the group that the societies belongs to latin longitude of the location of this culture um and i you know and some other variables they used in their model and they have this i added this moral moralizing not moralizing so with a data frame with tetra i can now make effects uh i can make a spat vector so we look at spec rosters but spat rasters are created with the rust method spat vectors are created with the vect method and so v is a spot vector in this case of points and it has you know all the data frame attributes in there as well and we can add that to them we can we can oh we can map that um by itself you know you get a little legend there uh or we can of course put it on top of a um um an existing map so now okay that's great so we have reproduced some of their data and and i will say actually this this paper is a great example that where by and large it's quite reproducible i couldn't i can't exactly get the same numbers but i get very close so i i actually you know so there was no trick there there was no trick that i felt that they made actually um uh one of these mistakes that are really overlooked just purely a mistake a mistake in coding mistake analysis you know mixing up a variable something that consorted wrong not nothing of that that kind i discovered in this case there's maybe the hardest part was to reproduce was this part here after simultaneously accounting for potential non-independence among societies because of shared industry and culture diffusion we find okay so how do we account for this and what does that mean um essentially it's galton's problem you know you find similarity between cultures um and you can attribute it to their environment but it's actually most more likely if they're near to each other that's due to borrowing uh or to common descent um you yeah they're they're they're really sister cultures that have um you know ancestor culture if you like uh which both have uh inherited from um not not even we can come to that if there's time but this whole question you know well how do you even delineate a culture where is the culture beginning where is it and you know there's no there's no of course there are no heart boundaries it's very fuzzy and so this is this is a general uh very you know major issue of course in data in general um and essentially because that is a spatial autocorrelation where spatial autocorrelation reflects this this phenomenon it also exists of course in all kinds of other data it's just it's just we don't have maps and and and we don't have the same tools to show that you know similarly you can have if you if you have data from you know from 100 people but it tends to be of the same family or so you know there's all kinds of other ways in which in which independence is not guaranteed um but the spatial data we actually have nice ways to look at that that another case you don't have and so this is what they did so they said um we looked at them out at the ten years neighbor for a given observation as a co-variant so the first thing they said well okay what so you know essentially what they're saying is like well i i may have a moralizing high god but my near 10 years never also have that you know that is not that this doesn't then add so much weight to my you know that point doesn't add too much weight to my inference because it's probably from borrowing but if my neighbors don't are different then it should be the important one um so this is how how how i implemented that um tara has this nearby function so i say give me i can take this off first so which are my uh procs which are my approximate points case tense over each point what are the nearest points i take out that first one because that's an id so we don't really need that they're always ordered in the in in you know you know the sequence that the data came in anyway so let's run this again i'm making a matrix in which i replace this number with its actual value so let's see if we can look at that so now for the first one none of them has that moralizing i got neighbor but number two has two of them and then let's do you know a normal row means oh no row means uh and i get a single um value and they can again plot that so yeah this should not be surprising what we see now we'd already seen that you know this was a phenomenon that occurred in mostly in um europe north africa and sort of the near east uh and so all these points only have you know this the interneers points have the same values whereas here you see a gradient and here also so in middle america central america is also kind of interesting and here are some other gradients so yeah nothing surprising there but it's always good to look at to see if you know what we did actually adds up uh finally what we can do uh we can look at the you know the names um of this vector and we can write it to in this case i say well write a vector to a shapefile again it already existed so i'm going to overwrite it on purpose i take a shapefile here it's a bit maligned but still by far the most widely used file format i think for vector data uh one of the real um drawbacks of it and maybe the main drawback i think is that if i now read this again um some of the names have changed you know for example society name ea has been uh cut to society n a you know it only takes three six nine or ten characters uh for a variable name so be very careful with long variable names and shapefiles okay so now we have um i'll say i'll call this now we have prepared the data you know we have the npp data we have the space autocorrelation and i just you know um you know in interest of time and also if you're of your attention span uh i'm not gonna reproduce the whole of the statistical analysis i'll just i'll just um um show um essentially what they did with with some critique on that and then two alternative ways to look at their data and i have to reread that but i'll take the s i'll take the for my i could do it i take the data frame from the spatial from the space vector so essentially getting rid of the geometries and i make a model moralizing is a function of abundance because i read it from the shapefile i or you know it's a shorter file name and you know they have more complex models but i'm just going to say okay let's just their main finding was that you know moralizing god's you know the presence of that depends on on having a low npp we do uh what was that it was an error message okay that's that's um oh because i want to do glm really applause let's look at the summary well for most intensive purposes you can say well there's very strong support right the p-value is essentially zero so in in the symbols of worlds you say well there's this you know this is scientific evidence for this um for this finding and essentially that's what they report they do a bit more and they look at all kinds of other things but essentially like you say well you see there's this association um one interesting thing i thought to look at is like well this is true if you you know i don't deny this this is this correlation or is there or is this connection or that this model uh fits that way but if this were generally true then you would expect that to happen you know not only in this region but also in the americas and maybe also instead of you know in in the eastern part of of of of the world um so how generalist is really or is this is this really you know something uh having to this is more you know specific particular to to to um this part of the world so one simple way to look at that could be um nice you know an exploratory data analysis where i say well um first i i look at the relationship between the presence you know npp um let me redo the figure here so that the you see the the labels npp and the prince of moralizing god and if you do that uh for all data you see this relationship so pretty strong uh npp by the way is here standardized so that you know but this is still low and that is high now if you only do euro africa so between -20 and 65 longitude including southern africa um we get a much stronger uh relationship you know very strong in that part of the world however if we do um only southern africa yeah there's nothing or the whole world except for um europe and africa there's also absolutely nothing no relationship you know that is surprising particularly for example if you think of southern africa south africa is actually very similar to northern africa in the sense there's also desert there's also more humid areas at least in the northern parts of it so why you know why would it be so different you know what what what's um what's going on here uh why why do i think this is interesting now we don't have really a setup to do uh uh you know q a here but of course it's a nice thing to discuss about in class but um you know and there's all kinds of alternative hypotheses you can you can think about or alternative models you can propose and i think that's that's really why i like this paper a lot because you know it's one thing to say well yeah we have this finding and it's all really interesting it's a whole other thing to say well does it add up that's that's your domain knowledge supported you know and then of course what you know domain knowledge is tricky here especially because it's you know it's not only my domain anyway uh yeah i don't i don't study um the spread of religion but one thing that struck me is that and you know and in their data it's not so simple to see but if you look at these cultures here well of course the one thing is you know guys there's so many more different cultures in some places in the world and other places the world is that is that correct um let's not go there for now but another big thing is here they took the culture as the point the unit of observation not their religion and uh for better or worse that that could possibly make a huge difference because what we actually see here if you actually look what relations do we have here well we have christianity and we have islam and there's judaism as well uh so in my view you're really talking about one basic religion judaism that's you know that sort of from which christianity and islam essentially derived um so you could also say well this is just really one religion here and yeah they took their ten nearest neighbor but if you think this as one relation that is spread out uh all of a sudden you have a very different process and then and then attributing that you know that that pattern true environment that iran is just a spreading event becomes a bit tenacious another thing i would point out is that you know if you think about islam you know currently at least um so the time aspect would be 0.2 it says 1700 or what is this but currently of course indonesia is as has the largest muslim population yet it only is represented by one euro and malaysia as well so like one two three four five six seven and then the in philippines eight points or so um not very represented and some of them are actually different religions so not very representative in terms of numbers so you may also maybe argue well maybe you should just wait by population so there's always all these alternative hypotheses that you could have and different ways you could look at these data so i did one thing uh well i already showed this um you know it really depends um i didn't show the the p values but if you um if you do if you run the same model you know with this very low p the zero p value essentially you leave this data out you know you get a p value of you know 0.94 um about as bad as you could get it but you could also look at an alternative hypothesis saying well let's just look at jerusalem if indeed i'm right it's just a spread event out of jerusalem so that this is where jerusalem is uh i'm we have this this spot vector i create a new empty roster from the average temperature so by saying ros tf i get a raster with the same structure as the average temperature data um but with no data so that helps me often to uh rasterize things to or to [Music] otherwise sort of you know create sort of new empty data sets that i i want to make sure that they align with the roster that i already have that by the way is another common uh mistake people make when they talk when they work with roster data and some of the commercial software sort of almost push you that way is that they have make all these rosters that don't align you know the one thing that you want to do when you work with roster data is really you know from the outset i have to think what is my error of interest make sure it's large enough what is it what is my pick you know my cell size or my pixel size what is the origin make sure that typically try to make keep your origin at zero so that from there you know you can easily then make sure that all your other data that you bring in uh uh a line the other big thing is you know trying not to reproject reproject or or resample your rasta data because there is always a quality loss with vector data you can do that as much as you like withdraw you want to avoid that so i i created this this um uh empty roster and a rasterize is one point to it i'm not sure if you can uh oh that actually takes a little while it's funny because it's only one point but it well as i said not everything is faster but there you go so let's see if we can actually see it uh it's very hard to see it's too small um so let's quickly let's compute the distance to jerusalem so the distance from this one point to all the n a's and i'll use t f as to mask out everything that's well like i can i can show it um what's going on here blood integer okay so this is distance from jerusalem it looks odd but it's because it launches your latitude data you get these odd shapes and mask with tf will then remove all the cells that are n a or it will set to n a all the cells that rna in the temperature data so that helps us see better what's you know what we're dealing with um i can plot the points on top of it again um and i'll print i plot the original under twice for the white background give it a bit of a halo which seems appropriate here um so you see well you know even by eyeballing it you know it's not it's not it's not the worst of a model where it doesn't seem to work very well is southern africa because yeah you're still not that far from deuteronomy um but there is is this a gradient here so let's um quickly i'll just um now do a model of a of um distance um i do summary and again i see that you know essentially zero the effect so so it seems also a very good model so we can you know we can of course compare them in different ways so here i use aic uh only distance and abundance only distance and only abundance so abundance again is an um npp um here and what we see is well the lowest aic score is the best so you know combined is has more support if you use aic um separate they're very similar but actually distance is a better model than an abundance uh we can also do you know cross-validation with with um aic well actually there's not constellation because i just use the testing data we can also but there's another measure in which case you know the ai auc um gives us somewhat similar uh result all right so i got one left but let's let's reflect a little bit more on that one um you know and the big question becomes all right so why why is uh npp still useful because there's all these points here in more humid africa where there are no moralizing gods so that helps um i personally would uh take this model of distance a bit further you know i would claim like well you know think about how these relations were spread they were actively spread both by christians in the north and i mostly from the south but they were spread westward by relatively high population density um peoples living living in the sahel this more more rainforest region was more isolated lower population density uh you can't have cows there um so so there was this this this is natural barrier and so the question is you know whether whether or not you know there clearly is this separation but whether that is just because of you know the people on the lord just weren't interested in to conquer the south essentially or if there's different reasons you know that that remains a a question you know you can philosophize about the law too but you could also surely uh if you take for example population density uh around maybe 1900 or 1800 into account you would also change that model quite a bit um just one more last thing in in uh r and then i'll i'll wrap up and then i'll have time for questions a whole other thing that i thought you know you might come up with here you could also do a special distribution model let's get bioclimatic variables um i thought i downloaded them i guess not so typically with the geodata um the way it works is a package you know it will download the data once but once it's downloaded to a particular folder you won't you won't have to download it again and again you know this is this is you know the reason i developed this it's obviously you know to help reproduce ability um and and just ease of writing a uh you know code but one of the worst things always is having to deal with you know people's data so you know you know if you if you want to share data in in uh um in scripts you know put it on a data verse or put it on some place on a google drive or at least you know we can provide some kind of a link to it otherwise it's it's just a mess but then when you download it i would also say also make sure you actually save it because it's here tomorrow and maybe let's see today it may be gone tomorrow anyhow so uh filled that time so now we have bioclimatic variables that are often used in this kind of uh spatial distribution modeling or species distribution modeling where essentially say well if i know the environments at locations of where i observe something i should be able to predict where else that something could be or maybe is or at least you know at least i can say where there are environments that are similar to where i've observed something so the first thing i do i extract from this roster the spot raster bio values for these points v so i had you know bio has 19 layers v has 583 points so e dme is 583 by 20 and that 20 is just because there's again there's the um the first column is this id for with points there's no no much relevance to it because it's always going to be one-on-one but if you had a polygon you might have your many values for a single polygon so that's why that one is that first column is there i remove it here remove the n a surrounding force doesn't like it oh i've got something i guess undefined columns order oh this problem the random forest model um you always always think of you know people think about um machine learning as being difficult and i always have to talk about the machine learning part is very easy in principle you know one line you're done the hard part is getting your data in shape um to be able to actually do this and the other hard part maybe actually thinking about what it all means and what it maybe doesn't mean but given a model you can make a prediction back you know given mole m and the predictor variables that are in bio uh make a prediction uh to where these um moralizing gods may occur random force is a bit of a slow predictor because all these all you know these complex trees i should have made a smaller one perhaps but there we go plot p and plot these moralizing gods now i think this is a really interesting results because on the face of it you would say oh this is a great result at the same time it really shows also why this method is so weak i feel i you know yes it's great because really reproduce is what we already knew um but also you know while we're using environment as a predictive variable there's clearly environments elsewhere that should be quite similar to this whole region but it's all excluded it's like no no no that's not similar enough with very few exceptions because there were no blue points there um and so you know this these methods tend to be unless you're very careful and you're you know um they tend to not be not be very general they're they're they may be reasonable interpolation methods they tend to be very poor extrapolation methods unless you take very good care um and so these are just some examples i gave you what else you might do with with um the data and how else you might think about it um you know that was really my my challenge uh for you uh i'm by the way uh you know uh i'm i'm gonna i'm trying to work this uh out in sort of a more a larger exercise that we'll put on the our spatial website uh yeah sometime later this year um so you know i talked about you know or i mentioned you know most published research is wrong um you know that paper itself could be wrong and that would be done a good thing i suppose there's also this problem of replication now that the context of this is actually more experimental replication but i think there's there's still this big crisis of um also uh analytical replication even for those of us using you know r and scripts and and so i want to end with just a few comments about you know workflow design and and what i do and hopefully that is useful to to to some of you and maybe you have some ideas or better ways that you can mention in a discussion but it i really want to emphasize it because yeah that as as i you know um the the one thing um that just keeps coming back when i work with graduate students and others is is the you know the complexity of actually sharing workflows so this is how you may have seen that already this is how i how i make projects how i organize codes you know you may have seen that i already had you know that if you paid good attention that i had under my project folder had a data folder in a raw data folder and always you know download your raw data into a folder don't touch it with your hands so to speak don't you know use the scripts to clean your data so that's the first scripts data cleaning and then maybe some final data sets that you actually use for analysis big other folders for graphs or docs or whatnot organize your code in in in separate uh files you know another real difficult thing is like you have these very long uh um uh code files that are very high you know and so many senses like oh you know can you please help me yeah this doesn't work you know lying 350 000 line files like yeah well i can't help you really well i actually try anyway um i don't know why but i do but um you know the more you can you can separate steps out um the better and you know and what i should emphasize uh that maybe this um um really sums it up a bit is you know whenever you do a project the more you start out thinking about other other people wanna you know needing to look at and understand what you've done the better but another person and i'm old enough to to to have learned is it's often you your future self so in this um lovely uh cartoon um yeah i always think of like you know i'm asking you know this to my to my past self you know and and yeah so it took me a while but in the long run you know i'll have to salt you know i'll things will work and and um here you know this on this invisible person is writing a system um to me it's about all about functions you know what i keep hammering to my students all is like well you have to use functions and as soon as i see someone who can who actually starts writing functions meaning not only making you know putting code in separate files but also now within these files having clearly isolated functions that do a single task like pausing assault that's when when their code gets becomes a joy to look at and and i would answer that uh you know i actually explicitly said in my in my exercise do not use the tidy verse and the and the only reason for that is i feel that you know that introduces all kinds of new interesting things maybe a lot of this is replication but it clearly has has as cool features um but but i see two downsides of it one is is actually sometimes very simple things in base are are overlooked so you get complex more complex called through the tiny bursts but the other thing the the extensive use of um um the the pipe or the you know the magritte and whatever whatever you call an arm the pipe symbol has uh i think uh seems to stimulate um long complex um you know pipe together um workflows rather than much cleaner uh functions that are that are used in much shorter statements so so um there's one one very general point i want to make to all of you yeah write functions right not thank you that's that's um my uh um sort of um whirlwinds where i try to use uh talk about terra in the context of reproducible research and i hope that was of interest to at least some of you thank you very much for your attention great thanks a lot robert so we're just in time that we have five minutes left for questions and there's some posts that don't matter most robert can you turn on your video maybe or is it uh is that an option oh great thank you it works now i think i don't see myself but maybe you do yeah we see we see you thank you so much there we go let me know thank you germany all right hey talk great talk there's five minutes for questions please hannah yeah so the first question is about the extract function and terror because in the rasta extract function there was an option to have a buffer argument which apparently is not there anymore for terra and the question is if this is going to come in the next version of terra or not are you sure let me check um it may not be there yet i mean so the answer i like yeah i didn't know that uh if if so it's a really nice function and it should really come back and if you don't see it please um send me an email or put it on github so i don't forget of course now you could make a buffer first and do that but i agree that's a nice question thank you okay so the next one is from ezza and uh you you're suggesting i asked it myself already so i assumed you want to oh no yeah so the question is reproducibility using open source software for reproducible open science helps because users can try to read the source code to understand what's going on if needed and the question is would our users being helped in that respect more by a package that's written mostly in r like roster or by a package that's mostly written in c plus plus like terra yeah it's a good question um it's interesting i mean this is true i i think though that that it's sort of a very different level um i think most functions in in a sort of a well you know a stable r package you know you don't really have to you really have to look at how the mask function works or so you could you can you can see that it works um and so i'm i'm much more concerned about about sort of a more of an end user that that doesn't have to um you know and and and how they can write better code with supposedly well-functioning um r software or whatever software you use but nevertheless it's it's it's it's of course it it so it is often towered as a great benefit to open source software and it is true um i must say i've you know i've rarely used it but particularly if actually if something fails i do it or if i if i see that a function doesn't quite do what i wanted to do and it's like oh i'll grab address function and i'll edit into it in that letter case especially i guess that um for our users and our function will be easier for sure i mean you can with rcpp it's very easy to to to uh write your own cpu c plus plus functions but not everybody has that skill so um yes all right short answer yes okay okay okay so the next one is a detailed technical one uh sometimes we need to plot a vector above a raster plot this fails when the raster have many layers how to fix this by just using tera objects i i didn't quite hear the first part sometimes we need to plot a vector above a raster plot this is when the raster have many layers oh um oh i see i see so you want the same vector on um many layers so let's see if i can do that here um very quickly i'm not sure if i know at the top of my head but you could do something like i'll make a xst i i'll just take the first uh four layers um of the average to make it smaller and then i would say plot x and then there's a fun argument and it should be something like function points v well all right i'll try one more time it didn't exist oh is that what i said oh sorry so the average is a single layer probably ah yeah thank you oh it's good that somebody's paying attention here thank you applause so we got now the faulty average now one two four upla right so you have a function hook yeah so you can add a function that adds something to to all all maps and actually it's interesting yes there's a feature request on github where somebody said well could i could there be some way that you know on which layer you are so that the function would have an argument you know a late layer or so and then and then based on that layer you could actually set something else that's that's not there yet but that might come for the sake of time we have to stop now the people
Info
Channel: Tomislav Hengl (OpenGeoHub Foundation)
Views: 470
Rating: undefined out of 5
Keywords:
Id: O9sJhafrxqM
Channel Id: undefined
Length: 114min 51sec (6891 seconds)
Published: Fri Sep 03 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.