Hengl, T., Bonannella, C.: "Spatiotemporal Ensemble ML in R" training session

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
the slides so here the slides for the last session of the day so in the previous examples i spoke about a bit about machine learning and all the advantages of machine learning and disadvantages and i also showed you these examples with uh how you can use just buffer distances to make spatial interpolations there is one more technique as i said similar to this buffer distances is to use uh distances to let me see i think i have now two two same sessions yeah i don't need to so there's a different option is to use the distances to the nearest neighbor uh that's the method from the meta package uh little developed by circulation at all and uh it's very intelligent because you basically you take uh for the distances you uh you you calculate the distances to the second neighbor first neighbor second neighbor up to the tenth neighbor so here i can specify uh the neighbor observation up to tenth neighbor and then you see the algorithm also computes so distances to the tenth neighbor and it computes also the values and how do the values look like it's something like apollo uh warren or polygons uh you get something like this uh basically uh it's kind of a splits the values decides the areas uh splits the values based on the proximity okay so and then you do a gain overlay you train a model and you do combine these things and then you have you have a model looks like this uh basically zinc is a function of distance to first neighbor second neighbor and also function of values from the second first neighbor second neighbor etc so there are two types of color it's not just geographical but also the values and when you fit the model you can see it also gets about 0.46 it's a bit less than what you go to buffer distances but you have less you have only 20 layers you don't have 155 so you have 20 layers and when you produce the predictions you get something like this so it's a bit different pattern and you can we can look at that in a google earth and let me see if i have enough ram yes so let's take a look at that in my google earth so there's this one and so we can compare the two predictions by two different uh algorithms using a basically pointing so this is this one with the distances that's kind of like uh it will be lots of uh this uh pixels going up and down so uh that's a bit maybe strange pattern maybe it's because of the size of these data sets also uh maybe it's just because we use only 10 distances we could use distance to 20 points you know but it's a bit different than than what we get here with this one so this one is a much smoother right so so it's a bit uh it becomes a bit different so which one is better i don't know difficult to say uh uh you could do cross validation and uh looks like here at the extrapolation part it's very different from the two methods as you see this is where the real differences kick in so you can see where the the differences but definitely you could use that method so that's something we spoke and then other thing we spoke about uh is the the modeling seasonality um so that's something that we're going to implement it so now it's not just hey this is the theory but now this is i show you that function in in practice and we compute for every pixel and uh it's not complicated it's not very computational so we can compute it for all world many many rasters you know so yes this session we will talk about ensemble machine learning i also showed you ensemble machine learning i see it as a remedy to prevent from overfitting and also to prevent from extrapolation problems um so so so there's a good reason to use machine learning but it increases computational complexity and definitely uh if you are not prone to making more code and computing more then uh then it's not a direction for you but if you if you feel like okay i do more computing and as i told you computers uh get uh you know exponentially more efficient in computing and i don't know in couple of years so it will be computational efficiency but now maybe some code that looks very computational in two years you run it in a few seconds maybe not in two years but maybe in 10 years so so that's a question so what we're going to look at now ensemble machine learning we look at the mlr package there's also super learner package also there's a tutorial that explains how to do it i'm not going to run that as a demo but just be aware there's many ways you can do ensembl not too many but let's say there's about three or four frameworks in uh from h2o to superlearner to mlr ml3 so there are a couple of frameworks and you can apply ensemble machine learning and then i will show you this with the three data sets and i will pass the last data set i'll pass to my colleague carmelo and he will actually show you the real data and really big data so we look at the actually i should show you we look at the whole europe uh training points for fargo zlatica and you can see that's a large data set even to visualize i mean uh when opening qgs it will take some time and even it's difficult to see which of the points which are occurrences and uh so there are zeros and ones where the point comes and where the species counts doesn't so it's a really real data set so we'll show you that in a practice um so that's the the plan for the this third block uh so three case studies and we'll do step by step first a bit of introduction so why and symbol importance so we can potentially increase accuracy that's the uh first reason why people do ensembles and but it's not a lot uh there's no ensemble there's many magic there's no magic being find the sample so if you do ensembling you will see there's a really gentle increase in accuracy so there's never like i i double the accuracy or something that will be very strange because um that that means that some of the learners is really polluting the model but let's say if you have really stable and reliable learners then there's just a gentle increase in accuracy usually it also helps with the extrapolation problems there's a much more training so you do maybe 10 times more training than if you just use a single model and you can see you can derive this what they call a model-free prediction error so the prediction errors which are not dependent on the model uh so that prediction error is independent of the model and you can call them model-free and also there's in mlr you will see some in some sections you can combine uh where you prepare the data you can combine the model selection feature selection uh fine tuning you put it all in at once so you just say i want an ensemble with all the other stuff done um and that's that's also nice you don't have to separate it you integrate all the things yes no i will show the you there's a different ways to do the ensemble different approaches and we we often use this uh algorithm called superlearner which is uh based on stacking and stacking means you create a meta learner so since you're asking i can show it so it has kind of two levels it's a two level modeling in the first level you do k-fold um you do k4 to fit all these models uh separate so let's say here there are four learners you see four learners then you take this and you do second level modeling and this here you predict the so-called meta learner and this meta learner basically then shows you the the the final model as a function of individual models so it's a model of the model yeah and uh so this is a bit a bit complex but in principle i can show you here this has been worked out in this example i just have to run let's see and actually i had to restart the machine but not the problem i just have to rerun up to this plot um well actually this saved so here when i do when i do this ensemble model you see i get now i i change the model but i get the the final model is a function of reg glm gum boost and a net yes and this is kind of weighted average when you think about this because when i use a linear model then uh to predict the final value it's a weighted average out of and so basically what it means it's very simple the coefficients of the weights if you have a weight 0.09 you could remove that learner there's no point to keep it you just compute inefficiently so that means i will go and replace that and put some other learner maybe change it right so i said okay this one doesn't work um cv maybe this one and then i rerun it here on this one this one let's see if it changes anything so here we get um let's see what happens so this one i actually cannot fit because uh for this uh cvgl and that i need to have at least two coordinates so this one doesn't work so i will have to put something else let me put again ranger uh see if that changes anything every time you run because it's a random subset in the training unless you put the the seed number uh it's a random subset and every time you get a different uh different coefficients so don't get confused with that so here i do get the model and the model says again render forest no point to use it it just doesn't help just the gum boost and glm okay i would still use it you know but have in mind that then the uncertainty uncertainty will be high so when we plot when we plot the we compute and plot the uncertainty it will look something like this so now this is a different one so the uncertainty goes high because i kind of force the renault forest and then the uncertainty that reynold forest uh propagates to the ensemble uncertainty i think that's probably probably that's more again more realistic you know to do this than to only pick up the learners that match the data because you could you could do a or over a fitting yeah so but uh officially when you look at the individual model this model i mean you don't there's no need to use a ranger in the forest it's a very low very low coefficient very low weight and also the this coefficient is not significant this three stars and means significant to the level uh 0.01 so point one percent um and then two two i think at zero point five percent zero point five percent yeah and then one star i'm not sure one star but uh so these are different significant levels um okay so that's about that uh we look further so that's this method learner this is this super learners explain uh calls from california they worked this out and i think they publish on uh they this person did a phd on this topic and then this is when they applied they sh they prove it you know we repeat it many times these are many repetitions right all the dots are individual repetitions and you can see if they really add a super learner it has accuracy it's uh always a bit better they're not always not always i'm wrong sometimes it uh it can be super large can be equally bad as the best learner but uh officially they should never be significantly worse than the best learner that means something's wrong just when you think about this i mean if you already have the best learner you cannot degrade the predictions right if you do everything properly i mean so but if you do everything properly when you pick up the subset of learners then there's a good chance that this the ensemble or in our super learner will be a slightly better um okay so then this shows you for this mass data set these are different models i use uh with the subsetting with the five-fold um and the each fold is the j j is the fault right and so you can see that predictions can be quite different yes you see that they are quite oscillations in predictions it's the same inputs it's just different model and then this one is the weighted weighted average through metal learner and uh so then you can see that it kind of takes the best out of multiple learners and then you can estimate this model error also and you see the modelers they're high whether you have high values so they always be proportional to the values but also the the biggest we have this isolated point with the one high value if i look at the predictions i have here one um one high point suddenly there's just the one point very high value and so every every time you have one point in space that's isolated for sure it's going to unless it's really correlated with some core area but it's going to usually be high higher okay usually it's going to be uh okay because it could be if it's taken out imagine you do five-fold and in this five-fold maybe two times it will be taken out and three times will be stayed in and then these two times that you predict without it the model will say oh wow i predict much less and then you have the higher yes and but otherwise if you have the where you have high values you also get high errors but that's logic of the high values you know just because if you uh if you look at it i mean obviously where you have a much higher values where skew distribution the errors are proportionally bigger so that's this uh super liner and you see we use the software to do this to do the ensembling so if you look at these steps the steps are like really straightforward i mean you define the you say this is my target learners i can now turn this off i can turn this off and put the old one so i said these are my target planners this is just to define them and then i define the task and you see this is a regression task and if i if i type in the task it will show me something like this let's see so it shows me the definition of the task it says it's a regression task i have 100 points i don't have any blocking i don't have coordinates and there's no missing values okay so and i have only one numeric covariate i don't have any features ordered values whatever so also it's a simple model and then once i define the task then i can uh specify the stack learner so if i press the first one so i get again i just this is just definition there's no computing it's just a definition so i defined a learner which is called a stack learner which basically i defined this meta learner and i said the meta learner um [Music] it's a regression but doesn't say what is the meta learner it's in the definition only um but uh i can put some hyper parameters etc so this is just the definition and then when i do the train train is like from current you know from all the packages they use that train when i do train then it does the computing and if you look at it it can be a bit computationally intensive uh let me see now it's a very small data set but let me see if i put this on so we will track that um um at the moment is nothing completely nothing but it can be later on we'll say it can get computationally tensive sometimes you have to wait five six seconds you know but otherwise to run the whole script the the all r markdown i put a couple of databases because i want to give you an idea about special temporal regression so i put a relatively larger size and it does take about 10 or 10 seconds depending how many trades you have but it takes 10 seconds to train and that should it's good that you get an idea that is very computational right so so that's about uh mlr it's i have a contact with people from mlr they're very helpful yes yes i just wanted to show you 59 yes no of computationally you waste money if i take 59 i take 59 you waste you waste resources uh yes so i think from this 59 about 50 are the legacy the legacy models and about five or six uh interesting they're interesting because they're cutting edge if you have time yes if you have time if you don't have time as they suggest because uh if you have 100 data sets and here's one of the developers mlr it's great to be lupus in fabula so if you will take all 59 if you pick up let's say 100 data sets in r with multivariate data sets in this 100 you will get in about 98 you will get always the same so there's no need that you have to redo that so many a legacy legacy learners but i would use some glm you know just to protect you see this overfitting so and i show you with a synthetic example if you put one more legacy like a glm or cvg lm net then you protect from uh from um overfitting if you have a simple structure let's say there is a potentially simple structure linear structure then then the other ones they will do damage and some are very computational the support vector machines they exponentially with the number of covariates and you start going like i cannot use this i mean it becomes so computational that we cannot use it for anything we could use really for small datasets and then some we don't use based on experience because they're just so computational so we the ones i gave you in the script these are the ones we use for many many many projects so this is this uh this one's here the renault forest glm or cvgl net gum boost and support recognition you could also add maybe some others and the new mlr3 has will have more more models but otherwise there's no need to do all 59. this is this make stack learner so that's the definition in this definition you will see i can i can do different things let me open that help so you see you can do different type of methods for stacking i usually use the stack cv but there are and this is a computational but if you want to speed up i think he'll climb faster and so and when you do the uh stack cv then you can do uh also resampling method uh which is like a usually it's a five-fold uh cross-validation brief fitting but you can also put a blocking parameter and which is very important i will show you that in the example it's important to block to avoid overfitting especially with the with the space-time data where you have uh stations station data they're very important and this is i had to learn this uh basically hard way by uh you know overfitting and then looking oh the great art square and we almost submitted the paper published and then we took the whole stations out and the r square was about 40 percent size and then we said oh what happened here well we overfitted um so that's um that's another important thing um there is also super runner package but as a patrick also told me uh it's not really maintained uh it's really worked out it's really great text and i use it for some data sets uh it works very sim similar to mli it also does a stacking and it has also cross validation separate just to do just to do accuracy so without fitting a model um so that's uh but it's not so well maintained and there was just uh one or two people made it and then i made a package landmap in this landmark package it's a verus x machina so it wraps up my knowledge and you put it in one line and you say here my points here that my rasters make ensembles ensemble machine learning uh also feature and everything and you just go through all the scraps in one line and you just get it and you can make predictions and you can do you can predict uh classes numeric whatever and you get the uncertainty everything and uh you can plot directly so that's the deus ex machina but uh be careful if you have wrong assumptions or data set which is classical then i cannot help you it's not it's this uh um the methods to deal with the clustering and not implemented in planet at the moment i only do spatial cross validation um so this is the reasons to do ensembling um kind of already said that what's else is important if you put this in open knowledge maps you can see what people publish on ensemble machine learning you know this open knowledge maps you put a term and it will show you publications clustered it will do automatic transferring and put the publications that are related and also you can see which are the subfields so so that's uh that's a nice way to organize data sorry the publications uh okay now we go to the real data there was a lot of introduction maybe long introduction but i needed to do that because i'm not going to now explain you the seasonality modeling blah blah so i go straight to the modeling so space time machine learning um and this is the data set we published 2012 as i said we did a space time regression clicking uh or universal training and we got that predictions i showed you and the space-time data looks like this so these are the actual uh tree stations and you see daily measurements they can go up and down it's kind of like a a bandwidth of the oscillations between days as you see um and you can see if it's you if you're in the mountains so if you're in in the plane or next to the coastline you can have different with different uh peaks uh and also they happen in different times this black line is just a spline fitted to that so it's just a spline there's no like nothing special so let me look at this data set in r if you can still follow me so that's here so that's this data set and i load it i load it now it's on my disk it's on the disk so i load it locally it's not in a package yet and this is a typical uh let's say typical metastation data set you have the metadata which you have the id of a station then you have the date and you have a mean daily temperature md temp yes and then the c day is the cumulative day of the year starting from 1970. okay so that's a typical data set this has almost 45 000 measurements of temperatures and then we have another another data frame which is the station location so we have the id and x and y okay so that's what you start so these are the training points now we load the covariates covaris we have two sets of coarsers one are the static collariots and one are the dynamical artists okay for example elevation distance to the sea these are the these are the basically statically they don't change through time here's the map of the elevations nothing special it's uh just a map of elevations for the whole block this is croatia here balkans forever and you see this is just elevation you can plot something like this in netherlands but you have to increase the colors the ranger the you decrease the range of course and you lose more drastic colors um so here's the these are the static and then for each pixel i have also latitude longitude the the map is actually in the projector coordinates but i also put latitude longitude for every pixel uh i could plot that also so that's for example uh four you see that it will be a bit curved not too much but it's a bit curved you see it's a bit like so it's a really fair geometry right you see it on the image here um so that's the lattice longitude so this is scottish and now we have dynamic and dynamic there's a lot of data so you can play with it but we have in a folder we have a 45 let me see 46 images and they will be in this folder under the uh ods workshop our training input and then you have a folder and this one has all these duties this and i are real projects these are also like this but there will be 10 terabytes yes so there's lots of objectives and you don't i don't have to read that data i don't have to read it in r i don't have to load it in r it's inefficient i will just read and if it's a larger data it run out of rom and you finished uh so what i do i first ovulate the static data i just do overlay so now i have um station id so 152 stations so i know um what are the values of the static covariates and then i do a space time overlay and this is something i said we're going to use the tera package because it's very fast and i will load this little function i made through sourcing the code and i will copy so i copied to this data set i copy the xy coordinates and also i will put a raw id so i don't make a mistake and i convert this date the raw date i convert it to the sdate format and now i have to do something i have to stripped so if i look at the list of the geotiffs we made list of geotiffs we have sorry here so i have the list lsp list day i can see that in the in the name of the file you see in the name of the file i have the date i have the 2006 0 1 0 1 yes but i now have to strip the date because i have to estimate i know this is a eight day data set modis uh ndvi it's a eight day data set we can also look at it in a qgis and now i strip the name and i take four days for the start of the notice time reference i take four days minus and i add plus four days for the end and that's why i make it our eight day okay and so this is where i stripped this and i convert that to the list of dates so if i look at the first date let me open it i already prepared it so we don't lose time but here's the modis here's one more this image this is the one to one that's what you download from the modis website and you see there's quite some problems there's missing pixels right there's missing pixels there can be some artifacts uh let me see somewhere they seem to be some artifacts because of clouds like here so it's not the perfect data set but it's one of the best data sets to model uh daily temperature okay so it has problems uh but we want to use it and it's a as i said we have 48 48 in time series one year 2006 we have 48 and now we created this uh begin end time so let's take a look at that if i take first five and so we see these are the dates so this first one starts on the 28th of uh [Music] december 2005. so actually it goes back and then the end time it's just uh the same thing it's just uh basically the four days added so this would be fifth yes okay so so i defined this thing i was sketching i defined the begin and end time for every geotiff and now i do the overlay and i do it in parallel and as i said this function is generic you can use it with any data set and um and just to show you so so here i do the parallelization and i say run separately for each durative so for each geotiff it will so if i have 46 due tips and it has 46 threads it will rub them at the same time okay so you can do overlay at the same time and it speeds up drastically but i don't have one we have six so i can run that in parallel so let me do that i will run it from here the whole thing and i will open uh the my uh here and you see it's running in parallel um it went fast it's not so many points 155 points but it takes literally uh takes like a two three seconds and this way i did the space time and what you get as a space time overlay you get values in the long format do you know this long and light format long format it means i have each space time id straight from id i get value of the temperature so instead of having like if i did just a spatial overlay then i will get 46 images and then i have to rename the blah blah i just get one long format so i get 44 000 exactly the same as the space-time matrix i get 44 895 values and you see there are zeros here and these zeros this is not good because these zeros are missing values these zeros are here unfortunately they are the missing values um so that's that's uh here is no data but these are the missing values and so that's something we have to be careful not to use the zeros because these are not actually zeros these are missing values then i repeat it for the night time do the same thing for the night time again watch it in parallel and then emerge and now we have two data sets and now we can do a join all in deep wire so we join the meteorological station data the results of the overlay yes and we join that and then we just add the static data and off we go we have now space-time matrix but we also want to add this geometric temperature remember we computed it so we add extra column which is the daily geometric temperature this is just computed from the latitude we have the latitude column and we have elevation so elevation you have the day of the day of the year and you have the latitude and we compute this thing and so now we have everything and this is a complete regression uh space navigation matrix and now we say mean daily temperature is a function of the geometric temperature of the mode day time night time and distance to the c distance of the c is the same every day right it's the same so this is a statical value so we have a mixture of geometric covariate dynamic and remote sensing based dynamical values and then i do the same as the same code nothing changes except i will now run a ranger in parallel and i want to get importance so now i modify this a bit and i say make a ranger that also does a bit extra things and so we define the task and then we can do the model but because this model [Music] because this model we have the station data we have the station data we have to use the blocking so we block stations so we put this blocking parameter see here it says blocking true through so we block we have the whole stations we take them either out of the modeling or in the modeling so we don't need inside the stations that we split randomly but we take the whole stations out and that's very important to avoid the overfitting because i burn my as i said i burned my hands many times and we even published this paper this is back in 2014 um there was this paper we published and we published here a paper and we said oh wow if we do if we take the whole stations out or if we leave the stations then it's a really different so so this was this was a modeling with the randomly subsetting imagine this is the same thing so this watercolor same modeling everything is the same except here i take randomly subsetting and here i take the whole stations out it's a huge difference right so how come the rainbow forest fit so perfectly you know like r square was 0.96 i'm not sure but i think it's because when you have the locations on the same locations then the the values are kind of the values get locked in uh with that uh combination of features and then the model always know what the values are so it's a very difficult like it kind of has an extra help in predicting yes and that's not realistic because this thing it only shows you what is the accuracy of that random forest at the stations but we want to know what is the accuracy when you go anywhere in the space right that's the real map accuracy so if you go anywhere in the space how accurate can you predict moisture and this one only told us what is the accuracy at the stations but nobody is interested in that so so then we computed with wow there's a really really huge difference and i think this uh this uh random forest cross validation when we do taking the whole stations out then it was similar to clinging so it's a similar accuracy it wasn't it wasn't much better or much worse it was similar for soil temperature was a bit better than clicking but for the soil moisture water content was about similar so that's very very interesting visual explanation of the overfitting also so that's what we do here this blocking and now we can go and train we can go to screen this model and now it's a bit it's a bit computational so i actually use this parallel package that by uh by default it will uh fully parallelize you see now it's using all the bars uh these are called happy green box by the way when you do programming if you in r more and more the good program is the one that you paralyze every step okay so that's a good programming every step you can so here's the green box and this is now modern with a bit bigger data set so this did take about maybe five seconds right and it's after the full parallelization and now we have the model and we see that the model looks good uh we have the neural nets we could exclude it is very low coefficient so we could throw it out but the other two the gum boost and uh renault forest they help explain r-square up to 0.87 and that's that's a good result but uh the residual error uh it's still relatively high uh 2.9 so we get plus minus here let me see we predict plus minus minus 0.8 to plus 0.8 degrees two-thirds of the point so one standard deviation it's a minus 1.8 to plus one 1.8 so that's the errors we get uh for the daily temperatures for anywhere so this is a let's say mapping accuracy with the blocking so we assume this is anywhere in the in the space this space we produce so anywhere in this area okay so that's the actual mapping accuracy and then we can then do predictions let's do the variable importance first this is the variable the very important score is a bit longer i don't have time to go to this i i wish it was just a single line but you see that the lsd night time interestingly looks like the mode is nighttime images about three times more important than the daytime images for the daily temperature so it's a there's more variation i think in the nighttime images than in the daytime and then we can do predictions and predictions i also do i i have to read i have to subset i want to predict for four days in august then i have to substitute and now i have to read it to r but i don't read the whole year i just read this august let's say so one of the 112th of the images i read it to our and then i have to uh add some things i have to uh because i have the eighth day i have the eight day stack i need to interpolate values between the eight days in the mod space and there's a also nice function i made for you that you can use universally and the trans in parallel let me run that so this one will take uh all the order so it creates like a space-time cube and because i have only every eight days it will interpolate values of modis for every pixel so it gets you get this gradual changes you understand no this time i just put i have this approx function um and this time i just put it basically uh linear but i could use in optics i could use some spline or something but then it's more computational but you don't need anything you just need a simple thing and this one is really just to fill in these gaps this one helps you fill in the gaps because look at this thing if i have the gaps and if i do this up rocks then if i have like for example three missing values out of let's say 45 images then it will fill in all the missing values it's a very elegant way and it's uh fully scalable so you can apply to any type of data it's a super simple gap feeling it gets it get filled gaps based on the temporal neighbors it only looks at the temporal neighbors not spatial neighbors you could combine it with a so what is the output that you get i'll show you this output you get so this is the for example we started with uh think four dates or something and now we have for every day from 14 to 21 we have for every day we know the more this temperature and you can plot this and you will see it's kind of a linear linear model between the points and and we started with when you look at the the x1 and x2 so this is the x1 this is what we started with so you start with the 5th of august 13th of august 21st that's what we started and we finish with basically eight days consistently more this fielding no missing values okay and so now we prepare the prediction data and now we can go and do predictions and this is this thing and how do we do predictions we have to copy the data now i just copy because i don't save everything basically i will subset the new data create new data predict then throw it away then another day and a day by day so i have a prediction machine but it doesn't create in ram lots of things so it just uh writes basically a geotiff it will put a geotiff here right you know and then the gop pops up here in the output uh you get the duty and so here are the predictions so i already made them so we don't have to remake them uh and now we can look at the prediction so this is as i said this is the modis original missing values artifacts and here the predictions predictions we have no missing values except for the missing values in all images and we don't have artifacts artifacts are minimized and you see that you see the variation between days you can see how the values change between days and this is a result of space time now space time machine learning for predictive mapping please close the door and i can plot right now i put i put a tree maps but yeah i didn't load this i put only uh three maps but i could put uh for example eight maps and then plot them i don't recommend it because it takes time even like plotting three maps it takes four or five seconds and then you get this thing this one does have a missing value but it's possible because now i do for the filtering i only use uh four images i i should have used like uh maybe two month data and then filter all the missing values right so there are some more missing values than i expected um but otherwise you know it looks you can see that some days are warmer some days a bit cooler and locally you adjust and you make the predictions so that was the the daily metal data set so i think it's a very nice example and then there is a cook farm data set so the cook farm data set it's in the landmap package you can read about it if you want but this is this space time uh 3d plus time so it's actually four dimensions and and i have the the covariates and everything prepared and this model is a bit longer so i have the also the soil depth so how deep you go in the soil and i have some static covalents and these are the dynamical variants the maximum temperature minimum temperature so i have the static dynamic and then if i fit uh just a random forest if i fit the random forest what happens is that i will overfit and that's what happened exactly this thing i'm just doing now that's exactly what i used to do in this paper uh so here yes so it's exactly now i'm just repeating that and i get the r square almost 0.9 okay so if i don't do any blocking no ensemble just random forest take the data as is but if i do with the blocking repeat that with the blocking so now it's a there's a blocking parameter repeat the modeling now the the training is a bit also taking time um and you see it does what it does here so this is the first learner and you see it has it has ten elements so it means it fits ten models it does the five fold it repeats two times so it fits ten elements uh just for the random forest and then ten elements for the super total machine etc and so that takes time so now this is uh random forest is done and then goes to the next learner second learner also running in parallel and then the last learner and then it fits finally the meta learner and now we have the ensemble machine learning based on blocking also and look at this now r square 0.4 so we go from 0.9 to 0.4 just because i put the blocking and so obviously that's very important because otherwise you get difference between this and this yes i have a question about that of course because the idea is that you make for example the trees basically it's important is if you split the data and then the metaphorical itself makes its own trees which are not for sure so you don't do any splitting uh you just this in this case i give all numeric numeric target numeric covariates and then it fits many uh regression trees in regards to if they do the splitting they are kind of optimized to minimize the rmse and then it makes an ensemble that's the renault focus level is also ensemble by the way but it's based on sample from one model yeah the idea that uh the idea is that the for example only part of the observation but now you say okay you have to take all the uh right what's measured at that time yes yes this will take everything yeah but the with the blocking i make sure uh if i show you here in the article maybe that that will be make it a bit clearer these are these uh stations right yeah these are stations and they go through depth they have four depths and so what you have uh not the not the triangles sorry the triangles not the dots these are the stations and basically uh when you do when you do now uh ensemble learning you take the whole stations out so you take maybe i don't know 1000 measurements out then you model and it predicted that station and then you get r squared 0.4 right because you took the whole stations out so you you cannot you don't over fit i mean this is obviously it's not overfitting whether this could be better higher i don't know i probably know i would say no i think that's the realistic r square but the only way to know you know it will be do probability sample and go and sample and and do the same measurements through space-time and then use the probability sample to validate how good it is but most likely it will be about this number and then we make the predictions again i have to prepare the new data i call it new st and then i do predictions and off we go here is the map and i predicted only one uh one depth and i put it like this so you see the station data and you see the predicted values and you see where the hot spots where the soil is bit dryer etc i could also just plot this maybe i i will plot this just to show you just in the in google earth i can do kml any more questions no call of scale so yes please so um so we're talking about this it's something way beyond than boosting and lagging right and slowly related to the ensemble is the general name right that's a general name but uh the algorithm to make the final predictions that's what the difference so so if you look at the what i said here if i look at stack learner so there are different ways you can do average that's a very simple that's computational for sure the easiest average also takes assumes that all the learners have the same weight which we saw it's not the case so that will be making a damage right then you can do stacking no cv no cross validation you can do stacking with cross validation that's the most computational and if you do it like a five fold it means you have to fit ten times model per learner so then it's about yeah you fit about 10 times more models way more training and then you have this here climb and you have the compress and then compress is to use a neural net to compress the model from four learners i don't know to one yes so these are the options in mlr h2o has something else but uh i i always use almost exclusively i use the stack cv because i want to do blocking because blocking very important so let me plot this uh i plot this oh yeah i did plot it already i just have to open it did somebody figure out how to reset the google earth so here's the um windows 10. now i'm passing to to my windows and let's take a look at this data set we computed now we jump to another so here is the soil moisture uh why do i open it here so you can look let's take a look so have to turn this off so as you see this is the red is the higher saw moisture and the blue is lower so what happened is the basically we see uh looks like here there's like uh it i think it's connected with the soil texture but you can see that the soil moisture depends on the depends on the place yes so you can see where it's obviously here it's a little like a valley you see this it's like a little valley and here we have a bit higher there's accumulation of soil moisture but it's also also here that's higher so yeah it's interesting and now i could go and you know predict basically now how many maps could they make now imagine how many maps if i have 365 days four depths so that's four times 365 just for one year just for one variable and i have three variables and if i want to map them all and then i have also uncertainty so i can make a ten thousand maps uh and with this thing i'm done with the with the two first case studies and i will pass it on to my colleague uh carmelo and he's going to now show you the third example which is uh with our project european project these are just like uh school examples and now this is a real example so let me pass on to you the microphone it should work let's see okay perfect so what we are going to see here is a case of species distribution modeling which for who is not familiar with it i can have some kind of a gentle introduction to the problem uh here with the slides so we will talk about correlative species distribution models which basically try to associate values of predictor variables in both environmental and future space to the location of the species that were actually observed and then we use different models to actually estimate the similarity between these sites and produce two kind of results either as you can see in this slide maps that show like a zero one distribution of the maps so areas where the species could occur and species areas where the species do not does not occur or probability maps so we have a gradient that goes from 0 to 100 over where the species could possibly be present so based on this we just say that the probability of occurrence or the zero one of one species is just a function of different predictor variables that in our case could be both climate terrain spectral reflections and also the competition between different species so inter-specific competition that is not usually not included into the species distribution models because it's very difficult to actually assess that so what we are now using for these four kind of groups of variables we use temperature precipitation snow cover and all the kind of environmental factors that we gather always from remotely sensitive data or dtm slope methodology and all these kind of things from spectral reflectance we just use landsat data because we are trying to model on a time period that goes from 2000 to 2020 so data sets that come from from sentinel for example are not available so back in time and then the competition that we use is uh species distribution maps uh for species that are not the species target that we are actually mapping so for example we are focused on forestry species here so there are there is a big data set of uh distribution maps that come from the jrc the european forest atlas and we use these maps to model the presence of other species in the same environment for the target species we are trying to model so based on that we can say okay modeling is easy we just take the data we plug into a model and we get a response but as tom already mentioned today we have to understand first the data is not too easy so in this case in species distribution modeling for example we have to assess if we are using just as a input dataset a presence only data set so when we have just a dataset with rows that record where the species that we are actually trying to target is located or if we have more hybrid datasets so with both zeros so where the species is not located it's not observed and present so once where the species is actually located and in this case we have to actually check if there are for example missing values in the covariates that we are trying to use for modeling if there are very skewed distribution if the data set between presence and damson data is very unbalanced that would actually influence the prediction of the models then we also have to understand the problem so are we trying to model the suitability of a certain area or are we trying to model the realized niche where the species is actually located which is different to just assess where the species could potentially be present so in europe we have previous distribution maps that are not very precise that are very low the resolution and they do not record a very high probability for the species because it's very difficult to model this kind of things and we have different either european or pan-european data sets available that record either just the presence or also the absence of certain species and for example this documented was published by the jrc in 2020 gives an overview of both the data set that can be used to model the forestry species and also the products that are already available the state-of-the-art products are already available to show where the species that we are trying to model are located which is uh which are the same methods that i was showing here for the species that we are trying to model so the data sets that we are using are just the unf5 data set and gbif dataset which are both mostly density in terms of density of points and also they cover a very long time span that matches the time span that they are trying to cover with our models while the other data sets even if they are useful or important they are either ancillary or they have a very coarse resolution and the sampling that was used to actually gather the occurrence of the points is at a very coarse resolution like one point every 20 kilometers so it's very difficult to assess if this data can be used to produce very high resolution maps like at 120 meters or 30 meters or more uh so we are now using as base for our models this harmonized three species data set that was produced for by johannes hasik last year during the last summer school recorded by openg hub in 2020 and you can find this gitlab repository where you can download the code that is used to produce this data set that basically tries to harmonize the european forest inventory data set the gbif data set and also points that come from the lucas project from copernicus that it's a project based on mainly on gathering ground through the observation across all europe for land cover data and among these points we can actually find also points that are related to trees how many points in total it's 3.8 million of points of trees and this is this is just the raw dataset with all the observations and you can find different uh made in different kind of metadata inside the dataset so for example aside from as classical metadata like the coordinates of where the point is located in the species you can find from which data set it was downloaded in this case this point comes from the gbif data set and also the location accuracy of the points usually data points that come from the lucas data set have an accuracy of zero meters because you actually have the picture of the tree that is actually sampled or surveyed while instead for gbif you can have points that have a very high uh uncertainty in terms of location can be even five kilometers so it's very difficult to model high resolution maps with this kind of data sets and that's one of the main point of this uh of my talk it's like you have to clean your data set and you check if the points that you are using can be used because gbif is also global it's not just covering europe and it's one of the main data sets that is usually used for species distribution modeling so we can now have a look at how the points are located in space you can see how many points we have per country and now the points are located this is a kernel density map that's produced on a 10 kilometer grid so you can see that even if for example spain has a lot of points that are actually clustered in this part which are mostly plantations of olive trees and cork oaks or for example here in belgium is one of the mostly high-density country where you actually have every kilometer you have a tree then it's actually regularly sampled and then you may have like problems in the eastern european countries where you don't have actually points you know you have very very few occurrences and not the whole country is not even completely covered so how can we actually try to remove or handle this bias into the data set what we did is uh using land cover classification maps we produced for europe for 20 years and we created forest masks and we overlaid all the points that we have that have also a target about time so each point is sampled in a specific year and we extracted the values of the probability of length cover classes for the whole 20 years and we just took all the points that were covering like either coniferous forests or broadly forests or other kind of vegetation so that we know that for the whole time period that we were covering from 2000 to 2020 we were sure that these points were forests or at least trees after that we cleaned the whole data set and that's what we ended up with we just have a small data set with almost the same amount of points for every species plus absence data that comes from the lucas data center that was mentioning and the only data that you have in terms of metadata is just like the year of the point if this point come from also servation or from this kind of procedure that we were talking about here and the species that we are actually trying to uh target the tile id is of course like a we are we divided the whole europe in uh in a grid of 30 kilometers that are among uh like 7 000 or more tiles covering points so this is actually the column that we will use for blocking during the ensemble machine learning model so here we don't have the stations but we still want to block uh because we have this clustering problem the one that's developed and so we pick up arbitrarily 30 kilometer but if you take more kilometers distance then you degrade the model but if you take too little distance then you might be overfitting because of the cluster that's why you pick up this walking so by the spatial block this is the amount of coverage in total that we will use in your data set they are already filtered by a feature selection process so there are just 180 instead of all these 600 that we can see here and they are divided the four groups that i was mentioning so either climate covariance terrain covariates reflectance covariates and competition that comes from already other suitability maps for the other species to have an idea that what we use for example we have long-term average from climate temperature that comes from the bioclaim data set that was produced from observations that come from meteorological stations that could be like around the 70s and then you have a long-term average that is computed from the 70s to 2013 so in that way we can have an estimate of the average conditions of the climate for the species and then we also have a time series of different climatic value like the air temperature the precipitation and the surface temperature that was computed by the copernicus project from the era 5 data set and this is actually one kilometer daily observations of uh different variables that we aggregated by month and for each value aggregated by month we took three uh different parameters the monthly average the the sum for the precipitation and the sum of the previous five years we have both a long term for bioclaim a short term based on a five years time window from this and also monthly values and these are all these dataset was resembled by us and is available in the open data science viewer that tom already showed then from terrain we use for example a proxy from continentality based on a distance from the coastline uh different derivatives from a digital dtm high resolution for the whole europe available at different resolution of the data portal and also reflectance from from landsat this is different from the landsat data that you can for example download for earth explorer or it is available on google engine because it's actually reprocessed by the university of maryland to be consistent across all the three uh landsat uh satellites that we normally use like landsat four five seven and eight and what we did was to download this whole data sets create four different composites per year so we have four seasons winter spring summer and fall from louise we removed cloud and cloud shadows using the the data that already the university of maryland like provides and then we computed for each image each composite three different interquartiles to see the distribution of the pixels in the environmental space then we mosaic everything and we use a temporal moving window to gap fill the data and that's what we get for the different quantiles in the distribution of the pixel for each year and each season so at the end of the like day over a longer period you can extract uh distribution and if you take the 25 percent quantile that's a low reflectances median and high reflectances and every season has a different and so basically from one variable which is the landsat reflectance we derived three variables but because we aggregate through clients yes and we do it so we can fill in the gaps and we have something smoother so you can see for example that in this mosaic that is from fall 2019 even if we aggregated all the pictures that come from the rgb bands for europe there are a lot of gaps and we gap feel that based on a temporal moving window when we computed all these distributions so at the end of the day even if we have for one pixel that for example was surveyed in 2006 we have just like uh 91 covariates that come from the different landsat bands plus the three one tides plus the first season instead we have points that are observed across 20 years so in total we have more than almost 2 000 of reflectance of areas for the all-time series so having an idea of how the data set is built then we skip the either parameter optimization and the feature selection for the sake of time in the model and we can go back one image up oh europe 30 meter it's about two gig or something yes something like that we have one for one band one thousand eight hundred huh and one is two gig so you get the idea it's a it's a big day it's a lot of data so let's go back here here we are actually facing as i was mentioning is that zero one with a probability distribution is not anymore a regression problem but it's a classification problem so if you want to follow from the merc down this is the distribution of the points that we are using with the occurrence of the data and you can see them also in qgis so this is the distribution of the points it will it takes a while to render all the green points that you can see is the absence data that comes from the lucas points while all the blue points are actually where this piece is was actually recorded and you can also check visually if this piece is actually there by also adding the street view coverage from google and then clicking on one of the presence points to look at the street view and see if the species that was actually surveyed was the species that we were looking for let's see this that's very close to the street there is also another plug-in that you can use integrated in qgis and you can having qjs a small window on the left where you can see the trees you jump and you validate the point so you can check okay this piece is the one that we were looking for so the point is correct we didn't of course check all the points we have given them and they can do fixes so therefore i told you the machine learning lots of data cleaning lots of federation checking right not from the very best yeah it can come from either lucas the the gbif or from national forest inventory so we have also the plots from the forest inventories from the different countries in europe so it depends you can check from the original dataset from which one this point comes from when it comes yeah i mean of course like if they are like species that do not move it's easy to do this in this way so let's go back to the markdown yeah i did see this so we should we see where the trees are located and what we are using is just like okay let's see how the data does it look like how many covariance do we have as i said we already filtered the covariates and we have around 180 for this model so let's go back here in the model you can see how the dataset looks like also it's the same these are the same fields that you could have seen in qgis and then we can check uh this is the function the tom prepare that fits a spatiotemporal ensemble machine learning model in this case we are using random forest the ranger is the name of the parallelized version of the algorithm in r xg boost or extreme gradient boost so we are using already bugging with random forest boosting with extreme boost and then we are also fitting a generalized linear model and then by training the model which is very fast because your deficit is already clean that is also not very big we can get both the feature importance uh here the image was not loaded i think uh yeah yeah it was not already loaded so let's run this also this is already in parallel because the table reads everything in parallel these are the the fields that i was talking about then we have here this is plotting all the points and that we have everything here yes there we are you can open the progress bar i think it's already done so it was really fast [Music] yeah and the whole details it is more than seven hundred thousand points for these pieces i think because for every species we have different occurrences of points so the the imbalance in the dataset is always different and the distribution through time is it there are some years we have more data yeah so there's peaks in the time so those problems it's difficult also you can see here from the this is the meta learner as a meta learner in this model we are using logistic regression because we want to constrain all the predictions that we get from the previous models in an interval that goes from zero to one and then if we actually wanted to model just a binary classification usually the threshold the default value of threshold is 0.5 so every prediction that goes below 0.5 is classified as zero and the prediction above is classified as one but here we are not trying to assess that we are trying to make probabilities so we get the straight values that go from 0 to 100 and as you can see here from the output of the model xt boost is actually the model that is not useful for this uh task problem is that extru boost is also a kind of algorithm that's very sensible to fine tuning and we skipped fine tuning here so what we actually found during the training of our models is that if you spend time to fine-tune this algorithm it actually turns quite useful and from the slides let's go back here uh i guess the moment here yes this is for example what we do when you predict for the tile they gave us an example for the different three algorithms that are using from us based learners so as i said xc boost doesn't turn out to be very sensible you see that the values are very smooth but here instead random forest and glm perform quite well what then the meta learner does is using as an input data set a data frame that does the same amount of observations that we use for training with all the points then we use blocking with the tile id that i mentioned before and then we use the predicted value for every single algorithm to generate a final prediction which is uh used by the stacking method that's tom described before to get the final tile and this is for example the output of the model for 2015 with probability values go from zero to 100 and you can see that all the low values are actually values where you can see either a lake or a city or a street where the high values are where the mountains and the forests are so it actually performs quite well usually you would use the area under the curve to assess the uh legitimate of your model and as tom already mentioned we can use ensemble machine learning for different kind of tasks and we have usually the same step-to-step process depending on how complex the task is but once you learn how to do it and once you learn what your data is telling to you it's not difficult to actually model what is the task that you're looking for so if you have prediction i think i have it yeah you have another one that's uh that you can see where you can see the differences uh where you can see the differences across the ears because we are using covariates in the dataset are static so they are the same values for the whole 20 years and instead time series of for example climatic data that are different across the different years so where is the difference here you can see for example here here you can see that there is an increase in forest cover in this part and it's around the the roads and then also here there's some from augmentation instead uh and see you can see that the values are like the difference is very high we go from values that are around zero to twenty two varies that are from 60 to 80. so of course you have all these maps are also correlated with uh maps that show the uncertainty of the model so you can see that even if you have a big change here it might be an error of the model because all the three models are maybe predicting not correctly there so you have a very high uncertainty so you say okay this area i would not use as a valid value but in other areas where you have very low value of uncertainty you can see okay the model is telling me that there is something wrong here there's a change for example so any question maybe let's see if there is something in the chat okay let's see i know this is these are old [Music] i think these are more for uh for you no these are these are recent this one for example can we apply a mask to ignore some areas [Music] well the best way to uh to do this blocking is to do it systematically right like we do the spatial blocking yeah of course you can you can apply a mask to know some areas walking by station so you have to do it systematically you cannot play with it and that will be objective so you need to allow randomization and do it very systematically for example you can see here for the one the people that are asking for a mask that there are values that are completely white this is because we used a mask to avoid predicting in areas that are covered by water for example that's the same that we are doing for example where the ces and anything so you can do the same for the mask that you would create with your jedi data to avoid predicting in those areas for now 15 species 15 species yes we want to do 60 it was a lot of computing actually we stopped with 15 but we have a capacity to do all 60 spaces space time 20 years but we switch from 30 meter to 100 under 20 meters so 16 times less data volume because otherwise it was too much and also we're predicting different things we'll show these apps on thursday yeah thursday morning they will bring me up so it's a lounge it's an actual celebration that after six months i think yeah more so so that's that's great that we want to get countries to give us open data and we think we could train a full forestry species model so that we know exactly which are the key drivers which is the key uh coal areas that help with the metric accuracy and that we can then also support all the forest projects or the red restoration reforestation projects we would like to provide the data output and we will have uh two uh predictions uh one is the actual distribution this was all actually yeah this is realized distribution of the species that's why we use reflectance data because otherwise and this will be visible in the in the slider you will have this clip and you will be able to open let's say somewhere in germany and you say this is the actual pharmaceutical this is where it could happen and the difference in the two could be because either there you have a city so the species cannot physically grow there or because there is a problem with dissemination or like with the moving of the species in particular areas so i think we can stop now thank you i'll take over sure there you go so you saw the the three uh three case studies the nice topic i think you notice uh there's different problems uh all three are a bit uh maybe complex uh but this is the they reflect the type of data you use in real life so especially the last one is actually you know inside the project but we are able to take something which is like uh complex and we're able to put it in the r markdown so we are we really work hard to simplify that so it can be run within uh basically within less than one minute and you see also very nice about mlr is that you can have the classification problem but as a meta learner you use regression so you combine classification and regression uh when you work on the zero one states it's kind of both classification regression very similar when you have multi-state variables then it's a bit different then it's not easy to combine but uh here it worked very well and we also have for every pixel we have uncertainty which usually people don't do in the species distribution models right carmelo they don't do it so we map actually yes so we met for every pixel we mapped uncertainty i would like to switch to mlr 3. i didn't mention that maybe they're just for the end as you see lots of code is running mlr mlr but if you go to the mlr website they tell you don't use it so don't get confused you see this project is retired you see these people automated and you see that's why i told you you automate in machine learning and then you retire so it's this project retired so they said don't use it go to mlr3 but it's not simple mlr3 it's a lots of changes ml3 so it's a different it's uses r6 for data uh which is different classes more object-based programming the command names are completely changed uh also the logic higher model also split in multiple packages and complexity and complexity and i just i cannot start using it so i it i it looks like it's a too steep now the curve it's too steep so i'm hoping to speak with patrick next days and uh he showed me that because you are in the you do the development for an mlr3 right so we you you show me some tricks because we would like to get it uh translate lots of code you know uh but i just at the moment i don't know how to like this blocking how to set it up um so that's that's the mlr3 ml3 won some awards i heard so it's a really vital uh vital project and um i do i do see that it's running faster and this object-based programming it looks more and more like a python i think but but yeah it's great that also community grew i think it's more people now there's also package to do uh special temporal learning right uh how's your call okay great great but that was the that's the the future now we do as i said we this all works the code is fine the mlr package is still running it will run you know maybe a couple of years more but then if somebody changes it they might break and then it just goes off crown and then then you have to fix it yourself but for now it does work you can do and everything you we do it's parallelized so there's no nothing uh unefficient with this but uh of course it will be better to switch to mlr3 um yes so with this thing i will now stop uh let me see if there's more q and a's uh the chat no and then also the metromost r dev i didn't see any much questions you can keep on asking uh people that online uh please keep on asking questions so metamorphs just use the r dev a group and uh if something is interesting i can answer also tonight not a problem but please keep on using
Info
Channel: Tomislav Hengl (OpenGeoHub Foundation)
Views: 460
Rating: undefined out of 5
Keywords:
Id: vNApDJ0BmPE
Channel Id: undefined
Length: 91min 50sec (5510 seconds)
Published: Mon Sep 06 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.