Christian Knoth - Introduction to Deep Learning in R for analysis of UAV-based remote sensing data

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so as i said in the plenary already the aim of this tutorial is mainly to give you a basic understanding of the main steps needed to conduct deep learning in r and we'll try to go through the basic workflow which is build a model prepare data to get it into the model then train the model and finally predict with the model and in addition to that i have quite a few larger control plus like this and in addition to that i have a few extensions that are um considering the yeah an interesting stuff for remote sensing like pixel wise classification and um getting a remote sensing image into a convolutional neuronet and the results back out into a map of predictions but as you see here this is quite a lot so don't get scared by that the idea is really to um go through this part one here and to give you a basic understanding on how to do these things in r but also what the arguments and the the commands that we're using actually means that you don't just put in some activation whatever and you don't know what it means so we'll also try to give at least a basic understanding what's behind these these code lines and the idea is then to kind of hopefully erase your interest in the topic so that you can if you want continue with this document um even after the session that's also why it's so long because i want to do it to be as useful as possible also on its own so without me standing here saying things so i have quite a lot of descriptions put in there and one other thing i have is that there are quite a few side notes [Music] as i call them and the side notes are for some background information that i um kind of cannot really go into through this or doing this tutorial but i also just didn't want to leave you with some words so i put some description of these terms then in side notes and if you are interested in you can click on those segments to get some more information so for example when i'll be talking during the training about loss functions and that will be using the binary cross-entropy loss function to calculate our error i'll leave you with that for the tutorial but you're interested in that you can click on the site note here and get a bit more information how the binary cross entropy is computed and what it actually is so this is about four or four things that i cannot go into the into right now but that you might be interested in looking into later so that's what's the plan for this tutorial so as i said the basic steps that we want to go through are building your model preparing your data training your model and predicting with your model and the scenario is the following so our task will be to [Music] develop a convolutional neural network for mapping wild dryer extent the extent of wild dry which is a certain form of grass let me put it this unscientific way i'm not a landscape ecologist but we want to map this wild right here that you can see uh growing in these patches here and the reason is that this this species as a species that can if it becomes dominant in an ecosystem it can be kind of have negative impacts on the biodiversity in these ecosystems this is a so it's a plant or a species that needs management and therefore needs monitoring and this is a classical ecological monitoring task one of the tasks um that are the reason why um these tools have gained so much interest like these tools like uavs have so we gained so much interest in the field of landscape ecology so this is the test area and the training areas are a bit further to the north these two images here so you see all this all the data is very very small we have a very small toy example because we hope that this is kind of something that can be reproduced also on uh on a laptop without gpu and so on we won't even use all these the whole areas here but only a few subsets out of these areas so you will see later how many samples we have so this is the scenario and the first thing we have to do is to build the model so coming back to this overview that i've already shown during the plenary first thing we need to kind of wrap our hands around heads around is what is actually happening in such a convolutional neural network as we are going to build it and as i said before the idea behind these networks is that there are a stack of successive layers and these layers transform the input data so in this case a small subset of an image into different representations and these representations are hopefully useful in the sense that at the end we get a representation for example in this case an output of one number that is kind of close to what it should be and what it should be that is determined during the training so we hand over the network an input image and the corresponding label so in this case we have a binary classification problem we want to detect is there a target object is there while dry in this image present or not and we hand over for the training we will hand over these images and a corresponding level one or zero for target or non-target and the model will then learn to transform to or to conduct all these transformations such that as a result we get a number that is as close as possible to what it should be zero or one and the first thing we have to do is to assemble the model out of these layers and we will go through these layers later on first we will just add a few layers to to build our model uh which is kind of the first step and in the keras package which is the package that we use there are two ways to build models one is the keras model sequential function and the other one is the keras model function let me just check something there's one thing i wanted to show you before that before we start um because that might be a bit um helpful to understand what we're actually doing so we are now using the keras r package which is our interface to the keras api written in python which in turn is the api for conducting tensorflow operations so the actual low-level tensor operations are being done by tensorflow that's the back end then there's kiras which is the api the high level api for tensorflow and then there's the kerasr package which is what we use to get access to carers at some point we will also use the tensorflow package which allows us to get access to directly the tensorflow rp we will use that as two points in the tutorial and we will also use another package called tf datasets that we will later use for um preparing our data for conducting the pipeline to prepare our data so just that you know what i'm talking about when we later be touching at some points some python stuff so mainly we will be using keras as i said or a keras the tensor flow can you run it locally they have certain apis you can also run it from from from several other apis also you don't need necessarily keras but keras is very kind of popular for using tensorflow so as i said there in keras which we are using or keras keras i don't know how to spell it uh pronounce it correctly um there are two ways to um build models and the one that we will mainly need is akira's model set sequential so this is a model that as the name says is sequential in the sense that it has one input one output and a directed linear stack of layers something like in this image here so one input one output and there's a directed stack of layers so there are no connections between layer one and ten or layer four and two but it's only one to two two to three and so on and so forth this type of architecture is fine for the most part of this tutorial only when we want to build more complex model architectures for example having more inputs more than one or more than one output or we want to have specific connections between non-following layers within this model then we need the so-called keras functional rp or api which allows us to build quite flexible different sorts of models but for now we'll stick to the chaos model sequential building a model with keras model sequential is pretty easy actually so you start with just initiating an empty model by just calling this function here k ross model sequential let me see if i can do that here so we start with building a model and then we use certain functions for adding layers to this model so here in this chunk we've added the first layer layer conf 2d that's a convolutional layer and we'll talk about these arguments later if you now get the summary of this model we'll see that up to now this model only has one layer because we have only added one layer yet and this is going to get bigger and bigger as we add more layers by the way if you first bigger like this let me see if i can still see something by the way when you're first executing a code that it has to do with tensorflow or keras you might get something like this this mean looking code here but that's not a problem that is mainly tensorflow complaining that it's not running as fast as it could because we're not um using so i'm running here this in the docker container that i provided for you and we are using the installation that is as or as comparable to as many systems as possible so we don't um have gpu support enabled here and also we don't have a specific version of tensorflow that would be kind of optimized for a certain cpu which is also possible and that's mainly what these warning message messages here are about so as i said we now have added one layer so far and we've done that with the function layer conf 2d and just the same way we can also add other functions so the function functions that we're going to use is layer dense layer max pooling 2d and layer flatten because these are the the layers that we are going to need we will we need convolutional layers dense layers max pooling layers and flattening layers and before we talk about what these layers are let's just continue with building the model by just adding all the remaining layers that we will need so i'll do that here so it's all just each line represents the adding of one of these layers one thing that you might have noticed is that we are not reassigning the result of these functions back to the variable first model which is probably a bit surprising for you so we're just conducting each function here once and that's because in keras or keras models are modified in place so if you call this function here with this model then this model is directly modified by this function you don't have to reassign the result of this function here back to first model it's just enough to to call this function on this model so if we now look at the result by getting the summary i think i go back to the html for that here let me make it big smaller for now you can see this is how our model looks like so it's a bunch of layers each layer has certain perimeters and a certain output shape and what these output shapes mean is hopefully going to get clearly as clearer as we proceed with the with this tutorial so you can see we have convolutional layers max pooling layers flattened layers or a wonderful end layer and two dense layers and let's check what these layers actually do first a dense layer so what dense layers do is pretty simple they consist of a certain number of units and each unit calculates a weighted sum of all the input values it gets adds a value and then returns the output so coming back to the figure here this would for example be a dense layer with one unit because it takes all these input values here and returns one output value and it does that by building a weighted sum and adding a number and then giving back the result this weighted sum and the so-called bias that is added these are the perimeters which the model or this layer in this case learns so all these layers as i said learn certain certain weights in order to transform the imagery in a way that it's as close as possible to the this is the layer that um have added last this one here and as you can see there is equals one so this one has one unit which means it will return one weight sound the length that stays before that has 256 units so it will return 256 when it sounds in other words this layer takes the input values and transforms them in 256 different representations of those values kind of extending the future space to look here and again all these waves for all these weighted sums that are being calculated there that's what the network actually learns and um in principle we could just clear the model only of dance layers and train together a lot of dense layers which would then do all kinds of representations of the input data and then at the end give us an output and we can also process images with such a network the problem is that as we build them here would in this case so that means if you trade a model like this with images that can already um there can only be attack patterns they cannot detect local patterns and the standards also have to be in the same spot because they're not that much different from what you do when you do feature expressions manually for example if you're running an edge detection with soul filter or something like that that's a convolution and convolution means that you have a certain filter kernel next one thank you okay um so all the pixels where you can center your image where you can center the this kernel around are within this yellow box and what's happening then is that each or the weighted sum is built of each of these pixels within the moving window and the weights for this weighted sum come from the kernel so the first output here would mean input 1 times weight 1 plus input 2 times weight 2 and so on and so forth so it's a weighted sum of all the input pixels in within this filter with the weights coming from the kernel so the next position here would be with the window from here to here and then this is where the weighted sum is being built so i guess you get the get the meaning what this does is depending on how these weights are configured it extracts certain features from the input image for example edges so this is an example of the first two convolutional or the first two channels of the first convolutional layer and what they do to this input image for example you can see that they are extracting different types of of features channel is i forgot about that so each layer is not only running one filter but multiple filters on on the image so that this kind of equivalent to the units of a dense layer so for example the first convolutional layer here we said it should have 32 filters this is what we specified here filters 32 that means that it's not only running one kernel but 32 kernels 22 different feature extractors over the same image and then the output is a so-called feature map with 32 bands each band representing the response of one of the 32 filters to the input image and that's why we have an output of 32 in this case and here i have only plotted the first two of these 32. so we now have looked at two filters the dense two layers the dense layers and the convolutional layers both of them extract information from images by transforming those images and how this transformation happens depends on the weights of these layers and that is what the network learns and it learns this by being exposed to actual data and the corresponding labels as i showed here these layers that have weights that need to be learned also called or you just say that those layers have certain states while the rest of the layers that we're going to use don't have a state they're called stateless because they don't have any weights to learn you will see in that later so one of these is the max pooling layer what the max pooling layer does is basically sampling down the resolution of the input image so as you see here for example this is a max with putting with a 2x2 filter it slides again a moving window over the input image but it doesn't compute any weighted sums or anything it just takes the maximum of the four pixels within this window and output output that as the output of the new um of the new tensor so this value here will be the maximum of those four so it's a simple down sampling and the reason this is done or that mainly two reasons for using max pooling layers one is to reduce the size of the feature maps so you don't blow up your network because as you have seen with all these filters that are kind of added by the convolutional layers your feature maps are getting bigger and bigger and one reason to do this max filter this max pooling filter is to reduce the size a little bit and the second reason is that the following convolutional layers that come after this max pooling layer we look at a larger spatial area because remember that the input of such a max pooling layer could be the output of a previous convolutional filter or convolutional layer so we have here the response map of this convolutional layer which is then aggregated and then the next convolutional layer looks at this aggregated feature map so to say and thereby each convolutional layer later in the network looks at a larger more complex pattern indirectly and that's the second reason for doing these this max pooling in between so it's a pretty common thing to have in your network combinations of one or two convolutional layers and then one max layer max pooling layer doing the ups and down sampling and then again a few convolutional layers and then again a max pooling layer just as we have done here in this sense in this example we have one convolutional max pooling convolutional max boolean convolution and so on and so forth what all these combinations of max pooling and convolutional layers do is they extract increasingly complex but also increasingly abstract patterns from the image increasingly abstract features that means also if you look if we look at the intermediate results they become more and more difficult to interpret what is actually extracted there because they're more and more abstract and these more and more complex features are then fed usually into a dense layer which is then doing the actual classification classifying um the input based on all these features that is that are fed in into um into the dense layer and the last layer that we need to talk about is the flattened layer this is more a technical thing as i said before a dense layer is doing a weighted sum of certain input values and technically these values need to be one one vector of data just one row of numbers and what we get out of a max pooling or a convolutional layer is a 3d tensor with pixel rows pixel columns and image def like how many events do we have and the flattened layer just puts all these values one after the other into one long vector which is then processed by the dense layer so that is more like a technical thing so to sum up we compute or we assemble our met our model using convolutional and max pooling layers in combination for extracting certain local features of growing complexity and then we put this into or then we add dense layers for using these features for classification and one thing that you might have asked yourself is [Music] what about this what about these arguments here so we've talked about these functions for adding layers and the first argument is clear that is the model to add to which we want to add this filter or this layer and then for the convolutional filters we have another convolutional layers we have another argument filters that is how many filters this layer has this is meaning how many different features features are being extracted then there's a argument called kernel size this is simply the size of these sliding windows and then the convolutional and also the dense layers both have an argument called activation and we set it to relu all the time except for the last one which is sigmoid this activation stands for activation functions and the reason we need these activation functions is that all the transformations that we have talked about up to now are all linear transformations so william rated sums and adding a value just as building a weighted sums within this kernel are all linear transformations so if you stack all those transformations together the result is still one linear transformation so we don't gain anything if you stack multiple layers all together in the end we still have one linear transformation our network would still not be able to learn more complex representations of the data and thereby not be able to kind of conduct more complex classifications how do i define the number of filters that is a choice of how you um so there's a choice of the architect of the network so to say and um there are several um kind of best practices but i i wouldn't feel in the position to judge which is the best one but what you have to keep in mind of course is that the more filters you have the more parameters you get so i would say as as many as needed but as few as possible because the more filters you have the more you parameters you have and usually i mean if you look at this uh where is it at our first model which is still a pretty simple one we already have 1.4 million parameters which is quite a lot i mean if you come from statistics of special statistics you find this ridiculously a ridiculous amount of perimeters but it's for deep learning it's still a low number of comparables [Music] okay so all the transformations that we've done so far are linear transformations and in order to enable our model to also learn non-linear representations we need activation functions and this is what we define by this activation argument what accumulation functions do is they take the output of these operations that i've shown before so the weighted sums plus adding a bias or these weighted sums within the filter kernel um these are actually not the final output of a layer but instead these outputs go into an activation function which then transforms those values non-linearly and the output of these segment activation functions is the final output of a certain layer and which um which activation function to choose it's not kind of it's not an easy choice there's quite a bit of thinking behind that and we cannot go into detail here i have a side note with a bit more of a side note a bit more explanations about this as far as as this tutorial is concerned you can just remember that these two are the main activation functions that we are going to use one is called value rectified linear unit and the other one is called sigmoid so the sigmoid activation function what this does is pretty easy to see it takes the input value and squeezes all the values between or in the range between zero and one so you can see for large negative values it's almost zero and four large positive variance almost one and then here that this transition between them other one the rectified linear unit simply outputs a zero for all values below zero for the input values below zero and for all positive values it just adds the linear transformation like it adds the same value and again the reason for using these activation functions is to introduce non-linearity into the model so that it's not just learning linear representations both these activation functions have advantages and disadvantages you could um or you should look into the side note on the vanishing gradient problem which i also don't want to talk about now but that is um kind of related to the type of um it's amongst other things also related to the type of activation function you use but for our model or actually for all the models that we'll use here we will use the rectified linear unit for all the hidden layers hidden layers are layers that are not input or output but the layers in between so these these layers here and we'll use the sigmoid activation function for the final output layer the reason why we use the sigmoid activation function here is because we have a binary classification problem and for such problems the sigmoid activation function is a good activation function for the output layer because as you have seen it outputs a value between 0 and 1 which can then be interpreted as as the probability of the image containing the target or not so in our case the image containing a while drive patch or plant or not so for example the output of the convolutional layer that i've shown in this figure here doesn't actually not look like this but rather like this so all the the result of the sum is put into the radio function activation function so this is um a summary of the model that we've built i'm not sure if you can see the gray boxes so actually they're meant to be gray boxes here you can probably see them at least on your on your computers and this summary um or this overview is meant to kind of briefly um summarize what the model is doing so the green arrows here symbolize convolutional the blue arrows symbolize max pooling layers the purple one the platen layer and the orange ones the dense layers so this is how our model looks like the gray box are the intermediate results and the numbers on the left or right of these boxes or images are the spatial extent the pixels and the numbers on top of them are the image depth the number of layers or the number of bands not layers the number of channels or bands of the feature maps so for example here this is the input image it has 128 128 pixels and three channels it goes into the first convolutional layer which transforms it into 32 feature maps each with 126 by 126 pixels the reason why it's getting a bit smaller is because as i said before this convolutional kernel cannot be placed over all the input pixels because on the edge of your image you don't have any neighboring pixel anymore to calculate the weighted sum on so as you see we have a sequence of convolutional and max putting layers and then at some point we end up with an intermediate result of only six by six pixels but 128 feature maps of those with that amount of pixels and then we have a flattened layer that transforms this into just one row of 4608 values that are then fit into the dense layers and the first lens layer does 256 representations doing 256 weighted sums and then the result of this goes into the last dense layer that only has one unit which means it takes all these values and computes one weighted sum that then outputs a certain number between zero and one representing the possibility or the probability of that image here containing the target or not and again i mean this is all the weights that are involved here are just learned during training so we expose this network to ideally a very large number of images and according labels and then it learns all these weights that are involved here such that the transformation of an image results in something that is pretty close to what it should be as with respect to the label so if the label is one then it should um predict something pretty close to one and so on and so on and this training is what we'll um look um at later so this is how we built the model and i hope you also got at least some basic idea about why we choose different or at least what we choose when we choose different um perimeters here so filters are the amount of how of filters how many different feature extractors are applied the same is true for units for the dense layers so each unit has one weighted sum activation defines which activation function to use and these sizes determine the size of the kernel so the filter kernels are three by three pixels and the pool for the the the size of these max pool windows is two by two pixels so also the as i showed here in this figure 0.93 can be interpreted as the probability that this image or how certain the model is that this image shows a target or not so now 0.93 means it's pretty sure or you can be pretty sure that there is an [Music] target in this image and the reason is that it is trained with images where or with data where images that show the target get the label one and those that show that do not show the target get the label two and then the weights doing all these transformations are trained such that all images in the trading data that show targets are for results in as high as or as close to one as possible number um by these transformations and all those they do not show image the target um kind of are transformed into a number as close as possible to zero but we how to how this is done we will see during the training also in your language um exactly so this is a binary classification problem so it's only yes or no for the certain of the yes so you can also do um multi-class classifications something that we don't talk about here um the result is that you would then use another activation function for example for the output not a sigmoid but a soft max which is kind of the multi-class version of the sigmoid um but that's also possible but here in this tutorial we are always only talking about binary classification like detecting a certain object 0.93 means that um yeah i think you could translate it like there is a pattern visible there that makes the model think or that makes the model react to this quite highly or with a high number yeah or some part of it we'll get this that's actually a problem later on because we have of course not only images where they're only target where the full image is full of this target or where there's none but also maybe where there's half target and half background but that's uh something that that is um that we'll get to later when we do this pixel y specification that it also looks at where in the images [Music] um no it's not just trial and error there are i mean i'm not a good model architect so i don't have a lot of experience there but there are woods what you choose and [Music] some rule of thumb is that is this mixture of max pooling and convolution layers to extract the features and what i've so far mainly looked at is not to get too many parameters to not to blow up the model too much but usually there's this workflow you start with a certain architecture you test it on your test and validation data and then iteratively improve it um um that's a good question i don't know if there's a kind of best practice about this computer maybe you come to the same data sets where it also takes a while to see how good the model is if you cannot run it but to be honest um so i'm looking at it from the application side and we'll get to that or at least i will give a outlook on that later in the extensions which is kind of using models that are kind of established for that that have proven to work well and use those models sometimes even pre-trained for for the certain um applications so i i will at least give an outlook on this because you can also use pre-trained networks that have been trained on completely different input data and use at least a few layers of those models for your use case at hand to see that later so so far what we've done regarding our code is only we have built our model with these functions here um at this point i can also show what would be the next step i i'll for now kind of not show the preparing your data part but only what would be the next step of the model so we have only used a few lines of code here to build the model if we later train our model there are only two functions that we need one is compile where we define certain models for the training parameters for the training and then the fit function and then after that we only need one more line to predict and then we are done with the prediction so the actually only slightly tedious part now is the data preparation as usual when doing working with data so we've now built a first very simple model and now we want to uh prepare our data in a way that it can be um fed into this model to to detect the target and the first thing we gotta do is we just organize our data into um training and validation and into target and non-target this is nothing special here so all we do is um so at this point we assume that the data is already prepared it's sitting on our desk we have one folder as it is also in the tutorial training data folder we have one folder with um images showing a target and one folder with images showing something else not to target so we'll first read in the images showing the target we'll add a label of one to those and we build a data frame let me just do that here you just go to that so first we can see that yeah first we extract the file paths to the images showing the targets and then we create a data frame that looks like this so it's simply in one column we have all the images and in the other in the other column we have the label in this case all once because we have up to now only loaded the data that shows targets we do the same for the images not showing targets and then we merge these both to one data frame so you can have a look at how it looks now so it's a large data frame with targets and non-targets and at this point it's only the file paths to those images and the corresponding labels because the actual reading of the data will come later what we now do is we use the um function initial split and i don't give word of that from other occasions or other use cases of the our sample package which splits our data into training and validation per random and the two arguments that we're setting here is the prop which is the proportion of training data so i set it to 0.75 so we have roughly 75 of all our data assigned to the training data and 25 to the validation data and we use the column label here the right column as a stratum for this split to make sure that the zeros and ones are split in the same proportion so that we have the same proportions of or roughly the same proportions of 75 to 25 percent with for the target and the non-target because if you do it completely random you could end up having [Music] almost only targets in the training data for example and the rest is in the validation data or something like this so to ensure that you have the same probabilities in the training and validation data the result then looks like this it's an art split object can you see that i'll go save it here you can see it has in total we have 287 samples 216 of those are in the training data set and 71 in the validation data set and you can also look at those if you want to access either the training or the validation data set you can do this with the function training on data for the training data and the function testing on data for the validation data and we could for example check if this equal split has worked out if we just count how many how many if you look at our training data and count how many um samples with label zero are there and how many labels work the label one and we see that they're both 108 so we have an equal split within the training data we have just as much zeros if we have one as we have once so the same is also true for or almost true for the validation data which is what we want so we don't have an overall under representation of one of the classes that we're looking at so up to now we've only organized the data now that the part that's now necessary is to actually read the data into something that can be processed by our convolutional network and um in keras or tensorflow we are working with so-called tensors and tensors are kind of um a generalization of vectors and matrices and so on to an arbitrary number of dimensions and for the type of data that we've been that we are concerned with um i can i've tried to translate it into something that we know from r so there would be a zero d zero d tensor um it's just one number as a scalar a one d test is a vector a 2d tensor is a matrix and a 3d tensor or any higher dimensional tensor would then be represented by multi-dimensional arrays one thing to keep in mind here or to kind of notice is that [Music] the data sets in convolutional neural networks are not processed all at once but in in batches so that's something that we will better determine how large our batches will be so they will never or usually you don't feed the whole data into the network for only a certain batch of data for example first 10 samples or not the first 10 they're randomly distributed but a batch of size 10. and this is usually the first dimension of the tensor so when we're working with images as a result we are usually working with four dimensional tensors the first one is the number of samples the second one is the number of rows third one is number of columns and the fourth one the number of channels or bands so for example a tensor of shape two four four eight four four eight three in this example here would mean that it holds two images each having the size four four eight by four for eight pixels and three bands so let me catch up with this here so as i said we need to get our data somehow into the form of these tensors as it's needed by the convolutional neural network and for this there are several ways to do that so keras has some pre-processing functions but here we are using the tf datasets package which which is a package that kind of provides access to the kind of a pre-processing pipeline for data um that as it is needed for convolutional neural networks and the first function that we're going to use here is called tensor slices data set let me just execute that what this does is it creates a so-called key of dataset objects from the input in this case our training data so the data frame holding all the images and labels of the training data and the tf data set again consists of a number of elements each element being a pair of training data for example one element is one image and the corresponding label and in the form of tensors so one thing if you work with python before you may know the concept of python iterators that's something that's a bit daunting so i will not go into detail here i have a another side note on this but a t of data set is under a hood a pipe iterable so you cannot easily just plot it as a list as you would think it should be possible um but you use you have to use the python functions as iterator and iterate to get access to single um elements of a tf data set and so i can show you how it looks like just but this is just for us so the uh we don't need a list or something we need a tf data set for the further processing but if we want to inspect the data set then we can use this just to check how it looks like so what i've done now i've used the functions as iterator and iterate to kind of create a list of this tf data set and this list shows then the single elements of the tf data set and each element as i said is again a list of two tensors one for the image you see this here image and one for the label this is just for us to check i have also only plotted the first i don't know six or so elements however as of now this sensor as you can see for the image is still uh only a string because after now we still only have the five parts we haven't read any data yet this is what we are going to do next so let me go here and yeah as i said the data preparation is the most tedious part usually so i will not i will try not to go too much into detail but um just show the kind of the minimum that we have to do so we need to read the images then we need to convert them into a float and into numbers between zero and one so we normalize or rescale values that's something that is um necessary or at least it's advised in order to help the network to train properly again i have a side note on value normalization here that you can read to get further information about that and the third thing we do is we resize the images that's something we don't need in this case actually because the data as i've prepared it already has the correct size of 128 by 128 pixels as we as the network expects but if that was not the case then they would be resized here into this um predefined size um using interpolation using different resampling techniques so these are the three steps that we got to take and we do that using the data set map and this modify functions dataset map is a function that helps you to apply a certain function on each element of the tf data set so if you said for example read file and decode jpeg and we put that as a function into the data dataset map function then this function is going to be applied on each data set each element of the data set so let me show this here we call the function data set map on our trading data set which means that on this data set on each element of this data set the following function is being applied and the function that we applied is list modify because again as i said before each element of the data set is a list with two elements or two items the image and the label and the function list modify allows us to modify a certain item of the list in this case we want to modify the image item and that's why we use list modify and the item that we modify is image and we modify it by these functions here and this is another thing that um yeah might be a bit strange to using this these dollar operators this is what i said before where we use the tensorflow package to get access to the tensorflow rp and use functions of this rp for reading the data and the functions that we use are read file and decode jpg so we first read the file and then decode it we use decode jpeg because the images are stored on the disk as jpeg and we modify our image list item by the result of this which means that after this function let me do this we read in the imagery the second step is as i said before change or transform the data type into float floating point values and then the third is to resize them so the idea is always the same just the function that we apply using this modify on each element changes first it's the decoding then the data type transformation and then it's the resizing then the last three steps that we gotta do is um we gotta shuffle the training data set because up to now it's ordered we have we first read in only the images with one label one like all the images containing targets and then all the images containing no targets and in order to have proper training and not kind of start running the model only on on target images and then later only on no target images we need to shuffle them and this is what dataset shuffle does then we used function dataset batch to create batches from our data set this is what i said before then the data set is not processed as a whole but instead it's processed in batches and here in this case we define the batch size 10. so each in each step only 10 images are pre-processed and then the next 10 images and so on and so forth and then the last step is unnamed we need to unname our data set meaning these names image and label are being removed this is simply as far as i know a technical thing because if we don't do that then we get an error message in the next steps we can now have a look at the resulting data set just to see how it looks like and how we have changed it so i will go back to into the html for that so i've now done this iterator trick again to make the list out of this data set and check the first element of the data set and of this first element i check the first item meaning the image tensor that's what i'm looking at now and this function gives you a very long result because it shows some of the actual values within this tensor so you don't need to kind of pay attention to the exact numbers here it's just kind of a preview and it also shows you the shape here of the stencil and the data type and the shape is 10 128 128 3 which means we're looking at the tensor holding the sorry holding the first batch of data batch size was 10 that's why the size of the first dimension is 10 then the next two dimensions 128 128 that's the size of each image and the fourth one is three that's because we have three bands and we see the d type the data type is float because we have changed that during the preprocessing we can also if you wanna only look at the shape for example you could also use this dollar shape operator here so that you get only the shape we could do the same with the second item of the first element to check the tensor holding the labels corresponding to these images that's a much kind of less complex tensor it only has these ten values which are the ten labels of these ten images correspondingly and it's a data type integer and with this we've prepared our data which as as i promised is a quite tedious process but when you've done that it's pretty easy to kind of feed this data into the into the network you will see later when we do the trading you just throw in this data set because it all the preparation has been done already so we've done that now with our trading data now we do the same with our validation data set these are all the same functions that we're using here the difference is that we don't need the shuffling because it doesn't matter if they are ordered or not because we're not using the validation data for training the variation data will only be used to give us an idea of how well a model is performing after a certain epoch you will see that so again i'll catch up on this so we have two tf data sets one for the training data and one for the validation data and with this we can finally start training our model and the training of the model again in terms of our code consists mainly of two functions one is the compile function and one is the fit function the compile function as you see here needs a few arguments the model that we want to compile compile means kind of prepare for trading and then in this case we set three other arguments optimizer loss and metrics and these three arguments are kind of the keywords associated with the general workflow that is kind of pursued for training and i've sketched this workflow here how actually this network is being trained the first step is it runs on one of these batches that we've defined so each batch has 10 images in our case it the model is being run on one of these these batches so meaning on 10 images with in the first place it has randomly assigned weights so it will probably not make the output that we get will probably not make much sense but the output is being computed in the first step then this output is compared to the corresponding label that is assigned to this to each of these input images and the loss is being calculated loss is kind of the error how far off is our prediction from the label and how to calculate this loss is what we define with the loss function so here as you can see we have determined binary cross entropy as the loss function and as i said before i will not go into detail here how binary cross-entropy works but you can look at the corresponding side note to get additional information on that what we also have to assign is the optimizer because that's what's interesting in the third step so we have com we have um computed the loss on a certain batch and then the next step is that the perimeters meaning all these weights of our weighted sums and the dense layers and of the kernels in the convolutional layers are adjusted just a little bit to reduce the loss on that current batch and um the the way how this reduction or this adjusting of weights is being done that's what the optimizer defines and the learning rate again defines you could say in what or how large the steps are in which we adjust these perimeters um i shouldn't have chosen gray because you can again not see quite a lot but um here's a kind of figure showing um how this works so as i said before run the model on a randomly drawn batch compute the results of the prediction and compute the loss kind of how far off is the prediction for this batch from the labels and then slightly adapt the perimeters using the gradient descent approach to reduce the loss and then run the model on the next batch and do the same and then iteratively reduce the loss always by adjusting the weights or the perimeters a little bit and this um adaption of perimeters uses the gradient descent approach so again i'll not go into detail but you can imagine it like this the loss function as i said is computed and the derivative of this loss function is computed so the gradient so you cannot get a loss surface and you just go a little bit down the slope of this lost surface so you kind of change the perimeters in the way where the loss is going down so gradient is kind of the multi-dimensional version of the derivative and this you do kind of for all the batches in in your data set and when the whole data set is is finished one epoch is done and we can determine how many epochs we want to run how long we want to do the training in the next function because the fit function so here i have another description of these different arguments one argument that i haven't talked talked about is it's the metric this is um in this case accuracy this metric is not used for the um for the training itself it's just what we get out during the training as an input or is an information for us how well the model firm performs at a certain iteration so in this case we use the accuracy which means binary accuracy which is simply the fraction of samples meaning of images that are correctly classified and by correctly classified in this case by default this means um as threshold of 0.5 is chosen and everything um above this threshold and also every output of the model above 0.5 is considered 1 and everything below 0.5 is considered 0 and then it is compared to the label which is 0 or 1 and the fraction of correctly assigned [Music] images or samples is then calculated as the metric in this case so this is what is specified with the compile function and then we use the fit function to actually do the training and what you see here is we as what we need as inputs here is the model that we want to train the training data set and the validation data set and the number of epochs so as i said before if it runs in batches and if all the data is has been processed fetched by batch then one epoch is finished and the next epoch starts again with the same data and this is continued for as many epochs as we determined in the in this method here or in this function let me again catch up with this i hope i haven't forgotten any type of code so again um as you can see there's a lot of kind of background that i'm kind of that that i cannot talk about here like what is the mini uh batch of faster gradient descent but it's all in the document for you to read later on because i'm pretty sure i also forget some things so again this is the method to that we use to fit the model and the reason why i'm so actually we wouldn't have to assign this to a result again because it's modified in place but assigning this to a variable um in in keras makes the metrics for each situation being saved to this variable so we can later have a look at the metrics at different um epochs so i'll try to run this now i hope i go ahead and forgot anything let's see so if you can see that so this is the first keyboard here each one means calculating the loss and so on on one batch and then when it's done and on the right you can see the blocks accuracy is referring to loss and accuracy on the training data which is the data that has been used to um to change the because it's a very small as i said before it's really a toy example i know that usually you would do this on much larger data datasets but this is the reason why i can show it here uh yes i think so but i'm 100 sure but it's much lower it's much quicker on the gpu i can i have the gpu at home and i can try to get access to that okay this is i think i had prepared this let me just try i'm not sure if i had so this is now my computer at home it is running a kind of normal user gamer great graphic card but that's already enough to do quite a lot of stuff i'm not sure if i hit everything let me see yeah so it's running now and you see it is much faster so it's almost done it's the same number of epochs so it's done and i think the other one is still running yeah but still i mean uh this is a two core how many i don't know how many years old computers this is a gtx 1080 ti i don't know if that tells you something the graphics are it used to be a quite decent graphics card a graphic card two years ago i think okay but now they have a new generation which i've heard has specific tensor cores so it's kind of yeah the hardware is also getting even more um optimized for this kind of stuff but that's not my bit of expertise i don't know so what you get as a result during this training process you can see here also a plot of the validation accuracy and the loss and as i said before i assign this to the variable diagnostics then i can later look at the diagnostics for example to plot them this is the result here on this on what i've done just now and this is the one that you can that i've done before in the markdown and what you can see here is that maybe after like maybe 10 epochs we might have stopped already because the values are not inc are not getting better anymore not a lot um by the way one aspect which is very important but it's not covered in this tutorial is the aspect of overfitting which can happen quickly if you have that many perimeters of course and just one note here is what you should look at when you do the training is how the training curve and the validation curve behave so for example if your training curve is he keeps on increasing or that so the accuracy keeps on increasing or on the other hand the loss keeps on decreasing for the training curve but the validation curve has flattened out already and it's not kind of improving anymore then you're starting to overfit because the data is getting for the perimeters are getting fitted more and more to the training data without any effect on the validation data because remember the validation data is not used for the training this is just for looking how this model performs on other data than the one it's that is used for training and um yeah if the validate if this accuracy on this data is not improving anymore but it is improving on the training data then it's basically overfitting okay in this specific yeah however yes however but in this case we have to be careful with interpreting the result because of this small toy example with very low data so if you try this it might the curve might look quite a lot different to this one so there's quite a high vulnerability so to say to changes in certain parameters because it's a very low a very small data set you also see this if you these are the curves which are smooth but if you look at single values they're fluctuating quite a lot especially in the training data set because it has even lower it has only 25 percent of the of the data [Music] yes it's possible i mean this is i said the minimum example on the laptop the other extreme would be running running it on a amazon web services gpu for example and then you can kind of do pretty large trainings but i haven't kind of typed in this field yet for example but actually there's smarter ways which we also cannot cover in this tutorial for example you can configure the training with early stopping mechanisms that kind of use you you give us a name 15 epochs but you split set certain rules for early stopping like when when the accuracy is not improving or only improving slowly or something like that then it stops early so that's something you can do so now that we have fitted the model we can use it for prediction by simply using the predict function and the predict function just needs two arguments our model and validation data or not validation data whatever data set we want to predict on in this case we predict on the validation data set in the first step and what you get out in this case it's just a matrix with one column that holds the corresponding prediction value for each image in the validation data set so here the rows are the samples and the values here are the predictions because in this case we only have as an output zero or one just one number and here i have plotted three examples of images of the story of the validation data set and i put in the caption the prediction result what the model puts out and the label what what kind of is the ground true and you see that in this case in these three examples the first two these are kind of images holding or showing the target and the prediction is also pretty high 0.95 0.97 and the last one where the level is 0 also has a very low prediction so that looks pretty good but even more kind of interesting than doing it with the validation data would be to do it on a completely different data which in our case would be the test data set which if you remember is on the in the rest of the image of the of this island did i break it this one um okay okay so what we do now is or we use the model to predict on the actual test data and this is what you see here [Music] we again have to pre-process all the images and as of now we still assume that all the images are already kind of subsets in the corresponding size and they're sitting on our desk so we still we still only have to um load them into tensors and we do that the same way we have done before with the training and validation data using it looks a bit strange here but using um read file decode jpeg and then convert image type and so on what we of course don't need is the shuffling again because we don't train on this data um so we basically only transform the data into the format that we need and then we do the same as before we predict using first model as the data set and a first model as the model and the data set which is now our test the data from our test area as the data set to predict on and we would then get again a number of predictions here and in this case i've turned these predictions already into a map which then looks like this so i've for now not shown the code for this here it's of course in the markdown but what it basically does is it reassembles the or first it turns the result into an array of the size of the corresponding image all of the same value which is the prediction value makes polygons out of them and then it puts them into the map and this is what the result now looks like i don't know if you can see it very well on the screen sorry so on first side it doesn't look too bad given that this is really a yeah toy example it's a quite small model and also a small data set but it doesn't look too bad for now um however there are of course certain issues with it the most prominent one probably is the fact that i mean that's also what your question was referring to before we are predicting it on this size of these tiles so that's always just kind of they're always kind of edge cases where there is target and non-target and our result kind of has a low spatial resolution what we could for example do is decrease the size of these tiles but that only makes sense up to a certain point because we want to include the spatial context and the more we reduce the tile size the less we can include spatial contacts and also we create a huge overhead so sometimes the kind of reassembling of all these tiles to one map here takes longer than the whole prediction with the model so what would actually be cooler than running this on tiles would be doing a pixel wise classification and this is what units do and this is what would be the first sort of extension of this basic workflow so um let me just show you how this how the result was would look like this here on a finer kind of spoiling a bit um final prediction result would then look like this this is if you do the pixel wise classification so let's get back to the prediction yeah sure but i'm jumping so i've i've skipped all that leads to this i will kind of i was planning to give at least an outlook on how to do this how to get here yeah but that's the same thing i did for the oh yeah i forgot to tell it's the same thing i did here as well so i excluded all the all the tiles that had a value below 0.5 and the rest of them still get the kind of color corresponding to how much or how high the value is you could also do them all in one color yeah so i mean we have i think we have 30 minutes left and um with quite a lot of markdown left but as i said before i'm not playing to kind of talk about all this but i was planning to give at least an idea of how that works because that is another form of model um architecture which is i guess pretty interesting for us when we are doing remote sensing so that i was planning to focus on that for the remainder are there any questions as of now regarding this basic workflow somebody using the [Music] okay i i guess if you're familiar with python there's no reason to use another backdoor to that but for me that is kind of the reason i could start using it is that it is an r because i have a tiny little bit experience in python but not enough to do that probably but um under the hood it's all done and maybe let me just briefly summarize that i think um the more as i said before the boring part is the preparation but other than that it's pretty straightforward so we have built our model with these lines here more or less and these two then we have prepared the data which is a bit more complex and as i said there are other ways probably smarter ways to do that i chose this one because i think it's right into it quite intuitive to do it with a tf data set package then the training is just these two functions and then the prediction is yeah basically only one function so that is pretty straightforward and i mean there's a lot of background to the to these um especially to the training part i could only kind of scratch the surface there but i do have a few side notes and in the side notes there are also some additional links where to go to to get a bit more info about that and again just keep in mind we have a very small data set and so this is not what you would kind of normally do or publish in the paper this is really just to show you something that works also on a small machine and that still shows some results that kind of can be visualized well so um there are certain things that can be done as extensions to that one is again the pixelized classification and then two things that i want to just at least mention that are kind of interesting if you're working with eib data because they are these methods are good for handling for tackling small data problems um this data augmentation and using pre-trained networks data augmentation i can just pretty quickly tackle this means that you take your training data set your images and manipulate them so you create synthetic new images for example you can flip them rotate them you can do some elastic transformations and thereby kind of emulate new data and you can also change the random certain illumination conditions like or spectral conditions like hue saturation contrast and so on and i have an example here in the markdown i will show it now but i just want to mention it and i've done that here in this example and by that increased my um my my training data size by the factor four so i edit three times the data set size with some augmentations and [Music] then another technique for that is useful when doing um or when working with small data sets is using pre-trained networks again i will not go to too much detail but i would like to at least show the idea that i mean um so the keras package comes with a number of uh kind of well-established model architectures that you can just load and you can decide whether to load them with train weights or not and you can sometimes choose on what data set these weights should be should have been trained before and that is pretty interesting tool i think so the function to use for this is application underscore and then something in this case vgg16 because uh here we read the um vgg g16 model this is a classical model so to say it's not kind of state-of-the-art anymore because it's already quite old but it's a pretty good model and here we read this convolutional model and we define the argument weights equals imagenet so we load the model that has been trained on the imagenet dataset the imagenet dataset is a huge dataset but it has the the images in this dataset are mainly animals like cats dogs and everyday objects but we can still use some layers of a model that has been trained on this data in our use case um if we only use the first few layers because this as i said before these first few layers extract such features within images like edges and certain shapes and um not necessarily too complex features or features that are too close to what it has been trained on and since in our case we have also kind of target objects that are not too complex if you have seen the input images is mainly this graph it might be helpful to use as a few layers on a free pre-trained network and then just add our own dense layers and then continue the training with this hybrid model so to say and freeze the weights of the first layers so that only the layers that we have put on ourselves the dense layers are being trained this is what i'm kind of trying to show here sorry so this is the model that we had before and we kind of replaced this whole feature extraction part by layers from a pre-trained model in this case the pre-trained vg60 model this is how this is what it symbolizes and this is how to do this and load the model using the application vgg16 function then we freeze the weights and then we extract the first 15 layers of this model this is done here by um dollar layers and then subset 1 to 15. we've write that to the variable pre-trained model and then we add just a flattened layer and the two dense layers as we had before to this pre-trained model so we now get this and if we continue to train this model only training the last two dense layers here i've done that as well and here i've done that only for six epochs and you can see that after three epochs already okay this is very high it's already already an accuracy and a training data of one hundred percent and um between 92 and 98 on the validation data so it kind of it converges pretty quickly and this is how the result looks like it's i mean it's hard to compare that now to the result before but again i wouldn't over interpret the results anyway because of this small data set this is just meant to show how this could work so you can kind of take more time to try that on your own it um how it works when you kind of use pre-trained networks or parts of pre-chat networks to increase your accuracy and the good thing is we don't have a lot of data but all the these feature extractors we don't need to train we just use the the pre-trained weights already which makes the result of course a bit more generalizable because it has not been fitted at all to our specific data set so it cannot be overfitted to that data set yeah it looks cleaner so well we could look at some um details and i think it would kind of be a higher accuracy in total but you can also see that it's that it's much more certain about its decisions here the higher values there but again it's kind of it would be careful in interpreting single weights and so on okay so um this was just a very quick kind of right through data augmentation and using pre-trained networks again i hope that in the you find you will find the html useful in getting more information on that um now as promised we would like to talk about this pixel based classification and for doing this pixel-based classification or segmentation we use an architecture called u-net which has been uh proposed by ronaberger eye which um maybe i can show you the original paper on that if i find quickly [Music] so this is how the original architecture looks like oops sorry and so the first part is actually pretty similar to what we've used before it's a sequence of convolutional layers and max cooling layers that as i said kind of successively extract more and more complex features and if we would for example here at this point after all these feature extractors add a flattened layer and a few dense layers then we would have a kind of traditional image recognition network so we could put a sigmoid activation function at the end of the sense layer and then get again a number between zero and one for a binary classification problem telling us whether or not the target is in the image but the unit doesn't stop there but instead at this point is where the second part of the u starts so they start up sampling the image back to the original resolution of the input image so let me get back to the html here this is a very simple uh depiction with only um three of these convolutions or convolution blocks and so when you hear down at the unit or down at the bottom of the u so to say you start up sampling the data again to the original resolution and so that you get at the end a result that has the same resolution at the input but the trick here is actually shown or hidden in these gray arrows here because the data is not just up sampled but at each step after each up sampling the corresponding layer of the contracting hub this is this is how this this part of the use called which the corresponding layer of this contracting path is concatenated to the corresponding layer of the expanding half so of the right side of the unit concatenate means they're seriously just concatenated and this means that the following convolutional layers here take into account kind of this condensed information that has been getting that it has got out of all the convolution and max boolean convolution and also it takes all this context information that has been learned but that has not that or that does not have a lot of location information because this much cooling reduces the resolution at each step so it uses this kind of context information but at the same time they also get access to the corresponding information from the higher resolution so to say layer of the contracting path that at this certain steps has the location information of the features that were extracted at this point of the unit and this way the model can better assign back the kind of the information or the detected features to the location where they are or have been detected it's a bit difficult to explain and probably also to understand i um i assume i admit but the result is that we don't get just an up sample image but an upsampled image that takes into account the location information about the features that have been extracted on the contracting path so on the left side of the u and the final form that we get looks a little bit like this that's how they depicted it and that's also why it's called units so we have the contracting path which has a lot of kind of abstract information extracted and then we have the expanding path where the images are resampled to original resolution but this these resamples outputs are combined with the kind of location information from the compulsory contracting part to get a kind of better localization of features within the resulting images and as a result you can then kind of produce segmentation or pixel device classification and um i have i have an example here in this html as well how to build such a model this is how it looks like i will not go into every detail but what you can see is we have convolutional blocks they consist of convolutional layers two in this case followed by a max pooling layer we have in this case two of these blocks so this is only this small example here then this is kind of the bottom of the curve as i call it this is here these two convolutional layers and then is where the expanding path starts this is done by a transposed convolution for the up sampling then this is the concatenate this is what is represented by these gray arrows this is where we kind of glue together the upsampled output and the corresponding layer from the contracting path this is also done in this example twice so we have concatenation here and concatenation here and this is then followed by each time by two convolutions that can then as i said before take into account not only the distilled information from the whole model but also the location information from the from the corresponding um layer of the contracting car and we can also combine this idea with the idea of using pre-trained networks as i said as i've shown before um so what i've done here is um i use the again these the layers of the pre-trained network for the contracting path for this feature extraction path and then i just manually add the second half of the u so in a sketch it would look like this again not that well visible on the beamer but i think you get the idea so we use this part of the model as a pre-trained model again the water that has been trained on pictures of cats dogs pots whatever balloons and add only the up sampling calculate the expanding part and with this architecture we have so far gotten quite nice results because this is also the one that i used here for this final prediction result so what we appear by the way if you're doing this as well i hope you haven't already started training this unit because that is something that the only thing that i would not do on a laptop it takes a bit longer then so um that's maybe i should have said this a bit earlier i'm sorry um yeah so this is um actually so it will work i've tried it but only for two epochs because i didn't want to wait anymore but this is something that still works easily or kind of performs easily on gpu as the one that i've shown before it then takes a few minutes only uh maybe i can see if i get this prediction if i can at least start it notice i'm just looking for the code chunks um [Music] let me see if it works so i include the all this um augmentation that i've shown before i'm really running through it now foreign so even the gpu is now complaining that it could be faster if there were more gpu ram available now it starts training let's see how long this takes um okay one last thing that is actually i think quite an important topic but again not gonna talk about it in detail is um inspecting your network like um getting a bit of shedding a bit of light into this black box that it still to a certain degree is because um at least for these feature extraction paths all these um convolutional layers and so on they're quite amendable to kind of being visualized so that you get an understanding of what your model actually has learned and what you can do is for example um again only trying to show the result more or less what you can do is you can build a model out of only parts of your trained model so that the outputs of this model are the activations of those certain layers so you can get intermediate results and see how your trained model responds to certain input images this is what's done here i'm not showing this right now but this is how the input image for example looks like and here i've visualized so you can try this as well i have the functions in the functions in the html as well i have visualized the first four channels of all the convolutional layers in this net so you can see all the different forms of patches or patterns these layers relax will react to so this is how to use this function again it's defined also in the code you basically choose the the image that you want to kind of see how or the image of which you want to see how the model layers react react to it you put in the model and you put in a vector containing all the channels or all the layers of which you want to see the results and the channels of all those layers so here i've chosen layers two three five six and so on and of each layer changes one to four and so i get um of all in this case these are all the convolutional layers i get the result of all those for each only the first four channels of each of those ah it's done so this is um i should have taken the time but as you see that's what that was how the the gpu is now ready with this training already so that didn't take too long again this is as i said a quite small data set um for this example that i've shown here um we also did like more training on larger areas and then the training takes about um half an hour on this machine that i've shown you but it's still something that at least at this scale can be done on a kind of home computer and then this is again what the results look like
Info
Channel: Tomislav Hengl (OpenGeoHub Foundation)
Views: 1,686
Rating: undefined out of 5
Keywords: OpenGeoHub, R spatial, Open GeoData, Geospatial Data
Id: 3wPgRS0XYjA
Channel Id: undefined
Length: 109min 23sec (6563 seconds)
Published: Wed Sep 16 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.