U-NET Paper Walkthrough

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what is going on guys hope you're doing freakishly awesome in this video we're taking a look at unit convolutional networks for biomedical image segmentation so the paper is actually quite old from 2015 but we're taking a look at it because it's one it has been one of the most influential paper papers in [Music] segmentation [Music] all right so we're just going to step through it the focus is obviously going to be on the architecture rather than for the biomedical application that they had it for because i think i think for most of us that's not the interesting or important part but so for the abstract they go into the and talk a little bit about architecture which consists of a contracting path to capture context and asymmetric expanding path that enables precise localization if that doesn't say anything to you so far uh don't worry we're gonna go through we're gonna understand the architecture but anyways they trained the train that end to end and it outperforms the previous best method which at in that time was a sliding window convolutional network all right so um the typical use of a convolutional network is on classification right we're used to that using sort of resnets and stuff like that where the idea is that we input an image and we get some some class labels as output for for what is actually in that image but in many visual tasks especially in biomedical image processing the output should include localization and what that means is that class labels is supposed to be assigned to each pixels so if you're not familiar with semantic segmentation actually let me show you an example of that all right so if you're not familiar with semantic segmentation this is an example of how that looks like where at the top here we have some image and then at the bottom we have classified each pixel of that above image into some specific classes and then we just map those classes to a particular uh to particular color and then we get this right here and so they mention here that cirrus on at all i haven't read that paper but they train the network using a sliding window setup to predict the class label of each pixel and from my understanding uh how that works is that sort of we have an image and then what they did is that they used a sliding window approach so they took some particular crop maybe like this and they wanted to classify a particular pixel of that crop so they just want to classify a single pixel but they took out the crop because obviously we need to have some context to what's going on and then they ran that through a a separate cnn and so then they did that for every single pixel of the image and so you can just imagine just how expensive that must be to actually run and train um but one good thing or actually we'll get to that i want to get ahead of myself and here's the architecture we're going to skip that for now so we'll come back to this but so this network and when they say this network they mean that sliding windows approach that network is able to localize and so to find the cl the class for a particular pixel and then another benefit is that the training data in terms of patches is much larger than the number of training images which is obviously a good thing especially in these biomedical applications because it can be quite difficult to actually find a training data and then uh the resulting network it won the em segmentation challenge so that using that sliding windows approach uh in 2012 but you know they mentioned some drawbacks and you know it's quite slow and that's sort of the main main drawback and so what they mention here is that in this paper they build upon a more elegant architecture which is a fully convolutional network and the main idea and actually let's now go to that figure right here all right here is the architecture of the paper the unit architecture which follows simply because it has a u u shaped and i'm kind of thinking of the best way to explain this but sort of the overview here is that we have the output input image at the top left and then in the top right we have the output segmentation map and so the input first goes through um just you know as an overview it goes through this um this contraction path where the input size is down sampled and then it has this expansive path where the image is uh upsampled and in between those this contraction and upsampling this this contraction and expansive path they have these these skip connections and we'll go into why they have these skip connections but anyways so the input here is 572 by 572 and this just has a single channel and that the reason for that is because it's a grayscaled and then the output in this case is 388 squared and has two channels and the reason why it has two channels is because we have you know two classes um and one thing here to to notice is that the output is not the same as the input size and we'll go into that in a little bit more detail or degree into that in a little bit more detail in the paper but for now the reason is because they use padding on the input so that's why it's larger all right so let's just follow the path so they first use a three by three valid convolution which is also why the the size here is is reduced slightly and you they do that two times while uh making the number of filters to 64. and they then use down sampling so they have the the input by using a max pooling of kernel size 2 and stride 2. and then again they use that similar approach as they did right here from the beginning where they use two three by three comms valid valid padding and then they double the number of channels so they do that again downsample a double number of channels two three by three comms and they continue doing that until finally they go into this expansive up-sampling path so here they up-sample the image using a transposed convolution and then so they they use these skip connections and then they sort of concatenate it along the channels dimension so we have you know half coming from the skip connection and have coming from the up sampling they use a similar approach as they did in the contraction path using two three by three comps valid padding still up sample 2 3 by 3 up sample 2 3 by 3 and up sample and they also you concatenate along with these skip connections for every single sort of upsample that they do finally they they do two three by three comms again and then finally they use a one by one comp and so that doesn't change the input size in any way but they change the the number of channels to whatever classes that they have so in this case two all right so i mean that's what they do uh maybe we want to understand a little bit more of why they do that so specifically you know you can imagine sort of an intuitive way of doing it is that you would have some sort of image like this and that image you can imagine sending that through some you know 3x3 same convolution and then just continue doing that a bunch of times until you finally have your output and this is a valid you know this is sort of a question why aren't we just doing it this way in this way we're preserving the input size all the time and there's sort of two main problems why of doing it this way and the first problem is that we're not building a very large receptive field and the second is that this is very expensive so what i mean by by a receptive field is that if we imagine that we just have some some image all right and we sort of use a 3x3 kernel 3x3 conf and so what we're doing then is we're looking at a 3x3 region of that image right through by 3 like this where we have weights of that kernel and then we're computing doing some computation and that is going to result in a single pixel in the um after that 3x3 convolution and so essentially this pixel right that pixel had a 3x3 receptive field and then you can imagine that if you stack those convolutional layers the receptive field grows and grows and grows but unfortunately it doesn't grow it doesn't grow fast enough so that's sort of the the reason why we're not doing these uh 3x3 same convolutions all the time and also that is the reason why we're doing it in this contraction path so i hope that was clear uh why we're doing so is you know specifically what we're doing along that was horrible all right so you can imagine that what we're doing along this contraction path is we're learning sort of what and what i mean by that is we're learning to summarize sort of what is in the image but unfortunately we're losing uh spatial information so we're losing sort of where those sort of where that information is and so the the idea is then that through this contraction path is where we figure out the where and that is also where these skip connections comes into play and gives us that valuable information and uh yeah so let's see if there was something that i missed oh yeah so another thing here is that if you look at the actual sizes here for the in the contraction path they have 136 squared and then they use a skip connection to something that is just 104 squared so what they do here is that they actually crop that image right there to to match that one and similarly for you know for all of these skip connections really the the sizes never match so they have to do cropping to make them match and and i guess a viable option is also to use padding all right so i hope that was a sort of a good introduction to the architecture let's continue with the paper so the main idea so we've kind of looked at that but the main idea is to supplement a usual contracting network by successive layers where pooling operators are replaced by up-sampling operators and then in order to localize high resolution features from the contracting path are combined with the upsampled output and a successful convolutional layer can then learn to assemble a more precise output based on this information and so you can imagine that that is sort of the perfect combination of the what and the wear of the original image so they continue in our architecture the up sampling part we have the up sampling part we also have a large number of feature channels which allow the network to propagate context information to higher resolution layers all right that kind of makes sense as sort of when you have more feature channels allows it to propagate yeah more information i guess as a consequence the expansive path is more or less symmetric to the contracting path as we saw they're very similar and it yields a u-shaped architecture one thing too important to note here is that it doesn't use any fully fully connected layers and only uses the valid part of each convolution ie the segmentation map only contains the pixels for which the full context is available in the input image yeah i'm not really sure but what they say here that the it only contains the pixels for which the full context is available but i get the first part so they only use valid convolutions and that's also why if we scroll up again to the architecture we can see that the input is reduced slightly from 572 to 570 and then 568. and it continues to have that pattern just because they use these valid convolutions all right and then they also continue to mention that this strategy allows the seamless segmentation of arbitrarily large images by an overlap tile strategy and i'm a little bit hesitant of exactly what they mean here but if we scroll up to this um this figure 2 right here the strategy that they do right here is that they they take a crop of the input image and so you can imagine that the input image is very large and we might not be able to run that to our network so they crop a a region of that image sort of this this yellow part right here and they also use padding around the border so specifically they use a mirror padding strategy as you can see here the the sort of the these two sides are mirror mirrors of each other and also at the top here we can see that so there the idea is that this blue this blue will be the input to the uh the the network which will obviously be larger than the output and that is also why there's a mismatch and the reason why do why they do it this way is from my understanding that they want the border pixels to to have context so you know if you would have removed uh this this padding right here then obviously the borders would have no context to what's above or to the left and so from my understanding then they just continue doing this for different parts of the image and then uh sort of the output are these different crops um segmentation crops uh as the result and then they i guess they sort of stitched that up uh in the end to to this full image right here all right and then they continue to talk about that another challenge in many cell segmentation tasks is the separation of touching objects of the same class and so i'm not too familiar with the exact application that they had but sort of they they had different cells i imagine that and then they sort of had issues with the borders when they were actually touching each other and so what they did is that they proposed using a weighted loss where they really prioritized um to have the borders accurate and so yeah and then they they sort of continue with the network architecture and we've already i've already explained parts of this but it consists of a contracting path which is the left side and an expansive path in the right side it consists of the repeated application of two three by three columns which are unpadded each followed by a relu and a 2x2 max pooling operation with stride 2 for down sampling so at each down sampling they double the number of feature channels and then that would and you know of course half the input size and i think this is very similar very much inspired by vgg so every step in the expansive path consists of an upsampling of the feature map followed by a two by two convolution up convolution that halves the number of feature channels and a concatenation with the correspondingly cropped feature map from the contracting path and two three by three conv convolutions each followed by a relu and they also say that the cropping is necessary due to the loss of border pixels in every convolution um so we saw that previously they simply don't match you can't concatenate them without padding or cropping so at the final layer a one by one convolution is used to map each 64 component feature vectors to the desired number of classes and also one thing that's important to note here is that you know to be able to do that sort of down sampling the you need to have sort of a matching input size so they mentioned that to allow a seamless tiling of the output segmentation map it is important to select input tile size such that all two by two maxwelling operations are applied to a layer with an even x and y size so you need to kind of be careful with the exact input shapes that you choose and so they've already done that with 577 and then 388 as output all right and we're kind of going to skip we've kind of gone through the main parts of the paper in my view but they also mentioned that due to the unpadded convolutions the ipad image is smaller than the input by a constant border width and also here what they did is that they on the output they used a softmax with a cross-entropy loss and so that's also why they had two channels as output and so you can imagine that instead of doing it that way you would have a sigmoid in just one single channel and then you would use binary cross entropy instead i'm not sure why they didn't do that but um but i mean it works just as well i guess all right so in the in sort of the remaining part of the paper they sort of go into some some data augmentation that they used and they used quite a lot of data augmentation uh because they had i guess i think they had uh 50 images or 30 images or something like that quite a for like a substantive that's not a very much data but so they go through some data augmentation and then they also go through the results which is associated with this biomedical application i don't know too much about that so we're just going to skip those two but you know obviously they had good results and so anyways that is hopefully wasn't too fast and hopefully hopefully i captured my understanding of the paper but yeah hope this was useful thank you so much for watching the video and i hope to see you in the next paper [Music] review [Music] you
Info
Channel: Aladdin Persson
Views: 15,332
Rating: undefined out of 5
Keywords: UNET paper explained, UNET architecture, Unet paper, Unet
Id: oLvmLJkmXuc
Channel Id: undefined
Length: 19min 55sec (1195 seconds)
Published: Thu Jan 21 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.