Implementing original U-Net from scratch using PyTorch

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello everyone and welcome back this episode is a bit different from other episodes in this episode we are going to take a look at units and see how it is implemented in Python we will use the original unit paper and I've seen a lot of implementations of this paper and different kinds of units floating around on the internet but all of them are very different from the original paper so I decided to implement this from scratch exactly like the paper itself so as you can see the paper in the background this paper is from 2015 so it's quite old but still very good paper and a lot of people are using it for segmentation problems so if we if we see if we scroll down we rapidly read derivative of this paper read the introduction so you it says that cornets are used generally in classification tasks but they provide a kind of network that you can use for segmentation tasks so this is the network architecture and obviously there are a lot of different terms that you should know about so terms like what convolutions our water strides word is padding and what is max pooling what is transposed convolution so these are some of the terms that you have to know about what is a real you activation function and I'm sure that many of you have been using units for quite a while but never tried to implement it on your own and this is what I am going to show you in today's video but I'm not going to talk about these terms what are what is convolution and what is up convolution so these kind of things I'm not going to talk about one thing that you should remember that a lot of people call up convolution as deconvolution which is kind of wrong so we will use the term up convolutions and this is architecture that has been provided and there is also an example like yeah it's here so you can see like this is the batch of the image and so the resultant image is smaller than the original image and it should be the same it's only because we just add padding to the image right appending to the original image so then our concerned final image is same as original image but as you can see like in this original implementation it says the input image of the size 572 Cross 572 and output is 388 Cross 388 so I mean I don't know how much I should go into details so it's called unit because it's u-shaped obviously and everyone says that and you have to see like okay this this is the this is the image the first one channel single channel image 572 X 572 then we apply a convolution layer to it and the image size reduces because no padding is used and then we apply a convolution layer again and main size reduces again it so you see like we are applying to convolutions one after the other and so we can have some kind of function called double con or double convolution or two convolutions whatever you want to call it then we are applying a max pool and then we are applying two convolutions again and we repeat this for for a while and in the end we start expanding it so we apply up convolutions now you will see a lot of implement a where they have not applied up convolutions and I I would say like that's not the real implementation that they have used by linear up sampling now when you use up convolutions you are learning some filters and by linear sampling is just applying a function to make the image bigger so that's that's not what we are going to do and this paper is quite straight forward and if you scroll down a little bit more you will see like they have a couple of paragraphs on network architecture so we are going to follow that and try to implement it so obviously I'm not going to implement the convolution layers or the convolution layers they are already done in Python so we are just going to see how we can build this architecture using Python okay so I have created a simple file called unit dot PI and here I will be implementing the network so let's first import the necessary stuff so import arch import towards dot and then as penan and let's also okay let's leave it for now and then we write a simple class unit which inherits from one dot module so now we can start implementing this class it's ok what's what what do we have what else do we have so now we go back ok this one so first thing that we have is let's scroll up a little bit is the input image and the to convolution so you can see like on the top you have the number of channels so one Chan image is converted to 64 channel 64 channel to 64 channels then we go to 128 256 we keep increasing the number of channels and in the first part and in the second part we decrease the channels okay so now we go back to our code and we can write here super not serve okay so now we can start building this unit so one thing that we would need is max pool max pooling and max pooling in this is not something that we learn so we can just define it once and use it several times so I will just say self dot max pool and okay let's read the paper a little bit more let's read the network architecture part so it says that it consists of two parts left left side and right side the contracting path and expensive path and the contracting path follows the typical architecture of a convolutional neural network okay that's fine so it's say that it consists of repeated applications of 2 3 cross 3 convolutions unpadded each followed by a rectified linear unit and to cross to max pooling operation with a stride of 2 for down sampling ok and that's why you should know all these terms so max pool we say it's a 2 cross 2 max pooling and this comes from a non dot max pool kernel size su and stride is 2 so this is our max bullying layer now we also need a double convolutional later so what I'm going to do here is create a function and call it double conf which has input channel and output channels okay and this is a sequential network so you have so I'm just going to say corn is equal to n dot sequin sure and inside this I can define the different layers so and then dot corn 2d só con 2d takes input arguments like input channels kernel size padding okay so here it will be I an underscore C comma out output channels and kernel size will be tree I think let's see so 2 3 cross 3 unpadded convolutions right so kernel size is 3 ok now we need another layer but before that it also says that we need rectified linear unit so let's add the rail you first and adding value is also very simple so what we do is you have and then dot raloo and in place equal to true so it will add it here in place and now you have one more convolution layer so I'm just going to select this thing I can select both actually so I'm going to select both and paste it here now here instead of and it will be out channel and our channel kernel size will be three and you can return the corn player so this is this is a first and the most basic part double convolutions so you were done with that so now you can basically implement the first part of unit so starting from here to going till here that can be implemented very easily now so let's let's see that so now we will create some layers based on this double convolution with some kind of input channel and somewhere out some channel so I'm going to say self dot down conf one it's down convolution is double cons and the number of input channels so if you if you look here the number of input channels is one and number of output channels is 64 right and that's what we have to do so 1 comma 64 and let's copy this thing to create few more of these so how many do we need ok so this is fine no I need to open terminal here ok and going back to this one so we have one one double convolution here one double convolution here two three four and five so we need five double convolutions 64 128 256 by one two one zero two four these are the number of filters that you need so you see like they're just increasing so let's create five of these three four and five and here it will be 64 128 128 and this will be mm 256 256 5 1 2 1 5 1 2 1 0 2 4 so this is this is what we have till now so we will now define uh-oh I just want it won't work this way so I forgot to define the init function and this all goes inside it ok now it's fine and def forward and you have the image so take one image and runs the forward pass so here what we can do is we can see we can write the encoder Powerforce let's write the encoder part so here your first output will come from self-talk down on one and that takes the image and now after each down convolution you need to apply max pooling so X 2 will be self dot max rule 2 X 2 and this will take input X 1 which is the output off down convolution 1 ok and now we have to repeat this 5 times so let's repeat it 5 times let's write one more and repeated x3 will be self dot down corn one and this takes X 2 and X 4 will be itself but max pool 2 X 2 X 3 okay so far so good it looks fine to me so we need 4 and this will be 4 and this will be 5 and this will be 6 and this will be life okay so yeah X 1 X 2 X 3 X 4 X 5 6 so how many layers have we applied dunk on one dunk on 2 and down called three we need two more 3 & 4 & 5 3 & 4 & 5 so this one takes X 6 this will be 7 and this one takes X 7 X 8 and this will be X 9 and the stakes 8 so it's just like output of one layer is going into next book but the other one okay so now let's try to run this so how we are going to run it so in the forward pass the expected size is batch size channel height and width so let's assume that we have a random image of size 5 72 X 5 72 because that's the input size of the network in the paper so image is torched or and and I will say bad size is one channels is 1 by 72 comma 572 and I will initialize the model unit and then I can just print model model and image so it runs the forward trust okay so we have everything that we need and now we can run at once so let's print a few things print X 9 dot and we also print the size after the first down convolution double convolution so print X 1 dot size okay so let's run let's run this one fighting unit by module torch that I mean has no attribute max pool okay so what's wrong what's wrong max cool 2d should be max bowl 2d instead let's see okay so we have some problems now so we will try to see what the problems are so this is this is fine and then you have the down corn okay two three four five two three four and five let's see okay so now it works so you can see the we printed the first ones up after the first down corn and after the fifth down corn so the first one gives you back size one which is one obviously and then we have 64 filters 568 cross 568 so I may just become smaller so let's compare so we have 64 filters and a mid-sized is 568 X 568 here so so far so good right and now we go back to this terminal and we can see the after the fifth double convolution our image size is 28 X 28 and we have one zero two four filters so this is one two three four and five so we see the fifth one one zero two four and image size is 28 cross 28 okay so we are good to go now in this paper unit they are also doing something extra when they are doing this up sampling thing right a convolution so when we are doing the up convolution we we are also taking the image from one of the previous Conville convolutional layers the double convolutional layers and we are cropping it to match the size of the up convolution and we are concatenating them before passing through double convolution again so we are doing it at five stages so first one at four stages so the first one is after the first double convolution so let's make a note of that so this output is passed to the second part of the network so let's just write hash here and I don't need the size anymore and then and then we have the max pooling then we have the two convolutions again and the output is again passed on so we have the max pooling and this output again and this one is also going there and then we have the one more max pooling and then again output of the two convolutions one more max pooling output of the two convolutions this one is also going there and then we have one more max pooling and output of the two convolution 5 1 2 X 5 1 2 which is also going to the next part so this one ok so now we know that we mean X 1 we need X 3 we need X 5 and we need X 7 in this decoder part of the network but we don't need X 8 or X 9 we need X 9 but in a different way and you see like after X 9 there is no max pooling because it's not there in the paper right so we don't have to worry about it now the second part comes and second part is more about transposed contributions so convolutions decreased the size if you don't use any padding and this transposed convolutions will increase the size depending on how you use the strides okay so to use transpose convolutions you can use con transpose to D from potage so it's the second part of the network so let's write let's write the first a convolution bar which is this one so you have the 28 cross 28 image and from 28 cross 28 image look at the blue part only you are getting an image of size 56 cross 56 okay and which has five one two filters okay channels sorry and we can write here self-taught so that we can call it up transpose of trans one so that's the first one that you need and and and dot corn trans oops and then dot con transpose 2d I think yeah okay so here we have 2d and now you need the input channels and output channels and kernel size and stride so let's see you have to increase the size of the image by almost by two right so let's see input channels here is the total number of input channels is 1 1 0 2 4 or 5 1 2 was it let me see so you have 28 cross 28 and here you're going from 1 0 to 4 so it's 1 0 to 4 and this is also 1 0 to 4 ok let's write that but there is one more thing here so here you have 1 0 2 4 so this one is also 1 0 2 4 and this whole thing here is 1 0 to 4 but you look at only the blue part so blue part is by 1/2 and the rest 5 1 2 comes from this double convolution thing so we can we can write input channels as 1 0 2 for our channels 5 1 2 okay now the kernel size and strides so in this section in the network architecture section they have written the kernel size so kernel size here is 2 and strides is 2 so you have to find out the correct number for stride so this is up transpose 1 okay 1 0 2 4 input I won 2 output channels kernel size of 2 and stride - so let's apply it here 2 X 9 so now we can use a just X because we don't need anything from here so this will become your decoder power and X will be self dot up trans one and this takes input x9 and you can do print X dot size okay mmm this looks good let's run it okay so in the end you're getting pi/12 channels and 56 X 56 image now the more interesting part comes is now you have to see okay so now here you have X 9 so you got X now you have to concatenate X with X 7 so we can also print the size of X 7 let's remove this one do we have any more prints no okay so we got this so 56 X 56 but X 7 size is 64 cross 64 so obviously you cannot combine them so if there are multiple methods you can concatenate them one is you can pat this 56 X 56 image but or you can drop this 64x64 image you can do some kind of center propping so in the original paper they cropped it so that's what we are going to do we are going to crop the image so what what I'm going to do is I'm going to write a function simple function crop image function it should rather be crop tensor but okay so we we have we have two tensors one is the original tensor and then there is a target tensor so make the original tensor the same size of target and sort by cropping and it has a lot of assumptions in it so target size we'll be target tenser dot size and now it's square images so you need to take two so square meters so two and three are going to be the same so it's zero one two three remember bad size channels height width and so this is your size of the target tensor and then there is a tensor sighs so tensor size will be yeah same way and then we can calculate some kind of Delta so that's your tensor size so I'm assuming that my tensor size is always larger than the target size because we are only cropping all the time my start target size and then I will say my Delta is Delta by 2 so if my Delta is 8 then Delta becomes 4 and return and then return tensor just aches all bad sizes all channels but then Delta 2 tensor size minus Delta and the same for width Delta 2 tensor size minus Delta and now what we can do here is we can create a new tensor called Y and Y would be crop image and now we have to crop X 7 to the same size of X and we can also print white dot size tens of sighs your friends before assignment okay so something is tensors size is tensor sighs okay okay this should work okay so this was our original the up sounding output then we have the x7 and this is now cropped excellent okay great so now what we can do is so now what we have to do is we have to apply that same old to convolutions so how many to convolutions are we applying one two and three right and we are applying one two three and four up convolutions so we have up trans one here so let's write the network first and then we are applying and double convolution right so I can just say we call that down corn I can call it up cons but it's just double convolutions so we can actually create a function for this two and call it up conf because why not so what we let to let me first you get this thing here okay and I will just copy this function double conf and let's see if we need relu activation I think we do so every step in the expensive path consists of an up sampling feature map followed by a 2 cross 2 convolution that halves the number of feature channel concatenation with correspondingly cropped feature map from the contracting path and to 3 cross 3 convolutions each followed by a real okay so we cannot we cannot actually create a function because we have to apply the transpose convolution and then we have to apply the OPCON the double convolutions so what I'm going to do here is I'm going to keep it as it is and I'm going to say we have double corn and here the input channels will be 1 0 2 4 and output will be by 1 2 so now here are two channels up by one - but you combined it with the cropped image so your channels have now increased and here I can say X is self dot up on one torch dot cat here you will have X comma X 7 or X 7 comma X X comma X 7 X this is 1 and that's it so now we can print the last size let's see if this works no it shouldn't be X 7 it should be by because that's look dropped X 7 okay so we got a 52 + 52 502 channel image so let's let's try to see that again are we getting in the same size so we got 52 X 52 5 1 2 channel image awesome so now we just we just repeat it a few times so I will take all of this and how many of these do we need to apply 1 two three three of these so one two and three okay now let's change the variable so this was the first one so this is going to be just second one it's 2 this is 3 and this is 3 and this is 4 this is 4 and let's try to see the sizes now so 1 0 2 4 5 1 2 and 1 0 2 4 5 1 2 so now here input channel will be 5 1 2 and 2 to 6 everything else remains the same and here you will also do the same thing by 1 2 and 256 and now here input channels will be 256 output will be 128 this will be combined with 128 and 256 and 128 again okay and here it will be 128 and 64 and this will be 128 and 64 so in the end you should get a 64 channel image here and then you have a output layer so surf thought out is also it's also convolution under called 2d and in channels is 64 out channels is in the paper it's 2 so here if you have if you have multiple objects to segment so you can just increase the number of our channels and kernel size is 1 so it's a 1d con now sorry it's a 2d Conville with the context kernel size of 1 okay so now we have the up trans one and we applied it an X 9 so let's copy this thing and we we apply up trans - but we apply it on X now and shear we will crop and concatenate X 5 and this will keep up corn true and this thing we can now copy and 2 times so it was up to 4 right here so two times and then we have this 3 here 3 here and 4 here 4 here and this is X 3 and X 1 so we wrote some functions and they can take care of everything for us so we just need to repeat so in the end I'm getting 64 channel 388 388 image 64 channel 388 388 and we are done so now we just add the output channel so thought out X and return X so let's let's just also trend this thing okay awesome so now we are getting it and since I have in 388 388 image with two channels back size is 1 388 388 two channels so one channel will be foreground one channel will be background or you can just use a single channel for it this is 2 class so you can also use single channel in the end so this is this is what you have now so you have the full network you have the encoder part and the decoder part and now you can now you understand how the original unit is implemented I've seen a lot of people like write about units but no no one shows how they are really implemented so this is the original implementation you have to remember you have to use the transpose convolution not the bi-linear upsampling and then one more thing that you have is you can see the encoder part is nothing but a convolution simple convolution neural network that you have been using for a long time now so just just remember that you can also change it with any kind of network that you want so if you want a pre-trained model here you can replace this encoder with that pre-trained model so you just have to take care of the concatenations in the end or you can just write a generic decoder that works with different kinds of models like you can write our decoder read works with rest nets so you can have a rest net 18 horas in a 34 small Network which is pre trained on image net as encoder and you can write a decoder for it which is quite easy when you understand how this thing is actually implemented so that's the end of the video so this is also included in my book and there are a lot of different things that are included in my book and if you're interested just buy it after reading the description and this is the end of the video if I have done something wrong here let me know and I make mistakes from time to time so it might be possible that I have made some mistakes in this video too but I hope not and if you have suggestions on what you want to learn next let me know so today we implemented unit from the paper and we will be doing more of these kind of videos in future so stay tuned and thank you very much and if you have not subscribed you have to subscribe now so and like the video if you liked it so thank you very much and see you next time good bye

Info

Channel: Abhishek Thakur

Views: 32,370

Rating: undefined out of 5

Keywords: machine learning, deep learning, artificial intelligence, abhishek thakur, unet, how to implement unet, unet original implementation, unet pytorch, pytorch implement unet, pytorch

Id: u1loyDCoGbE

Channel Id: undefined

Length: 43min 37sec (2617 seconds)

Published: Sun Jun 21 2020