Pix2Pix Paper Walkthrough

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video we're going to take a look at the picks to picks paper and then in the next video yeah you already know it we're going to implement it from scratch because that's just what we do baby [Music] all right so the title is image to image translation with conditional adversarial networks and so what was pretty cool about this is that you sort of have some input and it's going to map it to some output which i guess is a bit different than to normal gas that we've looked at previously which is where you just have some latent noise and then from those latent noise vector you can generate an entire image so you know this is quite different and basically what they were able to do is that they could take a segmentation map to some pretty cool actual real scene and then some satellite image to uh to a map say a google map image and device versa of course then also some labels to facade colorization of images and stuff like that so um what you would probably be wondering is why didn't they just use a regular cnn um where you would just input this and then you would have a cnn that made this mapping and um yeah i guess we'll get to it so anyways they made a conditional adversarial network and why it's conditional is because you send in an image rather than having that latent noise that we talked about so the idea is that we don't just learn the mapping which we would if you just use the comnet but we also learn a loss function and um so that could be a bit confusing as to you know when i read first like oh what is the loss function but the loss function is the is the gan itself right the discriminator is inherently learning a loss function uh instead of using some some specific one and basically the the whole idea of this paper is is that using gans even to this kind of application or this time kind of a task so the the idea of this paper is kind of that we don't have to hand engineer our loss functions and that is because it's going to be done inherently in the network so basically just using something like a mean squared error which they which they uh tried so if we just take the naive approach to minimize the euclidean distance uh it will tend to produce blur results so you know then you can kind of think about you know what do we want for our loss function and basically you know it would be desirable if we could instead specify something like make the output indistinguishable from reality and you know so that's a pretty funny way of framing your your loss function and formulating your loss function but in fact you know that's exactly what gans do so that is sort of this sort of leads us to the path of using gans instead of just using a normal loss all right so as i said i'm going to try to make this shorter so i'm just going to go to the most relevant parts basically they use sort of a standard gan loss right so they didn't use vegan gp um i guess it so i tried it actually on this and it didn't work well and that's what the authors uh said as well but anyways uh these they just use a standard gan loss and then basically uh as you can see here the discriminator takes in x and y so basically if we basically if we go up again and we look at the images we had from the beginning right here it takes in the input x and the target y sort of a side by side um as we send it into the discriminator so the discriminator you know gets both of these um as its input and then you sort of concatenate them um across the channels so that's what they did in the loss function for the discriminator and that's also what they call this uh this conditional again so i guess i guess the the generator is already conditioned on the input but the discriminator is now also conditioned on the input so this is in contrast to to you know how we normally do it which is just sending the target value i guess for the discriminator of the output and so what they did in the loss function is that they use that sort of standard uh loss or that that loss that we just looked at standard gan loss where they send in x and y to the discriminator but then they also added this additional this additional loss right here which is an l1 loss between the target value and the generator so they found it useful to use both of them and the reason why they used l1 instead of l2 is because of that we saw before that the they mentioned that l2 produces these blurry results which l1 doesn't suffer from all right so for the actual sort of uh the the the model that they use basically for the uh the generator they use unit and uh so basically they use a small variant of unit but the idea is the same and hopefully you've um you know about unit or you've watched my previous video in unit but if you haven't basically you take in an image right and you do some comp layers and then you down sample then you do some more comp layers and then you down sample again you do a couple of comp layers you then up sample using something like a comp tran transpose convolution then you do another work with that it's for some common layers and then you up sample and then you do some common layers and then you sort of get this i guess sort of a u shape right here so that's why it's called unit so basically uh sort of these arrows to the right is just a in the case of unit this is sort of a double comp so double conf and that's the same for all of these right arrows and then for the the downward part so i guess these parts right that is sort of a i shouldn't make an arrow here that could be kind of confusing but so the down arrows are sort of uh either a stride two on the comp layer or it's just a a pulling layer and then the upward part the upward arrows are sort of calm transpose or i guess you can use an up sampling as well and then the difference between unit and the generator that they used is basically that they just used a single comp layer for its uh for the right arrow so i guess here this is just a single comp in the case of the generator and picks the picks so these a single com and then all of them with a stride of two and um we'll see them write that in the paper as well but use a single column with a stride of two so they didn't essentially i guess have these downward arrows that's just included with stride two and then they use these uh com transpose for the upward arrow and the other difference is that um unit just uh went down until they had some feature so sort of some image size of 30 by 30 or something the generator in picks to picks actually goes down so it downsamples until it reaches a a one by one feature map and then it starts to upsample oh and also i guess here sort of in between those we should oh we should have let's see so in between those right here we should also have some some arrows where we um do some sort of skip connection right there so we we add the um the the the one in the sort of the encoder which is the left part we add that to the decoder and i guess sort of the entire idea of unit is that if we just draw u is that in the first part in the first half part right there we sort of learn uh what is in the image so so we basically learn what is an image by getting good features and then in the upward part we sort of learn where things are in the image right so that's sort of the conclusion of that and you know and if you want to see the the details also we're going to look at some more details later on but the implementation i think is going to be also pretty clear that we're going to do in the next video so then uh both the generator and discriminator use compassion relu and then here it's actually kind of confusing they say relu but they also mean leaky relu so they use leaky relu as well so they use a combination of relu and leak reload we'll see more details of that later on and then for the discriminator they use something called a patch can and yeah so patch can basically they design a an architecture which they call patch scan that penalizes structure at the scale of patches so um basically or actually let's read the next sentence so this discriminator tries to classify if each n-by-n patch in an image is real or fake so i guess what's more common right is that you have some image uh and then you send it through a a discriminator right you send it through the discriminator and you get a scalar value 0 or 1. what they did instead is that they sort of um outputted us an entire grid of values so maybe we have i don't know a 3x3 grid output that's not what they used but you can sort of imagine a three by three grid as output instead of a one by one then just for a single scalar value where each of these values so each of these sort of uh values in the grid are between 0 0 or 1 right so this is going to be between 0 or 1. if we just look at what this single value in the grid is able to see in the original image it corresponds to seeing sort of a patch of the original image so let's say this is the original image then the blue sort of the blue one right there the single value between zero one is sort of uh able to see from a patch of the original image which is obviously much greater so perhaps you know this is something like i don't know 20 by 20 or something so this single value is responsible for seeing a 20 by 20 patch in the original image and uh yeah so why this is at advantageous is because it has fewer parameters it runs faster and it can be applied to arbitrarily large images and that's the the case because it's it's uh just looking at a single patch right so you can size the image and as you know as you want basically because the the output is still is dependent on just a small patch of that image so that's obviously much better than than um you know sending it through a bunch more comp layers and making it into a single scalar value or it's much faster anyways all right so if we go down for sort of the optimization of this uh they do sort of the training step is that they just use a gradient descent on discriminator and then one on the generator and they use the standard trick of instead of maximizing or minimizing the log of one minus the discriminator of yeah you can see this thing they instead maximize log d of of that so this is sort of the standard trick which is to have non-saturating gradients and then they use um this is kind of confusing actually so they say that we use mini batch sgd with adam solver but those are sort of that to me at least these are sort of opposites doesn't make much sense either you use sgd or you use atom and um and when i looked through the source code they used uh atom so that's what they used and then they also used these beta terms right so these are sort of indicative that they actually use the atom too so they use b1 beta 1 to be 0.5 that's for the momentum term and i i guess b b2 beta 2 is for um uh what is it at a grad or something the exponential averages or something i don't really remember but those are the parameters and then um yeah also they had kind of a funny way of evaluating their gan so i guess they had sort of a more standard metric but they also uh ran real versus fake perceptual studies on amazon mechanical turk so you know they detail it a little bit later on so uh turkers i didn't know that was a word but turkers were presented with a series of trials that pitted a real image against the fake image and then on i guess on each of those trial the image was shown for a second and then the the labeler i guess the turker were responsible to sort of say which one was fake and i guess they used that to evaluate if their again was good or not which is kind of funny to me but anyways moving on to the more important stuff um you can kind of imagine that it's uh basically what should the patch size be right so what should uh each be responsible for seeing in the original image and they tried different ones so they used a one by one 16 by 16 70 by 70 and then the entire image so that's just seeing the entire image and basically what they found is that 70 by 70 is pretty good using 16 by 16 we have these uh artifacts um so i guess you can see those also if you look at the image right there so and and that is also what i found i actually got artifacts randomly during training anyways so i don't know but i guess it's even more occurring when you have a larger patch so they mentioned those as well is that they wanted to use a 70x70 to alleviate artifacts and then they achieved slightly better scores as well and yeah so what's good is that they mentioned that even though the images are trained on 256 they can still use it since it's sort of a fully convolutional you can still use it on larger images and here we sort of have a photo to a map and i guess yeah this is a 512 and it was trained on 256. and of course i think these examples are pretty cherry picked but yeah they look pretty good and then i this is sort of the most that was sort of the most important parts of the paper but for the actual training details and yeah so here we can also see some some outputs examples and you know all of these are i guess cherry um because all of them look really good and i think there were some that you know were where it failed but they didn't show those but yeah these look pretty good all right so moving on after the citations in the appendix uh they sort of go through the details of implementing this and they also have a repository for the source code so basically uh they used a combat from relu with k filters and they denoted that ck and then cdk is a combat from dropout relu and then all the common layers are four by four kernels with a stride of two so basically the convolutions in the encoder the downward part of the unit uh is uh going i'm going to be with a stride of two and also in the discriminator they down sample but in the decoder the upsample by a factor of two which sort of makes sense and yeah so this is the generator architecture um so c here was just combats from relu so they basically just used um yeah so basically each of these convolutors are gonna down sample by a factor of two and then they increase the number of channels um and that's it so in the end of this i guess in the end here it's going to be a one by one uh sort of pixel i guess and then or value and then in the decoder the the upsample and also so this is the decoder without the skip connections but what they used is that they used a unit where they added the skip connections among the channels which meant that instead of having 512 here we have sort of double that because we concatenate with the output from the encoder so they used those and then they also used as you can see here they used a drop out here here and not there anywhere there so they used two dropouts um no actually they used three dropouts because they used one dropout here as well i was kind of like maybe i implemented it incorrectly but yeah they used three dropouts and i guess also what could be important here is that they use some other details is that they used a 10h right on the generator output and then sigmoid on the output where did they mention that yeah they mentioned that over here i guess sigmoid on the output and then the reloads are leaky leaky and the discriminator with a slope of 1.2 um and then there are some other details like they they didn't use batch norm on the first c64 layer and yeah i guess that's it all right so that's it for this paper walkthrough of the most important parts in the next video we'll implement this from scratch and uh to make things super clear all right hopefully this video was useful to you like the video if you thought it was useful and i'll see in the next one hopefully alright bye
Info
Channel: Aladdin Persson
Views: 9,735
Rating: undefined out of 5
Keywords:
Id: 9SGs4Nm0VR4
Channel Id: undefined
Length: 19min 33sec (1173 seconds)
Published: Wed Mar 03 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.