Diffusion Models Beat GANs on Image Synthesis | ML Coding Series | Part 2

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

what's up guys in this video i'm continuing with the machine learning coding series and i'll be covering a follow-up paper to the previous video so in the last one we covered the improved ddpms or denoising diffusion probabilistic models people in this one i'm gonna cover the follow-up work called the diffusion model speed cans on image synthesis uh paper and the associated code so the idea will be to uh again for me to give you a brief context like 10-15 minutes short overview of the most salient parts of the paper and after that we're gonna dig into the code and analyze two scripts so one is this classifier train and the other one is classifier sample so basically i want to focus on uh the guiding so how do we guide diffusion using uh pre-trained uh noise classifiers so you understand what that means in 10-15 minutes uh in any case uh what i what i've done is i cloned this repo so the guided diffusion from openai uh they have like everything all the instructions you'll need i had i had to do some modifications again if i get enough enough votes down in the in the comments i'll just uh push my my changes so i had to make minor changes to make this to work on my machine in any case you can you can basically download various checkpoints here and just follow the instructions they uh they have in the readme which are super nice and uh yeah you should be ready to go without further ado let's first start with the paper okay so as i said it's diffusion models uh beat gans on image synthesis and two main innovations here are they improved the unit architecture from their previous work so from the improved ddpm's paper and the second improvement is they find a way how to condition additionally condition uh the models so that they can achieve a higher quality so they can trade off the diversity versus the quality of a sample which is something that previous generative models such as gans i used to achieve using this truncation truncation trick uh as well as some other regressive models can simply modify the temperature and thus uh control the like uh diversity versus the quality of the of the sample so we'll see in this paper how they achieve um the same thing for for the class of generating models called diffusion models so i strongly suggest you you watch my previous video if you haven't because i'm gonna kind of assume some knowledge from that one but yeah you can you can continue watching this one as well okay so uh so they say here we hypothesize that the gap between diffusion models and gans stems from at least two factors so first one uh that the model architectures used by recent gan literature have been heavily explored and refined and we saw the same thing happening with uh in the comnext paper if you recall that one uh where like basically they people showed like people focused a lot on transformers and there was a lot of innovation that happened on that side of the spectrum whereas the cnns were not as as explored and then what comnecs showed that if you take some of the innovations that were well invented in the context of transformers and you adopt that into cnns you get much better models and so similarly here they took some ideas from other game peepers and they took those ideas and then integrated them into diffusion models into the architecture and that improved the results uh quite uh quite nicely anyways so that's the first thing uh the second thing is uh that gans are able to trade off diversity for fidelity producing high quality samples but not covering the whole distribution so that's the second thing i mentioned uh previously so they do that using this uh truncation trick you can check out the the style game paper uh i think that's that's the one that we're very cold that they have that they're using the the truncation trick in any case um let me quickly show you um the dimensions across which they they kind of explored and how to improve the architecture so one is increasing depth versus width holding model size relatively constant increasing the number of attention heads using attention at 32 times 32 resolution as well as 16 um by 16 and not an 8x8 uh rather than only at 16 by 16 uh using began residual block for up sampling and down sampling the activations and rescaling residual connections with one over square root of two okay so those are some of the ideas they they kind of uh played with and you can see here some of the curves here after you uh take the best of out of those changes uh after a lot of experimentation i guess uh they end up having uh the the the best architecture here so you can see here number channels 160 number of blocks too blah blah blah uh anyways not that important i'm going to focus on the like guiding the diffusion much more than on the architecture so here again just to to wrap this up the architecture part in the rest of the paper we use this final improved model architecture as our default ver variable width with two residual blocks per resolution multiple halves with 64 channels per head attention at 32 16 and 8 resolutions began residual blocks for up and down sampling and adaptive group normalization for injecting time step and class embeddings into residual blocks uh you might recall this one from my previous video so this adaptive group normalization for injecting uh so this one adapt for injection time step and class embeddings so do check it out uh if you want to know the the details how that is how that works okay so this is the important part so classifier guidance that's what we want to understand a bit better dimension here we already incorporate class information into normalization layers so if you recall from the previous video basically the same way we integ we incorporate the uh the time step information uh through that same pathway uh they incorporate the class information so that means they already had class condition diffusion models but the thing is they show uh in this paper that using this classifier guidance is complementary to the class conditioning and because of that they achieve the best results um with with class conditioning obviously so with the classifier conditioning uh so let's let's see what it is uh basically um so here we explore a different approach compared to the one i just mentioned so exploiting a classifier p y condition on x so basically you input x and you get like a distribution across labels that's how you're going to imp uh implement this abstract um distribution here to improve a diffusion generator so blah blah blah the previous paper show one way to achieve this where in a pre-trained diffusion model can be conditioned using the gradients of a classifier okay that's the important part so so in particular we can train a classifier p parameterized by parameters five and conditioned on x of t and t so x t is basically the the if you recall from the this is the forward so again diffusion uh just a quick reminder is you start uh with uh like a fresh image let's let's uh let's let me draw it like here so we have a fresh image here and then what you do is you gradually stop adding more and more noise so basically here you have like some image and then you have here the same image and basically you start adding gaussian noise so you're gonna add some noise here and then you you keep repeating that that that process uh multiple times uh basically uh like for example one thousand four thousand are are numbers which makes sense and so at the end well actually at the end you'll end up with a completely noised image so at the end you'll just have a pure gaussian noise here i'm just gonna like draw it like this basically it's pure noise and what you're trying to learn is how to go backwards so this is the forward part this is the forward direction and we want to learn how to go backwards so we want to learn how to start from noise and then gradually basically generate the image from the underlying probability distribution which we have learned during the training procedure okay so x of t is basically the teeth step of this forward process so we we take the x image here so there's going to be like some image here and that's like these uh teeth uh time step this is gonna be this is usually denoted as zeroth this is usually denoted as peak t and uh that's it or t minus one whatever uh in any case so that's that's what uh this x of t is and t is just a scalar which we're gonna obviously um encode using the the usually they use the uh like sinusoid uh embeddings same as for the original transformer paper okay so in particular we can train a classifier which i just explained on noisy images and then use gradients of that classifier as you can see here so uh to guide the diffusion sampling process towards an arbitrary class label y so the idea here is this gradient here tells you how you need to tweak the current image so x of t how do you need to tweak it such that you maximize the class label of your interest so let's say you want to you want to create a dark image then you want to make sure that so this will tell you how to tweak the image such that the dot class is maximized uh ie how to yeah basically how to tweak the diffusion process such that uh basically uh you start guiding it towards this part of the space where the dog images lie in uh we're gonna see those details uh in a second but like that's that's that's a rough intuition for now okay let's um continue here here just brief they briefly mention the notation they're gonna omit the t part because uh it's kind of uh redundant in the sense that well just for concise and sake basically it's not redundant it's just easier to write it down like that okay so uh so here's how we're gonna do this uh we start with the diffusion model with an unconditional reverse noising process so p parameters by theta and the idea is to given x t plus one so that's the image with more noise you want to predict the next step of the reverse process as i previously explained um here okay so that's that's this part we want to learn how to go how to use how to go from more noise to less noise in any case um to condition this on label y it suffices to sample each transition according to this distribution here so if we want to sample from a class conditions so you can see here we additionally have the class information here uh it can be shown and they have a like a whole derivation in the appendix h so if you want to dig into the maths you can go there i'm gonna kind of uh assume this to be like true uh like as an axiom because it otherwise would take too much time for me to explain the maths and in in any case you can see here so that is just an arbitrary context uh constant and we we already have this part here so this is the unconditioned uh reverse process which we learned and here this is going to be the classifier that we are going to train on the noisy images where by noise images i mean images taken from arbitrary time step of the forward diffusion process okay that's the basic idea so if you take this as as given now let's kind of derive how can we actually sample from this distribution on the right hand side because as they mentioned here it is typically intractable to sample from this distribution exactly okay so that's that's the that's the idea now let's see how we can break this down and make it simple and easy to sample from um they say here uh well we basically know that uh our our reverse process can be parameterized like this this is what we saw in the previous paper so we we're gonna learn how to predict mu and covariance matrix now if we take log of this expression here since this is like e raised to blah blah blah something that's the gaussian definition you end up with this expression here okay so we are working our way here so we are basically trying to understand how we can break this down and now we're going to see how we can break this down and uh that's going to be on the next page and then we're going to have a simple expression for for uh like a sampling from a class conditioned distribution okay so under certain assumptions which they mentioned here so we can assume that the log um of this of our classifier has low curvature compared to the um basically inverse of this covariance matrix which are which we're going to predict from the unconditional diffusion model so this assumption is reasonable in the limit of infinite diffusion steps blah blah blah where the well i guess this is forbidden's norm converges to zero and in this case we can approximate this a log p so the classifier using a taylor expansion around mu and mu is basically the mean of the x of t so basically uh let me go back here so that's the same u as this one here okay so that's that's the basically the mean of the distribution from which we're gonna sample x of t that's it uh fairly simple and if we do that they end up with this expression here uh this component here is going to be very very important and under this uh like um basically uh approximation and then derivation here we can finally get the the the the final expression here so we just take the log of the distribution we just saw above and we can omit z because that is a constant so we don't care about it and you can see that after some derivation here we end up with this expression here so again we can ignore c4 is just a constant so we basically if we want to sample from that complex intractable distribution here what we in practice need to do is just sample from this gaussian here so as you can see the gaussian just has a mean a shifted mean so we have mu which is from the unconditional diffusion model and we just simply add this offset here so we have the covariance matrix we just um basically do um like uh we we multiply it with the g and g is just the as you can see here the gradient of our classifier uh for x of t equals mu so i'm going to give you some intuition uh for behind these formulas but first let me just tell you this so so we have thus found that the conditional transition operator so that's the the one we saw above here can be approximated by a gaussian similar to the unconditional transition operator but with its mean shifted by sigma uh g okay so let me try and uh give you some intuition for why this works it's going to be a very hand wavy explanation but like uh but i think the intuition is is completely correct on some level of abstraction it's it's a correct um handwave explanation okay so g again it tells you how you should tweak uh x of t so the the image such that you can maximize certain class y okay so let's see what we do so we we pass in uh we have our model and let me kind of draw it like this so let me just change the color here so we have an image uh that's x t plus one so that's the x t plus one image we pass that image into our diffusion model so that's our diffusion model here and out comes what outcome outcomes mu and out comes this covariance matrix and these two two together basically parameterize the gaussian distribution from which we would sample the uh next so x of t so from from this from this distribution here we sample uh x of t okay so we can kind of visualize uh the gaussian as just uh like a point in a in a multi-dimensional space for the sake of explanation i'm just gonna assume we're we're dealing with 3d space but obviously this is going to be much higher dimensionality mu is actually the dimension the same dimensionality as our our image so basically uh is gonna be um well uh h times w times three so that's the dimension of our image and similarly for for sigma but just for the sake of understanding what's going on with this shift here so the the sigma g part i'm going to kind of assume a coordinate system like this and like assume we have a like a gaussian distribution the mean is here so this is the mu okay and uh we have some we basically have some some variance so under normal circumstances if we just have uh like an unconditional diffusion model we would basically sample a point some from from this distribution somewhere from from this distribution like maybe maybe this would be our next uh this would be our x of t okay but because we we additionally have this information coming from from the classifier what it does is the following so now you take this mu here let me change the color okay so so you take the mu and you pass that as an image to like our classifier so we're going to have something like this so we have an image uh that's going to be the same as this one here so that's going to be mu we pass that inside of our classifier this is our classifier and out comes uh well for imagenet they train obviously it has like thousand classes and we basically learn how we should tweak this image so the classifier tells us how we should tweak the image such that a particular class is maximized maybe this one okay and with that information with that gradient so so those blue dots i'm just going to denote that as g with that information which is of the same dimensionality as the mu we're going to shift this mu so we're going to shift this maybe here and we're going to end up here so our new distribution is going to be actually here instead of here okay and now we have and we have the same variance so i'm going to kind of change the color because otherwise you will not be able to see anything so we shifted our original distribution and now we are somewhere here okay it's kind of messy but you get a point so this shift vector we just saw there so that's this blue one is just this expression here so sigma big sigma times g and this is the like the hand wave explanation of what's going on so the classifier knows where you should kind of move this image in which direction should you move this image such such that you maximize the class of interest and by just doing that by just kind of moving the um the mean in the direction of the gradient of the classifier you basically increase the probability of the class label here and that's that's basically everything you need to understand okay uh hopefully that's clear enough that's how we can combine again the classifier here and the unconditional diffusion model here to form images from a particular class again i'm kind of lying here because as i said they are still class conditioning this model here but using the classifier we additionally complementary uh improve uh the uh well this class conditioned uh generation cool uh that's it let me briefly um explain a couple more things um aside from using the the classifier the classifier gradient they additionally scale it and this is just empirically they found out that this gives them much better results so here they mentioned our classifier architecture is simply the down sampling trunk of the unit model with an attention pool at the eighth times eighth layer to produce the final output we train these classifiers on the same noising distribution as the corresponding diffusion model then they say in initial experiments with unconditional imagenet models we found it necessary to scale the classifier gradients by constant factor larger than 1. so to understand the effect of the scaling classifier gradients note that if we scale by s so this is the what we denoted as g previously this is these are the gradients of our classifier so if we do this you can show quite easily that that's equivalent to this expression here so well basically um you can convince yourself this is true because log of a product is a log of sums and because this is delta x we don't care about the this part here and then we would end up with the same expression basically these two are equivalent okay and you can see that we are basically just raising this to the power of s which which means we're going to sharpen this distribution um so let me continue reading what they said here so so that is an arbitrary constant and as a result the conditioning process is still theoretically grounded in a renormalized classifier distribution proportional to p raised to the power of s when s is bigger than one this distribution becomes sharper than uh the original distribution okay in the above derivations we assume that the underlying diffusion model was unconditional modeling p of x it is also possible to train conditional diffusion models and use classifier guidance in the exact same way table 4 shows that the sample quality of both unconditional and conditional models can be greatly improved by classifier guidance so this is the the thing i was mentioning we we can both have conditional model plus the classifier guidance because they are complementary we see that with a high enough scale the guided unconditional model can get quite close to the fid of an unguided conditional model although training directly with the class label still helps finally guiding a conditional model further improves fid so that's what i mentioned multiple times using the class conditional model plus using the guiding is what gave them the best uh result on various metrics such as fid is inception score uh precision recall etc etc okay um quickly to let me just quickly address this because we're going to see it in the code later on so how they train the so why do they need to train the model uh on the the classifier on on noisy images because they say here we train these classifiers on the same noising distribution as the corresponding diffusion model well because we're going to be using it on images on x of t so x of t which can be super noisy and if we just train our classifier on normal images normal when i say normal i mean images without any noise uh then the model will probably struggle because these images would be out of distribution for for the model and then it would not give us correct class for a noisy version of the original image because of that we want to train model for all so so basically on the whole process so this is our diffusion uh like forward process something like this so this is the step zero this is the step let's say big t minus one we wanna train on all of these images so we'll be randomly sampling images and making sure that the model knows how to predict the correct class and because of that this is kind of bothersome and some of the later papers such as glide introduce this concept of a classifier free guidance i'm gonna well tell you more about that one uh in one of the future videos okay let's go back here uh here we can see that the trade-off we mentioned in the beginning of the video and that's that we want to trade out between diversity and quality of the sample uh they show here using various metrics that that's indeed the case so you can see here as we increase this gradient scale so that's the high parameter s you can see that the recoil is uh falling down which means that the diversity the the the mode coverage of our generative model is falling down but the precision which basically um well uh it's a proxy for the image quality goes up and you can see similarly here is goes up which means that the image quality is increasing and fid is is also going up which means that the diversity is is uh basically uh diminishing okay that's pretty much it i'm going to skip all of this and i'm just going to briefly uh mention a couple of sentences here and then we're going to jump into the code our proposed classifier guidance technique is currently limited to labeled data sets so keep that in mind we do have to train it on on labeled images um but the effectiveness of classifier guidance demonstrates that we can obtain powerful generative models from the gradients of a classification function this could be used to condition pre-trained models in a plethora of ways for example by conditioning an image generator with a text caption using a noisy version of clip and because of this sentence uh after reading this paper you could have assumed that open eye would start working on something like glide or daily models and they did indeed and finally it also suggests that large unlabeled data sets could be leveraged in the future to pre-train powerful diffusion models that can later be improved by using a classifier with desirable properties okay again you could have if you read this paper you could have anticipated open ai is going to try and just scale up scale up diffusion models uh such as what they've done with with ali and when i say daley i basically mean delhi two because the league one didn't even use diffusion models okay let's jump to the code now um i i did expect this uh paper overview part to be a bit shorter but like do let me know whether you find this type of combination uh useful or not just feel free to comment down below uh what do you think about this format in any case let's now start uh working on on the classifier train script so this is how we trained uh this is how we train this this noisy classifier uh that i just mentioned so let me start and walk you through the actual script for training it so i'm gonna obviously again uh abstract a lot of the details because otherwise this would not be tractable uh for for a video um that has an ambition to be less than an hour long uh so just some setting up some distributed stuff which we don't care about logging blah blah blah this part part we do care about because i just want to briefly show you basically how the classifier is constructed we did saw in the paper that it's simply the first part of the unit model up to the bottleneck part and then we just stitch attention pulling and we ditch the the second part of the unit model that's how we that's how they set up the uh the classifier the architecture so let's kind of uh go inside here so this just kind of gives you some default arguments i'm going to try and zoom in a little bit in case you don't see this i'm going to close this part okay so let's step inside this line of code and let's see how the classifier works um okay so here's the classifier uh image size 128 128 blah blah blah those are just some hype parameters obviously can be different uh we are not using fp16 there is some width depth various hybrid parameters uh we saw this is going to be um well basically this is going to enable uh the attention to be at resolutions 32 x 32 in the latent space as well as 16 by 16 and 8x8 um this is the scale shift norm that i showed you in the previous video uh this basically um enables us to integrate to incorporate the the class as well as the time step information into the model in any case i'm going to skim over those details let me just kind of go through here because we have image size 128 we have some specification for the number of channels in the unit and finally we just uh transform uh this from the resolution so so remember this is like um basically 6 32 16 8 after we step through this uh for loop we're basically and end up with the same information but just like and now it tells us how many down sampling steps you need to take before you apply the attention uh we saw that in the previous video as well so i'm gonna just gonna skip over this blah blah blah and here here it comes so here's the encoder unit model uh basically let me see what interesting is here so we have the output number of channels is hard coded to 2000 because this is uh trained on on imagenet i'm going to skip over all of those not that important uh classifier pool is gonna be attention we're gonna see that in a second so let me uh go to the construct constructor here so i'm gonna again just skim over they're just kind of setting all of these arguments to local variables here nothing nothing fancy there we construct the time embedding this is the the mlp that we use to transform the sinusoidal embeddings for our time step time steps um nothing nothing interesting really we can just kind of skim over all of this so if we recall the from the previous video we had the same structure so we initially just have number of input block we just have the conf basically 2d so the convolution as the first processing layer then we have this part where we add various residual blocks and attention blocks and then we have the middle block and the main difference is we don't have here the output blocks we just have as you can see here we just have the pooling operator so let me go to there obviously these details are not that important the only important part lies in the residual block where you can see how the um how the information such as time steps and class information is incorporated i'm going to show you that in a second i'm going to kind of skip over all of this again you can see here that sometimes we're going to add attention blocks and that's going to happen precisely for the resolutions 32 16 and 8. and there are these wrapper functions which just enable us to uh sometimes pass the temporal information sometimes just past the the feature vectors okay in any case i'm going to skip all of this because i'm just there's too much information there and it's not that vital for your understanding it's just like the architectural details of the unit model i'm just going to show you this one so here is the let me just kind of uh remind you from the previous video uh there is this part where we embed so here we embed the uh temporal information and then because this is set to true use scale shift norm what we're gonna do is we're gonna take the temporal information here as in the form of a scale and shift and we are going to combine that this is how we merge the temporal information uh into into the uh features here and this is just as a reminder uh for the what happened in the previous video let me just see whether they have class information here um doesn't seem to be the case so embedding the past embedding here no i don't think this is going to have well it doesn't make any sense to have like a class conditioning in a classifier so this is going to make sense for the diffusion model and in any case okay let's continue here so we uh we have the the unit part and now we just end up uh with this um sequential model that has some normalization some activation function and finally this attention pooling 2d this is basically just going to take the output features from the unit model and using a tension mechanism is going to form a resultant uh feature vector which we're going to then use to uh predict the the logits and you can see that the output channels here is thousands it's going to give us a thousand uh thousand dimensional vector which is going to be just the logits uh that come out of this classifier and that's it guys um i kind of skimmed a lot of details because they are not really that important so uh you just need to have like a rough mental model how the architecture looks like it has this this basically encoder type of structure it has this attention pooling at the end and um that's pretty much it okay so i skipped the the gaussian diffusion model here because we saw that in the previous video i just want to focus on the differences uh compared to the previous video uh so here we just uh shift the model onto the gpu we create the uniform sampler here so this one is just gonna help us sample from the forward uh process we're just gonna pick the random time step using the uniform distribution obviously you can kind of bias this and form various different distributions one of those which i showed you in the previous video was depending on the loss you're gonna focus more on those time steps where we have higher loss which makes sense intuitively um okay so we can skip over all of these parts we don't want to resume from the checkpoint we just have some mixed precision trainer wrapper here we don't care about that part we have the distributed data parallel object that's for distributed training but we don't care about that either i'm going to skip over all of that we have some data loading basically we're going to have bet size 256 image size 128 128 we have the data directory which i downloaded before the video started and that's it we just created the training and validation data sets i'm not going to dig into the actual details of how that works because it's just like a side thing um so logging adam w optimizer we're not resuming from a checkpoint blah blah and here this this is gonna be the main function to understand how the classifier is trained so we have forward backward lock i've added a breakpoint to that function we're gonna step into it a bit later so here we defined it and now here we have we step for a number of iterations okay so we're gonna do this for a number of steps so as specified by the number of iterations there i've just set it to 30 because uh i want to have a short training here uh just for for debugging purposes in any case just some logging nothing interesting i can skip all of this and this is where the whole magic of the classifier training happens after that we just do the uh optimization step and we do the same uh logic for on the validation we just do the forward backward on the validation set and finally we just do some logging here so basically this is where the whole magic is in so the forward backward log function is everything that we care about so let's step into it so we're first gonna grab uh like a batch of images and associated labels let me show you the uh shapes uh as always that's a very important thing to do uh so let's see what batch looks like so if i do batch shape so we can see here bad shape so we have 200 256 images uh rgb images of resolution hundred twenty eight hundred twenty eight we have the labels as well so after i do this so basically this extra here just contains this y and let's see the shape uh so shape is 256 obviously we have for each of the images we have associated label if i were to step over this and print labels we can see there is a bunch of labels here because i'm using cipher 10 even though this is imagenet training that's why we only see labels from 0 through 9 but we don't care about uh this detail for the uh purpose of understanding how the classifier training looks like okay so we we put the as you can see here we put the labels onto our target device which is gpu in my case same with images here and now what we do is we do we take the uniform sampler and we just sample 256 time steps that's it so we now have let's see if i were to print t you can see uh like it's a it's a pile of numbers uh of the following length let me just print this so we have 256 time steps for each of the images in the batch and it's just randomly sampled between 0 and and and 999 after that we just do the q sample so we take the image we take the time step and we just do the noising process so we create from the x0 image we create we create the xt image i'm gonna quickly remind you i showed you this in the previous video so i'm gonna just quickly show you how this thing looks like so here it is uh there is literally a formula which i showed you in the previous paper which is fairly easy you just take these uh specific constants here you multiply it with the x start x start is the x0 the original image and we just add noise uh according to this square blah blah blah again is some constant let me try to find the formula it's going to be easier okay guys here is the formula from the previous video so here is how we sample so given x0 the original image here is how we get x of t we can just basically sample from this distribution here or equivalently we just compute this expression here so you can see here we multiply by some constant we multiply the x 0 or x start as we saw in the code we just add up on on top of that we add this constant types epsilon which is just your noise from your normal uh distribution now let's go back to the code let me see whether you can you can see resemblance between this formula here and between this thing here so again constant times x start we add on top of that some constant times noise and that's how we get x t and that's it that's because that's why i'm going to just skip over this function and we have finally we now have a batch of noisy xt samples and that's what we're going to use to train our classifier on and again the reason we're doing this is because otherwise if we didn't have these noisy images if we just trained on the original images then the classifier wouldn't be able to estimate the gradients correctly for for an arbitrary noisy image and that's why we're going to train it on noisy images okay so here now we just um instead of training on the whole batch this is just for for for memory optimization stuff we're going to kind of chunk the batch size that's 256 into smaller batches so-called micro batches okay and here it is and now we just call we just pass so we do a forward pass through our classifier again the the classifier is basically just a unit model uh well basically the first portion of the unit model uh and first of all let me just show you the shapes here i think it's gonna be four no it's gonna be actually just one okay so the micro batch size is one so we basically pass just a single image uh through the classifier as an optimization um thingy and we pass in the corresponding um like uh basically time step so again if this is like five that means this is going to be x and we pass that through the model and we get back to logits okay let me uh step through there uh and here we end up in the forward pass of our uh model so that's the that's the that's the classifier and we do the time step embedding which is just going to embed it into like using the the sinusoidal embeddings and then we just pass it through the mlp and so we ultimately end up with uh something that's going to be of shape 1 5 12 yeah okay so and then we just do forward pass through the input blocks through the middle blocks and finally we pass through the attention uh pooling layer and that's it depending on the module some of them are going to encode the temporal the time step information into our features some of the other modules will not do that such as the conf 2d layer but in any case that's pretty much it so i can now get back to the classifier training we can skip all the way to the loss here so let's skip skip over and here we are so we got some logits and now basically we just do a cross entropy for the label we have and that's it that's our loss we just do the cross entropy loss which is basically just a standard classifier training uh loss okay and now we just collect some information such as the accuracy at one accuracy five we do some logging blah blah blah we do the mean on top of the loss and uh basically here because we have micro batches uh what they do is uh in the in the zeroth micro batch we just clear up the gradients and then we keep on accumulating the gradients for each of the micro batches and we don't do the update until we do we go through all of the micro batches inside of uh instead of inside of a single uh mini batch and that's just again minor optimization detail but we saw everything we needed to know basically that the main thing is that we have uh this uh qsamp we call this function here which makes uh a stream not on the original images but instead we train on x t images so noisy versions of the images okay guys that's it uh just for um being detailed here i want to show you the part with merging the um the temporal information i'm going to show you that part here so rest block i'm going to enter there and i'm going to set i'm going to set a breakpoint inside of the forward function here just so that i can show you how we merge the information from the embedding here okay so i'm going to set a breakpoint here and let's now step let's do another pass to another micro batch and here we are so here we have the embedding and we are going to hit the rest block at one point of time and here we are so here is the um part where we merge the temporal information i want to show you this once more again it's fairly simple we just do some additional processing so let's see what these amp layers are so again it's just like a like a basically a linear layer an activation function nothing fancy there and we end up with embedding uh here of the shape 1256 and now we just as you can see here we start adding dummy dimensions such that we can merge this temporal information into the um image features okay so we can see here because this is set to true we enter this branch here we have some layers here and we split them into two parts the first part is the normalization layer and then the rest and you can see this out layer i think it's just like a sequential yeah you can see here how it's constructed not that important so the important part is here we take our embedding out so this vector and we split it into two parts one is scale one is shift and i guess it's going to be 128 is going to be the shape here so 128 and we added the dummy dimensions and so now if we do this we're going to have some broadcasting going on and we're going to gonna merge them with the image features and this image features are going to be of shape as you can see here 128 128 128 and that's it guys i'm gonna briefly show you how this looks like inside of my onenote so again fairly simple thing uh we have so we have something that came out of the um so we have a volume that came out of the u-net model that's going to look like something like this so here is the our volume so those are the image features from the latent space of our encoder and we saw that that's going to be like this is 128 number of channels and we saw that this is also 128. so the height and the width all of this is 128. whereas on the other hand we just have this is our vector uh this is our temporal embedding so this is how it looks like it's only gonna be 128 here but it's only one one here and then what's broadcasting is gonna do is basically just copy paste this multiple times so that we end up with something like this so that we end up with multiple copies here here here etc so we're just going to copy paste it here and then after we do that we can freely add it up so we can freely do addition on top of the image features and that's how we incorporate the temporal information cool let's go back here i'm going to stop the execution of the classifier transcript we saw how to train the classifier do let me know whether you found this explanation useful whether i can improve something any feedback is always appreciated and also if you like this video share it out with your friends um let's now focus on the sampling function so i have prepared some arguments again i just took the arguments from the readme and i created i just kind of extracted them here so here are the arguments i'm using but you can find all of that you can find all of that basically in the readme file itself so i'm going to go back here and i'm going to now focus on the sampling function okay guys let's run this script and um again i'm going to scheme a lot of the i'm going to skip explaining a lot of details um which are irrelevant to understanding the actual um well uh got in this particular case we want to see how the classifier guidance looks like and that's what i'm going to focus on um so what this function is going to return is the model which is going to be the u-net model that's going to predict the epsilon so the the the noise uh and diffusion just contains bunch of those constants and necessary to do the diffusion process so i'm going to skip over all of that we're just going to load a certain checkpoint we have like as i said i have a model path here so let me just see whether i can show you yeah so i went ahead and downloaded the model into this models directory inside of my root directory and everything is already explained in the readme so you don't have to worry about it so we just push the model to uh the gpu uh we convert it into fp16 um again just some optimization details i'm not going to focus on that we create the classifier so that's again our encoder the the the the half portion of the unit uh you can see here and we already schemed through that code previously in the in the training script so i'm going to just ignore all of that and uh whoops well obviously not so i'll have to just ask uh let me just kind of do the disable all the points all the breakpoints and then i'm going to click f5 and here we are now i'm going to enable all of the breakpoints again and we just load the weights we load the state dictionary from the classifier path so i went ahead and downloaded uh this this model again and i've put it inside of the models directory that's it nothing smart there uh we pushed the model the classifier uh on the to the gpu as well and we set it inside into the eval mode um that's it now this is going to be the most important part this is where the uh gradients are calculated for the classifier i'm going to analyze that a bit later once we hit once we enter that function so here is where it's defined later we're going to see how it basically works so that's it we just defined this function as well the model function and now we're going to do a bunch of iterations until we create the number of samples that we can specify uh and you can see here i specified hundred but that's just arbitrary numbers so not that important okay so we create random num so random classes and uh depending on the batch size we have like four random classes being generated here so let me show you what those are uh if i go to debug console if i print that we can see we have four random classes uh and we don't i don't know directly what these correspond to but like let's say these are just classes 33 918 etc etc okay we don't care about the actual semantics because behind the numbers we are going to store those the class information inside of this model uh like key word arguments uh dictionary so i'm gonna go there and because we're not using this ddim again ddim is just just it's just like this optimize like this uh sampling method whereby uh if you have less than 50 diffusion steps they showed in that paper that they achieved better results compared to the original ddpm paper so only in the mode where you only sample for less than 50 steps roughly then ddim makes sense but here i think we have like i think we set like thousand steps so we're gonna use the p sample loop instead of the ddim simple loop i'm not gonna focus on the ddin part because uh there is a lot of formulas it's hard to explain the logic but like on the high level just treating it as a black box when you would want to use it is when you have less than 50 diffusion steps okay that's it here's where the whole magic happens after that there's just some normalization going on permutation blah blah blah and we just collect uh the images here and finally yeah basically just a bunch of boilerplate code this is where the main logic is in so we're going to step inside of here we basically uh specify here the target shape so our images are going to be 64 by 64. later on they have the up sampling modules which we are also going to ignore but let's focus on this part here okay so we pass in the model uh uh keyword arguments they have the class information uh we passed these special functions so the con fn which we defined here is gonna calculate the gradients we're gonna see how that comes into the play in a second so here's the p sample loop uh what it does is we're just gonna keep on generating samples uh i'm just gonna directly jump inside of this function because um everything else is just like basically boilerplate code um so here we generate the uh random noise image okay so we have the desired shape and the desired shape is as you can see here four because we have four images in the batch and here we have the resolution of our target image and we just generate uh we just a sample from a normal distribution here and then we just generate indices so depending on the number of time steps so we have 250 here uh we just generate these indices and remember that's because we're gonna start from the completely noisy image so from the index 249 and we're gonna work our way all the way to zero where we're going to end up with completely denoised images from our underlying data distribution which we learned during the training but this is not the sampling part okay so in this is let's see what t is it should be to yeah 249 as i said and here is where the whole magic happens in this p sample step so after 250 steps uh we are going to end up with our final image and that's one of the drawbacks of of current diffusion models we need much more compute compared to like say gans where you just need a single four pass and you end up with a like a very realistic image very crispy image here we have much more we need much more compute okay so let's step inside of the p sample function and the main function here is going to be this p mean variance that's just um basically our reverse process that we learned after we calculate the out variable here we're going to generate noise again we're going to do the conditioning i'm going to get here so this is going to be important part here this is where we do the conditioning this is where we do the uh where we shift the mean uh via the sigma times gradients uh expression and then we just form the image here okay so the this is the this is going to contain the x uh t minus one uh if the if we condition on the x t or alternatively if you condition on x t plus one this is going to be x t in any case it's just like a single step closer to the final image okay let's jump into the p mean variance here it is this is what we calculate as i said our reverse process and let's kind of quickly skim over over this part so i'm gonna set a breakpoint here here we are we enter here and now we're going to pass what we're going to pass a couple of things first we're going to pass the the the noise we just sampled so that's going to be i guess uh four images right so 4 3 64 64. we're going to pass the time steps so those are going to be the 249 249 249 for each image in the batch and we're gonna additionally pass the class information because if you recall from the paper explanation we are not only going to use classifier to guide we are also going to use class information to condition the model so those two do the same thing but are kind of complementary so don't be confused by that okay let's step there uh we hit the uh basically the this model fn that we saw previously and because class conditioning is set to true we're gonna pass x so the noise images the time steps and the condition information all of that let's now quickly step into the unit model into the forward pass and see how the class information is incorporated into with the image features and um hint hint it's literally the same thing as with the temporal information pretty much we're gonna see that in a second so here it is we have the time steps we're gonna embed those so sinusoidal embeddings then some learn stuff here and we end up with a vector that's of shape i guess let's see what it is um it's going to be 4 7 68 because we're dealing with batches with four images okay and now because we have the the classes set here the only thing we're gonna do with our classes so here here are classes so we have four classes here we're going to um embed them using this label embed oh i hate these warnings i should have toggled them off there is some config line that you can set to launch jason if you if you don't want to see those but yeah i forgot to do that uh so let's see what this this label embed does okay so you can see here it's just a simple embedding layer add that's going to specify uh it's going to have some number of dimensions for each of the classes nothing fancy there you can literally treat that as let's see what it gives us basically so if we do this and i do shape yeah you can see it's going to transform uh our four scalars into four vectors of dimensionality 768 and then we're just gonna simply add that on top of the temporal embedding and then everything else proceeds as as usual we're just gonna keep on incorporating that information so both the class information as well as the temporal information directly merge them with the image features using the logic i showed you with the scaling and the offset we saw in the rest block and that's it i'm going to skip over all of this we don't care about that and now let's continue on so we are here and we ended up with the model outputs if sigma is learned this should be four six uh something right so four six because six because we all we predict the epsilon so that's the noise and we also predict the sigma so that's the the covariance um and now let's continue here we saw all of this logic in the previous video so i'm going to kind of skim through this we split the model outputs into two parts so again the epsilon and the variance and then uh well actually it's not the variance it's it's the v vector let me show you in a second the formula so again if you recall from the previous video this is the formula uh we used so we are predicting the v is here and then we're going to form the variance by doing this expression here so this is the expression calculation we just saw so we calculate all of those we get the variance ultimately here and now we need to predict the mean so here is how we predict the mean i'm going to skim through all of this here uh we just need to first predict the x start so that's the x0 and then we use this posterior mean variance to get the model mean so this is all of this is same as in the previous video pretty much so we got the mean we have the variance and we finally return all of those variables so the mean uh variance model log variance et cetera et cetera by the way let me just quickly show you what those are so here are we in the onenote explanation i showed you in the beginning of the video so this is what we currently generated we generated the um these two so we have the mean and we have the sigma and now we need to do this shift part so we need to do the shift part and then we can sample our our image and that's it fairly simple okay let me go back here i'm going to return all of these variables i'm going to return all of those uh we generate we sample from a normal uh distribution we have this masking stuff going on that because we have if you recall we have this discrepancy for t equals zero time step versus all of the other steps uh things look a bit different but like i guess that's just a minor detail again this is where the whole magic happens this is where the shifting of the mean happens using the sigma using the gradients of our classifier and also like just a note uh there is a bug here i submitted an issue and the authors of open from opening i basically confirmed that's indeed an issue i'm gonna um explain you a bit later uh what that exactly is but the funny thing is it did not even matter like that's the thing with uh training machine learning models sometimes you have a bug but the models are forgiving and you will just get something working maybe a bit sub-optimal but still there is no way you can figure out there is a bug it's not like with software system it's where something when you when you when you like basically develop a traditional software system if something does not work you will know that something will crash and the program will not work so yeah ml is a bit different okay so here we form a new mean and a new mean is formed by adding on top of our old mean we just add the sigma times the gradient so again that's that's that's the whole idea so here again we just have we passed the we have the conditional con function which we defined in the main script we pass our uh means the mean and the and the covariance we just calculated uh here above so this basically contains the x t and this is the x t plus one and this is the time step in any case let's step inside of this function and let's see what's going on okay so first we calculate the gradient and then as you recall from the equation we just had on our new on our old mean we add the variance we calculated times the gradient and that's it so that's that's easy let's now see the this part here let's step into this function here okay so first thing we do is we take our noisy image uh we detached it from the computational graph we said it requires graph to true so that we can calculate the gradients for that x because that's what we ultimately care about right we want to have the gradients of x with respect to the log probability of of our target class from the classifier okay so we pass it through the classifier we get the logits and unfortunately i stepped inside of the classifier we don't really care about it let's just kind of continue here we don't care about it so we have logits now we just form the log softmax so we have the log props so first let's see what what the dimensionality of log props is so if we do shape here we have four because we have four images and we have thousand lodges because we're dealing with image net and so what we do here is we basically extract the target so wise again remember why is that the target uh so those are the ground truth classes we use those to take only those legits logits so we have the selected ones so now we if i print this these are the um target classes so so these are the logits we need to maximize if we wanna have an image from that particular class if that makes sense and so what we do here is just and this is the part that kind of confused me we sum up all of these uh different uh like uh logics sorry not logic is the log props the log probabilities and then we find the gradient with respect to the input so the input is again a batch so it's going to be 4 3 64 64. and we just multiply so this is the classifier scale i mentioned so that's denoted as s in the paper so let me show you that in a second here let's go back to the paper and let's find the uh that part somewhere here so i think it's down here okay so here's here is the part we're just computing we have the s which is on the right hand side in the code and we have the gradients here so again let's go back to the code here so this is the s and this is the gradient where we're computing and the reason this works is because um if you do a sum you'll end up with so our loss here is going to be l1 so that's like l1 is just my shorthand notation for log probability for the first image so for the for the log probability yeah for the first image plus l2 plus l3 plus l4 and if we do a gradient with respect to x because x1 so the first image from this patch only influences this particular loss okay x2 so that's the second image from from the batch only influences the second one okay so because they only influence the log props that correspond to to their index and because of that if we do gradient here each of the images here will have only the so basically the first image will have um l1 dx the second image will be uh l2 dx etc etc okay so basically we will return a batch of gradients where each of the uh images inside of that batch is basically gradients with respect to the corresponding loss so if i go back here we basically end up with a gradient here and the gradient is going to be of the same shape as the images so here 4 3 64 64. and that's pretty much it okay so now we just do the the the logic we saw in the paper and guys that's pretty much it we are done we now form uh using the new mean and the log variance we just sample by doing this expression by just adding the log variance times the noise and that's how we get the next image in the reverse process now this just repeats on and on but this was the main this was the gist of the paper so where is the bug i mentioned i'm going to show you in the issue in a second but like basically what happens here is they passed x t plus one and t plus one so these are the uh they should have passed the out mean instead of x t plus one because this is x t and this is x t plus one and so it might be a bit subtle so let's let's let's try and see whether i can explain it so as you can see here they pass the the the the x t plus one so if i go back here uh you can see that then they call this conditional function with x t plus one and t plus one so we end up using x t plus one t plus one here uh so that means that we calculate we pass x t plus one t plus one here we get the logits but if you recall from the actual paper uh we do not need to we we need to condition on on x let me just find the the expression here blah blah blah so you can see here that we want a condition on x t and not on x t plus one otherwise the gradients we get here are again from the x t plus one not from the x t and all of that kind of messes up the logic i just explained it here so mu mu is from xt and we want to use that same u and pass it inside of the classifier and not the previous image and despite all of this it's very subtle i don't even know how i noticed it i was fairly i guess pedantic uh stepping through this code uh they they get the same results pretty much they didn't even notice it they noticed it post hoc after the paper was published or something i'm gonna show you the issue in a second so here is the issue i opened up i think a couple of days ago um so i think it's closed already so bugs blah blah so here's my issue so i created an issue here as you can see here i think i found two bugs so the first one we shouldn't uh shouldn't we pass the out of mean so that x of t instead of x of t plus one here uh and similarly the the smaller time step instead of t uh and then i linked to the particular line of code which is the same line of code i showed to you guys so that's here and uh basically um one of the authors replied yes this is indeed a slight bug we which we noticed shortly after releasing our rook however we did try uplating using the correct formula and found that it did not noticeably change results again that's the tricky part about training machine learning models and then the second part i was kind of confused with the summation of of those losses but it turned out it's quite makes a lot of sense uh it was just a minor mistake on my on my side any in any case i can kind of link this issue if you want to dig deeper into it uh i'll have it down in the video description so guys that's pretty much it i showed you how the how the paper what the paper introduced the two main things are improving the unit architecture the second thing is uh adding this classifier guidance technique and then i showed you how to train the classifier uh in in this classifier transcript we saw how to train it on on basically noisy images and then use that very same classifier inside of this uh classifier sample script to basically ship the mean and thus like sample images from the class conditioned uh diffusion model um there is a lot of details um hopefully you picked up something useful from this video if you did consider subscribing also share the video out and until next time bye bye [Music] you

Info

Channel: Aleksa Gordić - The AI Epiphany

Views: 12,339

Rating: undefined out of 5

Keywords: arxiv, paper explained, the ai epiphany, ai, deep learning, machine learning, aleksa gordic, artificial intelligence, code walk-through, diffusion models, diffusion models beat gans on image synthesis, classifier guidance, noise-aware classifiers, ddpm, guided ddpm, openai

Id: hAp7Lk7W4QQ

Channel Id: undefined

Length: 68min 55sec (4135 seconds)

Published: Sun Jul 17 2022