Diffusion Models | PyTorch Implementation

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey there and welcome to this video in the following I'll be doing a paper implementation of diffusion models diffusion models are extremely popular right now and are being applied to almost every single task which involves some kind of generation the most popular tasks they are being applied to at the moment is image generation and you have probably heard a lot from Models like Dali 2 stable diffusion or image gen why all of them differ in their exact approach they all have the same fundamental underlying technique and we are going to implement this technique in this video before we are getting started though in order to fully understand the video and the implementation you should have a basic understanding of the theory behind the fuse models if you are not familiar with it you can take a look at my explanation video which I made recently or take a look at all the other incredible resources out there okay so what is this video going to be exactly about at first we are going to do a basic implementation of diffusion models in pie Dodge this will consist of quoting all the tools for the diffusion process so noising images sampling images and so on of course we will also cover implementing the famous unit architecture as well as the training Loop then we'll train this model on some data set and look at the results after that you have a great starting point to apply the fusion models to your own data sets furthermore over time there were so many improvements to diffuse models making them faster or better this is why I want to talk about a few and also Implement them the two main things I want to talk about are classifier free guidance and exponential moving average don't worry if you don't know what it is I'll explain it later however these improvements will require us to rewrite some of the existing code to work conditionally since classifier free guidance only works with labels and our first implementation will be unconditionally and of course we're also going to train this and look at the results afterwards after that you should have a pretty neat understanding of how to implement diffuse models and maybe you can extend them even further with your own ideas so before getting into the coding if you need a short refresher about the fuse models that past may give you a recap the fusion models are generative models which became super popular and recent times for beating Gans on image synthesis the main idea behind the fusion models is that we have two processes the first one is easy we just gradually apply noise to an image over a lot of steps until the image has turned into complete noise the cool thing is that we can also do this in one single step which saves us a lot of compute and time now the goal of the reverse process is that we learn how to remove noise from an image but as we saw we don't instantly try to remove all of the noise we do it step by step that's just easier for the model to learn for that we have a neural network which takes in an image and the current timestep T and predicts the image at T minus 1. well to be more specific we saw that we can reformulate it so that we only predict the noise in the image then we can subtract the predicted noise from the image at time T and get the image at T minus 1. we will do this over and over again until we arrive at T equals 0 and if the new network was trained well we now have an image which could occur in our training data alright after this short reminder let's dive into the code okay so you can see I've created a new project and got what will be our main file up here at first I will just do the Casual Imports for all the things that we need the first thing that we'll code and this will be the most important thing out of all are the diffusion tools and as I said earlier this will cover setting up the noising schedule a function for noising images and also for sampling images and I'm just going to put all of this in one single class called diffusion the main parameters will be for the number of sampling steps the lower and upper end for better and the image size and the default parameters you see here will also be the ones that we are going to use in all training runs I'm setting the number of time steps to a thousand as proposed in the first papers then I also use the same values for the betas and the image size will be 64 by 64. I'm using a fairly small resolution here because I'm only trending this on my RTX 3090 and higher resolutions would take much longer and also going with the flow of recent set of the out models you probably would rather try and separate up Samplers to get to higher resolutions instead of just training one big base model but that just as a side note in this video we will also just be using the linear beta schedule as proposed in the first papers if you want to use the cosine schedule proposed by open AI it should be very simple to change though okay so these will be our betas and remember the alphas that were introduced which are basically just for making the formulas look thinner nice in our algorithms we'll Define these here the alphas will simply be one minus beta and then we also need the cumulative products of the alphas you'll see in a second how exactly we are going to use them so don't worry if you don't remember that okay so the first really important function it will need is to noise images and if you you remember from the explanation video there were two options to do that the first would be to iteratively add noise over and over again until we are at our desired step but there was also this great Insight that we can just do it in one single step and of course that's exactly what we will be doing because it's much faster you can see the formula on the screen and basically all we do is to take the square root of our Alpha's head and then also the square root of 1 minus Alpha hat at last we will also need some random noise and plug all of it into the formula and voila this function is ready another helpful function that we need is one which can sample some time steps because we need that in algorithm 1 for the training and we can just do that in a single one-liner like this okay so now let's take care of the sampling our sampling function will take in the model which we'll use for sampling and the number of images we want to sample at first we are going to set it to evaluation mode and wrap the following context in the torch.nograd generally everything that we are doing here will follow algorithm too from the ddpm paper I also talked about in the explanation video you see the first thing that I'm doing here is to create our initial images by sampling from a normal distribution using torch.rand n and now here comes the big loop going over all 1000 time steps in a reversed order starting with the highest and going until one the first thing in the loop is creating the time steps by just creating a tensor of length n with the current time step and after that I will feed that into the model together with the current images the last thing we'll need is noise which we add and scale with the variants later but remember we only need noise for the time steps greater than one because in the last iteration we don't want to edit because it would just make our final outcome worse in that case I will just set the noise to be zero and now for the final operation in each iteration we will alter our images and remove a little bit of Noise by making use of the formula that you see in the algorithm and that's already all there is in the main sample loop after that we'll set the model back to training and click the output to be in the valid range of -1 and 1. the plus 1 and the division by 2 is just for bringing their values back to the range of 0 and 1 which will further multiply by 255 to bring them into the valid pixel range and then change the data type for saving them later and that's it easy right until now we've treated the model in the sampling loop as the black box which just takes in the noised images and time steps so let's start coding the unit now I switched over to the modules.pi file and defined the basic construct for the unit it will take in the input and output channels which will be both three since we are working with RGB images but you can also feel free to maybe try it with just black and white images the next argument is the dimension of the timestamp embedding which you will see in a minute as you probably know a unit has an encoder a bottleneck and a decoder since our model will be very easy and straightforward I won't be making separate modules for all the three sub-components and just put all the actual modules here but you'll see which parts belong to the encoder decoder and bottleneck also for now I'll just put all the names of the modules here to give you a brief overview what the unit will consist of and afterwards we'll code each of the modules themselves at first we will start with a double conf which as the name suggests is just a wrapper for two convolutional layer us again you'll see the details soon after that we'll put three down sample blocks each followed by a self-attention block the argument to the down sample block are just the input and output channels and for the self-attention block the first argument is the channel Dimension and the second one is the current image resolution each down sample block will reduce the size by two so going from 64 to 32 to 16 to 8. after that we have the bottleneck which just consists of a bunch of convolutional layers and after that we go straight into the decoder part which is basically the reverse of the encoder and consists of three up sampling blocks again each followed by a self-attention block and at last we are projecting back to the output Channel Dimension by using a normal conf layer the forward pass is pretty straightforward too we'll take as input the noise images and the time steps as mentioned earlier the time steps will only be a tensor with the integer time step values in it but here comes the interesting thing now instead of giving these to our model in their plain form will make it easier for the model and then code them most diffusion papers just use the sinusoidal embedding and we are also going to go with that I'll put some resources in the description if you want to know more about them but due to time reasons we'll just take them as given the only thing to note is that we provide the explicit Dimension which we want to encode them into so to give you an example this is what it looks like if we have the following time tensor and encoded using the sinusoidal embedding keep that in mind because we will use that knowledge in some of the modules okay cool and the rest of the forward function is literally the same that we talked about in the init method so nothing new here except that you can see that the app sampling box take in the skip connections from the encoder 2. we'll see in a minute how they will become by mind but yeah that's already all there is for the abstract layout for the unit let's now take a look how all the modules work in detail the first thing that we will look at is the double conf module it is really just a normal convolution block as you have it in many other architectures and consists of a 2d convolution followed by a group norm and a gilo activation and then another convolution and a group Norm furthermore there's also the possibility to add a residual Connection by providing the residual keyword with true okay next up is the down sample block its main component consists of a Max pool to reduce the size by half followed by two double comps it's as easy as that but there's one more really important thing and that is the embedding layer you see down here remember that we encode our time steps to a certain Dimension however since most blocks differ in terms of the Hidden Dimension from the time step embedding will make use of a linear projection to bring the time embedding to the proper Dimension and you see this just consists of a zero activation followed by a linear layer going from the time embedding to the hidden dimension in the forward pass you see that we first feed the images through the convolutional block and project the time embedding accordingly and then just add both together and return that's it okay now let's move to the up sample block and you can see that it's basically exactly the same for the init method except that we have an upsample operation instead of a Max pooling as in a down sampling block and the main difference in the forward pass is just that we also take in the skip connection which comes from the encoder and after up sampling the normal X we concatenate it with the skip connection and feed it through the convolutional block and then at the end we also add the time embedding to it again last but not least we have the self-attention layer and this is really just a completely normal attention block as you probably know it from Transformers this can be seen good in the forward pass here we first have a pre-layer norm which is being followed by a multi-headed tension then we add the skip connection and pass it through the feed forward layer which also consists of a layer norm and two linear layers which are separated by a glue activation the first and last operations which you can see here are just for bringing the images in into the right shape by first flattening them and then bringing the channel axis as the last Dimension such that the attention can work properly and voila that's it for the model part once again you see that it's quite easy and not as hard as you might have expected it from diffusion models right okay we're almost done already we only need to write a training Loop now but before that I will write a few helper functions that will be useful in a training Loop the first method is just for plotting images and the second is just for saving them the next function is for preparing the data I just have some basic transforms here which will resize and crop the images and then convert them to a tensor and normalize them to be between -1 and 1. after that I Define a data set where I just used the built-in image folder and then return the data loader the last function you see here is just for setting up the folders for saving the model and the results awesome let's go to the training loop at first I just Define all the necessary things such as the data loader the model the optimizer the loss our diffusion class which we have written before and the logar to logout training stats now we can start the actual training Loop we'll iterate over all our epochs and then doing each Epoch run over the entire data set everything that will come now follows exactly algorithm 1 from from the ddpm paper our data loader will provide us with the images then we sample random time steps in the range between 1 and 1000 and after that we'll noise the images accordingly based on the time steps using the noise images function from the diffusion class this will also return the noise which was used since we needed in the loss calculation then all we do is just feeding our noised images to a model which predicts the noise in the images and then we take the mean squared error between the actual noise and the predicted noise and then it's just the usual Optimizer stuff where we first call zero grad then backward the loss and then take an optimization step and that's already all there is for the main training loop it's extremely simple right I will just add two more lines for logging and after each Epoch will sample some images and lock them to and save the model at this point I just want to re-emphasize that the fuse models are generally pretty easy to implement especially if you think back to the explanation video and how math heavy that was another thing I want to mention is that when you look at other diffusion model implementations especially the really popular ones you'll see that their code is very different and much harder to understand regarding the sampling process that's why they are still using the lower bound formulation which I personally don't understand since the first gdpm papers clearly show that this is not necessary anymore my only guess is that it makes sense for a few further modifications to diffuse models such as learned variants where we do need the lower bound again but yeah maybe if anybody of you knows why that is done feel free to share in the comments however I think for understanding the fuse models this implementation is much better suited since you can really focus on what was shown on the paper and don't get confused by 1000 lines of additional code which you didn't expect and have no idea where it is coming from so yeah I guess it's time for training this model now and see what we get I'll be using a small landscape data set for this unconditional training it has about 4000 images in it I will train it for 500 epochs using a batch size of 12 and an image size of 64 by 64. I'll I'll try on a GPU and use a Content learning rate and I guess with that being said let's start the training [Music] [Applause] [Music] so this has been trending now for 500 epochs and you can see some results on screen here I personally like them a lot and I guess we can definitely say that it is able to generate Landscapes here are also some more results from different epochs and it's very interesting to see how these evolve over time until they become realistic Landscapes awesome so now as promised earlier we'll also look at two improvements for the fusion models which are classifier free guidance and exponential moving average classifier free guidance or CFG can be used to improve Generations if we train with classes it can help to avoid posterior collapse which simply means that the model ignores the conditional information and just generates any image an example would be if we wanted to generate an airplane but the model would just generate an arbitrary class such as a car and CFG can help to avoid that and often also results in better sampling outcomes but you already hear that we need classes for that and that's why we will need to adapt our Chrome code a little bit we'll be using Cipher 10 for that which has 10 classes originally this dataset came out in 32x32 resolution but I'm using an upscaled version at 64 by 64. let's first talk about the model changes there are plenty ways to condition a model but one of the easiest to implement and one that works reasonably well is to just add condition information somewhere inside of the model to some intermediate result in diffuse models we have an easy time since we already have the time step as a condition so we could just add the conditional information of the label to that and again here instead of adding them in their plane form being a number in the range of 0 to 9 for the 10 different classes we'll also use an embedding for them but this time the model can learn itself for that we extend our model by a num classes argument and create the label embedding for all classes and note that this will have the same number of Dimensions as the time embedding and in the forward path literally the only thing that we're changing is that if labels are provided We'll add their embedding to the time step embedding and that's already all the changes for the model easy right going over to our training code we'll now expect labels to also come from our data loader and during training we'll also pass them to the model now our code works for class conditions cool okay so how does classifier free guidance work during training for like 10 of the time we'll try unconditionally that way the model learns to do both conditional and unconditional sampling and during sampling we'll sample also both ways but linearly interpolate away from the unconditional sample towards the conditional one and we'll do that in every iteration I'll put the paper for CFG in the description so you can read it more carefully if you're interested so the only thing we change in the training Loop is that we set the labels to none for 10 percent of the time which will result in that the model only uses the time step and no class information okay and then we also need to modify the sampling Loop you can see the exact formula for CFG here on screen proposed in the paper it's basically a linear interpolation between the conditional and unconditional predicted noise this is why we can use torch lab function which does exactly that so if the CFG scale is bigger than 0 meaning that we want to use CFG we'll first sample unconditional daily and then just do the interpolation and that's all we need for enabling CFG we'll do some comparisons later to see if it actually works better than training without it the second Improvement that we will Implement is exponential moving average or EMA EMA is basically a way of enforcing a smoother training it literally Smooths the trajectory of the model updates so if the training is really noisy and the direction of the optimization changes a lot EMA can smooth this trajectory and lead to a more robust outcome since it's not so susceptible to outliers in terms of model updates as the main model EMA works by making a copy of the initial model weights and then updating these based on the moving average from the main model formula for updating a single weight using EMA looks like this it's also an interpolation between the old weight and the new parameter weighted by Beta beta is usually around 0.99 which shows that the new weight only affects the EMA parameters a little bit which prevents out outlayers to have a big effect let's move to the code and implement it I'm just going to create a new class for that which would take better as an argument we'll let the EMA updates start after a certain number of iterations to give the main model a quick warm-up during the warm-up we'll always just reset the EMA model parameters to the main ones after the warm-up we'll then always update the weights by iterating over all parameters and apply the formula I just showed you before cool that's it for the EMA class let's put it into action in the training Loop we'll first Define it here and create a copy of a model and then the only thing we'll add here is to call the step EMA function after every model update and that's already it we'll also compare some sample from the EMA model versus the original model after the training to see its effects ok then let's start the training on Cipher 10 with the new updated class conditional model which uses EMA and CFG I tried this now for 300 epochs and you can see the results here on screen for the different classes these are using the normal model and without using CFG if we use the EMA model you can already see a significant difference from just looking at it on the other hand if we only use CFG then these are the results here and in the bottom right corner we have images sampled with both EMA and CFG I guess that we can definitely say that EMA and CFG help and result in better generations and there's not even a big trade-off that we need to encounter for getting better samples but that's just looking visually added to show that it's actually better than the normal model you could do some analytic calculations like FID or other stuff but for time reasons I won't be doing that here now alright so this sums up the training of the conditional diffusion model with classifier free guidance and EMA you see that it's really not that hard and all of this can be implemented in a couple lines of code and yeah that's it for this video where we implemented the fusion models from scratch in pytorch I really hope you enjoyed the video and feel free to suggest other topics I should cover or things that can be improved and also thank you for a thousand subscribers and yeah with that being said I wish you a nice day [Music]
Info
Channel: Outlier
Views: 67,598
Rating: undefined out of 5
Keywords:
Id: TBCRlnwJtZU
Channel Id: undefined
Length: 22min 26sec (1346 seconds)
Published: Wed Sep 21 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.