Diffusion models from scratch in PyTorch

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome to this tutorial on how to implement a denoising diffusion model in pytorch on youtube i've already watched several great tutorials on these models however so far it has very few hands-on content because of that i created another collab notebook with an implementation of a simple diffusion model and throughout this video i explained both the theory and the implementation and hope that you find it useful with the fusion models we are in the domain of generative deep learning that means we want to learn a distribution over the data in order to generate new data there are already lots of model architectures available to generate new data for example generative adversarial networks and variational autoencoders i found this figure in nvidia's blog which summarizes it quite nicely vaes and also normalizing flows have shown to produce diverse samples quickly but usually the quality is not great compared to gans as a recap vae compresses an input into a latent distribution and then samples from this distribution to recover the inputs after training we can sample from the latent space to generate new data points usually vies are quite easy to train but as mentioned the outputs can be blurry gans on the other hand produce high quality outputs but are most of the time difficult to train this stems from the adversarial setup which can cause problems such as vanishing gradients or mode collapse i have to say that i experienced lots of these issues myself throughout the years of course there are already lots of improvements nowadays but still it's not that easy to fit such a model as you probably know the diffusion model is a rather new generative deep learning model that has shown to produce high quality samples which are also quite diverse the fusion models are part of a lot of modern deep learning architectures and recently showed great success in text-guided image generation such as in delhi 2 or imogen i'm pretty sure you have seen this in the media the fusion models work by destroying the input until noise is left over and then recovering the input from noise using a neural network more details will follow later in this video of course the fusion models also have their downsides for example the sampling speed because of the sequential reverse process they are much slower compared to gans or vaes but as these models are still in their infancy there might be lots of improvements in the future using that as motivation let's try to build a simple diffusion model by ourselves more specifically we will fit a diffusion model on an image data set the architecture and model design are mainly based on the following two papers the paper on the left from researchers from berkeley university was one of the first publications that uses diffusion models for image generation the researchers demonstrate what this model is capable of and also introduce some nice mathematical properties the paper on the right from openai can be seen as a follow-up with several improvements that make the image quality even better for example they use additional normalization layers residual connections and more in my simple example i don't plan to build the latest state-of-the-art architecture but instead just want to build a solid base model finally i want to point out that this video is focused on the implementation and not all of the theoretical details of course i will explain everything that is required to follow along but in case you really want to get a deeper understanding you might want to have a look at one of the great resources collected on this github page with that being said let's continue with answering some immediate questions about implementing the fusion models in a fast forward mode first question what are actually diffusion models diffusion models work by destroying some input for example an image by gradually adding noise and then recovering the input from the noise in a backward process also called denoising this is also called a markov chain because it's a sequence of stochastic events where each time step depends on the previous time step a special property of the future models is that the latent states have the same dimensionality as the input the task of the model can be described as predicting the noise that was added in each of the images that's why the backward process is called parametrized we use a neural network for it in order to generate new data one can simply perform the backward process from random noise and new data points are constructed a typical number of steps that is chosen in this sequential process is one thousand however of course the larger this number is the slower the sampling will be next question how do you start implementing such a model we will need mainly three things for that a scheduler that sequentially adds noise a model that predicts the noise in an image which will be a unit in our case and finally a way to encode the current time step we will dig deeper into each of these components but first let's have a look at our data set so this is the collab notebook and the data set we use is called stanford cars which is included in pytorch and it consists in total of around 16 000 images 8 000 train and 8 000 test images of course as we build a generative model we can use all of them so we have 16 000 images and here are some samples of them later i will crop them to the same size and reduce the size a bit to make the learning faster as you can see the colors in these images are in many different poses with a variety of backgrounds so we should expect a lot of diversity in our generated images which we will also see later of course there's no need to reinvent the wheel there are already some existing implementations of the fusion models and this notebook is highly inspired by these implementations which means i got a lot of the code from these two resources implementing the diffusion model consists of several steps mainly the forward process the backward process the loss function and then we can already define sampling and training and before we start looking at the forward process let's talk about it in more detail even though i try to keep it practical we have to talk about some of the math of the fusion models the forward process is fairly easy all we need to do is to add noise to the images this markov process is usually denoted with q and the noise that is added to an image depends only on the previous image x0 is always the initial inputs and all other axes are more and more noisy versions of it the way how noise is sampled is described by this formula it's a conditional gaussian distribution with a mean that depends on the previous image and a specific variance let's inspect this term more closely the sequence of betas is a so called variance schedule that describes how much noise we want to add in each of the time steps xt minus one is the previous less noisy image which means the mean of our distribution is exactly the previous image multiplied with this term that depends on the variance schedule better the variance of this normal distribution is fixed to beta multiplied with the identity i thought it makes sense to gain some more intuition about beta as you probably know each of the pixels in an image has three channels red green and blue these values are usually between 0 and 255 or when normalized between for example -1 and 1. let's take this giant red pixel with a color code of minus one minus one so we have a lot of red but no green and no blue according to the previous formula the distribution of our next image is now described by this mean and variance that means the values of our red pixel multiplied with the square root of one minus beta is exactly the mean of our distribution for that single pixel depending on the noise level beta this could for example be 0.99 for the first image the variance is fixed to better if we choose a large beta it means that the pixel distribution is not only wider but also more shifted which results in a more corrupted image when sampling from this distribution we consequently end up with more noise eventually beta controls how fast we converge towards a mean of zero which corresponds to a standard gaussian distribution i recommend to try out different values to see how the means converge towards zero the important part is to add the right amount of noise such that we arrive at an isotropic gaussian distribution with a mean of 0 and a fixed variance in all directions otherwise the sampling later will not work that means we don't want to add too few noise but also not too much in order to have a too noisy image too early there are different scheduling strategies for that in our case we will simply add the noise linearly but in the literature there are also choices like quadratic cosine sigmoidal and more i said that the noise is added sequentially in this formula but actually the researchers proposed an even better solution because the sum of gaussians is still a gaussian distribution we can also directly calculate the noisy version of an image for a specific time step t and that's without sequentially iterating over its predecessors based on the initial image x0 we can sample the noisy version of it for any arbitrary time step t for this we need to pre-calculate the closed form of the mean and variance based on the cumulative variance schedules let's see an example of this to better understand it let's say we choose 200 steps in our diffusion process the variance schedule beta tells us how much noise we want to add in each of the steps we linearly increase it until a maximum value of 0.02 if we wouldn't increase it at all it would take forever to end up with pure noise the authors now define the term alpha which is simply one minus beta i like to think of it as a number that tells us how much information of the original image we keep when transitioning from one image to the next by calculating the cumulative product of these alpha terms we get a new term denoted as alpha over line with this we can specify a new distribution that allows us to sample for a specific time step now that's the high level explanation for more details i recommend to have a look at the extraordinary explanation video by outlier that captures all the mathematical derivations for which i didn't have time here a consequence for our training will be that we simply sample a time step t and pass the noisified version of it to the model that makes the training much easier and smoother compared to sequentially iterating over the same image that was mostly it regarding the forward process let's jump into the code and see how we can implement it alright let's have a look at the noise scheduler this is mainly inspired from the github implementation first we have a function that defines a linear beta schedule we use torch linspace for that which interpolates between two values for as many time steps as we define so these are the betas and then this is just a helper function that extracts a specific index from a list and also considers the batch size so we can use it later in during training and the main function is forward diffusion sample which will use some pre-calculated values from here and those are exactly the values i just mentioned so we calculate the alphas and then the alpha with the over line and some additional terms so that we can calculate the noisy version of an image so this function takes as input x0 the initial image and the time step t and it returns the noisy version at time step t of that specific image and so what's happening in this forward diffusion sample function is that we sample some noise and then get the time step specific values that are pre-calculated so we get a specific index from all of these pre-calculated lists for alpha over 9 square root and all of that and then we simply calculate the new sample based on the original inputs plus the noise and this gives us the noisified version of an image at a specific time step so next we can try that out on our data set but in order to do that we first have to put our data set into a pytorch data loader and also our images at the moment are pillow images which means we first have to convert them into tensors and as mentioned before i resize the images to a shape of 64 by 64 which is relatively small but makes the training much faster and then i do some data augmentation things like horizontal flips and then i call to tensor and this scales the data between zero and one and now i mentioned that we need to have the data in a range of minus one to one in order to work with these betas and because of that this last transformation converts our data into our desired range and that is because if we multiply that by two we have a scale of zero to two and then if we subtract one we end up in the desired range of minus one to one and then as mentioned before i load both the train and test sets and then i merge them into a so-called concat data set which holds all of the images and then i have a second function that does exactly the reverse direction it converts a tensor image into a pillow image and applies all of the reverse transformations that we did here and after that it simply shows or prints the image and if we have a batch of images we just take the first image to display it in our notebook and with that we load the data sets so train and test splits and then put everything into a data loader and here we can simply iterate over one batch and see how we convert an image into more and more noise so that we end up with pure noise so now after finishing the full process let's have a look at the neural network model in the backward process the authors propose to use a u-net this is a special neural network that has a structure that is similar to the one used in an auto encoder units are a popular model for image segmentation and their output has the same shape as the inputs the input passes a series of convolutional and downsampling layers until a bottleneck is reached and then the tensors are upsampled again and pass more convolutional layers the input tensors get smaller but also deeper as more channels are added besides that there are a lot of other components that are typically part of these units such as residual connections between the layers batch or group normalization and also attention modules as mentioned before i aim to build a very simple and easy to understand model and that's why i only use the main components of this architecture such as down and up sampling as well as some residual connections one constraint of the fusion models is that the input needs to have the same dimension as the output because of that the unit is a great fit for image data our model will take a noisy image with three color channels as inputs and predict the noise in the image because the variance is set fixed we will only output one value per pixel that means the model learns the mean of the gaussian distribution of the images this is also called denoising score matching note that there were also experiments with predicting the image mean instead of the noise mean and both approaches seem to work one important thing is that we need to tell the model in which time step we are because the model always uses the same shared weights for each input no matter if it's 45 or 8. we will talk about this component in a second the reverse process can also be formulated mathematically like this we start in x t with gaussian noise with a mean of 0 and unit variance then in a sequence the transition from one latent to the next is predicted this makes the model learn the probability density of an earlier time step given the current time step as mentioned before during training we just randomly sample time steps and don't go through the whole sequence at sampling time however we need to iterate from pure noise in xt to x0 which is the final image the density p is defined by the predicted gaussian noise distribution in the image in order to get the actual image of time step x t minus 1 we have to subtract this predicted noise from the image xt during sampling this is just a rough formula to give you an intuition the exact one in the paper also considers the noise levels better again if you look for a deeper explanation of this check out outliers math video okay now we have captured the two most important parts of the fusion models finally let's talk about how we can consider the time step in our model the neural network has shared parameters across time which means it can't distinguish between the different time steps that however means it needs to filter out the noise from images with very different noise intensities to circumvent this the authors used positional embeddings which were presented in the transformer model they are a clever way to encode discrete positional information like sequence steps positional embeddings can be visualized like that and have the task to assign a different vector to each index for example if you have the values 1 5 and 10 the corresponding embeddings could look like this the embeddings are typically calculated using sine and cosine functions for a specific embedding size defined by d in this case for more details i recommend to have a look at the machine learning mastery blog post in the video description the positional embeddings are then simply added as additional input besides the noisy image and used in several places in the model now that we are familiar with all the building blocks let's see how we can implement this in practice so this is the backward process and the unit implementation let me quickly show you the codes so that you see that it's not too much because i really try to make this model as simple as possible that's why it's also called simple unit so let's go over this step by step the incoming image so the noisy image has three channels for red green and blue and we extend these channels by applying learnable filters in these convolutional layers so that we have more and more channels so the depth of our tensors increases and this happens until we have 1024 channels and then in the up channels we again reduce the size and besides that we also increase and decrease the size of the images and we will see that in a second all of that happens in these blocks and in the forward function we simply iterate over each of these blocks pass it the current image and get an updated version of the current image with more channels and a smaller or larger size and then we pass it again and in addition to that we also include the time step in form of a positional vector and as mentioned before we use positional embeddings for that and here is how we can calculate them using sine and cosine functions and this returns a vector that describes the position of an index in a list we also have an initial projection layer that converts our image to the first dimension so to these 64 channels and then a readout layer that uses the last dimension again 64 and converts it again to 3 which are the number of channels our image has and now the only question left is what is happening in each of these blocks each of these blocks applies to convolutional layers to the input image and also considers some additional things for example we apply relu activation functions we use batch normalization and then we also consider the time embedding and for that we first transform the time embedding to match the current number of channels and dimension we require in this specific step and then we simply add the time embedding to the current embedded image and so we have basically we add the channel of the time embedding and then we apply the second convolution and this gives us a representation of the image that contains both the time step but also the image information and finally we down or up sampling and depending on if we are in an up sampling or down sampling block we use either this transpose to 2d function or layer or if we downsample we use this with a specific padding so that we end up with a downsampled version of our image so i won't go deeper here but i can recommend to have a look at this post from which i also got some inspiration for doing these implementations the last part i want to mention here is the residual connections and this is simply done by storing all of the transformed images in the downsampling steps and then simply recovering or reusing them as additional inputs by adding this residual connection so that means the input in our up layers will have both the upsampled image and also the residual connection from the down sampling steps and then we have a final output it gives us the noise the predicted noise distribution in the image and these models can get big quite easily so this one has 60 million parameters and is not even big as said you can extend this architecture by a lot of things group normalization attention modules i recommend to read the second paper i mentioned at the beginning of this video which implements a lot of these additional changes the last and final part is the loss function the fusion models are optimized with the variational lower bound similarly as it is done in variational autoencoders however as the authors bridge the connection to denoising score matching they propose an alternative formulation that is equivalent to using variational inference to make it short the final loss function is defined by this and simply means we calculate the l2 distance of the predicted noise and the actual noise in the image this loss function is pretty easy but there are quite some derivations and considerations to arrive at the simple term which i couldn't include in this hands-on video it is however highly recommended to look into some of the literature to get a better understanding so based on that the loss function is straightforward to define which simply means we have the l1 or l2 loss so i tried out different approaches between the noise and the predicted noise and this function get lost simply takes an image a specific time step and our model and returns the loss of the predicted noise and the sampled noise the final part before training is the sampling because we need to be able to generate new images and i wanted to test that during the training so i first define these functions here and the first part is that we should put a torch no great decorator before each of these functions and because of that we don't track all of the previous images for gradient calculations and if you don't do that you will quite quickly run out of memory sampling is straightforward all we need to do is we pass some image to the model and also a time step and we get a predicted noise in this image and then we simply subtract this noise from the real image and if we do this in a sequence so in a for loop which happens here we can iteratively get less and less noisy images again we will need our pre-calculated noise levels here according to this algorithm in the paper and by doing this we can get less and less noisy images and if we are in time step t we will return the image noiselessly in all other steps we will add the noise that is added according to our forward process and down here i simply do exactly that so in a reverse sequence from t which was 300 in this example until zero i iterate over the images and pass the images to this sample time step function which returns a less noisier version and every x or number of step size images i plot the image we would which we will see in a second in the training so the training part simply means we iterate over the data points in our data loader and then we call this get loss function and then we simply optimize the model and doing this looks more or less like that at first i was a bit disappointed by the results and thought my unit needs some rework on collab i trained the model a bit longer for around 100 epochs and started to see something like this i mean i wouldn't try with these cars but it certainly goes in the right direction unfortunately i hit the usage limits on collapse so that i couldn't try out more epochs because of that i decided to fit the model a bit longer on my personal gpu and after around 500 epochs i got these results now these are clearly cars of course the resolution is still very small but i was pretty happy that i made it this far i think if you even train longer and do more refinements of the model architecture you can clearly achieve high quality images here of course the fusion models are not limited to image data there's already a lot of work done in other domains such as molecule graphs audio and more i also wanted to point out that there are already very interesting variants of the fusion models such as the fusion gans overall i think that these models are a very promising family of generative models and i really look forward to where this is going that's the end of this introduction to the fusion models congratulations if you made it until the end i hope you found it useful and would be happy to see you again in a future video with that have a nice day and happy coding
Info
Channel: DeepFindr
Views: 113,863
Rating: undefined out of 5
Keywords: Diffusion Model, Pytorch, DDPM
Id: a4Yfz2FxXiY
Channel Id: undefined
Length: 30min 54sec (1854 seconds)
Published: Sun Jul 17 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.