Diffusion Models | Paper Explanation | Math Explained

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hey there and welcome to this video in the following i'll be doing a paper explanation of the fusion models this type of model has recently become super popular for image generation and achieves competitive results to usual state-of-the-art gans let's see what it is able to do especially in the generative art field diffusion models have enabled really amazing results for text to image below the image you always see the caption with which the model was prompted i mean just look at these ahead of broccoli complaining about the weather it is just so accurate or look at these with deli 2 which is also a diffusion model you can also generate variations of images the top left is me doing a handstand and the rest of the images are generated by dali based on my image or you can also do in painting i generated the following image and then asked to remove the tree and this is the result just amazing with other models like disco diffusion you can even create little animations given the text prompt so you see that these models are capable of creating incredible results and it's so much fun to play around with them so without further ado let's try to understand them since the fusion models are still pretty new in the image synthesis regime and the number of relevant papers is not too big i want to cover the most fundamental papers about ddpms which led to the rapid improvements for this technique in this video specifically i want to talk about the following four papers first of all the original paper from 2015 which introduced this technique to the field of machine learning coming originally from statistical physics then five years later the second influential paper was published and introduced a few groundbreaking changes which led to a huge jump in quality after that two researchers from open ai took on this challenge and introduced even more improvements in two consecutive papers in 2021 which resulted in better performance and faster runtime the video will be structured as follows i will be first talking about the general idea of diffusion models how they work and why they work i will try to do this on a really intuitive level without too much math involved after that we'll be diving into the evil looking math formulas and along with the pre-built intuition i will try to explain the math behind ddpms moreover i will also be mentioning the different improvements which came from each of the four papers oh and one side note please forgive me if i won't be going into every single formula and detail of each paper since this would go beyond the scope of this video and would explode the video length and also feel free to use the timestamps in the description to go to the parts which you're most interested in so let's try to understand the idea the 2015 paper described the diffusion model as the following the essential idea is to systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process we then learn a reverse diffusion process that restores structure and data yielding a highly flexible and tractable generative model of the data in short we apply a lot of noise to an image and then have a neural network for removing this noise the idea now is when the network for removing noise is learned properly we can start off with completely random noise and let the model remove noise until we have a new image which could occur in our training data to better understand this let's take the previous statement apart the authors talk about the forward and the reverse diffusion process the forward diffusion process is applying noise to an image in an iterative fashion so we start off with the original image and step by step add more and more noise to the image if we repeat this for a sufficient number of steps the image will become pure noise the original paper from 2015 decided to sample the noise from a normal distribution and this was not changed by any of the follow-up papers the authors showed that the image will converge to noise which follows the noise distribution in our case the image will turn into one which could come from a normal distribution so we see that going from an image to noise is fairly simple but what about learning the reverse and going from just noise to an image this is what the authors call the reverse diffusion process and it involves the neural network learning to remove noise from an image step by step that way we can give a model an image which just consists of noise sampled from our normal distribution and let the model gradually remove the noise until we have a clear looking image and if you're wondering why we are doing the stepwise and not going from noise to clear image instantly the 2015 authors claimed that this approach would not be tractable and lead to much worse outcomes so how does this new network look like what will it take in and predict eventually the ddpm paper from 2020 laid out three things the network could predict first predicting the mean of the noise at each time step second predicting the original image directly and third predicting the noise in the image directly which could then just be subtracted from the noise image to get a little bit less noised image we already said that predicting the original image won't work well so we can ignore this choice and as you will see later the first and the third option are the same just parametrized differently and all authors decided to go with the third option and predict the noise directly you may be wondering why do we just predict the mean and not the variance in the first option a normal distribution needs mean and variance well the authors of the 2020 paper decided to fix the variance and thus there is no need to predict it since it's always readily available i'll show what they fixed it to in the math part and later the choice of fixing the variance was rethought by the openai authors in their first paper and they eventually decided to learn the variants too because it would lead to improvements in the log likelihoods but more on that later also one more thing to note is that we don't employ the same amount of noise at each time step in the forward process this is being regulated by a schedule which scales the mean and the variance this ensures that the variance doesn't explode as we add more and more noise the paper from 2020 employed a linear schedule again more on that later but to give you some intuition if we apply the schedule to an image the transformation would look like the following looking at that the offers from openai found this approach to be sub-optimal especially at the end of the noising process because you see that the last couple of time steps already seem like complete noise and might be redundant also they found that the information would be destroyed too fast as a result they created their own called the cosine schedule in the following image you can see the difference between both schedules the cosine schedule destroys the information more slowly and solves both problems of the linear schedule which is too rapid in destroying the information and to uninformative at the end so we have talked about what the model will take in and predict let's take a look at the architecture of it the office from the 2020 paper went with a unit like architecture this kind of architecture has a bottleneck in the middle it takes as input and image and using downsample and resnet blocks it projects the image to a small resolution and after the bottleneck it projects it back up to the original size this time using up sample blocks instead and also at certain resolutions the authors put in attention blocks and employed skip connections between layers of the same spatial resolutions the model will always be designed for each time step and the way of telling the model at which time step we are is done using the sinusoidal embeddings from the transformer paper this embedding is projected into each residual block this is important because the forward diffusion process is using a schedule as i mentioned earlier which scales the mean and the variance as a result different amount of noise is applied at different time steps and with this information the model can also take care of removing different amount of noise at different time steps which greatly benefits the outcome i will be talking more about the schedules later in the maths part in the second paper from openai they heavily improved the overall outcome by improving the architecture they made the following updates first of all they increased the depth of the network and decreased the width then they included more attention blocks than the original proposal and also increased the number of attention heads they also took the residual blocks from big gan and used this for the upsampling and down sampling blocks next they propose what they call adaptive group normalization this is just a fancy name for the idea of incorporating the time step slightly different and additionally also the class of the label they first apply a group norm after the first convolution in each resonant block and multiply this by a linear projection of the time step and add a linear projection of the class label onto it the last improvement is classifier guidance which uses a separate classifier to help the diffusion model to generate a certain class however this is an entire topic on its own and i plan on talking more about this in another video so i will not be going into it much in this one so now you should have a basic understanding of how diffusion models work to briefly summarize we have two processes one for going from a normal image to complete noise which follows a normal distribution and then we have a reverse process going from complete noise to a real looking image using a neural network both processes are done iteratively and the architecture of the model follows a unit great now it is time for jumping into the math part to deeply ground this rough intuition okay so let's approach this really easy and start with some basic notation so we are all on the same level we'll define our image as x with a number as a subscript this number will determine at which time step we are so x0 will refer to the original image and as time passes and we add more and more noise this number will increase for example xsubscript42 will be our image after applying 42 iterations of noise the final image which follows an isotropic gaussian will be called xt t also varied throughout the papers initially it was set to a thousand however the follower papers were able to decrease this number to just a fraction next up we'll define two functions q of x t given x t minus one corresponds to the forward process and takes in an image and returns an image with a little bit more noise added so x t minus 1 goes in and out comes x t which has more noise we'll be using this notation often and in order to not get confused you can remember that t minus 1 always corresponds to a less noised image and on the other hand t corresponds to an image with more noise in short a smaller number means smaller amount of noise and a bigger number means a bigger amount of noise the second function is for the reverse process and we'll call it p p takes in an image x t and produces a sample x t minus 1 using the neural network so this time x t goes in which if you remember from before has more noise and out comes x t minus 1 with less noise great and with this knowledge we can now dive into the forward process so q of x t given x t minus 1 will be defined as the following and don't be afraid by this formula it is fairly simple just a notation is a bit frightening let me explain n is the normal distribution x of t will be the output this term will be the mean and this term will be the variance beta refers to the schedule i was talking about before all betas will be ranging between 0 and 1 and are ensuring that the data is being scaled so that the variance doesn't explode to better understand this let's take a closer look at the linear schedule proposed by the 2020 paper they defined the smallest beta to be 0.001 and to grow linearly up to 0.02 plotting both beta and the square root of 1 minus beta against time results in the following two plots you can see that we more and more scale down the image which acts as a counterpart for increasing the variance over time and keeping the variance inbound cool we now understand how to apply one forward step to apply say a thousand forward steps we could easily repeat this formula a thousand times but there's an easier way of doing this in just one step how cool is that for that we need to define a little more notation so bear with me it's not much we'll define alpha t to be 1 minus beta t and then we'll also define the cumulative product of all alphas from 0 until t so we just multiply all alphas together you'll see in a second how we are going to use this alright first of all we can rewrite the before established formula for one time step like the following using the reparameterization trick epsilon is taken from a normal distribution with mean 0 and standard deviation of 1. now we rewrite it using alpha pause the video to see that they are actually equal and now comes the trick we can extend this to earlier time steps and go from t minus 2 directly to t by just chaining the alphas and also from t minus 3 to t and so on and you see where this is going we can go all the way to the beginning and go in an instant fashion from x0 to xt and now you also understand why we introduced the cumulative alphas to simply shorten the formula all of this can now be also rewritten in the original form like this and that's already it for the forward diffusion process it isn't that hard right so now let's move to the reverse diffusion process so if you remember we already established the notation for the reverse process and called it p of x t minus 1 given x t we can also express this formula in the following way in that case we would have two neural networks which parameterize the normal distribution which we can sample from to get x t minus 1. we already said that we fixed the variance to a certain schedule so we don't need to predict it and can take this as given now i could just throw you the rest of the formulas at your head and say that you should take this for granted then the video could be over in less than a minute but i believe that's not the point and it wouldn't be useful as a result i want to really show how all the things play together to come to the conclusion which leads to the final objective where we just have to predict the noise so that's the end goal arriving at the formula which uses a new network to predict the noise in an image between two time steps keep that in mind in order to have the easiest time we'll start by looking at the loss function of this whole thing this will simply be the negative log likelihood but there's a problem the probability of x0 isn't nicely computable as it depends on all other time steps coming before x0 starting at xt this would mean keeping track of t minus 1 other random variables which is just not possible in practice as a solution we can compute the variational lower bound for this objective and arrive at a more nicely computable formula note that the overall setup of diffusion models is pretty similar to that of variation and autoencoders so make sure to look that up if some of the things may be unclear to you i also put some references in the description if you're not familiar with the idea of a lower bound let me give you some intuition say we have some function f of x which we can't compute in our case the plane negative log likelihood and we can prove that we have some function g of x which is always smaller than f x which we can compute then by maximizing g of x we can be certain that f x will also increase in our case this would be ensured by subtracting the kl divergence which is a measure of how similar two distributions are and is always non-negative so by always subtracting something non-negative from a function will always result in having something which is less than the original function but i wrote a plus sign instead of minus this is simply because we always want to minimize instead of maximizing that's why we add it and thus have something which is always bigger than the plane negative log likelihood and then we can minimize this objective however in this form the lower bound is not computable either since yeah we still have the quantity which we can't compute and that's why we need some clever reformulations to arrive at a nicer looking quantity at first we rewrite the kl diversions to be the log ratio between the two terms now let's apply the bayesian rule to the quantity at the bottom the upper term can be summarized to a joint probability and now we can see that we can just write this as the probability of x0 until t now we can pull the bottom quantity to the top by using the fraction rules and split up the logarithm and now you already see the magic these two terms just cancel each other and thus we get rid of the annoying quantity which we can't compute great now this our variational lower bound which we can minimize and note that we know all of these things the term at the top is just a forward process starting at some image from our data and the lower part can be rewritten as the following and you see that we can calculate p of x t and the product of all these terms is just using our parametrized model p now in order to make this analytically computable the offers applied a couple more steps of reformulation to arrive at a nice looking lower bound which we will also do now and i promise this will be the last math heavy part so let's take our current lower bound and write it in product form which i showed before by applying the log rules we can bring out p of x t from the denominator so that we are left with the fraction of the forward process and the reverse process next we apply more log rules to bring out the product which becomes a sum now the authors do something which might appear weird at first but will make sense in a minute they take out the first term of the sum which now starts at t equals two instead of t equals one so they split up the first term from the entire sum they now rewrite the denominator in the sum using bayes rule which can be expressed like this pause the video and see for yourself it's just a standard base formula now there is a problem all of these quantities have a really high variance since we don't know what we started from so this is why the offers drastically reduce this variance by also conditioning on x0 because just imagine this given this picture you might be pretty uncertain where it came from so there's a high variance but say i also give you the original noise free picture now the choice for possible candidates decreases and you might be more certain and it gets even better this formula has a closed form solution which i will talk more about later fusing this back into the formula will give the following and now you might already see the reason why we pulled out the first term if we hadn't done this the first term would look like the following after applying base rule and conditioning on x0 you see that we have a loop and other things which don't really make sense okay now we almost made it just a few more modifications to be done so bear with me first we split up the summation in two parts because we can now let's take a closer look at the second summation see something let's take an example case and say capital t equals four writing the summation out will give us the following and you see that most of the things cancel and only the last term from the top and the first term from the bottom survive that's why we generally can simplify the summation to just this and now the formula will look like the following now for the last step we take a closer look at the last two terms and by applying the log rules and converting the inner quotient into an outer subtraction we see that these terms cancel each other now the offers bring this term to the front and fuse them in one logarithm which leaves us with the final analytically computable objective we can now write the first terms in kl divergence notation and voila objective and it gets even better we can ignore the first term completely since q has no learnable parameters it's just a forward process which adds noise over and over and p of x t is just random noise sample from our gaussian and also by the aforementioned theory that q will converge to a normal distribution we can be certain that the kl divergence will be small i also want to give one more hint since often these big math derivations might be logical and one can follow them but you barely understand why this is being done and the most crucial part of the previous derivation is the part where we apply bayes rule to convert q of x t given x t minus 1 to this equation obviously one can understand that this is true but it doesn't seem obvious at least it didn't to me what the reason for that is and to give some intuition for the reason we can take a look at the final objective look at the second kl divergence it is a really nice quantity which we can compute as we will also see later and deriving at this would have not been possible if we hadn't applied bayes rule this reformulation and the extra conditioning on x0 gave us the possibility to have the forward process in the same form as the reverse process and we also saw the other additional benefits were a lot of things cancelled or simplified so we are left with these two terms let's first focus on this one we already showed that we can express p of x to minus one given x t like this nothing new here one you'll network for the mean and one for the variance however the variance is a fixed schedule as discussed before and i also mentioned that q of x t minus 1 given x t and x 0 has a closed form solution which we can also write in the form like we did with p deriving mu and beta would take quite a bit of time and would make the video even longer that's why i decided not to do this derivation here since it is also not relevant for understanding the algorithm however i will put a link in the description where you can find more about it so the final version for mu and beta will look like this big formula i know don't be afraid we'll get rid of it again better is not relevant since it's fixed so let's focus on mu we can simplify this formula how remember the closed form of the forward process for generating x t in one single step we can rewrite this in terms of x0 and express it like this plugging this into mu of xt and x0 will give us this and you also see that mu doesn't depend on x0 anymore and through some further simplifications we end up at this tiny formula and you see that essentially we are just subtracting random scaled noise from xt that's it and again take a look in the video description to see the full derivation now this is what our neural network mu needs to predict and the authors decided to use a simple mean squared error between the actual mu and the predicted mu and now they saw the following xt is available as input to the model so why predict this whole quantity and not instead just a noise and that is why we can rewrite mu like the following and as you see it's completely the same as mu t except for epsilon which is supposed to predict the noise so let's rewrite the mean squared error like this and through simplification we arrive at this friendly quantity just the mean squared error between the actual noise at time t and the predicted noise at time t by our neural network given some image and believe me or not this quantity is about to become even more friendly the authors found that ignoring the scaling term at the front results in better sampling quality and generally an easier implementation and i mean look at this all of this heavy math derivation starting at the negative log likelihood going to the lower bound reformulating and simplifying this lower bound seeing that we only need to predict the mean which leads us to the inside that we only need to predict the noise and then simplifying even more because it just works better isn't that beautiful so getting back to our starting formula we can now put in our new findings the new network for mu becomes this and once again the new network for the variants is not needed and can be replaced by beta t and to actually sample x t minus 1 we can apply the reparameterization trig again and arrive at this equation so just the mean plus the standard deviation plus noise sample from a standard normal now there's one more thing which we haven't talked about yet the last term in our lower bound also needs to be taken care of for up some clarification images have values between 0 and 255. the authors scale the images to be between -1 and 1 to have them in the same range as the prior standard normal distribution centered at 0 with variance of 1. now we can just calculate the probability of x0 given x1 like this ok there was maybe a bit too much at once but stay with me i'll explain d is the data dimensionality which is the number of all pixels in our case so we iterate over all pixels in the image then we have this evil looking integral but this just means that we integrate over the range around the actual value of the pixel in x0 and if this quantity actually predicts a mean value in the area of the true pixel then the integral will be high and if that happens for each pixel then the product will result in a high number and thus the probability of x0 given x1 is high however if we always predict a different pixel far away from the true value then the mass in the true pixel area is low and the overall probability will also be low let's take an example for 1 pixel say the true pixel value would be 10 over 255. also say x1 has a value of 14 over 255 at that specific position we now apply our network and it predicts a new mean of 11 over 255. now we can plot this normal distribution centered at 11 over 255 with some arbitrary variance given by our schedule and now we could integrate over the true region where the pixel in x0 is starting one pixel value below and going one pixel value above since the network did a good job at prediction the mass in that area is quite high if however the network would have predicted the mean of 20 over 255 then the mass in that area would have been super low leading to a low probability i hope this gives some intuition for understanding the formula if not don't worry the authors also just decided to get rid of it and approximate it by our neural network and the only difference is that at sampling time we don't add noise to it yeah that's it sorry if you just spend your valuable time trying to understand this and now our entire objective fits into one single line and can be written like this that is what we optimize t is sampled from a uniform distribution between 1 and t and this long formula in our neural network is just x t and we could also write it like this that looks even more peaceful if you imagine where we started so now let's take a look at the algorithms for training and sampling summarized this one is the algorithm for training and you see we first sample some image from our data set then we sample t and we also sample noise from a normal distribution and then we optimize our objective via gradient descent super easy right sampling is also not that hard we first sample x t from a normal distribution that's our starting point and now we iteratively use the formula which i already showed before that uses the re-parametrization trick to sample x t minus one and here you can see how the case for t equals one is approximated if t is equal to one we don't want to add noise why let's imagine we would add noise if t is equal to 1 we are predicting x0 given x1 our neural network predicts the noise in x1 and then we subtract it from x0 now it wouldn't make sense to add more noise since we are at our final image and the additional noise would make the quality of x0 worse so yeah that's the simple reason and eventually we are returning x0 and if we trained well we now have an image which could come from our data set so you made it we are done with the fundamental part of diffusion models great job all of what i just explained was taken from the 2015 and the 2020 paper now we already said that there were more improvements done by the follow-up papers from open ai i'm quickly going to show the updates which they made but i won't go into the details for time reasons the first improvement is reconsidering the variance before we just fixed it to either better or better head which are the upper and lower bounds on the variants now we're also going to learn it specifically we are learning an interpolation between better and better head using the following formula the authors just found it to work better than to learn it directly because these numbers are in a really low range but now we just need to learn a number between 0 and 1 which works quite well this also results in a slight change of the loss since it doesn't depend on the variance in its plane form we add the variation in lower bound to a loss and scale it with lambda which was defined as 0.001 next up it was also proposed to use a better noise schedule the new proposal doesn't look as simple as the linear and is defined by the following formula for alpha hat from which we can derive better here you can also see what it looks like when we plot both schedules it was designed to have little change at the extreme points at the beginning and at the end and to be linear in the middle region and i've shown this figure before but once again you see that the cosine schedule behaves better and takes more time for destroying the information in the image make sure to check out the paper to get more explanation on the improvements and other things which were tried and now for the last part let's quickly take a look at the results of the papers specifically i want to take a look at the fid scores of the different diffusion model papers and also compared to other non-diffusion models here i will only look at the scores on imagenet 256 by 256 but feel free to check out all the papers to see different data sets and also other metrics such as bits per dimension unfortunately the 2020 paper hasn't tried on imagenet but it obviously would perform much worse than the improved diffuse model which outperformed it in all other categories the improved tdpm achieved a score of 12.3 on imagenet that's quite a good result but the improvements from openai completely outperformed it and achieved an fid score of 4.59 by the way and adm stands for ablated diffusion model later in the paper they also propose a technique for up sampling in the diffusion process which i didn't talk about in this video with that they improve even more and achieve a score of 3.94 these results are incredible but let's take a look at how diffusion models compare to other models you see that the improved ddbm ranks last among all these state-of-the-art models however the approaches from open ai are place 4 and 6 and even better the results from the paper big rock which is also a diffusion model which i didn't cover in this video ran even higher with an even lower fid score but still the gold medal is being hold by again paper with some margin although i believe that the fuse models will easily outperform gans in the near future we've been working so long on gans to get them working and the fusion models have just recently rised and not much effort has been going into its research so i'm definitely excited where all of this will be going also let me know if you would like to have a video about any of the other papers listed here for example last time i already did a video about wikigan check it out if you want then also feel free to suggest other papers which aren't listed here and now finally we are done great job if you followed all the way until here but don't worry if you didn't understand every single part therefore let's quickly do a recap so you take out the most important things of the video the fusion models are generative models which became super popular and recent times for beating gans on image synthesis the main idea behind the fusion models is that we have two processes the first one is easy we just gradually apply noise to an image over a lot of steps until the image has turned into complete noise the cool thing is that we can also do this in one single step which saves us a lot of compute and time now the goal of the reverse process is that we learn how to remove noise from an image but as we saw we don't instantly try to remove all of the noise we do it step by step that's just easier for the model to learn for that we have a neural network which takes in an image and the current time step t and predicts the image at t minus 1. well to be more specific we saw that we can reformulate it so that we only predict the noise in the image then we can subtract the predicted noise from the image at time t and get the image at t minus 1. we will do this over and over again until we arrive at t equals zero and if the neural network was trained well we now have an image which could occur in our training data then we also looked at a lot of improvements which were introduced over time we saw a lot of interesting architecture improvements but also looked at updates in the training process by also learning the variants for example all of this is grounded by a lot of maps but if you got the main points you are already on a good way and that should be it for this video i really hope you enjoyed it and could learn something new it took a lot of work so let me know about things i can improve in the future or other topics which i should cover and with that being said i wish you a nice day you
Info
Channel: Outlier
Views: 173,129
Rating: undefined out of 5
Keywords: dalle2, dalle, diffusion, diffusion model, gan, generative model, imagen
Id: HoKDTa5jHvg
Channel Id: undefined
Length: 33min 26sec (2006 seconds)
Published: Mon Jun 06 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.