How I Understand Diffusion Models

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Diffusion models excel in many applications such as generating beautiful images from texts, animating a person, creating videos and 3D models. We are going to learn and understand how diffusion models work. In this video, we'll cover four main topics: Training, Guidance, Resolution, and Speed. Let's start with a simple Gaussian distribution for easy sampling. We want to design a generative model translating these noise samples into high-quality images. More specifically, we want to find the parameters of this decoder that can transform a simple Gaussian distribution into a complicated natural image distribution. However, we do not know what the true data distribution is and can only approximate it by collecting lots of data samples. Our training goal is maximize the sample probability known as Maximal Likelihood. Let's see what this maximal likelihood is trying to do. We first apply a log function to simplify the expression, turning the product into a sum. This help us rewrite it as an expectation. Here is the definition. If we subtract a constant that has nothing to do with the parameter theta, we find that this is just the Kullback–Leibler divergence between the distribution from our generative model and the data distribution. So maximizing the likelihood means minimizing the similarity between these two distributions. Sounds great, but it's hard to compute this log-likelihood value. Either it involves integrating out all latent variable z's or assuming that we know the ground truth later encoder. Let's see what we can do. We first introduce an encoder capturing a latent variable probability, given an observation. This term is one. since we integrate our all latent variables Z's. Let's move the log likelihood inside the integral and express this as an expectation. Here we can apply the Bayes' rule and multiply a dummy term. Next, we swap these posterior probability terms and separate them. We now recognize the second term is a KL-divergence between our encoder Q of Z given X and a ground truth encoder P of Z. given X. We don't know this value because we don't have the access to the ground truth encoder P of Z given X. But we do know the KL-divergence is non-negative. This means that the first time is the lower bound of the log likelihood value. Since the log-likelihood measures the statistical evidence for our model. This term is known as evidence lower bound or ELBO. One in type or later variable models is Variational Auto-Encoder (VAE), where they parameterize both the encoder and the decoder as Neural Networks. We can train both the encoder and decoder by maximizing the ELBO. Diffusion models are also latent variable models. But instead of encoding the observation X in one step, it encodes the image in multiple steps by progressively adding more and more noise. Similarly, the decoding process progressively remove the noise to generate a sample. In VAE, we have observed variable X and later variable Z. Similarly, in diffusion models, we call the clean image X0 and the latent X1 to XT. We can train the diffusion model in the same way as we train a VAE by maximizing the evidence lower bound. Okay, let's first take a look at what the encoding process looks like. We can write a encoding process as a product of transition probabilities. We define a transition probability at each time, step as a distribution where the mean is the image from the previous time step XT minus one scale by a scalar that's less than one and some variance. This encoding process ensures that the latent variables become a noise after many time steps. So with some derivation, our objective has three terms: 1) Prior matching, 2) reconstruction and 3) denoisy matching. The first term says that the latent distribution will be similar to the Gaussian distribution at the end of the diffusion steps. This is automatically satisfied by our forward diffusion process. The second term is similar to the reconstruction term in the Variational Autoencoder and is simple to compute. I want to focus on how we can maximize this denosing matching term or minimize this one. Here we see three probability distributions. First, what's a probability of a noisy image at timestamp t? Given a clean image x0? All we know is the transition probability of x_t given x_{t-1} T minus one. To do this, we need to know the reparametrization trick. This trick helps rewrite a random variable x as a deterministic function of a noise variable epsilon. Intuitively, we can represent a Gaussian distribution by scaling the epsilon by the standard deviation and shifting the mean. With this trick we can express x1, x2 and so on. Plugging x1 into the second equation, lead to this expression. Now we can simplify this because the sum of two independent Gaussian variables is also a Gaussian. Doing this recursively, we can write a noisy image at timestamp t as a function of clean image x_0 and noise variable epsilon. This means that we can directly sample from this Gaussian distribution. The second term says the following: Suppose we know the clean image x_0 and the noisy version of it after t forward diffusion steps, what's the probability of a "less noisy" image? x_{t-1}? This tells us how to denoise a noisy image when knowing the ground truth clean image x_0. We use this to guide our denoising network that models the probability of a less noisy image x_{t-1} given a noisy image x_t. Here is an actual photo of what's happening when training a diffusion model. To derive this term, we apply the definition of conditional probability and Bayes' rule. Here we know exactly what is three probabilities are. After some calculations, we find that it is also a Gaussian distribution. The mean lies on the line between a noisy image and a clean image x_0. We can also compute the variance in closed form. Here, the probability from our denoising network is also a Gaussian. Since both are Gaussian distributions with the same variance, minimizing the KL-divergence term is equivalent to minimizing the distance between the means of the two distributions. The process looks like this: We sample a clean image x_0 from the dataset and a noise image from a Gaussian distribution with zero mean and unit variance. We encode the clean image with forward diffusion to get a noisy image x_t . We then compare the L2 loss between the predicted and ground truth mean. By looking at this training objective, we can have three interpretations. First, from the ground truth mean, we see a linear combination of noisy image x_t and clean image x_0. But why do we ask the denoising network to predict noisy image that we already know from the input? Therefore, we express the estimated mean as a form of the ground truth one and only ask the model to predict a clean image x_0. Second, from the forward diffusion process, we know the relationship between the noisy image, the clean image, and the added noise. We can express the clean image x_0 as a function of a noisy image x_t and a noise epsilon. When we plug this in and we arrive at this new form of ground truth mean. Similarly, we can match the form of the estimated mean with the ground truth one, and only ask the the denoising network to predict the noise. To discuss the third interpolation, we need to use the Tweetie's formula. The formula states that if we observe a sample z from a Gaussian distribution, the posterior expectation of the mean is the sample plus a correction term involving the gradient of the log likelihood or the score of the estimate. Let's apply the formula to our forward diffusion probability. Replace the mean here and we now have this expression. When we replace the clean image x_0, we arrive at this equation involving the score. Similarly, we can parameterize our mean estimate using the same form. How can we compute this score? From a simple derivation, it turns out it's just the noise multiplied by a negative scalar. Intuitively, since the noisy image x_t comes from adding a noise to the clean image x_0, moving the opposite direction of the noise to denoise an image naturally increases a log-probability. Here's the formulation of score based models. Let's visualize the process in a simple 2-D plot. We take a clean image x_0 from our training dataset. Select a timestamp T and scale the clean image. We then sample a random noise and scale it. Adding these two up gives the noisy image x_t. We can train the diffusion model to remove all the noise and directly predict a clean image. However, this is challenging so that the noise image may still be low quality. So instead we only take a small step along this line. Alternatively, we can ask the network to predict what noise has been added to this image and take an opposite direction for a small step. Or we can ask the network to predict the score. In all three cases, we arrive at exactly the same distribution for the less noisy image x_{t-1} given a noisy image x_t. We can sample from this distribution and repeat the process. This allows us to create a path from noise to clean image. After training our diffusion model, we can use the denoising network to generate many samples. This is called unconditional generation. But perhaps we want to ask the model to generate specific contents like cat images. Or if we want to see more cats wearing sunglasses using a text prompt. Let's revisit the score estimation interpretation of diffusion models. At time step t, we use our denoising network to predict the log-likelihood gradient, which is the direction to maximize the log probability. For conditional generation, we want to predict the conditional score. By applying the Bayes' rule, the conditional score consists of an unconditional score and a adversarial gradient of a classifier. We can score a adversarial gradient by a positive factor of gamma. This is great because we can reuse the unconditional model and use additional classifier to guide the generation. We call this "classifier guidance". But we'll have to train another classifier because an off-the-shelf classifier is usually trained with clean images. Luckily, we can use the predicted noise to estimate a clean image. This estimated clean image is a bit blurry, but off-the-shelf classifiers are usually fairly robust to this. This is nice, but do we really need to use an additional classifier? By applying the Bayes' rules to the second term, we see that it consists of an unconditional and a conditional score. Plugging it back to our equation, we get "classifier free guidance". But training two denoising networks is expensive, so we train a single conditional denoising network and use null condition to represent the unconditional model. Here is the comparisons between unguided and guidance samples with classifier-free guidance. How do we generate high resolution images? There are probably three types of methods. The first one is using cascade. We first use an diffusion model to generate a low-resolution image, say 64 by 64. We then train a separate diffusion model that upscale the low resolution image to a higher resolution. The second type of approach is Latent Diffusion Models (LDM). We first train an Variational Autoencoder that encodes a high-resolution image into a low-resolution latent code. We train both the encoder and decoder using the reconstruction loss, some with a adversarial loss to ensure sharp results and the regularization loss on the latent. Now we can train our diffusion model efficiently in the latent space. Once we get a clean latent, we use the decoder to map it back to a high-resolution image. In both cascade and latent diffusion models, we need to train several models separately. End-to-end methods aim to generate high-resolution images with a single diffusion model. Several promising ideas have been proposed, such as adjusting the noise schedules for high-resolution image generation, multiscale loss, and progressive training. Division models are very slow because we need to evaluate the denoising network several hundreds or even 1000 times to get a good sample. Here are some methods that accelerate the sampling to make diffusion models more practical. Let's first review the training objective of DDPM. We train our denoising network to predict the noise. If we look at this training objective and don't care about what the waiting term is, we just need to ensure that: 1) the forward diffusion model remains the same. 2) The mean of the ground truth denoising step is a linear combination of the noisy image x_t and the added noise epsilon. The meaning of the estimated mean is of the same form. These three constraints do NOT assume the transition probability to be a Markovian process. In this DDIM paper, they construct a class of non-Markovian diffusion processes and find a and b that satisfy these constraints. This gives us these two Gaussian distributions. Interestingly, the sigma_t can be set to arbitrary values. By setting them to zero, we get a deterministic generative process. The only randomness comes from the initial sample noise. Here is the quantitative evaluation. With 1000 denoting steps that the DDPM did perform better. But when we reduce the number of denoising steps, the quality of the DDPM quickly degrades, while the results from DDIM remain decent. The best thing is that we don't need to retrain the model. We just need to take the model train with the DDPM objective and accelerate it with DDIM sampler. But even with the DDIM simpler, it still requires quite a few steps. Let's further reduce the number of steps using distillation. We can use a pre-trained model as a teacher and teach a student denoising network to use one sampling step to reproduce the output of a teacher network using two sampling steps. So after this distillation process, we halve the number of sampling steps. We can ask the student model to be a new teacher model and repeat the process until we reach the target sampling steps. Another idea is to distill classifier-free guided diffusion that requires evaluating both conditional and unconditional models into one single model. We can further apply previous ideas to make it faster, such as progressive distillation and latent diffusion models. We can distill a pre-trained invoicing network using consistency models. The main idea is to train a model so that for any points on the path, the model predicts the same origin. This supports single steps generation as well as multi-step generation. Applying consistency distillation in the latent space gives us high-resolution image generation using only a few steps. We can also extend the idea of latent consistency model with LoRA. But what's LoRA? Given a pretrained model, sometimes we want to generate contents of particular style, such as pixel art, LEGO, IKEA instructions, and anime. We can achieve these styles by fine-tuning the base model with additional data. However, this is computationally expensive and require high storage. In many cases, it turns out we only need to fine tuned across attentional layers in the denoising network. We freeze the pre-trained weight W_0 and optimize the residual parameters, but there are still a lot of parameters to update. We can reduce the number of parameters using low rank approximation on the weight matrix. Therefore, we can accelerate the consistency distillation with LoRA. Here the acceleration vector is the parameter difference between the distilled and the base model. More interestingly, we can accelerate other fine-tune models by linearly combining the acceleration and the style vectors. Here are some examples. Another recent distillation method train the student denoising model using score disillation. However, these predictions are usually blurring. Their main idea is to apply an adversarial loss by training a discriminator. Here are some results. After this distillation, these models can achieve text-to-image generation at interactive rate. To sum up, we covered the training objective for the diffusion models and their three interpretations. How we can guide the generation with classifier and classifier-free guidance. How we can synthesize high-resolution images using cascade, latent, and end-to-end diffusion models, and how we can speed up the sampling using DDIM simpler and various distillation techniques. Please comment below if you have any questions. Thanks for learning with me. I will see you next time.
Info
Channel: Jia-Bin Huang
Views: 25,131
Rating: undefined out of 5
Keywords: Diffusion models, AI, Computer vision, Generative models, Score-based generative models, AI content creation, Classifier guidance, Variational autoencoder, Evidence lower bound, ELBO, Denoising, Stable Diffusion, Imagen, DALL-E, Classifier-free guidance, Latent Diffusion Models, Cascade Diffusion Models, SDXL-Turbo, Adversarial Score Distillation, Progressive Distillation, Consistency Models, Latent Consistency Models, LoRA
Id: i2qSxMVeVLI
Channel Id: undefined
Length: 17min 38sec (1058 seconds)
Published: Mon Jan 08 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.