Diffusion models excel in
many applications such as generating beautiful images from texts,
animating a person, creating videos and 3D models. We are going to learn and understand
how diffusion models work. In this video, we'll cover four main
topics: Training, Guidance, Resolution, and Speed. Let's start with a simple Gaussian
distribution for easy sampling. We want to design a generative model
translating these noise samples into high-quality images. More specifically, we want to find
the parameters of this decoder that can transform
a simple Gaussian distribution into a complicated natural image
distribution. However, we do not know what the true
data distribution is and can only approximate it
by collecting lots of data samples. Our training goal is maximize
the sample probability known as Maximal Likelihood. Let's see what this maximal likelihood
is trying to do. We first apply a log function
to simplify the expression, turning the product into a sum. This help us rewrite it
as an expectation. Here is the definition. If we subtract a constant that has nothing to do
with the parameter theta, we find that
this is just the Kullback–Leibler divergence between the distribution
from our generative model and the data distribution. So maximizing
the likelihood means minimizing the similarity
between these two distributions. Sounds great, but it's hard to compute
this log-likelihood value. Either it involves integrating out
all latent variable z's or assuming that we know the ground
truth later encoder. Let's see what we can do. We first introduce an encoder capturing a latent variable
probability, given an observation. This term is one. since we integrate
our all latent variables Z's. Let's move the log likelihood
inside the integral and express this as an expectation. Here we can apply the Bayes' rule
and multiply a dummy term. Next, we swap these posterior
probability terms and separate them. We now recognize
the second term is a KL-divergence between our encoder Q of Z given X
and a ground truth encoder P of Z. given X. We don't know this value
because we don't have the access to the ground truth
encoder P of Z given X. But we do know
the KL-divergence is non-negative. This means that the first time
is the lower bound of the log likelihood value. Since the log-likelihood measures the
statistical evidence for our model. This term is known as evidence lower
bound or ELBO. One in type or later variable models
is Variational Auto-Encoder (VAE), where they parameterize both the encoder
and the decoder as Neural Networks. We can train both the encoder
and decoder by maximizing the ELBO. Diffusion models are also latent
variable models. But instead of encoding
the observation X in one step,
it encodes the image in multiple steps by progressively
adding more and more noise. Similarly, the decoding process
progressively remove the noise to generate a sample. In VAE, we have observed
variable X and later variable Z. Similarly, in diffusion models,
we call the clean image X0 and the latent X1 to XT. We can train the diffusion model
in the same way as we train a VAE by maximizing
the evidence lower bound. Okay, let's first take a look at what
the encoding process looks like. We can write a encoding process as
a product of transition probabilities. We define a transition probability
at each time, step as a distribution where the mean is the image from
the previous time step XT minus one scale by a scalar that's less than one
and some variance. This encoding process ensures
that the latent variables become a noise after many time steps. So with some derivation,
our objective has three terms: 1) Prior matching, 2) reconstruction
and 3) denoisy matching. The first term says that the latent
distribution will be similar to the Gaussian distribution
at the end of the diffusion steps. This is automatically satisfied
by our forward diffusion process. The second term is similar
to the reconstruction term in the Variational Autoencoder
and is simple to compute. I want to focus on how we can maximize
this denosing matching term or minimize this one. Here we see three probability
distributions. First, what's a probability of a noisy
image at timestamp t? Given a clean image x0? All we know is the transition
probability of x_t given x_{t-1} T minus one. To do this, we need to know
the reparametrization trick. This trick helps rewrite
a random variable x as a deterministic function
of a noise variable epsilon. Intuitively, we can represent
a Gaussian distribution by scaling the epsilon by the standard deviation
and shifting the mean. With this trick
we can express x1, x2 and so on. Plugging x1 into the second equation,
lead to this expression. Now we can simplify this
because the sum of two independent
Gaussian variables is also a Gaussian. Doing this recursively,
we can write a noisy image at timestamp t as a function of clean
image x_0 and noise variable epsilon. This means that we can directly sample
from this Gaussian distribution. The second
term says the following: Suppose we know the clean image x_0 and the noisy version of it
after t forward diffusion steps, what's the probability of a "less
noisy" image? x_{t-1}? This tells us how to denoise a noisy
image when knowing the ground truth clean
image x_0. We use this to guide
our denoising network that models the probability of a less noisy
image x_{t-1} given a noisy image x_t. Here is an actual photo of what's happening
when training a diffusion model. To derive this term,
we apply the definition of conditional probability
and Bayes' rule. Here we know exactly what is three
probabilities are. After some calculations, we find that
it is also a Gaussian distribution. The mean lies on the line between
a noisy image and a clean image x_0. We can also compute the variance
in closed form. Here, the probability from our
denoising network is also a Gaussian. Since both are Gaussian distributions with the same variance, minimizing the KL-divergence term
is equivalent to minimizing the distance between the means
of the two distributions. The process looks like this:
We sample a clean image x_0 from the dataset and a noise image
from a Gaussian distribution with zero mean and unit variance. We encode the clean image with forward
diffusion to get a noisy image x_t . We then compare the L2 loss between
the predicted and ground truth mean. By looking at this training objective,
we can have three interpretations. First, from the ground truth mean,
we see a linear combination of noisy image x_t and clean image x_0. But why do we ask the denoising
network to predict noisy image that we already know from the input? Therefore, we express the estimated
mean as a form of the ground truth one and only ask the model to predict
a clean image x_0. Second, from the forward
diffusion process, we know the relationship
between the noisy image, the clean image, and the added noise. We can express the clean
image x_0 as a function of a noisy image
x_t and a noise epsilon. When we plug this in and we arrive
at this new form of ground truth mean. Similarly,
we can match the form of the estimated mean
with the ground truth one, and only ask the the denoising
network to predict the noise. To discuss the third interpolation,
we need to use the Tweetie's formula. The formula states
that if we observe a sample z from a Gaussian distribution,
the posterior expectation of the mean is the sample plus a correction term
involving the gradient of the log likelihood
or the score of the estimate. Let's apply the formula to our forward
diffusion probability. Replace the mean here
and we now have this expression. When we replace the clean image x_0,
we arrive at this equation involving the score. Similarly, we can parameterize
our mean estimate using the same form. How can we compute this score? From a simple derivation,
it turns out it's just the noise
multiplied by a negative scalar. Intuitively, since the noisy image
x_t comes from adding a noise to the clean image x_0, moving
the opposite direction of the noise to denoise an image
naturally increases a log-probability. Here's
the formulation of score based models. Let's visualize
the process in a simple 2-D plot. We take a clean image
x_0 from our training dataset. Select a timestamp T and scale
the clean image. We then sample a random noise
and scale it. Adding these two up
gives the noisy image x_t. We can train the diffusion model
to remove all the noise and directly predict
a clean image. However, this is challenging
so that the noise image may still be low quality. So instead we only take a small
step along this line. Alternatively,
we can ask the network to predict what noise has been
added to this image and take an opposite direction
for a small step. Or we can ask the network
to predict the score. In all three cases, we arrive
at exactly the same distribution for the less noisy
image x_{t-1} given a noisy image x_t. We can sample from this distribution
and repeat the process. This allows us to create a path
from noise to clean image. After training our diffusion model, we can use the denoising network
to generate many samples. This is called
unconditional generation. But perhaps we want to ask the model
to generate specific contents like cat images. Or if we want to see more cats wearing
sunglasses using a text prompt. Let's revisit the score estimation
interpretation of diffusion models. At time step t, we use our denoising network
to predict the log-likelihood gradient, which is the direction
to maximize the log probability. For conditional generation, we want to
predict the conditional score. By applying the Bayes' rule, the conditional score
consists of an unconditional score and a adversarial gradient
of a classifier. We can score a adversarial gradient
by a positive factor of gamma. This is great because we can reuse
the unconditional model and use additional classifier
to guide the generation. We call this "classifier guidance". But we'll have to train
another classifier because an off-the-shelf classifier
is usually trained with clean images. Luckily, we can use the predicted noise
to estimate a clean image. This estimated clean
image is a bit blurry, but off-the-shelf classifiers
are usually fairly robust to this. This is nice, but do we really need
to use an additional classifier? By applying the Bayes' rules
to the second term, we see that it consists of an unconditional
and a conditional score. Plugging it back to our equation, we get "classifier free guidance". But training
two denoising networks is expensive, so we train a single conditional
denoising network and use null condition
to represent the unconditional model. Here is the comparisons
between unguided and guidance samples
with classifier-free guidance. How do we generate high resolution
images? There are probably three types
of methods. The first one is using cascade. We first use an diffusion model
to generate a low-resolution image, say 64 by 64. We then train a separate diffusion model that upscale the low resolution
image to a higher resolution. The second type of approach is Latent
Diffusion Models (LDM). We first train
an Variational Autoencoder that encodes a high-resolution image
into a low-resolution latent code. We train both the encoder and decoder
using the reconstruction loss, some with a adversarial loss
to ensure sharp results and the regularization
loss on the latent. Now we can train our diffusion model
efficiently in the latent space. Once we get a clean latent, we use the decoder to map
it back to a high-resolution image. In both cascade and latent
diffusion models, we need to train several models
separately. End-to-end methods
aim to generate high-resolution images with a single diffusion model. Several promising ideas
have been proposed, such as adjusting the noise schedules
for high-resolution image generation, multiscale
loss, and progressive training. Division models are very slow because we need to evaluate
the denoising network several hundreds or even 1000 times
to get a good sample. Here are some methods
that accelerate the sampling to make diffusion models
more practical. Let's first review
the training objective of DDPM. We train our denoising network
to predict the noise. If we look at this training
objective and don't care about what the waiting term is,
we just need to ensure that: 1) the forward
diffusion model remains the same. 2) The mean of the ground
truth denoising step is a linear combination of the noisy
image x_t and the added noise epsilon. The meaning of the estimated mean
is of the same form. These three constraints
do NOT assume the transition probability to be a Markovian process. In this DDIM paper, they construct
a class of non-Markovian diffusion processes and find a and b
that satisfy these constraints. This gives us
these two Gaussian distributions. Interestingly, the sigma_t
can be set to arbitrary values. By setting them to zero, we get
a deterministic generative process. The only randomness comes
from the initial sample noise. Here is the quantitative evaluation. With 1000 denoting steps
that the DDPM did perform better. But when we reduce the number
of denoising steps, the quality of the DDPM quickly degrades, while
the results from DDIM remain decent. The best thing is that
we don't need to retrain the model. We just need to take the model train
with the DDPM objective and accelerate it with DDIM sampler. But even with the DDIM simpler,
it still requires quite a few steps. Let's further reduce the number
of steps using distillation. We can use a pre-trained model
as a teacher and teach a student denoising network to use one sampling
step to reproduce the output of a teacher network
using two sampling steps. So after this distillation process,
we halve the number of sampling steps. We can ask the student model
to be a new teacher model and repeat the process until we reach
the target sampling steps. Another idea is to distill classifier-free
guided diffusion that requires
evaluating both conditional and unconditional models
into one single model. We can further apply
previous ideas to make it faster, such as progressive distillation
and latent diffusion models. We can distill a pre-trained invoicing network
using consistency models. The main idea is to train a model
so that for any points on the path,
the model predicts the same origin. This supports single steps generation as well as multi-step generation. Applying consistency distillation
in the latent space gives us high-resolution image generation
using only a few steps. We can also extend the idea of latent
consistency model with LoRA. But what's LoRA? Given a pretrained model, sometimes we want to generate
contents of particular style, such as pixel art,
LEGO, IKEA instructions, and anime. We can achieve these styles by fine-tuning the base model
with additional data. However, this is computationally
expensive and require high storage. In many cases,
it turns out we only need to fine tuned across attentional layers
in the denoising network. We freeze the pre-trained weight W_0
and optimize the residual parameters, but there are still
a lot of parameters to update. We can reduce the number of parameters using low rank approximation
on the weight matrix. Therefore, we can accelerate the
consistency distillation with LoRA. Here the acceleration vector
is the parameter difference between the distilled
and the base model. More interestingly, we can accelerate
other fine-tune models by linearly combining the acceleration
and the style vectors. Here are some examples. Another recent distillation method
train the student denoising model using score disillation. However, these
predictions are usually blurring. Their main idea is to apply an adversarial loss
by training a discriminator. Here are some results. After this distillation,
these models can achieve text-to-image generation at interactive rate. To sum up,
we covered the training objective for the diffusion models
and their three interpretations. How we can guide the generation
with classifier and classifier-free guidance. How we can synthesize high-resolution
images using cascade, latent, and end-to-end
diffusion models, and how we can speed up the sampling
using DDIM simpler and various distillation
techniques. Please comment below
if you have any questions. Thanks for learning with me. I will see you next time.