Hello! If you weren’t living under a rock these
past months, then you surely have heard about diffusion models. We mean: OpenAI’s Diffusion models beat
GANs at image synthesis. OpenAI’s model GLIDE generates more photorealistic
images than DALL-E with… text-guided diffusion models. What are these diffusion models and why do
they keep impressing us with their image generation capabilities? This is what we will explain in this AI Coffee
Break. Ah, here is my alert that we should thank
Weights & Biases for supporting us for this video! Today, we want to highlight the Weights & Biases
feature called "Alert" which comes in very handy when you are keeping track of your Machine
Learning experiments: You can be notified via Slack or email if
your W&B Run has crashed or whether a custom trigger has been reached, such as your loss
going to NaN or a step in your Machine Learning pipeline has completed. W&B Alerts apply to all projects where your
launch runs, including both personal and Team projects. Get started with W&B Alerts in 2 quick steps:
Turn on alerts in your weights and biases User Settings. You can get notified via slack and/or email
for whether your run is finished, whether it has crashed or any other custom trigger
you like. For custom triggers, add wandb.alert to your
code, wherever you’d like to be alerted for a custom trigger. I think the Alerts features is extremely useful
even in my low-cost projects, but imagine that some Weights & Biases users have saved
large cloud GPU bills by being alerted early to crashed runs while training large, expensive
models. Now, back to diffusion models. Let’s say we want to generate an image. What generative models are there on the menu
in 2022? Well, we can identify four main types of generative
models. Here, I am using the visualization from the
awesome blog post of Lilian Weng, do check it out, especially if you are into all the
cool math. The blog post about diffusion models is linked
in the description below. So, we have 4 types of generative models on
the menu. Let’s see what the principles of each model
paradigm are. First, we have generative adversarial networks,
or short GANs. They generate images from noise, much like
a diffusion model. But this is where the commonalities stop. So, we have this generator neural network,
starting from noise; or from some informative conditioning variable, like a class label,
or a text encoding and it generates something that should look like a realistic image. The success in this is rated by the discriminator
who labels the image as being either a true image coming from the training set or a fake
one, synthesized by the generator. If you are curious about how GANs work in
more detail, check out our previous video about this. Then, there are Variational Autoencoders. Vanilla autoencoders take the input, encode
it by usually reducing it to a latent space of lower dimensionality. The decoder part tries to reconstruct the
input again, with the goal to minimize the distance between the input and its reproduction. So far, vanilla autoencoders structure this
latent space z however they see fit , as long as they are able to reconstruct the data seen
during training. But this is not necessarily a meaningful structure. Meaningful could mean that similar data points
are close to each other, and dissimilar points are far away. Variational Autoencoders have their own way
of introducing a meaningful structure: they have an extra regularization term on this
latent space to make sure that the latent representations here are not ordered anyhow,
but according to a pre-defined distribution, usually Gaussian. This makes the space around the learned data
points behave better and one can better sample from points in between training points. With this regularization, VAEs implicitly
learn the data distribution. A class of models that explicitly learn the
data distribution are flow-based models. In a nutshell, flow-based models do not learn
just any encoder and decoder, but specific ones: They apply a transformation f, parametrized
by a neural network onto the data, much like the encoding step in autoencoders, but then
the decoder is not a fresh neural net that has to learn the decoding process by itself,
but simply the exact inverse of the function f. To achieve this invertibility of f with neural
networks requires quite some tricks we will not discuss in this video. We’ll link Lilian Weng’s blog post about
this in the video description. And finally on the generative menu, we get
to the model class we are talking about today: Diffusion models. How do they work? We cite from the figure here: “Diffusion
models gradually add Gaussian noise and then reverse.” Well, everything is crystal clear. Okay bye! Noo Ms. Coffee Bean. Let’s break it down a bit. Let’s talk about the term “diffusion models”
first, why are these models called like this? “Diffusion” is a term you have maybe heard
in physics classes about thermodynamics. If we have a system with a high concentration
of a substance, like perfume, in a certain place, then it is not in equilibrium. To transition to equilibrium, the diffusion
process happens: The perfume molecules move from the place of higher concentration to
everywhere in the system such that the concentration becomes the same everywhere. Diffusion makes everything homogeneous in
the end. So, this is what the blog post here means
by, we cite “Diffusion models are inspired by non-equilibrium thermodynamics.” And “They define a Markov chain of diffusion
steps…” A Markov, what? A Markov chain. It’s a sequence of variables in which the
state of one variable depends only on the previous event. So, this is Markov, this is… not so Markov! Now it’s Markov again. So, for our diffusion models, we have this
Markov chain where random noise is added to the data. We take the image and during the forward diffusion
process, we add a certain amount of noise to it; sequentially, it's Markov. We store the noisier image and go on to generate
the new image in the sequence by adding a little bit more noise. And we do this a certain number of steps. If we do this infinitely many times, we get
to an image that is just noise. But infinity is something that only mathematics
can deal with, right? In reality, we do this only let’s say 150
times to get a last image in the sequence that is a good approximation of just noise. And here is where it becomes interesting:
How we can generate images with this thing? We take a neural network and learn to reverse
this diffusion process. So, the backward diffusion process involves
the same network, the same weights, being applied at each step to generate the image
from t to t-1. To simplify the problem even further, one
could choose to predict the noise at each step which needs to be subtracted from the
image, instead of letting the network predict the image. In any case, the choice of the architecture
of the neural network must be such that it preserves the data dimensionality, like a
UNet. A UNet is a convolution-based neural network
that is downsampling an image into a lower dimensional representation and reconstructs
it during upsampling. The downsampling and upsampling stacks communicate
through skip connections. And the UNet used in these two papers, also
uses global attention in the lower resolution layers,
because you know, it is all you need after all. But why would we ever want to use such a diffusion
model when there are GANs around? and there’s the GPT-like models such as
DALL-E? The short answer is: DALL-E is great concerning
generation diversity. What you see here now are generations of DALL-E,
which was the previous model from OpenAI being able to generate images from text. But while the cartoon-like images are great
and diverse, the photorealism of this cat or of these signs
is not exceptional. But DALL-E was not a diffusion model, it was
basically a GPT-like model autoregressively generating the image with a piece of text
and the start of an image as input. If we do want the model to go bananas and
generate many versions of a baby daikon radish in a tutu walking a dog, or the avocado armchair, DALL-E is great. But if we care much more about high fidelity
and realism in our generations, we usually turn to GANs,
capable of producing photorealistic images. Only that diffusion models are even better
at realism, as we see in this paper, also from OpenAI. Because you see, diffusion models are more
faithful to the data in a sense. While a GAN gets random noise, or a class-conditioning
variable as input and then BAM, it must produce a realistic sample, diffusion models are a
much slower, iterative, and guided process. When reverting from noise to the real image
by going through iterations and iterations of denoising, there is little room for going
very far astray. The generation process runs through all these
checkpoints and at each step, more and more details can be added into the image – which
was just noise in the beginning, much like with GANs. Diffusion models are ridiculously faithful
to the to the image data! How faithful exactly, we can see with GLIDE,
which is OpenAI’s next iteration on diffusion models. Before GLIDE, OpenAI’s work on diffusion
models generated images from class labels, but now, this next iteration
has successfully integrated textual information into the generation process so we can now
produce images from text with diffusion models. But it is funny how the authors had to smash
in the text information into GLIDE to convince it to pay attention to it, so let’s break
it down here. GLIDE does the following thing: We have our
data consisting of images and their captions. After the forward diffusion process, we have
noisier and noisier versions of the images. The diffusion model is trained to reverse
this process using a UNet-based architecture very much like the previous paper from OpenAI. The belligerent one, beating GANs. But now, the backward diffusion is taking
the text prompt into account too. So, the GLIDE authors took the text, encoded
it through a transformer (what a surprise, it's a transformer!) and took the final token
embedding as a class-conditioning in the diffusion model. So now, the neural network needs to generate
an image with less and less noise but has more guidance from this additional input of
a class-specific variable, specifying what kind of image to generate, or rather: text-specific
variable. But the authors did not resort to this, because
seemingly (though the paper is not explicit about this), the text information as such
was not enough. So additionally, each, each! attention layer in the model is also attending
to all the text tokens that the transformer produces when encoding the text. Cool. That should be enough to make the textual
information evident for GLIDE, right? Right? No. I mean, at least for the training procedure
it does suffice. But it is still not enough, so the authors
hack the text even more into the diffusion model that is too faithful to the image modality,
and they do it at inference time. So, brace yourself, this is not training anymore. The authors tried out CLIP-guided diffusion
to make the text more persuasive when it comes to image generation from text. The idea here is to use an extra model, to
make the generated image better correspond to the text. The extra model here is CLIP because CLIP,
also from OpenAI is trained to predict a similarity score between image and text. So, to generate an image with CLIP-guided
diffusion, the authors first let GLIDE denoise the image, conditioned on text. But then, they further guide the process by
adding the gradient of the image-sentence similarity of CLIP with respect to the image. Ms. Coffee Bean, slow down. Conceptually, this takes the initial denoised
image and moves it into the direction in which CLIP predicts a high image text match. This is like “deep dream”, if you are
familiar with that, but now we dream into the direction of what CLIP thinks is a better
image-text match. This is a trick to make the generations match
the text better, because, as we previously said, it seems like GLIDE kind of wants to
ignore the text if let on its own. Classifier-guided diffusion is one way to
make the text information more obvious to GLIDE during inference,
but there is also another way, the classifier-free guidance the authors used. And it worked better in their case. As the name already says, no extra model is
needed here. It is just a weird trick applied at each diffusion
step to emphasize the text even more. First, GLIDE produces the image twice, once
with text and once without access to the text. Then we compute the difference between the
diffusion step with text and without text and with this difference, we now know in which
direction to move if we want to go from no-text to text. So, if we take the text-less generation and
add this difference scaled by quite a lot, the output of the model without text information
is heavily extrapolated into the direction of text information. Is this a weird hack? Yes, it is! But it works. The results of GLIDE are too convincing, just
look at this. GLIDE is trained on the same data as DALL-E,
has almost 4 times fewer parameters than DALL-E but nails the photorealism aspect. The authors conducted human evaluation experiments
where it is clear that the majority preferred GLIDE’s generations
over the blurrier and messier outputs of DALL-E. But it’s not all great news for GLIDE and
diffusion models in general: diffusion models, having to go through all the diffusion steps
sequentially (150 in GLIDE’s case), take much longer than GANs, for example. Further bad news is, that OpenAI being OpenAI,
didn’t release the full GLIDE model, but a smaller GLIDE version, trained on a smaller,
curated dataset. And how should I tell you these news… it ...
It cannot produce the avocado armchair.! I tried a few times, with different noise,
to get different generations, but nah. No real avocado armchair. Ms. Coffee Bean is sad now. Is there anything cool YOU can produce with
the released model? Find the link in the description, if you're
interested. And let us know in the comments what you think
about OpenAI not releasing the big version of the model? Thanks for watching! We’re so happy we finally got to publish
this video about GLIDE. It has been waiting in the coffee machine
for so long, but Ms. Coffee Bean had other coffees to brew first. Can’t wait to see you next time! Okay, bye.