Diffusion models explained. How does OpenAI's GLIDE work?

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Hello! If you weren’t living under a rock these past months, then you surely have heard about diffusion models. We mean: OpenAI’s Diffusion models beat GANs at image synthesis. OpenAI’s model GLIDE generates more photorealistic images than DALL-E with… text-guided diffusion models. What are these diffusion models and why do they keep impressing us with their image generation capabilities? This is what we will explain in this AI Coffee Break. Ah, here is my alert that we should thank Weights & Biases for supporting us for this video! Today, we want to highlight the Weights & Biases feature called "Alert" which comes in very handy when you are keeping track of your Machine Learning experiments: You can be notified via Slack or email if your W&B Run has crashed or whether a custom trigger has been reached, such as your loss going to NaN or a step in your Machine Learning pipeline has completed. W&B Alerts apply to all projects where your launch runs, including both personal and Team projects. Get started with W&B Alerts in 2 quick steps: Turn on alerts in your weights and biases User Settings. You can get notified via slack and/or email for whether your run is finished, whether it has crashed or any other custom trigger you like. For custom triggers, add wandb.alert to your code, wherever you’d like to be alerted for a custom trigger. I think the Alerts features is extremely useful even in my low-cost projects, but imagine that some Weights & Biases users have saved large cloud GPU bills by being alerted early to crashed runs while training large, expensive models. Now, back to diffusion models. Let’s say we want to generate an image. What generative models are there on the menu in 2022? Well, we can identify four main types of generative models. Here, I am using the visualization from the awesome blog post of Lilian Weng, do check it out, especially if you are into all the cool math. The blog post about diffusion models is linked in the description below. So, we have 4 types of generative models on the menu. Let’s see what the principles of each model paradigm are. First, we have generative adversarial networks, or short GANs. They generate images from noise, much like a diffusion model. But this is where the commonalities stop. So, we have this generator neural network, starting from noise; or from some informative conditioning variable, like a class label, or a text encoding and it generates something that should look like a realistic image. The success in this is rated by the discriminator who labels the image as being either a true image coming from the training set or a fake one, synthesized by the generator. If you are curious about how GANs work in more detail, check out our previous video about this. Then, there are Variational Autoencoders. Vanilla autoencoders take the input, encode it by usually reducing it to a latent space of lower dimensionality. The decoder part tries to reconstruct the input again, with the goal to minimize the distance between the input and its reproduction. So far, vanilla autoencoders structure this latent space z however they see fit , as long as they are able to reconstruct the data seen during training. But this is not necessarily a meaningful structure. Meaningful could mean that similar data points are close to each other, and dissimilar points are far away. Variational Autoencoders have their own way of introducing a meaningful structure: they have an extra regularization term on this latent space to make sure that the latent representations here are not ordered anyhow, but according to a pre-defined distribution, usually Gaussian. This makes the space around the learned data points behave better and one can better sample from points in between training points. With this regularization, VAEs implicitly learn the data distribution. A class of models that explicitly learn the data distribution are flow-based models. In a nutshell, flow-based models do not learn just any encoder and decoder, but specific ones: They apply a transformation f, parametrized by a neural network onto the data, much like the encoding step in autoencoders, but then the decoder is not a fresh neural net that has to learn the decoding process by itself, but simply the exact inverse of the function f. To achieve this invertibility of f with neural networks requires quite some tricks we will not discuss in this video. We’ll link Lilian Weng’s blog post about this in the video description. And finally on the generative menu, we get to the model class we are talking about today: Diffusion models. How do they work? We cite from the figure here: “Diffusion models gradually add Gaussian noise and then reverse.” Well, everything is crystal clear. Okay bye! Noo Ms. Coffee Bean. Let’s break it down a bit. Let’s talk about the term “diffusion models” first, why are these models called like this? “Diffusion” is a term you have maybe heard in physics classes about thermodynamics. If we have a system with a high concentration of a substance, like perfume, in a certain place, then it is not in equilibrium. To transition to equilibrium, the diffusion process happens: The perfume molecules move from the place of higher concentration to everywhere in the system such that the concentration becomes the same everywhere. Diffusion makes everything homogeneous in the end. So, this is what the blog post here means by, we cite “Diffusion models are inspired by non-equilibrium thermodynamics.” And “They define a Markov chain of diffusion steps…” A Markov, what? A Markov chain. It’s a sequence of variables in which the state of one variable depends only on the previous event. So, this is Markov, this is… not so Markov! Now it’s Markov again. So, for our diffusion models, we have this Markov chain where random noise is added to the data. We take the image and during the forward diffusion process, we add a certain amount of noise to it; sequentially, it's Markov. We store the noisier image and go on to generate the new image in the sequence by adding a little bit more noise. And we do this a certain number of steps. If we do this infinitely many times, we get to an image that is just noise. But infinity is something that only mathematics can deal with, right? In reality, we do this only let’s say 150 times to get a last image in the sequence that is a good approximation of just noise. And here is where it becomes interesting: How we can generate images with this thing? We take a neural network and learn to reverse this diffusion process. So, the backward diffusion process involves the same network, the same weights, being applied at each step to generate the image from t to t-1. To simplify the problem even further, one could choose to predict the noise at each step which needs to be subtracted from the image, instead of letting the network predict the image. In any case, the choice of the architecture of the neural network must be such that it preserves the data dimensionality, like a UNet. A UNet is a convolution-based neural network that is downsampling an image into a lower dimensional representation and reconstructs it during upsampling. The downsampling and upsampling stacks communicate through skip connections. And the UNet used in these two papers, also uses global attention in the lower resolution layers, because you know, it is all you need after all. But why would we ever want to use such a diffusion model when there are GANs around? and there’s the GPT-like models such as DALL-E? The short answer is: DALL-E is great concerning generation diversity. What you see here now are generations of DALL-E, which was the previous model from OpenAI being able to generate images from text. But while the cartoon-like images are great and diverse, the photorealism of this cat or of these signs is not exceptional. But DALL-E was not a diffusion model, it was basically a GPT-like model autoregressively generating the image with a piece of text and the start of an image as input. If we do want the model to go bananas and generate many versions of a baby daikon radish in a tutu walking a dog, or the avocado armchair, DALL-E is great. But if we care much more about high fidelity and realism in our generations, we usually turn to GANs, capable of producing photorealistic images. Only that diffusion models are even better at realism, as we see in this paper, also from OpenAI. Because you see, diffusion models are more faithful to the data in a sense. While a GAN gets random noise, or a class-conditioning variable as input and then BAM, it must produce a realistic sample, diffusion models are a much slower, iterative, and guided process. When reverting from noise to the real image by going through iterations and iterations of denoising, there is little room for going very far astray. The generation process runs through all these checkpoints and at each step, more and more details can be added into the image – which was just noise in the beginning, much like with GANs. Diffusion models are ridiculously faithful to the to the image data! How faithful exactly, we can see with GLIDE, which is OpenAI’s next iteration on diffusion models. Before GLIDE, OpenAI’s work on diffusion models generated images from class labels, but now, this next iteration has successfully integrated textual information into the generation process so we can now produce images from text with diffusion models. But it is funny how the authors had to smash in the text information into GLIDE to convince it to pay attention to it, so let’s break it down here. GLIDE does the following thing: We have our data consisting of images and their captions. After the forward diffusion process, we have noisier and noisier versions of the images. The diffusion model is trained to reverse this process using a UNet-based architecture very much like the previous paper from OpenAI. The belligerent one, beating GANs. But now, the backward diffusion is taking the text prompt into account too. So, the GLIDE authors took the text, encoded it through a transformer (what a surprise, it's a transformer!) and took the final token embedding as a class-conditioning in the diffusion model. So now, the neural network needs to generate an image with less and less noise but has more guidance from this additional input of a class-specific variable, specifying what kind of image to generate, or rather: text-specific variable. But the authors did not resort to this, because seemingly (though the paper is not explicit about this), the text information as such was not enough. So additionally, each, each! attention layer in the model is also attending to all the text tokens that the transformer produces when encoding the text. Cool. That should be enough to make the textual information evident for GLIDE, right? Right? No. I mean, at least for the training procedure it does suffice. But it is still not enough, so the authors hack the text even more into the diffusion model that is too faithful to the image modality, and they do it at inference time. So, brace yourself, this is not training anymore. The authors tried out CLIP-guided diffusion to make the text more persuasive when it comes to image generation from text. The idea here is to use an extra model, to make the generated image better correspond to the text. The extra model here is CLIP because CLIP, also from OpenAI is trained to predict a similarity score between image and text. So, to generate an image with CLIP-guided diffusion, the authors first let GLIDE denoise the image, conditioned on text. But then, they further guide the process by adding the gradient of the image-sentence similarity of CLIP with respect to the image. Ms. Coffee Bean, slow down. Conceptually, this takes the initial denoised image and moves it into the direction in which CLIP predicts a high image text match. This is like “deep dream”, if you are familiar with that, but now we dream into the direction of what CLIP thinks is a better image-text match. This is a trick to make the generations match the text better, because, as we previously said, it seems like GLIDE kind of wants to ignore the text if let on its own. Classifier-guided diffusion is one way to make the text information more obvious to GLIDE during inference, but there is also another way, the classifier-free guidance the authors used. And it worked better in their case. As the name already says, no extra model is needed here. It is just a weird trick applied at each diffusion step to emphasize the text even more. First, GLIDE produces the image twice, once with text and once without access to the text. Then we compute the difference between the diffusion step with text and without text and with this difference, we now know in which direction to move if we want to go from no-text to text. So, if we take the text-less generation and add this difference scaled by quite a lot, the output of the model without text information is heavily extrapolated into the direction of text information. Is this a weird hack? Yes, it is! But it works. The results of GLIDE are too convincing, just look at this. GLIDE is trained on the same data as DALL-E, has almost 4 times fewer parameters than DALL-E but nails the photorealism aspect. The authors conducted human evaluation experiments where it is clear that the majority preferred GLIDE’s generations over the blurrier and messier outputs of DALL-E. But it’s not all great news for GLIDE and diffusion models in general: diffusion models, having to go through all the diffusion steps sequentially (150 in GLIDE’s case), take much longer than GANs, for example. Further bad news is, that OpenAI being OpenAI, didn’t release the full GLIDE model, but a smaller GLIDE version, trained on a smaller, curated dataset. And how should I tell you these news… it ... It cannot produce the avocado armchair.! I tried a few times, with different noise, to get different generations, but nah. No real avocado armchair. Ms. Coffee Bean is sad now. Is there anything cool YOU can produce with the released model? Find the link in the description, if you're interested. And let us know in the comments what you think about OpenAI not releasing the big version of the model? Thanks for watching! We’re so happy we finally got to publish this video about GLIDE. It has been waiting in the coffee machine for so long, but Ms. Coffee Bean had other coffees to brew first. Can’t wait to see you next time! Okay, bye.

Info

Channel: AI Coffee Break with Letitia

Views: 80,502

Rating: undefined out of 5

Keywords: diffusion models, how diffusion models work, generative model, GLIDE, diffusion models explained, photorealism, diffusion models beat GANs, GLIDE explained, classifier-guided diffusion, CLIP-guided diffusion, classifier-free guidance, OpenAI, transformer, CLIP, image synthesis, generating images from text, neural network, AI, artificial intelligence, machine learning, visualized, easy, machine learning research, aicoffeebean, animated, illustrated, letitia parcalabescu, stable diffusion

Id: 344w5h24-h8

Channel Id: undefined

Length: 16min 42sec (1002 seconds)

Published: Wed Mar 23 2022