AI Art Explained: How AI Generates Images (Stable Diffusion, Midjourney, and DALLE)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Thanks for making this, and the rest of your 'illustrated' series, they are very helpful to understanding these topics. :)

👍︎︎ 2 👤︎︎ u/blueSGL 📅︎︎ Jan 16 2023 🗫︎ replies

Thank you for the video. It is a good companion to the piece on your blog.

👍︎︎ 2 👤︎︎ u/Apprehensive_Sky892 📅︎︎ Jan 19 2023 🗫︎ replies

Thank you :).

👍︎︎ 2 👤︎︎ u/Wiskkey 📅︎︎ Jan 20 2023 🗫︎ replies

Captions

AI image generation is taking the World by storm it's clearly one of the most important developments in AI over the last 12 months the capability of these models to take a piece of text and generate a really beautiful picture is astounding it's a huge development in how humans create images how humans create art and it's unlocking a lot of creativity of not only painting new images from from text but also editing images turning childhood drawings or child drawings into you know higher Fidelity images there's a lot of creative paths that people are taking with this and we're only scratching the surface in terms of its capabilities if you've been wondering how these models work then this video is for you my goal is for this video to be the most accessible description of how AI image generation works even if you're not technical we look at the various components that make up the AI image generation model called stable fusion and we'll talk about how they're trained and how they work together now for this video I've created tens of graphics and visuals that explain the various components you'll find all of them in this article on my blog that we'll be going over in this video so if you want to go back or dig deeper into them or like recall one of the visuals or one of the concepts you should definitely go to the article on my blog with this let's learn about AI image generation [Music] so this is the Illustrated stable diffusion it's the article that I wrote explaining stable diffusion and how this model works it took me a long time to really understand the various pieces and it requires parsing through a lot of technical content a lot of math formula a lot of code but then from all of that research this is the most gentle approach that I can think of to explain how these models work now it starts very general but it gets a little bit more technical towards the end so the more Curious you are about you know how deep you want to learn and understand these models you know the more depth and content we'll get into that technical depth later in the video and the article but you don't need to be technical at all to start so this is the first let's say Visual and it's you know one of the main ways to understand how these models work is that you give them a piece of text and they generate an image so they're trained in a certain way to understand the text and generate an image as a result of that now they also work in this other way so this first way this is referred to as text to image but they also operate in a different way called image to image where you can give them both an existing image and some piece of text and they can edit that image for you so if we tell it a pirate ship and we give it this image it will you know think of how it can see a pirate ship or how it can edit this image to create a pirate ship in it and that is pretty impressive so now that we know you know as a black box how the model Works how it operates let's get a step deeper into the components that make it up so at a very high level you can think of stable diffusion having two components an image understanding component in an image generation component uh you see three major colors here I'm still showing you the two components but then the image generation component is also made up of two components and we'll get down to them now what happens between these components is that the text encoder or the image understanding component it parses the text that it got and it sends a representation so if you know anything about NLP or language models these are token embeddings so these are numeric representations of the input prompt and it hands those over toward to the image generator that uses them then to inform what kind of image it creates now let's talk about the image generation component now I call this the image information Creator this is not a technical term this is just something I came up with to make it easier to understand what this component does and what this component does the and what the image generator is composed of is these two components one is this image information Creator which creates information about the image and then there's an image decoder that paints the final image using that information that was created in the previous step in the pink step here these are the three major components that make up stable diffusion these three the text understanding component and then this component for creating information about the image and then this image decoder that paints the final image and then this is a great graphic to think about how stable Fusion is not really one model it is multiple models and these are you know three different models there's these are three different neural networks that are sort of trained on in in different ways but then at the end they're trained jointly so they can work together but then here you can start to understand what the components are and then the rest of the video and the article go into how each of them works and how each of them was trained and then how they were put together so they can work together so it starts with the image information Creator and which is where a lot of the image creation happens the image decoder is this idea that was added to the stable diffusion paper specifically to make the generations faster and then the text encoder is a language model that also has a sense of imagery to get a little bit more technical in describing these components the text in encoder is better called clip text so there's this model called clip it was released by openai and it's a model that has two models inside of it actually there's a text encoder and then there's an image encoder and they work together we'll talk about clip and how clip is is trained but this text encoder is the clip model and its output is these at least for stable diffusion version one is you know 77 token embeddings so the your prompt is only understood up until 77 tokens you can think of a token as a word but usually tokens are parts of words and so if you're dealing with stable diffusion one you know 1.4 1.5 it cannot really your prompt cannot be very long so it can't be more than I don't know 50 words that can probably get us to 77 tokens and then each of them represents you know part of the input prompt and it's in 768 Dimensions now the where a lot of the image creation happens is here which is what we started out calling the image information Creator but the actual technical term is this is a convolutional neural network it's a type of neural network architecture called a unit and it has another component that works with it and regulates a little bit of how it works called a scheduler and then the image decoder is one component of a neural network called the an auto encoder which includes an encoder in a decoder we'll talk about how it's trained in the end but then once you're generating an image this is how the interplay of these various components so you have your input text your input prompt that goes into the text encoder it encodes the tokens it breaks down the tokens of the input prompt into tokens it hands that off to the information Creator to the unit in the scheduler which does a bunch of crunching packing a lot of let's say visual information into this other array with these dimensions and then that array in the end this happens over a number of steps these are called diffusion steps we'll go over that's and what it is exactly but then the the end result is this array that is then passed to the image decoder which outputs the final image and here you can see this is three by you know 512 by 512 so these are pixels and three because it's red green and blue and these Dimensions this is just you know pure information it's not it can't directly be viewed as an image although there is some work that you know explains it as imagery now you know the major components uh if you're curious how they're trained and how they work together stick around Okay so we've heard about this idea called diffusion that these AI image generation models are diffusion models what does that mean let's see how that works in not training we'll get to training later let's look at it in the actual image generation so in a trained model generates an image what is the process what is the diffusion part of that process so this visual explains it uh it digs a little bit deeper into it so our prompt goes into the text encoder as we said has the token embeddings and then the diffusion process so here I call it diffusion but technically there are two better terms for it so one this this is called the denoising process but it's also called the reverse diffusion process but what happens here is that it starts out with a random array as a random image of noise and then it creates it packs a lot of relevant information from you know whatever it learned as informed buy the token embeddings so if we want a you know picture of a beach that looks like a paradise we start here with a random image that looks like noise and then over a number of steps this component packs more visual information into it so that the final array here the final collection of numbers have you know packed a lot of information that makes the image decoder paint this image of a beach we can sort of visualize this process to see that the initial tensor or array we can visualize it we can you know pass it through the image decoder and it looks like like this it looks like noise but then after it was processed it looks like an image so the image decoder is the way that we can translate between this information a space of arrays to this visual space we mentioned that it happens in steps and this is kind of what the steps are and if you go to I don't know dream Studio which is where I do a lot of my stable diffusion Generations has a slider for a number of steps these are the steps of uh denoising where the image information is packed together and generated so you have a random noise here and goes through you know one step and then another one and with each successive step there's better and better information there to help paint the final image we can also visualize this process this is this reverse diffusion this this denoising process of removing the noise so if we take the results after step one and step two and four and five and visualize them the same way we did in the past we can see that you know we start with this noise and then it be gets you know it's also noisy here but then the noise starts to reveal a picture and the picture successively gets better and better and more you know beautiful this is another video that shows that process so this is you know just the first few steps up until 50. it's it's quite breathtaking to look at I really you know enjoy taking a look at that and you can see that after each step the model improves the image a little bit and gets a little bit more detail you can see this this special thing happening between these two steps step two and step four so you can see how this noise reveals the imagery so you can see that this sort of green part of the noise it says if the model was squinting and saying hmm this looks like you know some C and this is probably a part of a beach and then this is probably like a tree and then from that the creation happens the image creation happens and that is again informed by the input current it's like the model is staring at this slice of noise and having the um tax or prompt in its memory and it's looking at you know okay how is this noise how can this noise be translated into the image of a beach and this is what uh how that works now that we've seen how the image creation works we can get into how the training is done how this model is trained so to understand diffusion so this is the forward diffusion process this is how we actually train uh these models or train the you know the pink part of the model which is the the image uh creation or the unit and scheduler best way to try to sort of explain this is that you know we have a lot of images we can get a lot of images and their captions which will be useful later on from the web uh and so that's how these models are trained so we can take an image any image we can generate noise randomly we have algorithms to create noise it's very easy to do but then we create some sample of noise so when we take our image we take a You Know sample and image from our data set to we generate noise and then we apply that noise to the image in some amount so we have you know three or four let's say examples here so no noise all noise and then some levels in in between and then we add that noise to that image in that amount and that is a training example we can do that again another image another sample of noise and then another amount and we apply that noise in that amount and then we have another noisy image here from here we can create this data set so we have an input data set of okay this is a noisy image this is the amount of noise applied to it and then this is the actual sample of noise and you know we've generated we've this is generated so we know this and you know we know all of these and so this kind of becomes a supervised learning exercise so we have our inputs we have our outputs and then we train a model to be able to predict the noise that we added here how do we train it you know we have computer vision models we have really strong machine learning and learning algorithms that are able to learn the relationship between this input and this output and if we have enough examples of of images this model is going to be able to let's say create images or paint images by removing noise so this is the data set that this image generation component is created using now let's look at what a training step is you know we take an example from our data set which is a noisy image and then amount of noise so these are the inputs we give these to our unit so our neural network that does computer vision and we ask it to predict the noise and you know in the beginning it will not do it correctly but we have the actual label uh and noise and we say okay you predict this no the real answer is this so we can calculate the error which is likely the difference between them and then we update the unit using that error so this is the classic machine learning update rule of like this is the label this is the logs loss function this is how off you were and then we update the model so that you know the next time it comes across an image like this it has a better idea of what the noise distribution is like so this is in the basic machine learning step and this is how it's applied uh so now we know how this unit is trained let's see how that trained unit paints images because this is let's say the reverse process so it works kind of like this so we start with this completely noisy image let's say here for example the noise amount the the largest noise amount is three really it's between zero and one like the fractions but for example I'm just using one two and three here so let's say we have the this is a completely noisy image and then we say this is the most noise that we have we give it to the model the model predicts the amount of noise that is in this image so given that you know we have the original noisy image we have the predicted noise we subtract the two and then we get a slightly denoised image so effectively here the model has painted something it has painted a version of the image but it's not the final image yet it's only one step there from here we can say okay so now we removed you know noise amount number three we removed a little bit of noise let's go back and say okay now the noise amount is two let's predict that noise and remove it a little bit we get more clear our image and then we are at noise amounts one now and then let's say let's predict that subtracted it again and then we end up with the final image and so we start here with the uh completely noisy image and then we generate the final image here now one Innovation that the stable diffusion paper introduces to AI image generation is to make it faster by using this Auto encoder so the autoencoder can be trained kind of like this it's a kind of a model that takes an image compresses it to you know a much smaller array of information and then has another component which is a decoder that then decompresses that or decodes that and tries to reconstruct the original image and so this is a model that can be just trained by you know having lots of images and uh you know comparing the original with the let's say the the final uncompressed version compresses it then decompresses it and by having this compression decompression process by applying it to the image generation a lot of this diffusion happens in a much smaller space so the arrays that are being you know diffused or denoised uh in when we're generating a picture are let's say much smaller arrays and so models can do them much faster and so this is how you know stable diffusion compares to some of the models previously that we're not doing this encryption step they were operating in you know much larger spaces of the actual images and that made them slow so by working in the compressed latent space stable Fusion makes this faster so instead of you know generating the images in the noisy images on the actual images it works on these latent or compressed versions of them where you know we pass it through this encoder and then we have our compressed version of it and then these are the amounts of noise that we add not to the original image but to the actual you know latent or compressed version of the image and so this is an image that's important if you've read the paper if you want to read the paper so at the top here is more about how the training process is is done so we have an original image we encode it and then we can generate examples from the various compressed latent versions of that image and with different amounts of noise applied to them and then when we're generating a an image we start with complete Noise We denoise it for a number of steps and then we pass that final processed image information array by the image decoder and that generates the final image and this visual is my version of this visual from the paper so this is a little bit of what happens here and this is how you know it is described in the paper and you know as like general research visuals um are like they pack a lot of information here and people who are generally reading this they have a little bit of a of a background on the various terms used here the one component that we'll get to here is this conditioning so the conditioning is how we have the text which is the input prompt here that's how that factors into this denoising process okay so now we looked at how stable diffusion generates images from noise to an image but we didn't at all talk about how it uses uh text we only now really know how to create random images so if we train this model and deploy it will not be able to guide it to say okay draw an image of a cat or a dog it will generate images that seem like images that was trained on so if it is trained on I don't know logos it will create just random logos if it was trained on cats it will Generate random cats but now we need to learn a little bit of how okay how do we get language involved here how do we train the model so it can be it can first understand the prompt understand the prompt quote unquote because it's not complete understanding it's you know some kind of understanding one and then two How Does it include that understanding in the image generation process so to understand that let's look at the text encoders so stable diffusion has a language model which is the text understanding component of it the initial paper used Bert the final model uses clip text now the language model is is important I talk a lot about language models you can find more about language models in my channel and blog about GPT invert and all Transformers and all of these kinds of models and it turns out as we've learned from the Imogen paper from Google that the language model is one of the most important components of the image creation model and this is a graph where they've varied the size of the language model and that had a major effect on how people liked the the final images that were were created and in this earlier version of the uh of the article you know I I point to these teams training a new clip model much larger and uh uh you know I anticipated that future versions of stable diffusion will use this much larger clip model called open clip and uh true enough stable Fusion 2 uses this model now how is clip trained so let's start by looking at the kind of data it is trained on so it's a collection of two things images and text that describes that images you can think of that as captions although you know in some of the data sets like the original one it's like the alt text part of the HTML that was part of this so these are examples from this data set that we link to here which is okay the image and its description so and what clip does it tries to embed you know embeddings are this numeric representations of things so clip has two models one that embeds images so represents images numerically and another that embeds text numerically and its goal is that it makes similar images and text represent by similar vectors and so that's maybe if I have an example here okay so an example is if I have an image of a dog and then a sentence that says this is an image of a dog let's say the vectors representing these two should be closer to you know an image of a cat and closer is you know using um things like you know similarity metrics that we have um it's it's part of basically let's say linear algebra how the clip is trained however it can be simplified as this so we've collected the data set let's say we have we take one image and it's caption and we embed them both and then we have the model predict you know the image embedding and the text embedding you know are these similar or not and we know they're similar because this caption is of this image so that's how we chose these examples and when we start to train the model it will say hmm they're not similar because the model is not trained it's not projecting them to similar spaces in its let's say numeric representation space or embedding space so it will predict no these two are not similar it will assign them vectors that are not similar but we know that they are similar so what we do here is that we calculate the error just like which we did before this is again this simple usual machine learning step of saying okay you were wrong you were wrong by this much you need to you know update your various parameters so that the next time you see an example like this you are closer to the actual write prediction and so we continue doing this with all of our examples with all of our images and we can generate a lot of images because what we did here is if we give it an example of an image and its own caption but we can also give it another caption of or the caption of another image and force it to say that these are not similar so we need both positive examples and negative examples and with that that's an idea called contrastive learning which is a very important idea in in machine learning and with that we created this let's see model that embeds images and text similarly in let's say the same space now that we have clip we have this model that sort of ties together language and imagery in in the same space we can bring it into the picture and train it jointly with our other components that we've already looked at and that kind of looks like this so now our unit our noise predictor doesn't take only two inputs it also takes a third one which is the tax prompt or the token embeddings of the text prompt and it uses that because that contains information that would help it predict the noise here so if this is complete noise and we say okay this is an image of a dog it'll try to squint at the at the image and say okay what noise can I subtract from here and you know this is not conscious thoughts happening inside of the model but this is how we're you know framing the problem um and our learning algorithm is gonna nudge our model in a way that you know applies this process and with this you we can see the final shape of our data sets uh here so we have the step or amount of noise is a better term so and step this is from a previous version I didn't upload update this one so this is better called the amount of noise here and then this is the you know not the image but the noisy compressed version of the image with the various amounts of uh noise applied to it and this is the text prompt used for it and then this is the predicted and so with this we train our unit and our image creation component unit plus scheduler the final sort of idea is about just how exactly inside the model this is generated and so there's a layer of attention that happens in between unit predictions and that's the layer where the amount of the information from the text prompt is added so this is just a final detail that I include this this article with so if you want to learn more about attention be my guest by you know previous blog posts talk a little bit about its Evolution about how it works in Transformers but then if you've made it here congratulations now you know the major ideas about how image generation works and what kinds of data sets it's trained on and how it's trained on them and its various components I hope you found this useful let me know if you have any questions and comments or send me messages or tweets on on Twitter thank you for watching please subscribe and like and I'll see you in the next video

Info

Channel: Jay Alammar

Views: 22,558

Rating: undefined out of 5

Keywords:

Id: MXmacOUJUaw

Channel Id: undefined

Length: 28min 46sec (1726 seconds)

Published: Mon Jan 16 2023