How Stable Diffusion Works (AI Image Generation)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

we live in a world where artists are losing their jobs because you can generate whatever piece of art you want with a simple text prompt within a few seconds that looks incredibly good more than that you can generate an image of any even things that don't exists in real life just by using the right descriptions what happened just last week I spent 2 hours trying to connect to a wireless printer how did the computers get here this video will be highly Technical and try to explain how stable diffusion Works which is currently the best method of image generation that we have beating out older technology like generative adversarial networks or Gans now you've seen the length of the video it's a long video but if you go search up other machine learning videos online this will probably still be the least technical out of all of them because I've tried to cut out all the math to make everything conceptually easier to understand while keeping the information mostly accurate look I know you're passionate and curious about this technology so I want you to try to pay attention even though if I'm going to be honest you prob probably won't understand a lot of it the first time you watch it through if you can grasp the intuition and Concepts well and then you decide to look more at the math and derivations or to pursue a career in this field everything will be a lot easier to understand and that's part of why I made this video AI is the future and you know what they say when there's a gold rush sell shovels now a lot of people are worried about AI safety thinking AI is going to take over the world me personally I'm not too pressed because I can't even get chat gbt to solve a simple math problem correctly but what I can say you should be worried about is cyber security which our video partner today nordvpn wants you to learn about the process of making this video involves a lot of research and developing neural networks on Google collab all of which uses the internet and sometimes I'd be away from home at a public library or Cafe using the free Wi-Fi now you'd be surprised how easy it is to compromise these public networks or make a fake Network just to steal your data and this is called a man in the- Middle attack so to make sure my bank account information and passwords don't get stolen by a hacker I use Norv VPN to make sure my internet connection was securely encrypted at all times and other than that nordvpn also has a bunch of other features like their threat protection and dark web monitor features to protect you against fishing attacks password leaks malware ransomware and the like I mainly use nordvpn for security but sometimes I can also use it to watch shows that are only available in certain other countries or get plain tickets for cheaper prices in other areas of the world for example if I want to connect to Japan's internet I just click on Tokyo and I'm there so novpn is offering an exclusive deal if you go to nordvpn.com gony where you can get a 2-year plan with extra month for free with a 30-day moneyback guarantee I'll leave that link in the description and pins comment below again that's nordvpn.com gony now let's get started deep learning is all about neural networks so of course there's many different ways that neurons can be connected to each other the most basic type of neural network consists of what are known as fully connected layers where every neuron in each layer is connected to every neuron in the next layer but in this video we'll find that the process of image generation with stable diffusion is largely dependent on two special types of network layers each serving a very important role here I'm going to introduce the first one called the convolutional layer pay attention because the second type of layer will come along much later in the video and the way that it relates to convolutional layers is kind of amazing you see basic fully connected layers work well for many different types of data but not images because images have way too many pixels imagine you wanted to do an operation on a 100x 100 image which outputs a new 100x 100 image even if the image was black and white and only had one channel that's 100 * 100 = 10,000 pixels so it's 10,000 inputs and 10,000 outputs which means there's 10,000 * 10,000 equal 100 million neuron connections just for a 100 * 100 image and in addition in a fully connected layer each input contributes equally to each output which means the relative spatial position of each pixel is irrelevant which kind of doesn't make sense because obviously in an image pixels that are closer to each other are more important in making up features such as an edge compared to two random pixels that are really far away from each other so for images a better type of layer is the convolutional layer where each output pixel is determined by a grid of all the surrounding input pixels and this is done with a 2d grid of numbers called a kernel usually with a size of like like 3x3 or 5x five where the output pixel is determined by multiplying the surrounding input pixels by the corresponding number in the kernel and then adding everything up for example here's a vertical Edge detection kernel and here's a horizontal Edge detection kernel you should start to see why convolutions work so well for images if we have a 5x5 kernel instead of a fully connected layer for a 100x 100 image that's only 25 parameters that we can reuse over and over again instead of 100 million so now that you understand how convolutions work it's time to talk about its significance to computer vision which is basically the field of identifying what's in an image level one of computer vision is simply image classification where the network just labels what is in an image we have to assume there's only one object in the image and we don't know where exactly it is but we know what it is now level two is classification with localization where we can also only have one object but the network also gives us a bounding box which tells us where it is in the Image level three is object detection so now the image can have multiple objects and we get multiple bounding boxes and labels around each of them but it's still a very rough estimate of what pixels in the image is that object because all the bounding boxes are just rectangles so it's at level four which is semantic segmentation that each pixel in the image gets labeled for what it is now we can have the exact shape of whatever it is in the image that we want to identify and this is good for things like background removal level five is instance segmentation where not only does the program classify what thing each pixel is it can also identify multiple instances of that thing like if there's multiple people in a picture the invention of stable diffusion starts with level four semantic segmentation and specifically for biomedical images so we're talking about images of cells neurons blood vessels organs and whatnot this is help for uh diagnosing diseases researching Anatomy stuff okay I'm going to be honest it's not that important why biomedical image segmentation is important all that you need to know is that people were trying to segment images of cells and if you're thinking what on Earth does this have to do with image Generation Well I promise you it's going to make sense in a bit and that's when the genius comes for a while image segmentation was inefficient and required thousands and thousands of training samples for biomedical image tasks a lot of times there weren't enough images or at least that was the case until 2015 when a group of computer scientists submitted a paper proposing a new network architecture which would then go on to be cited by over 60,000 people this is definitely one of the more influential breakthroughs in machine learning let's talk about the unit a unit is full of convolutional layers in order to do semantic segmentation on cell images but it's kind of weird in how it does it because it first scales down the image to a really low resolution and then it scales it back up to its original resolution that sounds counterintuitive at first but it's kind of genius and I'm going to demonstrate with a unit that I wrote myself I wrote this unit for this fish data set on kagle from which I acquired 500 images of different fish at a fish market along with their corresponding black and white masks of what pixels in the image make up the fish if you don't know kaggle is a website dedicated to data science and machine learning and I'll leave a link to the data set in the description so the black and white mask TKS are what's known as the ground truth because that's what you're comparing the Network's outputs against and training the network to try to achieve at first the unit's output when you give it this fish is not so meaningful but after a bit of training it's able to identify the shape a lot better the colors I like this because the values are outside the range of 0 to one so the image is rendered in what's known as pseudo color but if we clamp it to between 0 and 1 the final output is essentially the same as the provided masks now that we have a trains Network it's time to open it up to see what's inside and figure out how does a unet segment images so efficiently remember prior methods required thousands of sample images but I've only given this one 500 images and it's doing pretty well so when this RGB image of a fish gets inputed into the unet it's represented in computer memory as a 3D grid of numbers because it has a width height and three channels so this is a three-dimensional tensor in machine learning language now at the start the image only has three channels to represent redness greenness and bless but what if it could have more channels to represent more information like what part of the image corresponds to the body of the fish what part is the cutting board what part is the shadow what part is the highlights and so on so that's essentially the whole point of convolutions it's to extract features from an image from how the pixels relate to each each other and what makes convolutions even more powerful is when there's more channels in the image than just one channel because then the kernel is a 3D grid instead of just a 2d one the first half of the unit has all these convolutional blocks that makes the number of channels in the image go from 3 to 64 to 128 to 256 to 512 and finally to 1,24 in the convolution from 64 to 128 channels for example each kernel is 64 layers deep and there's 128 of those kernels that's how the network can extract more and more complex features from the image slight issue though even though the kernels get deeper and deeper they still have a fixed field of view on the image in this case a 3X3 field of view in order to better extract features from the image obviously the kernels are going to have to see more of the image so how can we make the field of view bigger well just making the kernels bigger rapidly increases the number of parameters that we have have which makes it inefficient so the unit uses a really smart and efficient alternative if we can't make the kernels bigger then just make the image smaller so after every two convolutional blocks the image gets scaled down before it goes into the next two convolutional blocks this increased field of view is how the network can capture more context within the image to better understand it so let's see what our fish has turned into in the middle of the unit where there's the most number of channels but the resolution is is the smallest out of the 1024 channels we can see that some of them highlight the body of the fish some of them the background some of them highlight the brighter area above the fish and some of them the darker area below just as we said before so at this point the network has learned all the information on what is in the image but the downscaling has made it lose information on where it is in the image so in the second half of the unit we start scaling it back up again and decrease the number of channels using the these convolutional blocks to kind of consolidate and summarize up all that information that we gathered in the first half but how do we get back all the Lost detail from the down sampling in the first half the answer is what's known as residual connections where every time the resolution is increased the information from the previous time the image was that resolution is literally just slapped onto the back and combined with it and then the convolutional layers mix the information back in if we compare the fish image at its highest resolution in the beginning to where it's at its highest resolution in the end we can see that the different parts of the image are much better segmented and that's how through one final convolution we get this very clean mask yeah so units are really good at segmenting images there was this International image segmenting competition where the people who invented the unit just went in there and demolished everyone here's them getting the award for it what a bunch of nerds to be honest no I'm just kidding I mean that in an endearing way but anyways okay when are we actually going to get to the image generation we're getting there listen up the unet is so good at identifying things within an image that people started using it for other stuff other than semantic segmentation specifically it could be used to den noise an image if a noisy image is just the sum of the original image plus some noise then if you identify the noise in the image then you can just minus it away to get the original in fact that's exactly what we're going to try to do so allow me to demonstrate with another image of a fish this time with a resolution of 64x 64 this time there is no black and white ground truth mask to go with it instead we generate a bunch of noise to be our ground truth because that's what we're trying to train the network to identify it's important that during training we train on many copies of the image with different amounts of noise added in so that it's able to denoise really noisy images as well as not so noisy ones and here's where an interesting challenge arises how do we provide the network with the knowledge of how noisy each image sample is because that's obviously going to affect the outcome so if you imagine all the possible noise levels placed in a sequence the information here of how noisy any sample is is basically a number of that sample's position in that sequence so this is called positional encoding so let's say for this particular image its noise level corresponds to the 10th position in the sequence now we've got a 64x 64 image with three channels meaning there's 12,288 numbers in total do we just slap a 10 on the end making it 12,289 numbers is that going to work no so here's how positional encoding works and I get it you might be thinking okay this seems like not such a significant detail why do we need to go through it this might be like the fifth time I've said this but it's going to come up later again it's going to be important positional encoding is a type of embed in which is when you take discrete variables like Words hint later on or in this case positions in a sequence and turn it into a vector of continuous numbers to feed to the network as a more digestible form of information that it can then use the way that our 10 gets converted into a vector of continuous numbers is using these s and cos equations here so that the vector of numbers always stays within a fixed range but each position is encoded by a unique combination of numbers in the vector since the different elements of the vector are given by S and cause functions of different frequencies and then it gets added onto the image data repeatedly at every point in the unit where it changes resolution to really drill in the information of how much noise is in the image just to help the network get it right okay that was an information overload but we can finally start the training process and you can see that at first the noise that the network predicts is obviously off it looks nothing like the actual noise that we gave it but after a while it looks pretty similar to the ground truth noise and we can't really notice any improvements anymore so let's instead show the D noised version where we minus this prediction to get an image and see how that improves so yes as you can see we do end up with the original fish image but it is kind of low quality and blurry this is because trying to go from Pure Noise to the original image all in one step is too hard so instead what we're going to do is we don't get rid of all the noise at once we only get rid of some of it and then we feed it back into the network again and get rid of a little bit more and then again and then again and as you can see removing the noise in small baby steps like this eventually gives us the clear high quality original image and this is why we've had to feed the network images of varying degrees of noise to train on because that's how it can work for the whole process of D noising now it's obvious that the Network's going to give us this same fish all the time because we only trained it on one image but allow me to demonstrate what happens when I take 5,000 32x32 images of ships from the famous ciar 10 data set the ciar 10 data set has 10 classes corresponding to 10 different objects each with thousands of 32x32 images now I'm not going to lie at first I accidentally put all 10 classes of images into the network so it was going 10 times slower than it needed to be and I stopped it early but then I looked at some of the results and while most of it was nonsense here is what I believe to be a red panda wearing sunglasses and using a green tent as a turtle shell and this I think is an orange boat wearing ice skating shoes with a mohawk I don't know how this happens maybe that's the magic of AI but anyways I started training it again on the ship images and at first it was giving us rubbish but eventually we can see that it comes up with completely new images of ships using the knowledge that it's gathered and that my friend is called a diffusion model now maybe it's come to your attention that what we have so far is not very efficient I mean it took forever to train on these 32x32 images imagine how long it's going to take for an HD or a 4K image and the reason is we're doing the noise prediction and the noising directly on the pixels right now and there's a lot of pixels meaning a lot of data so let's think about whether there's a way we can reduce the amount of data we have to work with to speed up this process imagine this I show you this image right here and you have to tell your friend what's in the image are you going to read out all the values of the individual pixels to your friend in order to transfer the information over no you'll tell them a description of blue sofa in a white room with a cactus to its right and a coffee table in front and then they can use their life experience and knowledge of different objects to kind of imagine and reconstruct what it's roughly supposed to look like it's not going to look exactly the same but it's going to be good enough let's use another example more relevant to computers usually we don't store images as just their raw uncompressed pixel values and instead we use a file format like jpeg which can reduce the amount of data by many many times and then when the file gets decoded to display on your screen it's a bit lower quality than the original but again it's good enough now notice how in both of these examples there's a process of encoding which is you coming up with a phrase to describe the image or the jpeg compression and then there's a process of decoding which is your friend imagining what the image is supposed to look like or the jpeg decompression so what people invented is a neural network equivalent of this known as autoencoders which are basically trained to encode data into what's known as a latent space and then decode it the best that it can back to the original data here's a demo of a latent space that's been trained for the amnest digit data set in this case the 28x 28 = 784 pixels got encoded into just two numbers which means we can visualize it as a two-dimensional space and drag around this point to see what the different areas correspond to when it's decoded now of course if it's five or 10 numbers in the latent space instead of two you can get higher Fidelity reconstructions in stable diffusion RGB images which are 512x 512 corresponding to 786,000 numbers are encoded into a latent space that's 4X 64x 64 equals 16,000 numbers now that's 15th of the original amount of data so instead of directly adding noise to images in their pixel space and then denoising those images the images are first encoded into this latent space and then we know noise and den noise that and then when it's decoded we roughly get the original image again this is called a latent diffusion model and it's one of the key improvements to the basic diffusion model because it's so many times faster than running D noising on the raw uncompressed data now up until this point we still haven't addressed something very important how do we make it generate images based on a text prompt not going to lie I think this might be where it gets really hard to understand stand so if you don't get anything from here on out don't stress about it cuz if you made it this far in the video that's already pretty impressive but anyways here's where we use that embedding concept from earlier all these words which are discrete variables have to be encoded into vectors just like the sequence position numbers representing how noisy the images are the way that people found good embeddings for words is this method called word to VC I won't go into too much detail but basically they had a list of a bunch of vectors one for each word in the English language and actually they had two of these lists then they used data of all the text that's ever been written by humans from books and the internet and whatnot to try to adjust these two lists of word vectors so that the vector of a word in one list would be similar to vectors of words that it often appears next to in the other list similar meaning it has a larger dot product so for example the words tall building are more likely to appear together than the words tall electricity so if you look at the vector for tall in one list it'll be similar to the vector for building in the other list but not similar to the vector for electricity in the other list so once you've trained it enough the relationship between word vectors and opposite lists is that the more likely they are to appear next to each other the more similar they are but what does this mean for the relationships between words in the same list in the same list the more likely they appear in similar contexts the more similar they are so if we just take one of the lists as the embedding vectors for all the words and graph it out we'll find that words that are used in similar contexts are grouped closer together here's a cool visualization on projector. tensorflow.org where if you click a DOT representing a word it shows you the closest words around it of course it seems as though they're kind of spread out and they're not really the closest words but that's because this only visualizes three dimensions while the word vectors actually have 300 Dimensions meaning each word is represented by 300 numbers with all those Dimensions it turns out these word embeddings can actually capture some of the more nuanced relationships between words and the most famous example is if you take the vector for king and you subtract the vector for man but then you add the vector for woman you end up with the vector for Queen Another example is you can take London subtract England add Japan and then you end up with Tokyo so I hope you can see how genius this word embedding Vector space is now remember how earlier I said there's two types of network layers that are really important to stable diffusion and the first one is the convolutional layer well it's time to introduce the second one which is called the self attention layer let's think back to convolutions for a second convolutional layers extract features from an image using relationships between pixels where the amount that each pixel influences another pixel is dependent on their relative spatial position so a self attention layer extracts features from a phrase using the relationships between the words where the amount of influence words have on each other is determined by their embedding vectors to build up the simplest possible self attention layer it's kind of like a fully connected layer but each input and output is a vector instead of a single number and the weights of connections are not parameters that it tries to learn rather the weight of the connection between A and B is determined by the dot product between A and B so in the simplest model the output is entirely dependent on the input since there's no parameters that we can control but we do want to control it obviously like if we have a convolution where the kernel just has all the same numbers then yeah sure it helps us understand how a convolution works but all it does is just blur the image and it's not that useful so how can we control this self attention layer so that just like we make a convolution detect edges for example we make it detects I don't know words that negate or emphasize certain adjectives let's break this attention process down into its components in our simple attention layer the amount that A's input influences B's output is determined by the do product and the amount that B's input influences A's output is also determined by the dot product so it's the same now let's focus on the part where a influences B I'm going to characterize this process as a conversation between A and B which sounds H goofy but I promise it'll make sense so B goes up to a and says hello I'm B here is my ID show me your ID so that we can compare it and decide how much you influence my output and then a says says yeah I'm a here's my ID and here's my data which I'll pass over to your output after the comparison so B's ID is called the query Vector A's ID is called the key Vector the process of comparison is the dot products between them and A's data is called the value Vector there I just explained query key and value and self attention in our simple self attention of course the query is just Vector B itself and both the key and the value are just Vector a in other words each Vector has to serve a total of three purposes over the whole process even though it's the same Vector the whole time so here's how we introduce in parameters to control this whole self attention process how do we manipulate vectors using matrices so we have three matrices called can you guess what they're called that's right the query matri Matrix the key Matrix and the value Matrix and these are applied to each Vector before they go to serve their purpose as a query key or value now the amount that a influences B is no longer the same as the amount that b influences a this way it's not just similar vectors that influence each other anymore due to the simple dot product now we can use any feature in each word's massive 300 Dimension Vector which is what makes self attention layers so powerful in extracting features in the relationships between words one more thing though right now the output is still not affected by the relative position of each word in a phrase and that's really important in determining the meaning of the phrase so in order to encode the position of each word we once again use the positional encoding method covered earlier and just add on those positional embedding vectors to the word embedding vectors wow that was another information overload you need a break okay let's take a break all right let's resume think about this using convolutional layers we can encode an image into a small embedding vector and using attention layers we can also encode a text phrase into a small embedding Vector but look at the image and the text that I've selected to display here the caption perfectly describes the image so imagine if the two encoders could come up with the same embedding Vector even though they're dealing with two different types of data so that's exactly what open AI tried to do with their clip text model where clip stands for contrastive language image pre-training this clip text model has both an image encoder and a text encoder and they trained it on 400 million images so that images and captions that match are supposed to come out with very similar embeddings and the ones that don't match are supposed to come out with really different different embeddings so it kind of makes sense that the text embeddings that come out of this clip text model which are already matched to encoded images are perfect to stick into our D noising unit which also encodes and decodes images as a part of how it works with the convolutions and scaling and all that so yes in stable diffusion we just take text embeddings generated by clip and inject it into the unit multiple times using attention layers well this time a slightly different type of attention this time it's not self attention which just operates on one set of input vectors we're adding the text information into the image so obviously it's two sets of input data so instead it's a process called cross attention which is literally like self attention except the image is going to be the query and the text is going to be the key and value that's it these cross attention layers in the middle of the unit are going to extract relationships between the image and the text so that whatever features in the image can get influenced by the most important and relevant features in the text and this is how we eventually train the network to generate images based on the text captions we give it so there we have it convolutional layers learn images self attention layers learn text and then when you combine the two you can generate images based on text pretty interesting isn't not

Info

Channel: Gonkee

Views: 82,324

Rating: undefined out of 5

Keywords:

Id: sFztPP9qPRc

Channel Id: undefined

Length: 30min 21sec (1821 seconds)

Published: Tue Jun 27 2023