Stable Diffusion explained (in less than 10 minutes)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

AI powered image generators have become very popular during the past 2 years but although they are so widely used they are still a mystery for most people in this video I will give you a deep dive on stable diffusion which has become one of the most popular tools in this field it often just takes a simple text prompt to get an amazing image or you upload an image add a text prompt and get something completely new and that's what we call generative AI generative AI is not just used for image generation it's a whole set of Technologies used for various purposes first we have texttext tools like chat GPT Gemini Claude Mistral Llama Or Gro then we have text to image tools like mid Journey doll e imagine Muse and of course stable diffusion then there's text to music with music LM music gen stable audio and soono and finally text to video with imagine video Lumiere emu video stable video diffusion and the famous Sora by open AI but how does it all work first of all it needs a well-trained model so in our case we need the Mona Lisa right but in a model trained only with Mona Lisa all we can generate is Mona Lisa and that's quite boring so we also need Albert Einstein and we need an image description because we want to generate images from text but how can we generate something completely new well we need more images and even more we need all the images we can get in fact stable diffusion 1.5 contains more than 2.3 billion image text pairs but how can they fit into a model with a size of just 2.13 GB and how can we create totally new images so what does it take to train a model we need to get 2.3 billion image text pairs we need to to merge them in a model that's not too big we need to map the image description to the image or to parts of it we need to identify similar contents across all images and we need to extract the characteristics from images in order to create new ones the image text pairs were scraped from the internet using the Layon 5B data set there are still pending legal issues so some newer models try to avoid copyrighted images the next big question is how is the training done it's done with a method called Deep learning and this is all about neural networks for stable diffusion we need two Network layers the first one is the convolutional layer more about it later and the second one is called the self attention layer may sound funny but we'll get into it and both of these layers are closely connect connected to each other now some basic things about neural networks to give you a better understanding neural networks consist of neurons which are connected to each other across different layers we could somehow compare it with the human brain there's an input layer one or more hidden layers and an output layer and if all neurons from one layer are connected with each neuron of the next layer it is called a fully connected neural network there are also weights and B biases each neural connection has a weight which determines the strength or influence of the connection each hidden layer contains a bias which is a constant value of one and is added to the output this ensures that there aren't any dead means zero outputs during the training new values are inserted through the input layer which changes the weights of the neural Connections in the whole network in easy words you insert a bunch of numbers into the net Network which adapts the values of the neural connections that's called learning now we can see images as a grid of pixels with certain color values so they are also just a bunch of numbers an image with a size of 512 by 512 pixels and three color layers red green blue can be stored in some 700,000 neural connections so using the network for creating a new image would require more than 600 billion neural connections and there isn't even any spatial information as each input connects to each output so fully connected neural networks aren't suitable for image generation and we need something different so let's build a different kind of network where only a small number of input neurons connect with the hidden layer so there are way less connections that's called a convolutional neural network and the connected regions are called local receptive Fields here the weights and biases are the same for all neurons in a specific hidden layer and that means that all hidden neurons detect the same feature in an image that means that convolution can extract features from images by using the relationship between adjacent pixels so a network trained with Einstein can detect him anywhere in an image next let's see how computer Vision Works it deals with the question what is in an image the first level is the classification of an image level two is classification plus localization level three is object detection and level four is semantic segmentation where the position of each type of object in an image is exactly determined now stable diffusion uses the level four let's see how it's done let me introduce you to the unet it's called unet because it's shaped like a u and it consists of several encoder and decoder layers there are connections from the encoder to the decoder at each level to add data to the decoder an image is converted to a stream of small pixel tensors each capturing fine details of the image these tensors are fed into the encoders each block detects more features in the image and at each encoding step the image is downscaled to detect a larger part of the image with less details so at the end of encoding the unet knows what's in the image but not where at decoding the image gets upscaled again in order to consolidate the information the data gathered by the encoder is added to it at each decoding layer that's been a tough one now let's talk about noise in a step-by-step process more and more noise is added to an input image until there's only noise left to reduce the size the image is Downs scared before using an autoencoder that's called Laten space we can say that a noisy image is the original image plus noise so if we know the noise level we can subtract it and get back the original image that's called Den noising so the noising process is reversed and it's also fed Into the unet Now how can we embed the image descriptions let's take the painting of Mona Lisa the words are converted into a vector a bunch of numbers and for that we're using a common list called word to VEC where all English words are already listed and based on all available texts on the internet words used in similar contexts have similar vectors here's an example for Einstein and since words are vectors you can do calculations for example King minus man plus woman equals Queen and here we have our self attention layer but there's an issue look at the phrase the man killed the dog it has the same Vector as the dog killed the man so we also need to encode the position of the words that's called positional encoding now let's summarize an image is encoded to a vector by the convolutional layer for text we do the same with the self attention layer what if we could use the same Vector for both of them that's actually done by the clip encoder and the vector is fed into the uet and here we have our trained model fortunately that's all done for us by stability AI now let me quickly show you how to use the model just using a simple comfy UI workflow so we load our model we enter a positive and negative prompt we use a clip encoder to turn it into a vector then we create an empty latent image which is just noise or we can also load an input image instead but then we also need a vae encoder to bring it into latent space then we need a case sampler for the calculations then connect everything and render use a VI decoda to get the image and we're done well that's it hope you enjoyed the video and see you in the next one

Info

Channel: Render Realm

Views: 5,078

Rating: undefined out of 5

Keywords:

Id: QdRP9pO89MY

Channel Id: undefined

Length: 9min 55sec (595 seconds)

Published: Fri Mar 29 2024