How Stable Diffusion Works! Detailed Explanation

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
if you've ever wondered how stable diffusion and generative AI artworks and why sometimes it produces these totally normal not weird at all looking hands you're in the right spot let's Jump Right In now if you've worked with these systems before you're used to this idea of providing a text prompt like this one where you say realistic detailed chocolate sprinkled Donuts on a white plate and then the system does some work and it spits out an image that matches the prompt but how does this actually work on the inside and what's going on behind the scenes what the heck is stable diffusion now to start off you have to understand something that's known in physics and chemistry as diffusion and this is an idea that applies to thermodynamics fluid dynamics and it's this idea where you can start with something like a clear Beaker of water you can add a few drops of dye and that dye diffuses throughout the liquid until it reaches a state of equilibrium now you know this has happened because when you look at the beaker it's completely blue or whatever color dye you've put into it with stable diffusion it's almost like we're starting with that blue color dye water and we're trying to get back to a clear liquid that that we started with and that's going to make a little more sense here in a minute now we start by training a neural network with something called forward diffusion and to do this you take a whole bunch of images you find from the internet and you pass them through a neural network and each time it Loops through you're adding noise to the image this kind of static that you see here and it's called gaussian noise and you don't just do this one time you've looped the images through the neural network over and over and each time it's adding a different distribution of that gaussian noise and you don't just do this to one image you do this to billions of images and you do this thousands of times per image now eventually what ends up happening is you get to a point where the neural network can do this in Reverse so it can start with an image that's almost entirely noise it can loop back through it and remove noise until it gets to a point where it's not quite the original image but it's an image that resembles one of the original images but I don't really care how much you train this or how sophisticated the neural network is you can't go straight from a completely noise filled image to a clear image like you would in this picture so it turns out what's actually happening behind the scenes is that we're not actually training a neural network to predict an image it's not actually generating an image what we've done is we've trained a noise prediction neural network so at each step if you start with an image that's completely Pure Noise and you Loop it through the neural network what it does is it identifies the gaussian noise in that image and you can see that right here in this first pass and then the neural network removes that noise from an image you Loop this through again and again and each step of the way it identifies the noise in the image and it removes it eventually you get to a point where you have an image that's mostly not noise that doesn't answer the entire picture right that doesn't tell us how it actually steers the image from being complete noise over to something that's human discernible and looks like what we added in our text prompt after all what we typically start with is one of those text prompts so in this case I said macro close-up photo of a bee drinking water on the edge of a hot tub now this is a really specific prompt and you can see the image generated is really good you wouldn't know looking at this that this wasn't macro photography of an actual bee how do we get there and what ties these words to the images that we're generating with the neural network it turns out that there's another key element you have to pay attention to when you're training these neural networks see you're not just training the neural network with the images you're also using the alt text associated with those images and if you've ever been on the internet and you've seen images if you right click on them or if you view the page Source you can often see alt text and this is usually used for search engines crawling the pages so that they understand what the image is about what the keywords that are associated with that image are about these can also be used by screen readers for people that are visually impaired so they can understand what the context of that image is so when you pair the images that were training the neural network with with the text that's associated with them you can start to build up a picture of how this all connects together on top of that there's also this concept of reinforcement training or RL with HF or rlhf which stands for reinforcement learning with human feedback and this is a really powerful concept that starts to make these stable diffusion models much more powerful over time to understand how this works a little bit more let's take a look at mid-journey so you can see we asked mid-journey to create an image of Jim Carrey wearing a blue shirt pole vaulting and when it Returns the images it actually returns not one but four images and this is the really important bit if you were to click upscale for one of these images because you think it's the best image that it generated it's going to return the result and then from there if you click favorite for example or you download the image this provides mid Journey with a really high quality signal that this prompt Jim Carrey wearing a blue shirt pole vaulting really matches closely with this image so when they train their new models when they start building these new stable diffusion models they can use these new images associated with those text prompts and they have this feedback loop that's able to feed back into the system and improve their models over time if you have thousands of people using these systems creating millions of images you can get enough training data to improve these models at a really high rate of speed and here's why that's so vital to these models so now that we have our neural network that's really awesome at removing noise from images we need a way to steer it and we do that through something called conditioning and the purpose of conditioning is to actually steer that noise predictor so that it removes the noise and ends up creating the image that we had with that prompt and the reason we're able to do that is because we have those connections between words and images and because these neural networks are trained on billions of images they know some of these Concepts like a macro close-up is a very tight shot of something a photo of a b is a winged insect drinking water is something that it can also produce because it knows the concept of fluids and water and then on the edge of a hot tub it also has the idea of a hot tub and what it means to stand on the edge of one what ends up happening is all of these layers stack together end up steering the neural network so that when it removes noise from the image it does so in such a way that by the end of the result you get this beautiful macro close-up photo of this bee and here's a real-time example of how this works the prompt for this was a photo of a cat and you can see that the image starts off with complete noise and then as it steps through and iterates through it's steered until it gets to a point where the noise removed from the image has created an image of an actual cat and the results from these neural networks and stable diffusion can be really stunning objects that you could never possibly have in real life as well as objects that are photorealistic and you can't tell the difference between those in reality like this picture of this burger you could use this in a food ad and nobody would know that it's not a real image of a burger now that's not to say that this always works out sometimes it produces really bad images and when it goes wrong it can go really wrong in these cases because I scaled these images up it didn't know what to do so I just started multiplying different body parts Until It produced an image that filled the entire screen space and sometimes these get really weird as you can see so what happens if you want to train your own neural network with this or train your own checkpoint or safe tense or file you've probably heard those terms thrown around and to understand that you have to understand a little bit about a neural network so here's a basic neural network you've got this input layer on the far left side this is where you put in all the training data all the base images that you want to start with and then on the far right side you get the output now this is what's put out this is the image that hopefully matches the prompt that you stuck into this and in that middle layer that's where all the magic happens and we don't quite understand everything that happens in there it's just really complex Vector space math and since some of these neural networks can have billions possibly even trillions of parameters it becomes very common complex very quickly and it's hard to dig through what's happening inside this means it's also incredibly expensive and time consuming from a resource perspective to train these neural networks but fortunately there's this idea of a checkpoint and this is really cool so think about it if you're training one of these models what happens if it fails Midway through well you don't want to lose all that work that you had so what ends up happening is every so often there's this idea of a checkpoint and you can think about this like the auto save in one of your document files right what it does is it creates a snapshot of all of the weights inside of a neural network and those weights are really the magic that happens you can think about every single one of those green dots in the middle as a different knob that you can twist and turn to a different setting and so by taking a snapshot of where all those knobs are twisted you can get all the weights of the stable diffusion model saved as a snapshot in time now the really cool thing here is that because these all have that checkpoint mechanism built into them you can take any of those base models and you can start training where they left off so in our case what you can do is you can go grab those base stable diffusion models on a site like hugging face you can fire them up in a cloud instance on something like collab and then you can take some pictures of yourself like these super awesome photos of totally the same person you can load those into the model and then you can start training these and the cool thing is with around 15 to 30 pictures you can train a model enough to where you can generate images of yourself or any other person place or thing and these are the results none of these are real images of me these are all AI stable diffusion generated but I've tricked a lot of people with these a lot of my friends and even family members have been like yeah these are great photos of you I don't get what the point is and they couldn't pick these out of a lineup with my regular real images and now these same techniques are starting to be applied to video so now we've got AI generated video this is a demo from Nvidia where you can see the prompt and then some some of the videos that are generated from it and we're still in the early stages of this technology but you can imagine that stable diffusion has only been around for a few months and the quality has gone from pretty bad to photorealistic so to be at this spot with generative AI video already is pretty impressive and this really brings us to the ethics of all this a couple weeks ago I put up these images as a joke of Elon Musk and the CEO of General Motors Mary Barra I put these up and thought it'd be funny because obviously these two would never be seen together but it caused quite a stir and even Elon Musk himself responded and said also I'd never wear that outfit not only that but a lot of different news sources picked this up it was on the front page of Snopes CNN business and even Reuters reporters reached out to me and this is where things get weird right because we've got elections coming up and these photos that I posted had over 30 million views and most people couldn't tell that these weren't real so what happens in a world where we can't trust the images and the video videos and even the voices that we hear online well we're already there I mean just a few days ago a Drake song dropped and it's gone viral there's tens of millions of listens to this song Drake had nothing to do with it it was completely trained off of an AI model now I'm not a pessimist with this technology I think it's really powerful and it's going to be world changing I think in a few years we're gonna get to a place where we have generative TV shows and even generative blockbuster movies where you can insert yourself into the movie and even your friends and family members or have it write a story on the Fly for you but we're also going to have a lot of disinformation and a lot of media mistrust so we've all got to be really careful and diligent about how this technology is used and how we think about applying the future of artificial intelligence to the world around us my hope is that it actually brings us all closer together and instead of relying on what we see online we get more interactive with real humans we go out and we talk to people we have discussions we have debates and we do it in person because that's something we can trust at least unless we're in a simulation at which point I don't know this all goes out the window let me know what you think down in the comments below I'm Brian Lovett this is all your Tech AI looking forward to seeing you next time thank you all so much see you
Info
Channel: All Your Tech AI
Views: 10,508
Rating: undefined out of 5
Keywords: stable diffusion, deep learning, machine learning, ai art, what is latent diffusion, diffusion model, how to use stable diffusion, artificial intelligence course, how does stable diffusion work, how does stable diffusion training work, how does stable diffusion generate images, how does stable diffusion learn, how does stable diffusion, how does stable diffusion work ai
Id: KVaJKzr4a8c
Channel Id: undefined
Length: 12min 10sec (730 seconds)
Published: Tue May 09 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.