Stable Diffusion - How to build amazing images with AI

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hello my name is Louis Sano and this is Sano Academy and in this video I am going to tell you about stable diffusion and how it's used to generate some amazing images so you've probably seen and play with some of the state-of-the-art image generators like mid Journey dream Studio Firefly dally and others in these you add a prompt as imaginative as you want and the model gives you a pretty faithful image described by your prompt I have been Amazed by these models and I play with them a lot for example here is a prompt that I added recently I said a penguin captaining a pirate shipped in a storm in the Caribbean the sunset can be seen in the background behind the clouds and the model Drew this which is pretty accurate so now I'm asking myself the question how is it that these models work and how is it that they can draw such amazing images in particular images that are definitely not in the data set that they were training trained on because the data set may have pictures of penguins and Pirates and sunsets but the model manages to understand the sentence I gave it and then combine the right images to form the output that is fascinating now for these mods to work you need a lot of data and a lot of parameters however at the end of the day the architecture consists of three neural networks that process the text output an image and then clean the image so let me show you how all this works so as I mentioned stable diffusion uses three neural networks the first one is the one that turns the text into numbers so here I have the prompt about the Penguin on the pirate ship in a sunset that is turned into a list of numbers called a vector it can be a really really long list of numbers there could be hundreds or thousands of numbers but it's a pretty good description of them and this is called an embedding I will tell you more about those those in a minute the second neural network is the one that takes these numbers and turns them into an image so the numbers that come out of the first neural network that are the interpretation of the text is now turned into an image now this image can be pretty rough so we use the third neural network to refine the image it takes a rough image and turns it into a nice crisp image of the stuff that the text is describing so in this video I'm going to tell you about this process the one that turns text into numbers and then numbers into a rough image and then the rough image into a crisp image and then after the high level explanation we're going to have an example in the example we have a very small data set of words that are going to be turned into some very simple images and the way we're going to do it is with some very small neural networks so you'll be able to see how it works in detail in a very very simple example so let's start with the first neural network and that one has to do a lot with embeddings so if you've seen any of my previous videos you know embeddings are a big deal because I always start by talking about embeddings embeddings are really where the rubber meets the road because it's when you take the stuff that's visible for humans and turn into something visible for computers things that are visible for humans are video images text sounds Etc and for computers they only talk in numbers so anything that turns an image or text or anything into numbers is an embedding and it's really the first part of each big model because you got to turn things into numbers in order for the computer to work with them so the first thing we're going to see is a word embedding I like to see words embeddings as ways to locate words in for example a plane so if I'm going to throw words in this plane let's say I have the words apple pear and watermelon over here and then the words car bicycle and truck over here and then the words dog and cat over here so what's a peculiarity of this embedding where well that similar words get located close by for example an apple is close to a pair because they're similar and a car is close to a truck because they're similar and the same for dog and cat that is one of the main properties of an embedding now remember that I said an embedding is a way to turn words into numbers where are the numbers well they are in the horizontal and vertical coordinate of the plane so for example over here Apple gets sent to 57 dog gets sent to 41 cat gets sent to 51 and so on because you have to take for example from the origin to get the cap you have to make five steps to the right and one step to the top and so this pairing is an embedding now I'm not going to have it as a list of numbers for each word I'm actually going to have it encoded as a neural network so a neural network takes every word and processes it and turns it into a bunch of numbers now it doesn't have to be two numbers it could be three it could be hundreds it could be thousands and those numbers are a pretty good description of the word but for now let's think of it as two so that we can see them visually in the plane so the first of our neural network since table diffusion is going to be this neural network over here it's going to be called an embedding neural network and it's the one that takes words and turns them into numbers how do these get built well it's a very complicated process and if you'd like to learn more check out this video in my channel called what are Transformer models and how do they work in that video I talk about word to VC which is a pretty useful way to build embeddings now we have word embeddings but we also have image embeddings and an image embedding is very similar it's actually a way to locate images in the plane so here I have the image of an apple a pear a watermelon here I have the images of a car a bicycle and a Truck and over here I have the images of a dog and a cat so again every image gets associated with the two coordinates in the plane and in real life this would be hundreds or thousands of coordinates but I'm going to see them as two and so what do we have well on the left we have word embeddings and on the right we have image embeddings and the idea is to find a way to turn the left embedding with words into the right embedding with images that's a way you would be able to take words and turn them into images so in this example is pretty easy all we have to do is send the words to the corresponding images which due to a strike of luck are actually located in the exact same position as the word now of course this is not what's going to happen in general if we build a text embedding and we build an image embedding they have no reason to Lo loc at the words and the images in these exact same places as a matter of fact they could have completely different dimensions the text embedding could have hundreds of coordinates and the image embedding could have thousands of coordinates and they have nothing to do with each other so what we want is to find a map so a way to associate the words in the left with their corresponding images on the right we need to find a rule that takes the word on the left and puts it into its corresponding image in the right and how do we do it well you guessed it with a neural network we're going to take take a neural network and we're going to train it to take each word in the left and map it to its corresponding image on the right so for example the dog goes here the cat goes here etc etc and of course neural networks have numbers as the input which are the coordinates over here and numbers as the output which are the coordinates over here so all we have to do is take a huge table of coordinates of words in a text embedding on the left and map them to the coordinates of the corresponding images in the right in other words we train a neural network with the input being the coordinates of the words and the output being the coordinates of the corresponding images as I mentioned we don't have to have two coordinates we could have many on the left and many on the right and they don't even need to be the same number but the idea is that we're going to train a neural network to map coordinates of words into coordinates of images now the picture is a little more complicated but not very much because in reality text and Bings are a lot more powerful than just for words they can actually take entire sentences or longer pieces of text and map them into coordinate so for example I can have a few sentences over here that talk about science a few sentence over here that are titles of beetle songs and a few sentences over here that are greetings and just like in the word eddings similar sentences are going to be located in similar places and image and beddings can also be much more complic at I don't need to have an apple or an orange I can have much more complicated images that are not described by a single word but actually described by an entire sentence or even more and the idea is that I take a text embedding of longer piece of text and I can send it to an image embedding of more complicated embeddings for example this one has four Images a ship in the sea an airplane in the sky which are close to each other a dog and a cat and people in a park and of course you can imagine in entire paragraph description of a very complicated image these ones would belong here and the idea again is that we're not going to be that lucky to have the images match up the text like here in reality the image embedding and the text embedding could be very different and what we need to do is train a neural network to match every single paragraph in the data set with its corresponding image in the image data set and if this neural network gets well trained then it's going to be able to take a new sentence that has never seen before for example a penguin dressed like a clown and be able to draw an actual penguin dressed like a clown this is the image I got when I put a penguin dress like a clown in a stable diffusion model and the power of this is that text and Bings actually understand what's being said beyond the words it actually understand the semantics of the sentence so I like to imagine it like this let's say that a neural network knows how to draw clown because there's there's a clown in the data set and it also knows how to draw penguin because there's a penguin in the data set now when you say a penguin dressed like a clown maybe it's never seen it but that's okay because I can somehow interpolate between the clown and the Penguin so if the neural network knows how to get the sentence a clown into the image of a clown and it also knows how to turn the sentence a penguin into the image of a penguin that's over here then a penguin dressed like a clown is going to be somewhere in the middle so the neuron netor is going to send it somewhere in the middle and it's going to get a penguin dressed like a clown obviously I have vastly oversimplified this there's a lot more coordinates and these neural network can be very complicated but that's more or less the idea that once you have embeddings where you turn sentences and images into numbers then the arithmetic that happens between the numbers also turns into arithmetic that happens between the sentences and between the images so it's not too far-fetched to think of a penguin dressed like a clown as the midpoint between a penguin and a clown in both the text and the image embedding funny enough I I tried to draw a clown dressed like a penguin and the model gave me the exact same thing a penguin dressed like a clown so in some way these models are still kind of limited the model when it sees penguin and clown it still imagines a penguin dress like a clown because that's just a little more imaginable than a clown dress like a penguin obviously these models are always improving and soon it's going to be able to draw a clown dress like a penguin and maybe some models are able to draw it but I just want to point out to you that these models although they're amazing they still have some visible limitations and I encourage you to play with these models and make them draw crazy things and see what they can do and what they can't so so far we've seen two neural networks right the first one is the embedding and the second one is the image generator so the embedding is the one that takes the sentences and turns them into vectors or into lists of many numbers here I have four numbers represented by Shades but in reality it would be hundreds or thousands of numbers and then we have the image generator which is the one one that takes these vectors these lists of numbers and turns them into images and so what would think that that's it right take the sentences turn them to numbers take the numbers turn them into images but in reality we need one more step why well because in real life these images that come out are not so nice they're actually pretty rough and it's understandable because images can be pretty complex and we're we sending a sentence into an image well maybe the computer needs a little help maybe the computer can figure out that it's drawing a dog here and a cat here but it's not able to do it so well so what do we do well we give it a booster so what we do is we have a third neural network called the diffusion model and that one takes rough images and turns them into crisp neat images and as I said that's another neural network so that is the third one now let me tell you a bit more about this diffusion model so let me tell you a bit more about the diffusion model which is the one that makes the image sharp the diffusion model is a neural network that takes all these rough images on the left and turns them into the crisp Sharp Images in the right and how do they work well like a lot of neuron networks what you have to do is feed it the right data for it to learn and what we're going to do is we're going to take a bunch of images that are crisp and then we're going to add some noise to them so adding some random numbers here and there to the pixels will turn them into something a bit noisier and then we're going to do the same thing again and again and again and again until we get pretty much full noise and then we are going to train a neural network to go in the other direction so it's going to take each one of the noisy images as input and return the previous one in the chain in other words the input to the neural network is an image with some noise and the output is the image with a a little bit less noise which is exactly the previous image before we added the noise so we add the noise and then we train the neuron Network to remove it at each step of this process so the input is going to be a bunch of images and the output a bunch of images with a little bit less noise that is the diffusion model and believe it or not that is the solution when we apply these diffusion models and you can apply it many times you will go from the rough images on the left to the pretty sharp good-looking images on the right and that is the big picture of a stable diffusion model we have the three steps embedding image generation and diffusion model and the embedding takes text and turns it into vectors which are long lists of numbers the image generator takes these lists of numbers and turns them into rough images and the diffusion model takes these rough images and turns them into sharp nice images and each one of this is a big neuron Network that is trained to do its job so in high level this is what we have but now let me show you an example now that you've seen the big picture of a stable diffusion model let me show you an example in this example I have taken a very small set of sentences and a very small set of very simple images and we're going to build three small neuron networks that are going to help us build new images from sentences that don't appear in the data set let's begin so the setting we're going to have is a city called bantis and in bantis a lot of pretty cool people live and they have a Pastime which is that they really love to play sports but they only like sports that have balls and bats so some of them plays baseball some of them play cricket and pretty much any sport that has balls and bats and we're going to create a stable diffusion model for the people in bantis now the people of bantis have a beta set and in this beta set you have a lot of images of balls could be baseball could be Cricket could be anything and a lot of of images of bats now it just so happens that all the balls are in the bottom left corner of the images and all the bats happen to be diagonally located from the top left to the bottom right that's just how people in bantis take pictures now bantis has good technology as you can see they have ai models but the technolog is a little basic as a matter of fact their computers only have 2X two screens so they're two pixels tall by two pixels wide and they're also monochromatic so everything is in black and white white and different Shades of Gray so somehow when we take a picture of a ball or a bat it has to be displayed on these pretty rudimentary screens so how the pictures of ball look well they look like a black or dark gray dot in an otherwise clear screen and the dot is always in the bottom left of the image and how do bats look well they look like a strong diagonal in in image that is otherwise clear so we have two dark pixels in the top left and bottom right and the other two pixels are white or light gray and that's our entire data set we only have balls and bats there are absolutely no images of a ball and a bat so if we wanted an image of a ball and a bat we wouldn't know what to do however we can build a model that's going to be able to draw an image of a ball and a bat and how would this look well if you put a ball and a bat together then the diagonal is dark and the bottom left corner is dark and the top right corner is going to be light so now I'm going to show you how to train three neural networks that will be trained with images from balls and bats and will be able to create an image of a ball and a bat together so let's start with the simplest of these neural networks which is the text embedding for a text embedding let's look at two coordinates the coordinates of ball and the coordinates of bat and where are the sentences going to live well the sentence ball has one time the word ball and zero time time the word bat so it's going to live in the coordinate one zero the sentence bat has zero times the word ball and one time the word bat so it's going to live in coordinates 01 and the sentence ball and bat is going to have one time the word ball and one time the word bat so it's going to live in the coordinates 1 one now all we have to do is build an embedding that will send these three sentences ball bat and ball andat to the corresponding vectors now I'm going to be building small neural networks here and they're pretty self-explanatory because they're made of vertices and edges but if you want to know more about them check out this video in my Channel about neural networks so to build the neural network we need two inputs and two outputs two inputs are ball and bat and the two outputs are the first coordinate and the second coordinate and all we need to do is connect the ball with the first coordinate with an edge of weight one and connect the B to the second coordinate with a weight of one and then put zeros on the other two weights so what happens here well when we take the word ball then that's a one and a zero and it gets sent to one Z which is the vector 1 Z when we take the word bat that's a 01 and that gets sent from 01 to 01 and finally when we take ballum bat that's 1 * ballum 1 time bat that gets sent to the vector 1 one so you may be thinking this is a pretty simple neural network this is because the example is simple in reality embeddings can be very complicated but I want to show you that even something very simple can be written as a neural network so now we have the first neural network the embedding one let's go for the more complicated ones so now that we have the text embedding let's build the image generator so we already have the text embedding on the left and I'm putting the image of the ball and a bat there for reference but in reality is just a text embedding so the horizontal axis is the word ball and the vertical axis is the word back and now let's build a a pretty optimistic image embedding on the right let's say that we have a horizontal axis which represents the image of a ball and a vertical axis which represents the image of a bat but remember that these appear in a monochromatic pixelated image of 2x2 pixels so the horizontal axis looks like this in the origin you have an empty image completely white and as you move to the right the bottom left corner which is the one representing the ball gets dark darker and darker and the vertical axis is the image of the bat so as you go up in this axis you're drawing a bat which remember it's a diagonal so that diagonal gets darker and darker so anything that is close to the bottom right looks like a ball and anything that is close to the top left looks like a bat and what happens with the top right well those are the images of a ball and a bat now as I said this is too optimistic this is not the image embedding we are going to have the image embedding looks very different as a matter of fact the original image and Bing you would have has four dimensions because our image has four pixels and we have one dimension per pixel this axis over here represents the darkness of the top left pixel this one over here now I'm going to give them different colors for clarity represents the top right pixel and so anything in this plane over here is an image where the top left and the top right pixels are colored in some level of greyness or bless now this is two dimensions if I were to add a third dimension this one actually think of it as a third dimension so it comes out of the page in front of you the third dimension represents the color of the bottom left pixel that now I'm going to draw as red so now we have a cube and anything that's floating around in this Cube has colors in all the pixels except for the bottom right and some level of intensity and the level of intensity tells us where in each of the axis it's located now I'm in trouble because I ran out of Dimensions I can't really think in more than three dimensions cuz humans can't see in more than three dimensions so I'm going to make an attempt to draw the fourth dimension and you may have to bear with me this is how I see the fourth dimension imagine that I'm taking this space and making many many many copies of it let's say infinitely many of them and I make them along this fourth axis so imagine this fourth dimension here that represents the color or the intensity of the bottom right pixel which I'm going to draw in green and so let's imagine a four-dimensional world so four axises flying around in four dimensional space and each one represents the intensity of each one of the four pixels so if you were to imagine how the space of images look like if you were to have a screen of hundreds or thousands of pixels it would just be a thousand dimensional world where each of the axis each of the dimensions is the intensity of each one of the pixels now of course a good image embedding will be able to bring that many dimensions down but let's work with these two embeddings so on the left we have a two-dimensional text embedding and on the right we have a four-dimensional image embedding and we know some information we know that the word ball has to go here because that's the image of a ball it's a bottom left pixel with high intensity and the rest of the image is white now where do you think the word bat would go I encourage you to actually pause the video and see where the bat would go and I'll tell you look at the black and the green axis these two form a plane and I can draw the plane here that's easy to draw because it's two-dimensional and in this top right corner of the plane lies the image of the perfect bat So This Bat goes over here now I need you to help me with your imagination where would bat and ball go well let's forget about the blue axis and now we have a three-dimensional space and you can think of this Cube over here formed by the black green and red axis and on the opposite corner of this Cube lies the bat and the ball it's here colored in Black white red and green so I hope I've given you a decent idea of how this four-dimensional image embedding would look like and now what we need is to find the map between the two-dimensional text embedding and the four-dimensional image embedding and what we have to do is to see where the things that we know go so as You' seen before the ball which has coordinates one Z goes to the red ball in the right that has coordinates 0 0 1 0 and the bat that has coordinates 01 goes to the bat which is colored in black and green and that has coordinates 1 01 if those four coordinates are not clear they're going to be clear pretty soon because what I'm doing is reading off the intensity of each pixel from left to right and from top to bottom let me show you how so on the left we have ball with coordinates 1 0 and bat with coordinates 01 and in the image and batting ball has coordinates 0 0 1 0 because I said I'm reading these numbers like a page of letters so 0 0 1 0 and the bat has coordinates 1 0 0 1 because I'm reading them from left to right from top to bottom and what I need is a neural network or a map that sends one Z to 0 1 0 and 0 1 to 1 0 01 and how do I build this neural network well in the input it's going to have two nodes because the embedding on the left has Dimension two they're vectors of length two and because the embedding on the right has Dimension four and the vectors have length four then I have four outputs one for each of the four colors and remember that the top right was blue but I'm actually going to think of it as white because this one's going to be zero all the time so we really just care about the first third and fourth numbers now I want one Z to go to 0 0 1 0 so I need this Edge to be one why do I need this Edge to be one because when I take ball that is a one Zer and if I want this one to pass to the right I multiply it by the value of the Edge which is one and I get a one here and I have no edges elsewhere they're all weight zero so everything else becomes zero and so the image of ball which is one Z becomes 0 01 0 which is precisely the image of a ball in the embedding as you can see these numbers map to 0 0 1 0 so this really simple neural network where I only have one Edge Maps the ball in the text embedding to the ball in the image embedding now let's do the same thing for bat so what do I need for the bat in the left to go to a bat in the right well if I take this Edge to be one and this Edge to be one then that works check it out the coordinates for batter are 01 so I put the 01 here and now the zero on the top becomes a zero here because it gets multiplied by the weight of the red Edge and the one on the bottom becomes two ones on the right because it gets multiplied by the weight of the black and the green edges and there's no edges going into the other node so that node gets a zero and so so therefore bat which is the vector 01 goes to the vector 1 0 01 and that's precisely the image of a bat and so this neural network does the job and all I need now is to add zeros for all the other edges and now I have my neural network that takes the ball in the text and sends it into the ball in the image and the bat in the text and sends it into the bat in the image and here's where some magic happens there's no image of a ball and a bat anywhere but in the text embedding I have ball and bat as the vector 1 one now what happens if I put the vector 1 one through this neural network well one one is going to go two there's a one going here multiplied by the weight of the red Edge which is one and the bottom one becomes these two ones multiplied by the weights of the black and green edges and everything that goes to the other entry is a zero is an edge with weight zero so therefore this one has weight zero and what did I get well I got one Z 11 one which is precisely the image of a ball and a bat so this may look simple but check out what happens we train a neural network to draw a ball and then we train the same neural network to draw a bat and we never told it what a ball and a bat is and from the image of the ball and the image of the bat the neural network was able to draw the image of a ball and a bat I know that looks simple but that's where the magic lies you can have a data set with certain images and and if you train a neural network well it'll be able to combine these images into other images that are not in the data set that to me is the key of stable diffusion so this works well as an image generator however there's something I'm forgetting what am I forgetting well neural networks normally have a sigmoid function this neural network we built over here was pretty linear and that's not going to be the case all the time so so let's go back to reality and add a sigmoid function what's a sigmoid function well that appears on the neural network video that I've linked in the comments but let me give you a quick refresher the formula is this do not worry about the formula I always like to think of a mental picture instead so it's just a function that takes all the numbers negative and positive and sends them into the interval 01 so here's the graph of the function if you have a very large number it sends it to something close to one if you have negative numbers it sends them to something close to zero if you have for example one it sends it to 0.73 you can calculate that in the formula and if you have for example 0 it sends it to 0.5 so in other words it takes the entire number line and shrinks it into the interval 01 it's kind of like looking at the line through a lens and why is it so useful because the outputs of neural networks are between 0 and one but out of these nodes any number comes out so you use the sigmoid function to shrink those numbers numbers and put them into the zero and one range so we had our neural network here that outputs numbers let's not forget to put a sigmoid function at the end of each note so what happens well let's go back again one Z is going to go to 0 01 0 because the ball goes to the image of a ball now if I pass everything through a sigmoid function I'm not going to get 0 1 0 unfortunately I'm going to get 0.5 0.5 0.7 3 and 0.5 that gives me this image over here which is kind of a ball but very fuzzy you can sort of see the ball in the bottom left but it's not so clear what happens to a bat well the vector 01 now goes to the vector one01 and when put through a sigmoid function that becomes 0.73 0.5 0.5 and 0.73 which is this image where you can sort of see the bat but it's not very clear and finally ball and bat will be one one that goes through one11 that now goes to 73.5 73. 73 which is again a ball on a bat but not super clear in other words we've managed to build a neural network that can draw a ball a bat and a ball and a bat but unfortunately they're very fuzzy so the question is can we improve these images over here can we make them more crisp well we can by making the neural network a little bit better I'm going to improve it check this out I'm going to turn those plus ones into Plus tws what's going to happen now well ball which is one Z is now going to go to the vector 020 because that one is going to be multiplied by two the weight of the red Edge however 0020 is not that much better but I can subtract one to everything here to get 1 oneus 1 one how by throwing in a bias the bias unit in a neural network is super important and it's pretty much a constant that you add to each of the nodes so if my bias unit has edges with weight minus one going to all the nodes at the right then I'm basically subtracting one from each of the values that we get at the end and now we get min -1 -1 1 - one why is that good well because now I'm spreading my values apart the minus one becomes a 27 and the one becomes a 73 and we've managed to make the image just a little crisper now you can see the ball a little bit better it's not as good as the previous NE network is better than the one that we just built now what happens with bat well very similar ser 1 goes to now it's going to be 1 - 1 - 1 1 which becomes 73 0.27 0.27 and 73 which goes to this Vector over here which goes to the image over here that is now just a little bit better bat you can see it more in the diagonal it's more clear than the previous one still not great but much better and finally ball and bat well that's a 1 one that goes to 1 - 1 1 1 and that's a slightly better image of a ball and a bat so we managed to make the images a little bit better but they're still not perfect so the question is can we improve this even more and the answer is yes we can continue improving this neural network and I'm going to show you a quick way and I'd love for you to pause this video and actually think about it may the calculations and notice that if I take the edges and turn the weights into plus 100 and I take the edges coming from the bias unit and turn them into minus 50 then this neural network is going to actually give me Crisp Images of a ball a bat and a ball and a bat but that's not what we want to do I don't want to continue messing with this neural network because I don't want to assume that this neural network is going to do a perfect job this neural network is very limited it goes from text to images and that's really hard so in some way I want to do something simpler that takes this images over here and says you know what this is good enough you've done well enough neural network don't worry you've kind of managed to show me where the ball is and where the bat is and where the ball in the bat is so you can go rest neural network because you've done a wonderful job now let me take it from here in other words the image generator gives us these images that are not that great but that they capture the ball and the bat and as long as they can capture it that's good enough because what we're going to do is we're going to build a diffusion model that is going to improve these images it's going to turn this roughl looking ball and a bat and turn them into clear images of a ball and a bat and a ball and a bat actually that's kind of optimistic we may be able to get somewhere like here with 0 n9s and 01s but the diffusion model is going to clean the images up and so next I'm going to tell you how to build this diffusion model for this small example okay so now it's time to build the diffusion model of all the ones we built today this is going to be the more complicated one but it's still not that complicated so to remind you we want to build a diffusion model that's going to take these roughl looking images of balls and B and turn them into nice Sharp Images and in our very simple case what it's going to do is it's going to take these rough looking pixelated images and turn them into their better version where the light Grays are turned into white and the dark Grays are turned into black and that's going to be the diffusion model so first I'm going to tell you how to train it if you were to train it what you would do here is you would take a clean image for example a clean image of a bat then you would add some noise by adding or subtracting some random tiny numbers to it to get an image of a bat that is a little noisy then you would add a little more noise to this image and then you would add a little more noise to the image until it's pretty much legible what it is and so you have a chain of images where you start with the crisp one and then you go noisier noisier noisier noisier until the end and now what you're going to do is train a neural network to predict the previous one so in other words the input is the image with noise and the output is the image without noise for each one of the images now this is done for a bat you also do it for the ball and for every image in the data set so just like I showed you at the beginning of the video you take lots and lots of images you add noise to them add noise to them add more noise to them and then you train the neural network to continuously be removing this noise at each step and so then the neural network knows how to clean an image and you may have to use it many times but at the end of the day you have a neural network that is trained to clean up an image now in general these are going to be very complicated but for our simple example it's actually going to be very easy and the reason is because pixels here are not super correlated in other words what I'm going to do is basically clean up each pixel at a time so if I have anything dark gray so anything where the intensity is bigger than 0.5 I'm going to turn that into black so any number bigger than 0.5 strictly bigger I want to turn it into a one any number less than 0.1 which is the lighter Grays all the way to White I'm going to turn into a zero because that's the color for white and let's say that if we have a pixel whose value is 0.5 we'll keep it as 0.5 that doesn't really matter for our examples now I'm just doing this for this particular case and I'm taking an arbitrary construction I encourage you to actually think of this and think of how would you build a diffusion model for this case maybe you have a better construction a better architecture and so how are we going to do this so let's do some math I'm going to take the dark Grays and turn them into something much closer to Black and the light Grays and turn into something much closer to White so I'm going to do some math with all these numbers first I'm going to Center them because they're in the interval 01 I'm going to put them in interval minus. 5.5 which means I'm going to subtract .5 when I subtract five now I get the numbers 0.525 0- 0.25 and minus 0.5 in other words I centered them now I'm going to stretch this interval so I'm going to multiply by 10 when I multiply by 10 now I get numbers between 5 and minus 5 so I get 5 2.5 0 - 2.5 and - 5 and now I'm going to apply sigmoid so for high values sigmoid goes close to one and for low values negative values sigmoid goes close to zero so now I have numbers that are 1 92.5 8 and zero and that's going to be my neural network so it takes one goes to 1 75 goes to 092 so notice that it really cleaned up that pixel it turned it from dark gray to pretty dark gray it left the 0. five the same and then it took my light gray 25 and turn it into 8 which is a lot wider and then the zero left it at zero and now I'm going to show you how to turn these operations into a pretty simple neuro Network so let's look at one of them let's say the one with 75 that turns it into 092 what did we do here let's recall that first we subtracted 0.5 then we multiplied by 10 and then we applied sigmoid those are neural network operations let's organize them a little bit so first let's say when I multiply by 10 I have 75 * 10 and this minus five since it was done before multiplying by 10 I can actually turn it into a subtracting five because .5 * 10 is 5 I'm just using the distributive property so here I have subtract five and finally I apply the sigmoid function so this is the operation that I'm going to turn into a neural network let's start with my value of 75 the multiply by 10 means that I have an edge with a weight of 10 and now the output is 7.5 how do I subtract five well that's bias unit over here which is a one with an edge of weight minus 5 that's how I subtract five so now my output turns into 7.5 minus 5 which is 2.5 and finally I'm going to apply sigmoid so apply sigmoid to this 2.5 and I get 0.95 so that's a neural network it's a pretty simple neural network and I can do that for every single one of the four pixels and I get this neural network over here which I can clean up by putting the sigmoid inside the vert and that is my diffusion neural network now let's see how it works for the image of a ball it has this image over here 2727 7327 and as you've seen that goes to 8892 and 8 which is a much crisper image of that ball the bat goes to this image over here which is a much crisper image and the ball and bat goes to this image over here which is much crisper now a small observation which I already said this is a very easy case because I cleaned up every pixel separately obviously in a big data set with lots of images and lots of pixels you're not going to clean every pixel separately the neural network has to figure out correlations between pixels that correspond to the noise so this can be a pretty complicated neural network but you can think of it as a neural network that learns how to remove noise from images and that's the diffusion model and in the more complicated case is going to take images that are pretty fussy and turn them into much nicer and crisper images and don't forget that this is a big neural network so that's it that is the diffusion model so let me give you a little summary of what we saw today the stable diffusion model for the example had a small data set of two words ball and bat and then it created the embedding for ball and bat and this was a very small neural network notice that this one doesn't have the sigmoid function it doesn't really matter the embedding neural network for this case is so simple but in the big case imagine a full fled neural network that will take the words and turn them into embedding vectors so this neural network outputed these vectors over here 1 0 0 1 and 1 one for the three sentences on the left and then we needed another neural network to actually build the images and these one build these kind of fuzzy images where you can sort of see the ball and we sort of see the bat but they're not super clear but we then built a diffusion neural network that cleaned up this image and left us with some pretty nice images of a ball and a bat and a ball and a bat these three are called the embedding neural network the image generator and the diffusion model and in the big case we're going to have a big data set with lots of sentences and they're going to be turned into long vectors with a lot of numbers could be hundreds could be thousands and this is going to be done by the first neural network the embedding neural network then the second neural network the image generator is going to turn these vectors into images that are not great looking they're kind of fuzzy but they're going to capture what's important they're going to be able to draw that penguin in a slightly rough way or the clown Etc and then the diffusion model is going to turn that into crisp nice looking images like the ones at the right so all these three neural networks the embedding the image generator and the diffusion model are going to be the neural networks that form our stable diffusion model and that is how images get generated that's what you see in Del that's what you see in dream studio and in pretty much any great image generator that you may be working with obviously a full stable diffusion model will have a lot of little moving parts that make this work much better but in high level it's formed by these three big components so That's all folks thank you very much for sticking all the way to the end I hope you enjoyed the video and I hope you learn how stable the fusion models are as usual I couldn't have done this alone I got a Little Help from My Friends in particular my friend surj was the one who gave me the original idea I was talking to him at a conference and he gave me a big picture of how stap diffusion works and that helped me build this video If you like this I want to recommend you llm University it's a course that we've built with my colleagues at kohir moror Amir and Jay Alamar who are wonderful content creators in particular you should check out their material me or has some great books and Jay has a wonderful blog and YouTube channel where actually he has a video on stable diffusion that you should check out I use that video to learn a lot about the material I showed you today so thank you very much if you like this video please follow me on YouTube if you haven't I put videos like this pretty often and I try to always simplify Concepts that are complicated using fun examples like you saw today the channel is called Serrano Academy please uh hit like please uh comment I love to read your comments and share it amongst your friends you can also take a look at my page Sano Academy san. Academy where I have a Blog and I have all these videos and a bunch of other courses Etc or you can tweet at me my Twitter is san. Academy and if You' like to Eng more I actually have a book called gring machine learning you can buy it on the link that I have in the description and make sure you use this discount code Sano YT that is for a 40% discount so thanks very much for your attention and see you in the next video [Music]
Info
Channel: Serrano.Academy
Views: 17,259
Rating: undefined out of 5
Keywords:
Id: JmATtG0yA5E
Channel Id: undefined
Length: 44min 59sec (2699 seconds)
Published: Tue Dec 12 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.