How AI Image Generators Work (Stable Diffusion / Dall-E) - Computerphile

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

This explanation really highlights the old quip that computers are just sand that we've tricked into thinking.

In this case an elaborate multi part plan to effectively fool the sand into thinking better.

👍︎︎ 1 👤︎︎ u/diqbghutvcogogpllq 📅︎︎ Oct 07 2022 🗫︎ replies
Captions
generating images using diffusion what is that right so I should probably find out it's just things like Dolly and Dolly too yeah Imogen from Google stable diffusion now as well I've spent quite a long time messing about a stable diffusion I'm having quite a lot of fun with that so what I thought I'd do is I download the code I'd you know read the paper with work out what's going on and then we can talk about it I delved into this code and realized it's actually quite a lot to these these things right it's not so much that they're complicated it's just there's a lot of a lot of moving Parts um so let's just have a quick reminder of generative adversarial networks which are I suppose before now the the standard way for generating images and then we can talk about how it's different and why we're doing it using diffusion having a network or some you know deep Network train to Just Produce the same image over and over again not very interesting so we have some kind of random noise that we're using to make it different each time we have some kind of very large generator Network which is this is just I'm gonna give this black box big neural network right that turn that turns out an image that hopefully looks nice like at the like the thing we're trying to produce faces Landscapes people you know is this that how those Anonymous people on this person does not exist is this one yeah that's exactly how they work yeah if that's using I think style Gan right and it's that exact idea and that's trained on a large Corpus of faces and it just generates faces right at random right or at least mostly at random the way we train this is we have you know millions and millions of pictures of something that we're trying to produce so we produce we give it noise we produce an image and we have to tell is that good or is that bad right we need to give this network some instruction on if this image is actually looking like a face right otherwise it's not going to train it so what we do is we have another Network here which is sort of like the opposite and this says is it a real or is it a fake image and so we're giving this half a time we're giving it fake images and half a time we're giving it real faces so this trains and gets better at discriminating between the fake images produced here and the real images produced from the training set and in doing so this has to get better at faking them and so on and so forth and the hope is that they just get better and better and better all right now that that kind of works the the problem is that um they're very hard to train right you have a lot of problems with things like mode collapse where it just produces the same face if it produces a face that fools this every time there's not a lot of incentive for this network to do anything interesting right because it does solve the problem right it's beaten this let's move on right and so if you're not careful with your training process it's these kind of things can happen and I suppose intuitively it's quite difficult to go from this bit of noise to a really beautiful looking image in high resolution without there being some Oddities right and some things that go a bit more so what we're going to do is in diffusion models is try and simplify this process into a kind of iterative small step situation where the work that this network has to do is slightly smaller and you just run it a few times to try and make the process better right we'll start again on the paper so we can clean things up a bit so we've got an image right let's say it's an image of a rabbit right we add some noise so we've got a rabbit which is the same right and you add some noise to it now it's not speckly noise but I can't draw gaussian noise right and then we add another bit of noise right and the rabbit it's the same shape rabbit there's a bit more noise right and then we come over here and we come over here and we end up with just noise looks like nonsense and so the question is like how do we craft some kind of training algorithm some kind of what we call inference you know how do we actually deploy a network that can undo this process the first question is how much noise do we add why don't we just add loads of noise right so just delete all these images and doesn't really don't need to worry about that add loads of noise and then say like give me that and then you've got a pair of training examples you could use and the answer is it'll kind of work but that's about a very difficult job and you've sort of in the same problem with the Gant you're trying to do everything in one go right the intuition perhaps is that it's maybe slightly easier to go from this one to this one just remove a little bit of noise and then from this one to this one a little bit more noise well in traditional like image processing you do there are noise removal techniques rise yeah it's not difficult to do that is it no I mean it's it's difficult in a sense that you don't know what the original image was so what we're trying to do is train a network to undo this process that's the idea and if we can do that then we can start with random noise a bit like I can and we can just iterate this process and produce an image right now there's a lot of missing parts here right so we'll start building up the complexity a little bit okay so the first thing is is let's go back to our question of how much noise do we add right so we could add a small amount of noise and then the same amount again I've been the same amount again and we could keep adding it until we have essentially what looks like random noise over here right and that will be what we would call a linear schedule right for that is the same same amount of noise each time basically right and it's not interesting but it works the other thing you could do is you could add very little noise at the beginning and then ramp up the amount of noise you add later right and so there are different strategies depending on what paper you read about the best approach for adding noise but it's called the schedule right so the idea is you have a schedule that says right given this image so this is an image at uh at time T equals naught this is T equals one blah blah blah T equals some capital T which is like the final number of steps you've got right and this represents essentially all the noise and this represents some amount of noise and you can change how much each step has right and then the nice thing is you can then very easily produce because gaussians add together very nicely so you can say well I want T equals seven and you don't have to produce all the images you can just jump straight to t7 add the exact right amount of noise and then hand that back to the network so when you train this you can give it image random images from your training set with random amounts of noise added based on this schedule right varying randomly between 1 and T right and you can say okay here's a really noisy image Undo It here's a little less noisy image undo it right so what you do is you take your noise image image right I'm going to keep going with this rabbit it's taller than it was before right you take your noisy image at some time let's say t equals five right you have a giant unit shaped Network we've talked about encoder decoder networks before there's nothing particularly surprising about this one and then you also put in the time right because if we're running a funny schedule where your at different times have different amounts of noise you need to tell the network where it is so that it knows okay I'm gonna have to remove a lot of noise this time or just a little bit of noise what do we produce here so we could go for the whole hog and we just say we'll just produce the original rabbit image but then you've got a situation where you have to go from here all the way back to the rabbit that's a little bit difficult right mathematically it works out a little bit easier if we just try and predict the noise we want to know what is the noise that was added to this image that you could use to get back to the original image so this is all the noise from t1234 and five so you just get noise basically out here like this right with no rabbit that's the hope and then theoretically you could take that away from this and you get the rabbit back right now if you did that from here you would find that it's a little bit iffy right because you know you're predicting the noise all the way back to this rabbit is maybe quite difficult but if you did it from here it may be not quite so difficult we want to predict the noise so what we could do is predict the noise at let's say time T equals five and to say give me the noise it takes us back to T equals four right and then T equals three and T equals two the problem if you do that is that you're very stuck doing the exact time steps of the schedule used right if you used a thousand time steps for training now you've got to use a thousand time steps of inference right you can't speed it up so what we might try and do instead is say well okay whatever time step you're at you've got some amount of noise remove it all predict me all the noise in the image and just give me back that noise that I can take away and get back to the original image and so that's what we do so during training we pick a random Source image we pick a random time step and we add based on our schedule that amount of noise right so we have a noisy image a Time step T we put that into the network and we say what was the noise that we've just added to that image right now we haven't given it the original image right so that's what's Difficult about this we we have the original image without any noise on it that we're not showing it and we added some noise and we want that noise back right so we can do that very easily we've got millions of images in our or billions of images in our data set right we can add random bits of noise and we can say what was that noise right and over time it starts to build up a picture of what that noise is so it sounds like a really good kind of plug-in for Photoshop or something right it's going to be noise removal plug-in how does that turn into creating new images yeah so actually in some sense that's the clever bit right is how we use this network that produces noise to undo the noise right we've got a network which given an image with some noise added to it and a Time step that represents how much noise that is roughly or where we are in the noising process we have a network which produces an estimate for what that noise is in total and theoretically if we take that noise away from this we get back to the original image now that is not a perfect process right this network is not going to be perfect and so if you give it an incredibly noisy image and you take away what it predicts you'll get like a sort of maybe like a vague shape and so what we want to do is take it a little bit more slowly okay so we take this noise and we subtract it from our image right to get an estimate of what the original image was right T naught okay so we take this and we take this and we do subtraction and we get another image which is our estimate for T equals naught right and it's not going to look very good the first time but then we add a bunch of this noise back again and we get to a t that's slightly less than this one so maybe this was like T10 T equals 10. maybe we add like nine tenths of a noise back and we get to what we roughly T equals nine right so now we have a slightly less noisy image and we can repeat this process so we put the slightly less noisy image in we predict how to get back to T naught and we add back most but not all of the noise and then we repeat the process right and so each time we Loop this we get a little bit closer to the original image it was very difficult to predict the noise at T equals 10. it's slightly easier to predict the noise at T equals nine and very easy at T equals one because it's both mostly the image with a little bit of noise on it and so if we just sort of feel our way towards it by taking off little bits of noise at a time we can actually produce an image right so you start off with a noisy image you predict all the noise and remove it and then add back most of it right and so then you get and so at each step you have an estimate for what the original image was and you have a next image which is just a little bit less noisy than the one before and you Loop this a number of times right and that's basically how the image generation process works so you take your noisy image you Loop it and you gradually remove noise until you end up back at what the network thinks was the original image and you're doing this by predicting the noise and taking it away rather than spitting out an image with less noise right and that mathematically works out a lot easier to train and it's a lot more stable than again there's an elephant in the room here there is you're kind of talking about how to make random images effectively how do we direct this so that's where the complexity starts ramping up right we've got a structure where we can train a network to produce random images but it's not guided there's no way of saying I want a frog rabbit hybrid right which I've done and it's very weird so how do we do that the answer is we base condition this network that's the word we would use we'll basically give access to the text as well all right so let's actually infer on an image on my piece of paper right I bear in mind the output is going to be hand drawn by me so it's going to be terrible you start off with a random noise image right so this is just an image that you've generated by taking random gaussian noise mathematically this is centered around zero so you have negative and positive numbers you don't go from zero to two five five because it's just easier for the network to train you put in your time step so you generate a you put in a times that let's say you're going to do 50 iterations right so we put in a Time step that's maybe right at the end of our schedule but it says like time step equals you know 50 which is our most noised image right and then you pass it through the network and say estimate me the noise and we also take our string which is frogs frogs on stilts I'll have to have to try that later oh look right what's this one anyway we could spend let's say another 20 30 minutes producing fogs on stills we embed this right by using our GPT style Transformer embedding and we'd stick that in as well and then it produces an estimate of how much noise it thinks is in that image so that estimate on T equals 50 is going to be a bit average right it's not going to produce you a frog on a stilt picture it's going to produce you like a gray image or a brown image or something like that because that is a very very difficult problem to solve however if you subtract this noise from this image you get your first estimate for what your first image is right and when you add back a bunch of noise and you get to T equals 49 right so now we've got slightly less noise and maybe they're like the biggest outline of a frog on a stilt right and this is T equals 49 you take your embedding and you put this in as well right and you get another maybe slightly better estimate of the noise in the image and then we Loop right it's a for Loop right we've done those before you take this output you subtract it you add noise back and you repeat this process and you keep adding this text embedding now there's one final trick that they use to make things a little bit better if you do this you will get a picture that maybe looks slightly frog-like maybe there's a stilt in it right but it won't look anything like the images you see on the internet that have been produced by these tools because they do another trick to make the output even more tied to the text what you do is something called classifier free guidance so you actually put this image in twice once you include the embeddings of the text and once you don't right so this method this network is maybe slightly better when it has a text estimating the noise so you actually put in two images right this one's with the embedding and this one's no embedding right and this one is maybe slightly more random noise and this one's slightly more frog-like right or it's better better it's slightly moving towards the right thing and we can calculate the difference between these two noises and amplify that signal right and then feed that back so what we essentially do is we say okay if this network wasn't given any information on what was in the image and then this version of a network was what's the difference between those two predictions and can we amplify that when we loot this to really Target this kind of output right and the idea is basically you're really forcing this network or this this Loop to really point in direction of the of the scene we want right um and that's called classify free guidance and it is somewhat of a hack at the end of the network but it does work right if you turn it off which I've done it doesn't it produces you vague sort of structures that kind of look right it's not it's not terrible I mean I think I did like a muppet cooking in the kitchen and it just produced me a picture of a generic kitchen with no Muppet in it right but if you do this then you suddenly are really targeting what you want standard question got to ask it is this something people can play with without just going to one of these websites and typing some words well yeah I mean that's the thing is is that um is that it costs hundreds of thousands of dollars to try one of these networks because of how many images they use and how much processing power they use um the good news is that there are ones like stable diffusion that are um that are available to use for free right and you can use them through things like Google colab Now I I did this through Google collab um and it works really really well um and maybe we'll talk about that in another video where we delve into the code and see all of these bits happening within the code right I blew through my uh free Google allowance very very quickly I had to pay my eight pounds for uh for premium Google access so um you know eight pounds eight pounds thank you yeah so you know never let it be said I don't spare expense I I know I spare no expense on um on on computer file uh getting access to proper compute Hardware but um could beasts do something like that it could yeah almost of our servers could I'm just a bit lazy and haven't set them up to do so um but actually the code is quite easy to run that the the sort of the entry-level version of a code you literally can just like basically call one python function and it will produce you an image I'm using a code which is perhaps a little bit more detailed it's got the full loop in it and I can go in and inject things and change things so I can understand it better and we'll talk through that next you know perhaps next time the only other interesting thing about the current neural networks is that the weights here and here and here are shared so they are the same because otherwise this one here would always be the time to make one sandwich but you've got two people doing it so they make twice as many sandwiches each time they make a sandwich same with the computer we could either make the computer processor faster or
Info
Channel: Computerphile
Views: 644,035
Rating: undefined out of 5
Keywords: computers, computerphile, computer, science
Id: 1CIpzeNxIhU
Channel Id: undefined
Length: 17min 50sec (1070 seconds)
Published: Tue Oct 04 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.