Stable Diffusion in Code (AI Image Generation) - Computerphile

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

lol at that "trending on artstation" bit. we've basically created an AI superstition

👍︎︎ 3 👤︎︎ u/finnamopthefloor 📅︎︎ Oct 21 2022 🗫︎ replies

The last video from Computerphile specifically on Stable Diffusion this time !

👍︎︎ 2 👤︎︎ u/danamir_ 📅︎︎ Oct 20 2022 🗫︎ replies

Captions

last time we talked about how these kind of uh networks and image generation systems work but there are different kinds aren't there there are there's daily two there's Imogen there's stable diffusion and I didn't talk about in the last video because they are for the sake of understanding diffusion models in general essentially the same but actually they are quite different underneath right and it comes down to you know the resolution exactly where you do the embeddings um how you do the embeddings how you structure your network and so on and so forth right so and in fact actually in stable diffusion's case it comes down to where you do the diffusion as well um so let's look we'll look at the stable diffusion code because I've got access to that right and um and we'll go into it in quite some detail I think will be quite interesting um I've really enjoyed using it because first of all it's giving me a better understanding of how it works and also you can do some pretty cool stuff by messing about down there and say well what if I gave it a frog but also a snake right and the answer is you get a frog snake um yeah exactly snake giraffe was the stuff of nightmares there were questions about ethics there were questions about how these are trained maybe we deal with them another time um let's talk about how they work so Dali 2 is perhaps at the moment the biggest one but it's being actually I think rapidly overtaken by stable diffusion primarily because stable diffusion is more available to people right I can download the code and run it of stable diffusion darly you access fire an API and you say I would like an image please and it gives you something back um if you don't have any interest in the code then sure just use the API but if for example like me you might be interested in what the applications for generating images in your area of research like plants or medical imaging maybe I want the access to the code and I can train up the network myself right so Darley Builds on a lot of studies from open AI it builds on a lot of stuff that they've already done right the first one is something called clip embeddings clip embeddings are the way of taking your text tokens and turning them into some meaningful numbers right and remember we're going through a Transformer so we're not just saying well look the word the word the is a five and the word football is a 17. what we're doing is we're taking the whole sentence we're doing a lot of cross attention and saying this is the overall meaning of a sentence reflected numerically so you get some context yeah that's the idea and clip is trained with image and text pairs so you put in an image you put in the text that describes that image and what you try and do is align those two embeddings so they kind of make sense and that way you've got a kind of semantically meaningful text embedding so it's a bit like a supervised data set of some sort it is yeah it's it's sort of it is a supervised data set it's trained using a contrastive loss which is what this CL stands for and the idea is that basically you want to try and make embeddings of an image and its text description very very similar and the embeddings of an image with a different text description very very different right in not in a dissimilar way to how we when we were doing the face ID stuff we're trying to put my face near previous shots of my face so you can unlock a phone all right I'll knock a phone of your face uh that was the one if you want to unlock a face with your phone so you've got a clip embedding this is for text embedding you also have various other things that works in Dali and I'm going to sort of simplify it slightly you put in an image which I think is at 64 by 64 pixels you put in a noise image at 64 by 64 right you put in your time you put in your clip text embeddings and you also put them into the network like we described in the previous video you have a giant unit structure that produces an estimate for the noise and you look and you Loop now that produces a a not bad image but only at 64 by 64 pixels this process of randomly producing noise checking with its work subtracting it producing this takes a long time at high resolution and the next sort of network you would need would be astronomically big so to make that easier we only run at a 64 by 64. now of course how do we then make that nice right because just dial E2 outputs one k by 1K images the answer is we put and we have another Network that does the same thing but this time its job is to up sample so you basically put in a noisy 64 by 64 and say output me with 256 version right and and so on so forth so you you put this through I think two levels of up sampling to go from 64 to 256. to 1024. we'll be part of my dumb question yeah are we finally at the point where we can say in hands you know what we are yeah um except except and it will work exactly like it does in a TV where it will just make up nonsense and they'll call it a win uh it works pretty well Imogen Google's version works in a very similar way you have a you have a network that's trained to denoise and generate 64 by 64 images Guided by text and then you have two up sampling networks that go up to 1024. stable diffusion does its diffusion process sort of in this bit in some sense you have what we call an autoencoder which takes some noise and turns it into a lower resolution but detailed representation you then do the diffusion process this way which denoises that latent space and then you have the other side of the autoencoder which expands it back out into an image so this is a different way of doing it and the advantage is that this is much lower resolution than this and they call it stable diffusion there's an argument that is slightly more stable I don't know to what extent that's true there are some differences in the way that these produce images but in all other regards basically it's the same kind of process you're still doing guidance from text you're still putting in T it's just that you're now doing it in this latent space instead of in the full image space think of it like like you put it through a zip right and you can press it down and then you do all the diffusion in that space and then at the end you expand it back out again right that's the idea and actually the auto encoder is very very good right you can take an image you can compress it right down and it'll still produce much the same image again let's dive into the code and have a look right so I'm in Google collab now for those who don't know Google collab is a sort of jupyter notebook style environment that allows you to access also Google's gpus for running you know machine learning things now I don't tend to use Google collab generally because a lot of our processes last longer than you really meant to use it for but for this it's excellent right so I've got this code from um a guy called Jonathan Whitaker which I then repurpose and done my own stuff with it and I've been messing about so thanks very much to him for that but I've taken it and I've played around I've changed the resolution I've I've toyed around with a lot of stuff and what I wanted to do was talk through some of these lines of code so you can see what it is that it's doing it's the same exact process I just described um it's just a few lines of code to do it now obviously there's a lot of deep networks and stuff going on behind the scenes but they end up getting extracted away abstracted away in function calls and so it becomes very straightforward okay I've imported all my libraries already and then what we've got here is one go we're going to have our text prompt and what we're going to do is take that text prompt and produce an image right so we have various things like we want it to be 512 pixels tall 768 pixels wide we're going to run 50 steps of inference and then a few other things that we can talk about in a moment like for example we're going to seed it with the number four now why why four because I I don't know I picked it at random I can see I can see it at 77 if you like this allows us to run the exact same code again and produce the exact same image another time if we just used a random seed if you got to an image you liked you accidentally get rid of it you never get it back right so um but if you change this number you get entirely different images because the noise that you start with is entirely different right so let's put in a prompt well what should we do fogs on stilts I think we need to do frogs on stilts I I mean this may not work I don't you know anything else you want to add like in you know in a park or just just focus on stilts what about on a stage okay focus on stills on a stage at the theater right yeah now the first thing we have to do is we have to embed this uh into some kind of usable space in which the machine learning can work so what we do is we tokenize this is the function that tokenizes the text input and basically turns it into a numerical code for each word and then that goes into the text encoder which is our clip embeddings so that's the bit where it sort of works out the context yeah that's the Transformer that goes well okay this this one kind of goes with this word and then this means they share weights and so on and then you go through the transform and you end up with essentially to us meaningless numbers put to there to this semantic information on the meaning of the sentence right we also because if you remember we put it through the network twice one with the text embeddings and one without so we also have to produce a dummy text embedding with nothing in it right and that's what this unconditioned input is then we're going to text encoder unconditioned embeddings and we get two text embeddings one of which is unconditioned and one of which is conditioned right so this one has fogs on stilts this one is just sort of blank now we need to set our schedule remember you can choose a scheduler that produces different amounts of noise at each time step right and and which one you use will depend on to an extent the kind of images you want out but also how you've trained the network we're going to be using the standard one that came with stable diffusion and I'm going to run for 50 time steps so what this will do is distribute the amount of noise it adds from 0 to 50. right so when I say 50 it's going to produce the maximum amount of noise and when I say one it's going to produce a tiny amount of noise right that's the idea and then we're going to actually produce our latent noise that we're going to be diffusing so we create a random array of numbers right of the right size and we're going to call these latents and we're going to stick them on the graphics card and then we're going to do some scaling to our latents as well because the the scales of some of these different parts of a network are different so you have to move them in and out and then we're nearly done right this is our Loop so how does the loop work well the first thing we do is we calculate the noise to be added at this particular iterations we're going through all the different time steps and we're going to add a different amount of noise we're going to add this noise to our latent space right so basically we're noising up the image here now remember this is a this is a an embedded version of this image but it is noised then we're going to predict the noise with our unit so that is saying how much noise do you think was in this image such that we can get back to the original image bearing in mind this text and then we can do our actual classifier free guidance right so what we're going to do is we're going to take our noise prediction with text and our noise prediction without text we're going to calculate the difference and amplify it and then we're going to work out what our official noise prediction is and then finally we're going to then use that noise prediction to calculate a slightly less noise version of the image which is what this line does here and we're going to repeat this process right so we repeat the process we calculate the new noise at the next time step we predict it we subtract it away and add a bit more noise and we repeat this process and the idea is that over 50 iterations we go from fully noise to some reasonable image should we see okay so let's run at this resolution I'm pushing the amount of image size I can get in on this graphics card so this is running on your graphics card here no this is running on Google's graphics card uh over at Google right could you be somewhere in London can I owe you another eight pounds for this uh and we can give me eight pounds no this is covered under the original eight pounds per month but hopefully this won't take a month to record so we're choosing 50 iterations for this and because that's a decent amount right you'll notice that if you don't do enough iterations you're trying to move the noise too quickly it becomes a bit unstable doesn't produce nice results of course I've not I've not done this before I don't know what the use of the results will be will it be frogs and stilts will it be bits of wood next to a fog will it be something different because it's failed horribly uh let's see actually that's not bad no I think that's uh that is pretty impressive now there's a weird leg coming out of this fog here but I would I would say that is a comparatively successful attempt this was produced from a noisy image so what we can do is we can change the noise seed so we can say you know 128 and what that will do is create a completely different noise which will probably lead to a tightly different image right I mean it's still the same text prompt so it's still guided in the same way but this allows us to produce sit near infinite numbers basically of fogs on stills if that's your thing right it is my thing actually yeah I've got quite into producing like cityscape futuristic cityscapes I think that's where I I spend most of my time on this I mean that's gone a bit wrong but actually still not bad it looks like a kind of stage um they're just a bit not foggy although yeah yeah all right all right okay so anyway we could spend let's say another 20 30 minutes producing fogs on stilts um but yeah so what you could there's loads of cool stuff you can do presumably you could just uh automate that so it just kept giving you loads yeah and in fact I've done that right so for example I created some nice pictures of dystopian abandoned futuristic cities with over plants right and then I just put them in a for Loop and just produce 200 of them so I can pick the nice ones for example in here I've just got a bunch of awesome looking City Vistas overgrown plants they all look really really good right I'm quite pleased I mean I've got no use for this but it's quite fun and the other thing is because you can do image to image guidance right so what you do is you take an image that's your guide image you nearly noise it all the way and then you reconstruct right so the noise is somewhat not come from a random place then you can get an image that sort of bears some Reflections you can say well I want a building over here and a tree over here so I'll draw them in and then I'll produce this and it will bear the same have the same shapes and stuff so you can control this process even if you basically like me have absolutely zero artistic ability at all um and to give you an example what I did was uh so if I go down let me let me go so this is a picture of my my colleague rabbit by cute rabbit and what I did was I embedded this added noise but not totally noise to remove the image and when I reconstructed it with the text a wooden carving of a rabbit eating a leaf highly detailed 4K Artisan I don't know if you are sound where it does anything I just thought it'd be fun it's trending on Art station I see a lot of that put on the end of things does that make a difference I don't know anyway and it produces a wooden carving of a rabbit right and if you look at the original image versus this image some things have changed sure but the shape is roughly the same right so it has guided this process using the original image and that's how image to image works so if you wanted to create an animation you could create quite a simple animation of a rabbit jumping about with no artistic ability right I mean actually I was struggling to do even that but and then each frame you could then use this process to produce it at the moment there's no kind of temporal consistency so you will see flickering right if you ever see one of these videos someone's produced online it'll look cool but maybe not consistent and interesting because each frame might subtly change things um but that's the idea right now you can do loads of weird stuff right so this mix guidance is one of my favorite things here we have two text inputs and what we're going to do is we're going to embed both of them we're actually going to guide the generation using the midpoint of those two right so I'm going to say okay I want a rabbit right and I want a frog and I want you to produce me a 50 50 rabbit frog right and what it will do is it'll embed both of them and it will do the exact same process it's just for now its text prompt is halfway between these two embeddings so you could potentially come up with a system with sliders uh you know what's it to what amount of fog do you want in this image right I mean um um you know again not sure what the use case is but it's quite cool here we go so it only takes about I think I'm training for 50 steps again so I'm running it for 50 steps while this work you can do loads of stuff so for example you could generate an image and then you could take half of it and try and generate the other half to expand it outwards and slowly grow your image to make an even higher res one right if you're limited by the resolution and there's going to be a lot of people playing around a lot of different ways to use this I've already seen the plugins for gimpin for um for Photoshop and stuff whatever it is it's a strange one we'll put links to the code in the description have a go you really need to register for hugging face to get access to the weights originally but then you can use something like Google collab or your own Hardware to generate pictures and people are having a lot of fun there are websites now where you can find cool images and the prompts that we use to generate them to give you some ideas so there's lots of cool stuff to do and the rabbit is the same shape rabbit there's a bit more noise right and then we come over here and we come over here and we end up with just noise it looks like nonsense and so the question for the same amount of time to make one sandwich but you've got two people doing it so they make twice as many sandwiches each time they make a sandwich same with the computer we could either make the computer processor faster or

Info

Channel: Computerphile

Views: 230,069

Rating: undefined out of 5

Keywords: computers, computerphile, computer, science

Id: -lz30by8-sU

Channel Id: undefined

Length: 16min 56sec (1016 seconds)

Published: Thu Oct 20 2022