Stable Diffusion - What, Why, How?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
stable diffusion a very impressive image generation model perhaps comparable to dolly 2. all the images you are actually seeing right now are coming straight from stable diffusion i'm sure many of you have already seen or heard about this it has absolutely blown up recently with plenty of people talking about this and so as to not beat a dead horse i was hoping to do something a little bit different in this video now i will briefly explain how stable diffusion works and also why it has well so quickly gained this massive following but we will be doing also quite a bit more for one i want to directly compare dolly 2 and stable diffusion now lots of people have been talking about how stable diffusion is around the same or at least like you know comparable to dolly two in quality that claim however is never actually touched on in the research paper so instead i want to do a direct qualitative comparison in this video now even more than that i still want to go deeper specifically we will also be going to the code in this video and seeing yes how you can use stable diffusion for yourself now plenty of people have already done that too so i want to go even beyond that and show you how you can extend stable diffusion beyond what's just in the starter notebook and by the end of this video i plan on showing you how you can even recreate something that at least i think is really cool that's been making its rounds on the internet and that is this ability of stable diffusion to do this image to image generation where you start with an image and then you go and transform it into a much better image but still something similar and that's what you're seeing on screen right now so i know that is a lot but i think i can fit a lot into this video although there's a lot to talk about i have lots prepared so if that's something you're excited for stick around one last thing i'll mention before we get started is that i do cover a whole lot on this channel from bigger papers and ideas to keep you up to date with what's going on to well also smaller ideas to you know introduce you to smaller and niche ideas that i think are pretty cool so if that's something that interests you do consider subscribing it really does help out the channel and i super duper appreciate it so with that out of the way let's now jump into this paper we're going to touch on this just very briefly before we jump into the code and some examples so the one thing i really want to touch on here because this paper is well to be honest it's not too different from your standard diffusion model paper they do have lots of details about how the model works and stuff like that but it is very similar to actually most previous diffusion methods there's one big difference and this is what i was looking for right here on the right and maybe before i go into this diagram specifically i'll first tell you a bit about how diffusion generally works using this animation i had for a previous video so generally the idea for diffusion is that you want to at least for training start with an image and then step by step you add noise to this image so you end up with a bunch of images that you know progressively get noisier and noisier until you end up with well pretty much nothing but pure noise now up until this point everything is just done programmatically there's no ai involved you're just adding noise to an image but after this point once you have your fully noised image this is where the machine learning comes in the idea is that you want to take these noisy images and use a model to undo the noise of a single step then you can reapply this model for each individual step and hopefully slowly try and recreate this image that you originally had by slowly getting rid of the noise step by step now of course in practice lots of the time you're not going to actually be getting back the original image but this sort of training still works very well so now going back to this paper what's different here well what you see maybe i should point out what's the same first this is going to be our original image we're going to ignore the e for a second and then what you see is that this goes through the diffusion process which is the process i just talked about of noising the image adding noise to the image step by step and then on the way back here what you see is the opposite of that this is the unit this is the model being applied to denoise the image step by step and then end back with hopefully the original image so you try and match these two right here now what is different in this model this model actually it's not too different as i have mentioned from everything else it just adds two extra things an encoder and a decoder and the idea here is that you're not feeding the observation directly into the diffusion process and the reason for this is because working with these giant images you know 512 by 512 or even worse at 1024 by 1024 image that is very expensive that's a lot of pixels to work with so what they found in this paper is that they actually use an auto encoder to first encode the image into a latent space which is this z right here and then do diffusion all in that latent space and then at the very end use the decoder to get back the predicted image that works it doesn't get you better quality but it's a lot faster which is very nice and that leads into the one more thing i wanted to talk about before we get into the code which is why this has blown up so much and it really comes down to four different core things so the first thing is just that this paper has good results this is the best i can write with my mouse right now i'm pretty impressed i think it's actually better maybe than when i'm using my tablet but anyway the paper has good results they're not like state of the art for most things but they're pretty good the second thing and this is a huge one it's free yes that is right it does not cost to use this and that leads directly into the third thing here which is that it's open source which means not only can you use it but you can also modify it work with it they have completely open source not just the code but also the weights which means you do not have to do your own training which training is the by far the most expensive part of these models so that's a very big plus and finally the fourth reason is that there are very low computational requirements to run this model now i say low they're not actually low but they are low in comparison to what it would normally take normally when you want to do one of these vision models it takes just a massive amount of particularly gpu ram which is something that does not come cheaply the average you know consumer gpu card i'm not actually sure i can say that my gpu card though which is a 1080 titan which is a bit of a dated card now but it was pretty good when it came out i guess it's still pretty good it's still pretty good and it has 12 gigabytes of ram normally to do this kind of thing you would need much much much much much more than this which means you're going to need multiple gpus or you're going to need to do something fancy that slows down the process a lot but this works on consumer grade gpus which makes it very nice because you can just run it on your own computer at least if you have a decent gpu and i'll give you one fifth and you know a little bonus thing right here i think some you know lots of people are probably just happy that stable diffusion is kind of showing up open eye you know people are saying this model is like you know comparable uh and it's free and you know on the other hand opening i just released dolly 2 that is fairly expensive to use it's closed source so you know it's always nice always nice to see someone going and really helping the community by open sourcing stuff and giving lots of resources for people to pick up this stuff so kudos to stability.ai now though let us finally jump into the code and kind of dig deeper into this see how it works see how we can generate some of our own samples and we'll eventually get to that dolly 2 comparison too here we are in a google collab now i will be linking this in the description of course along with a lot of other resources if you want to follow along yourself and the idea is that i kind of have another notebook up on the side here i'm going to be taking from my template and walking through the code with all of you step by step going through each bit and piece of it even from the future here i just wanted to come back and say i almost forgot to mention there is another notebook right here you can follow along with if you want my notebook does cover this this is provided by stability.ai i don't know why it's not loading oh here we go yes this goes into less depth than my notebook does it has maybe more commenting and kind of explains the whole process better i will mention my notebook does have a little bit of extra stuff we'll get into but that is an awesome resource if you want to follow along with that too that is also there for you now the first thing we're going to need to do is just install a couple things so diffusers is a library from hugging faces very new for diffusion models it makes this very easy we also need transformers because we need to encode the text and then a few other miscellaneous things so now that that's done i just wanted to mention one thing i almost forgot make sure if you're doing this by yourself you go up here to runtime and you change this change runtime type to gpu because otherwise this is not going to be able to run fast at all on a cpu so you want to make sure you're on a gpu then what we are going to do is just import a bunch of things i'm not going to go through all the imports this is really just doing everything at once so we don't have to come back and do it later we are saying the device to cuda which means this will be on the gpu and we also have to log into a hugging face you can just click here if you don't have an account it's super easy you get an access token you can then just come back here paste in here log in and now we have access to uh extra hugging face features and hugging face by the way is where we will be downloading the models from so the next thing we are going to want to do is just add a little bit of code to load up the model now this alone will already load up everything we need now we will go back and we will do this more step by step to understand sort of what's going on and customize this a bit more but for now this will load the stable diffusion model version 1 4 which is the main model they recommend using and then put it on the gpu so let's run this really quick and again we're going to wait a bit and i'll be back once this is done looks like i had an issue here i'll probably just restart things usually this works uh give it a restart and i'll be back once it's working okay i got it working i don't know what the issue was before i just gave it another run didn't change anything it's working now it was probably just a bug on the server side or something but now oh i put that there earlier now all we need is a prompt so let me think of a prompt really quick maybe we want to do something like uh i don't know maybe a cute shiba inu dog new dog you know who doesn't want to see that they usually do corgis at open eye we can show them up with a cute shiba inu and now we want to put this here so the auto caps will essentially make this run a bit faster i think it changes this to use like uh yeah here it says uh float 16 instead of 32 minor details but anyway this pipe is what we just loaded up and it is what is going to take our prompt run it through everything and eventually generate a sample image which we will then display like this so let's go ahead and do this and we will see that this does take a little bit of time essentially what's happening right now is it's going to go through by default 50 steps of diffusion because remember here we're just starting from uh essentially we because we're not trained anymore we don't need the input image we just start from noise and then go through these 50 steps of trying to go backwards and out we should hopefully get an image here oh oh no potentially nsfw content was detected this is one issue if you just either use your pipeline it will block stuff like this so let's run it one more time and it is random so we should get a different image of this tines well hopefully one that's not i don't know what it would have detected that would be nsfw but hopefully this one works uh let's see ah incredible this is a little cutie the color is a bit bright here but uh you can see this is pretty good you know for for a model that you just downloaded in a couple minutes you already got running i guess that's in on my computer this is on google servers but that's fine i could do this on my computer i have and there we go that's really all it takes so it's super easy to get running but now can we dig deeper what else can we do so the first thing i want to do is something that will help us do this dolly 2 comparison now if you've ever used dolly 2 before you'll notice that when you put in an image you get out i think it's three images in response so you don't just get one you get a few options so we can copy some code and this is some code from the original notebook i mentioned earlier it i'm not going to explain it's just normal python it takes in images and displays a grid of images and now we can do something like this so i guess she maybe we should come up with a different prompt here so what should our prompt be this time maybe we want to do alien uh named alien writing a giant flying beetle in space is that how you spell beetle hold on no no no no it's it's uh b giant beetle in space i think that's it hopefully it's right let's put an exclamation mark maybe that will make something happen then uh let's put digital art that's a little trick i'm not sure if i need a comma here who cares let's just give this a go the only differences here right are that i am multiplying this prompt by three which will mean we have three prompts which will mean this will generate three images then we pass our three images into image grid and we should get out an image grid the only downside here right is that this will take three times as long so i'll be back as soon as this is done there will be a lot of waiting when you're doing this yourself unfortunately now there are some things you can do that i will show you in a bit to make this faster the quality will not be as good but when you're experimenting just trying to figure things out it will be nice i'll be back in one second okay here we go should be coming up any second oh incredible this is definitely interesting uh so that's some i'm not sure i'd call this a beetle i don't know what it is maybe it i guess it kind of has some beetle-like features here um it's like a beetle spaceship maybe i guess that could be an alien and uh okay uh so so the results are mixed this kind of i just really this is like grabbing his hand or something and like pulling him along it's an alien child uh i have no clue what's going on here it's and it's like he's holding the balloon right is it just me anyway results are mixed uh this does clearly interesting but i i don't know uh never quite got the prompt but now i want to do what i promised in the beginning and that is going to be doing a direct comparison with dolly 2. now we will be coming back to the code for a little bit of extra spicy stuff in a bit but now that we have these three images this is basically dolly 2's functionality right so i'm just going to take this prompt directly as is and i'm going to go over to dolly 2 which i've already prepared paste this here and we are going to wait for our images okay okay so oh god oh my gosh this is incredible so for one if it's four images not three my bad i'm just gonna i'm gonna say it i think uh i think i see a clear winner here this is only one example so far but dolly 2 definitely has some interesting it's clearly getting the prompt better right we clearly have an alien we clearly have beetles in each of these the i will say the quality is very questionable um i love this one on the right this is so good uh but let's try something else so maybe something like i kind of like these uh very like landscaping ones so maybe we can do like actually let's do a beautiful ancient forest full of magic and spirits i think i i need to correct this and let's see what this gives us and while that's going i'm going to copy this and i'm going to come over to stable diffusion we'll copy this here and we'll see what we get over here too okay so here we go so this is our stable diffusion uh this is pretty good i would say now it does look somewhat i'm not sure it looks ancient definitely looks old it is definitely a forest i'm not seeing so much like you know magic and spirits i i could imagine some some magical stuff happening here but it's not really to exactly let's see if dolly 2 did any better um okay just finishing okay honestly i'm gonna say it's not super different it does look like the quality is maybe a little bit less is what i want to say in stable diffusion but that might also just be because we're blowing up the images a bit more i guess when i said magic and spirits it took it in a not very literal way but i kind of wanted a more literal way but i'd say this is about even maybe slightly favoring dolly too here there does look to be some nicer fine details i think but but let's go with something let's challenge dolly two there is something i know dolly 2 absolutely is horrendous at and that is anime characters it cannot do anime characters for the life of it let's try doing uh anime character eating a burrito and see what happens and by the way i have not tried any of these before i've tried this on a stable diffusion but i haven't trained it on dolly too so i'm very curious what will what will happen and we're going to try this here too uh we'll be back soon okay it looks like we're about done what are we going to get oh god the middle one just got me oh my god there's so much stuff happening here this is this is true artwork maybe you should let me save this real quick i don't want to lose this one okay i got that saved someone someone let me know in the comments if they don't want this too maybe i can post it publicly um so this is absolutely incredible um i don't know what's happening on this left one we look looks like we got a little story going on here which is great in the middle one first of all these are hands oh jesus christ there's so many things wrong what's happening with the hand what's going on with the hair why why is there hair turning red and into like a i don't know and i think like what are these things in the in the middle of this is this like supposed to be naruto almost like the little whiskers naruto has the mouth that just everything that could have possibly went wrong i'm pretty sure went wrong on this one and um yeah this isn't good either um the i wonder what he's trying to say here anyway i i wasn't expecting this to be too good honestly i for some reason i guess there's just maybe not much anime training data something goes wrong when you ask it for anime but to be fair i think dolly 2 is not great at anime either so let's let's see let's see what he came up with okay i'm gonna say this is much better than what i've seen it do previously with anime if i still have i have this hilarious image i'll put it up if i still have it of something dolly 2 generated but um this is not it's kind of anime not really it's not great it's clearly better than stable diffusion definitely clearly better than this right um so let's do one last thing we'll do it real quick i don't i don't i want to get back to the code soon but i have one cool idea so let's do like planet scale halo of water in space i don't know random thing uh we'll we'll add a digital art digital art and then maybe also trending on i think it's supposed to be capitalized trending on art station i think these are two little things two little things that prompt engineering you can do to make your to make your thing better so let's generate that there and then we'll go here again this will be the last one then we will get back to code i promise you all just about done generating i have a feeling that they're both going to do pretty good on this let's okay so the first one i don't know what's going on here the second one is it's kind of kind of beautiful i don't know what the text is but this is like i guess it's a planet and then you have a ring of like a halo of water that's interesting and then i don't know quite what's going on here but i guess you know this could be water kind of halo um let's see what dolly two got uh i now okay this is this is really good oh dolly 2 still amazing me quite a bit i've used it a bit now i got access a bit ago but um it still doesn't stop amazing me some of these generations are just so good i will say it didn't actually get i guess unless this is the water back here and i guess there's lots of water on this planet but we're not actually really seeing a halo of water maybe something down here again this one the left also doesn't really have a hail of water this is the only one that i can see that kind of has that going on so um doesn't really get the prompt i guess in this one stable diffusion maybe has been getting the prompt a little bit better but in terms of just quality dolly 2 really is i think knocking out of the part i'm quite a bit more so i i wanted to do this because i've been using stable diffusion i've used dolly 2 and i've seen lots of people talk about how stable diffusion is at least comparable to dolly 2. and i think that's just not the case there's a bit of a caveat here though um while deli 2 definitely does seem to have a better quality i think stable diffusion you don't actually see it in the pipe here the pipeline here but there is i i can't add like a num i think it's non-inferent steps yeah so there's an option this is set to 50 by default this means that 50 inference steps are happening you know the trying to denoise the image every time we want to generate a new image now during training it used i believe a thousand of these and during lots of the comparisons it does in the paper where they get really good results i think they're using around 250 steps which is a lot more i have also tried that and i haven't gotten the best results but they are usually a little bit better or maybe it's kind of hard to tell so maybe if you use the right parameters and the right scheduler which is something we'll get into maybe you can get better results from stable diffusion but at the very least if you're going to be running this on just you know your general gpu or on google collab um you're almost certainly going to be getting i think significantly better results with dolly 2 and that isn't to say anything bad about stable diffusion the results are clearly pretty good but i i still just don't think saying they're comparable is i don't know i don't think it does dahle to justice and and maybe giving a bit too much credit to stable the fusion might take maybe a hot take um but i haven't seen anything like this uh yet so i want to go ahead and do it anyway now that that is out of the way we can get back into the code so now that we have generated you know these these pretty cool things here oh i will make a few cells to start our new code what we're going to do is we're going to make our own pipeline and our own pipeline is going to be you know customizable we're going to be able to do more stuff so the first step in making our own pipeline is loading our own models instead of loading the entire thing we're going to have to load these models part by part so first thing we're going to want to do is load the vie so this is the variational auto encoder or the encoder that turns the image into like puts in the latent space and then converts it back into an image at the end of the whole process we are also going to need a tokenizer the tokenizer is used for you know converting the text into tokens which can then be passed to this clip text model which is the text encoder because we also need a text encoder to encode our text so we encode the image and we encode the text so that's what those two things will do and then we need a u-net so a unit you can look up what it is if you want it's just essentially a type of model that that kind of compresses something down and then blows it back up with some extra skip connections no reason to get into the details of it now but the unite is what will do the denoising it's once we actually have the latency from the vae and from the text encoder we will pass those into the unit and then that will predict what the noise is so then we can remove that and the last thing we'll need is not actually another model but what is called a scheduler and the reason we need a scheduler is something i'll get into a bit later but for now just know that it handles uh the the process of going and doing the the number of inference steps and by important steps i mean denoising it helps do those steps and make things uh clean so we can go ahead and do this and i'll be back in a second this again we're gonna have to download these so this will take a second at least though these are also all pre-trained so we won't have to do any training or anything once these are downloaded we will be ready to go so i'll be back after that okay that took around three or four minutes to download we are back now a bit more calm down now so we got uh this is what we're going to do next so now that we have all the models downloaded we need to actually take the text that we're going to get in our prompt is the first step right we start with a prompt and we go from there so we're going to take in a prompt and we're going to do a few things so in this function first we tokenize our text so this will turn it into tokens and then what we want to do is run that text through or sorry run those tokens through the encoder which will give us our embeddings now after this something a little bit weird happens you would generally think okay we have our embeddings we're good to go we can throw that in the model now right not quite there's a little bit of a trick that happens here and this is done with lots of diffusion models i'm not super familiar with this so just heads up my explanation might have some flaws in it but i think what this is doing right so they do this thing called uh what's it called classifier-free guidance or maybe this is different so this unconditional embeddings so what they do is they take these embeddings for essentially the empty string but then they expand them to be the same length as the actual prompt itself so it's like uh like a base right so if you don't have any prompt this this is like almost like the default and what they do is they concatenate these together the embeddings for the what they call it the unconditional embeddings and the text embeddings and then we are going to end up using these and i'll show you how we do that in a bit but one thing we can do now is we can go ahead and actually test this so for testing beds we can say maybe uh cute dog we'll just do this because this is easy um so if we run this what we will see is that we get some text embeddings and if we print them out you can see indeed we have embeddings of the shape 2 by 77 by 768. so this is 2 because we have 1 for the unconditional embeddings and one for the text embeddings 77 i believe is just the max length here and then 768 is the dimensionality of the embeddings so sorry if you haven't done like nlp stuff before and this is confusing it's a bit much to go into now if i need to explain all that so i'm just going to uh skip over it for now but i'm sure there are plenty of resources out there so the next thing we're going to want to do is this producing latency function so this is really the bulk of the work what we're going to do here is take in the text embeddings so these are going to be what we just produced also along with a few parameters so height and width are going to be the height and width of the model i recommend 512 by 512 the smaller you make it and if when the dimensions is not 512 you're probably going to have some issues just quality wise number of infant steps is how many times we're going to try and denoise the image so generally the better uh it's better when you do more of these but it takes longer right the more denoising steps the longer it will take 50 is what they recommend i found that 50 works pretty well and the guidance skill has to do with these unconditional embeddings 7.5 or like 8 here is is what's generally recommended i'll talk about what this actually does and then i have these ladens here and what is going to happen generally is we want to start because we're just generating something from the prompt it doesn't need to match a specific image right which means we're just going to start with a random noisy latent space and then we're going to start decoding it from there which means at the very start of this process the first thing we're going to want to do is generate so torch.rand and this is just going to generate a random set of latents and one thing i do here is i add the option to pass in custom latence so maybe if we want to pass in the same latent every time to get the same output because then there's nothing random this is the only random part of this process um are there some other uses we'll have for this later that i'll go and show you that will be pretty cool so anyway we produce the latents put them on the gpu and then we do this thing right here where we set the scheduler time steps so the scheduler now i'm going to try and explain what the scheduler is now i actually looked at the paper i tried to find out about these schedulers they're they're used in diffusion uh quite a bit it seems i guess everything in diffusion uses them try to find papers try to find documentation i could not find anything so what i ended up doing is i just went to the code and i read through the code and i think i understand what a scheduler is from that but my understanding is probably not perfect so if someone has a better explanation in the comments i'd love to hear it but here's my understanding essentially when they're training the model they do a thousand diffusion steps the issue is now when you're actually going to do the inference we want to say you know we don't want to do the full thousand steps that's a lot we want to be able to customize it based on the sort of uh fastness or fastness is that even a word uh the sort of speed versus quality trade-off so we're using 50 infant steps here right so what we need to do is for each time we have our unit predict what the noise is at each step we need to scale that in a certain way such that at the end of these 50 steps we end up with the end product so the scheduler essentially what it handles is how to take the output predicted noise of this step from the unit so it's taking what the predicted noise for the step is and it handles that and figures out how to scale it how to apply it to our current latents to get the best results that seems to be the case we are using just sort of the default one they use in the tutorial notebook here and it requires us to do a few things the first thing it requires us to do is scale the latency by this this sigma i can't explain how exactly this specific scheduler works so when we're like multiplying by sigmas and stuff i'm sorry i don't know exactly why this works i'm sure there's some reasoning i just couldn't find it but that's in general what the scheduler does i think so next thing we're going to want to do is start this loop over these 50 time steps and these will be our 50 diffusion i think the word is in diffusion it's like undoing the diffusion right uh our 50 inference steps so we're starting with this random latent model and what we want to do is let's see what's happening here so we're multiplying it by two this has to go in with the classifier free guide stuff again i'll talk about that in a second multiply it again or we do some sigma stuff so again this is the scaling stuff i'm sorry i can't explain exactly why this works not all the schedules require you to do this i think it's just part of this scheduling algorithm that requires this but then the two interesting things happen right is one we use the unit so we pass in the the latent data we then give it the time step uh and then we give it also so this time step is coming from the scheduler by the way um and then we give it the text embeddings and what it will do is it will predict the noise so not the next image but it will predict the noise and then what happens is the noise will eventually get fed into the scheduler along with the latence and that will give us the next thing so here you can see this is going from x of t to x of t minus one which is essentially undoing one of the noise steps and then we are going to repeat this process 50 times or at least if that's the value here right until we get lengths and then output those latents now you might have noticed i skipped over this here um and that's i'll come back to that right now so this is what's called classifier free guidance i believe hopefully i'm not wrong there essentially what's happening is you get the latents for the text embeddings that you you know gave the prompt for and also these unconditional embeddings that we were talking about which it's just like an empty prompt and i think the goal here is you take the noise prediction so this is uh this is the noise predicted for the actual prompt and this is the default the empty prompt you're taking the difference of the two and then you're multiplying this you're scaling this you're scaling that difference up a lot and then you're adding it to the unconditioned uh noise i think this says it helps it follow the prompt more because you're essentially you're multiplying you're scaling the difference between an empty prompt and the prompt you gave it making it like kind of pushing it towards the prompt i think that's what's happening uh you can look up classifier free guidance if you want to know more anyway that's what's happening there let's uncomment this and try this out and see if we get some old testaments oh that's because i commented that out so this will take some time right because this is doing what we were what was happening before now it's doing the 50 diffusion steps and then at the end of the 50 diffusion steps we won't quite get the image yet but we'll get the final latents and if you remember so there we go got our latence and we also have the shape so this is good we have one latent that is four by 64 by 64 and this is just the shape of the latency so we can now go down one and i should recommend this out we can go down here and now that we have the final links what's the final step remember it's to pass it through our decoder so that's what we're going to do here we've taken the latents now there is some scaling stuff we need to do here this is just because it's how they scale during training we need to essentially reverse this so we we just do i mean it's just some scaling there's really not much to speak about here and then once we do that scaling we can pass it through the decoder to get out our images and now that we have the images we essentially just need to rescale them permute them so that everything's in the right order and then scale them up to the zero to 255 scale that you always use for images and we can turn those into uh pil images and now that we have these images we should be able to decode and hopefully see a cute doc yes because that was our original prompt so now we kind of have functions for all the main things that we're doing so what we can do now is take these all together put it all together and make a pipeline so this is essentially the same parameters as before and we're doing just the same steps so we get the text embeddings throw the text embeddings into the uh into the model to get the latents then we use the latents to get the you know we decode the ladens and get the images so now what we can do is we kind of have if you remember back up here how we originally had this sort of a pipe where we just need to pass the prompt through it we've essentially just recreated that we pass in a prompt to our new function and then it will take it through all the steps that we need and we can do the cute sheba inu again we can do something i want to see another anime character let's do super cool anime character i love the anime character earlier so much uh super cool anime character now one thing i'm going to do here is because i don't want to spend super long waiting for all these instead of doing 50 infant steps i'm just going to be doing 20 that's what this 20 is here for that will mean the quality will be worse but it will let it iterate faster so just heads up if the quality is a bit worse on these images now that that's why so i hope people get a great anime oh incredible ah good stuff uh not quite as good as earlier but i'll take it that's pretty cool uh we we have our own pipeline working how can we modify this now how can we get in here i want to show you guys some examples they won't these won't be like the most and crazy things ever we only have you know so much time here but i want to get you know get you started show you how you can get into this and start modifying the pipeline so what i'm going to do is i'm going to go back up to our produce latents and i'm going to copy this down here and we're going to make some modifications what i want to do is let me i'll just tell you what we're doing there's no reason to keep it a secret here we want to make a video and the video we're going to be making is the video from the essentially the random noise up to the final image we get and i think this is pretty cool you've probably seen this from other like maybe gans where they start with like just noise and they scale up and eventually they have like uh the final image and you can see the process of it being made so the id that's the idea here so how would we go about doing that well what we want is we want to get all the latents along the path right we want to get all 50 latents that are the outputs of the undoing the diffusion and then we want to run all those through the decoder to finally get all these frames then we can turn all those frames into a video and see the video sounds like a lot but i'm just going to go through it pretty quickly the general idea i think is fairly not too complicated so i'm just going to add a variable here for returning all the ladens and we are going to then add a basically just an array or a list here to track all of our lanes throughout all the steps and at the end of each step we are just going to append the the latency that we have for the step and then at the end of this we're going to change this return so instead of just returning the latence latents so long as this variable up here is true then we are going to concatenate all these links and return all the links from all the steps as i was just saying so the next thing we need is we also need to modify this prompt to image function right now that we're giving out multiple latents we're going to need to be able to handle multiple latents so we probably don't need to make too many changes but let's see what do we need to do here i guess for one we're going to need to add in this same thing here returning all the latents so when this is true we'll return all the lanes and i'm also going to add in a just batch size here because if we try and do all this at once it will probably overload the gpu so we'll set this to 2 by default probably handle more but better safe than sorry i want to try and avoid having to restart this whole thing so now we only have a couple more things we need to do here one we need to pass in this return all latents argument latents equals i guess return all latents all latents and then we need to change down here how this works so instead of just running one image or like the whole set of latents through the decoder we're going to batch it up uh put them all in a list and then return all those images at the end so now if we run this this function should be adjusted hopefully i didn't miss anything here um but we're going to try something like this i already prepared a prompt i don't want to take too much of your time now so we have starry night with a violet sky digital art so we're going to do the prompted image let's do uh let's do 50 let's do four 40 different steps it should be good enough um set the return all latents to true and now we should get video frames like a set of video frames or just really images a set of images instead of the step-by-step process so let's do that and i will be back as soon as that's done uh so it's not quite done yet but i just want to mention so we did the 40 uh undef like i'm doing the diffusion steps but now we have to do 20 more steps here and this is because we need to decode all of these 40 images in batches of two instead of just uh decoding this final image this time so that's why that's uh there this time now that's done though so what we should be able to do is we should have let's see how many video frames we have here we should have 41 so the initial one and then the 40 diffusion steps so if we look at the first video frame it should be hopefully just like weird noise or something right let's see what we get there we go that that's noise so that's what we want to see so that's good what happens now if we look at the final one this should be the final image and there we go we get a starry night i guess i should have expected from that start yeah starting i should have expected that anyway now we want to turn these into a video and i'm not going to go through the code for this i i prepared it in advance but it's it's nothing machine learning rated right uh it's nothing like no machine learning stuff this is just standard python working with opencv to make a video you can you can copy these if you want and they will make a video for you so very nice stuff here and the way you use these is you get a video name so we're just going to take the prompt and turn it into a valid file name so add mp4 to the end replace the spaces with something else now we can pass into this images the video function our video frames the name of the vid and then we can use this display video function to display this in the way in the window and hopefully this still works i just made it a couple days ago so i don't know why it wouldn't and there we go we get a video so let's go ahead and try playing it and see what happens and there we go so it's short you can change the fps here there's there's parameters for that sort of thing and of course you can do this with more steps or less steps you might get a cooler video if you do more steps i'm not sure but i think this is like a good length you can see at the very beginning things are slowly changing and at the end i think the scheduler ran it might ramp things up to sort of uh make sure you get to the final image fast enough um but yeah so so this is one example of a way you can mess with this so the next thing we're going to want to do here this is going to be a fairly simple thing and it's going to be sort of this feature they have on dolly 2 where what you can do is if they have an image you like you can actually click on it and get a similar image so how would we do something like this i'm going to run through this one really quick because it's fairly simple so first thing i want to do is is just get an image right get an image but we are going to decide on the latents this time now the reason we're doing this you'll see but remember i added a latent parameter and i said i'd explain why i have it there later well here's the time if we have a latent parameter what that means is that we can pass in lanes and reuse those latents right so if you reuse the same latency the same starting latency you'll always get the same output so let's go ahead and generate this real quick um so we should get some sting punkish like sort of uh airship or something here oh not subscription oh yeah that was a mistake uh let's just go ahead and do this because i want to rerun that okay so interesting oh wow yeah it's coming down from the cloud that's kind of crazy i like this this is nice um so what would happen if we were to rerun this uh you can i can promise you i'm not gonna run it because i don't want to but we would get the same image as long as we commented this out and reused the same latency so the question now is how can we change this into a slightly different image that's kind of like this well instead of using the exact same latents what we can do is we can change them up a little bit and then get out a different image so right here i have a function that will do just that it will slightly perturb latency based on how much we want to change them so you can you can look through this it's just standard python it's nothing too complicated and what we want to do is we want to get new latency by perturbing our previous one so let's start with a very small value like 0 0 1. this is going and this is on a scale of 0 to one so this is going to do almost nothing we should expect almost the exact same image if not the exact same thing so let's go ahead and do this we're passing our new lens this time and we should get out almost the same image so hopefully this works and and then what we're going to do is we're going to ramp this up and see how it changes oh oh wow wow so even with just 0.001 you can already see we're getting quite a a different result here which is pretty neat you know i wouldn't expect such a different result from from such a small change but there you go so we can start ramping this up now we can go to like 0.1 and we can see what we get so this should be a bit more different i'm actually surprised by how different that was already but let's see and yeah that's that's that's good a good bit different and you know we can keep ramping this up go to 0.4 say um this might get a little bit weird because we're not preserving the uh the uh like standardization of this oh wow okay so so we still get the same sort of uh you know it's still the same prompt so we have an airship but it's almost completely different at this point right maybe we have there's some stuff at the bottom there's some clouds but on the whole is quite different so you can see this would be one way you can mess with sort of how you use this to get out different results we're going to do the third and final thing now the image to image i know maybe maybe some of you have been waiting i think this is pretty cool and this is the previous two things that we did right they were they weren't changing what was actually happening within the model too much they were kind of using it in a different way now we're actually going to be changing things the idea here just to remind some of you we're going to start out with maybe a portrait like let's say you draw something very poorly you know if you're a poor me but you want to make something really cool maybe you draw like an outline like kind of rough sketch then you could put it into something like this and maybe improve the quality of your image by giving it a prompt that tells it what you want it to be and then on top of telling you what you want to be you know having to go through these diffusion steps and actually produce something like that but something that's still similar to the original image so how would we do that well first step is of course we need to start out with an initial image so let's start out with like a house poorly drawn by a child right the idea here now you don't need a poorly drawn image you can you can do this with anything but i think this will really show the point right if we turn this into something much cooler so that that definitely hit the mark um so how are we going to do this well the first thing we're going to need to do is we're going to need to re-encode this image right because we're assuming we did not get this now we did get this image from our diffusion model but if we didn't if we just want to take like a random image we drew the first thing we need to do is convert this into the latent space so that we know you know so that we can work with the latent of this image and the way we do that is essentially we're just doing the reverse of the decoding process so if you go back and look at the decoding images function for that you'll notice that it's almost the exact same as what's happening here but in reverse order and instead of decoding something we're using the vae to encode it so what this is going to do is it's going to take an image i'll copy this i can i can show you this in practice it's going to take the image it's going to convert it into image latency now these image latents we can then pass back into our decode function and then hopefully we should get out if this worked well the exact same image yes so we're encoding it we get the latence decoded and we have the original image back so we have a way of going between the original image in the latent space now and now that we have that we can work with these ladens so the next thing we're going to need here is a new scheduler now i mentioned this i don't understand exactly how some oh i should leave that no i don't need to leave either i don't i mention i don't know exactly how some of these schedulers work but i do understand the reason we need a new one here and the reason why is because this specific scheduler we were using earlier um where is it you know this one we were using over here what it requires is that or the way it works one of the ways it partially works is that every time you calculate a new image it stores the last few images so it can kind of get the trajectory that you're going the issue with something like this is that we are not starting from like a random latent space right as we were before we're starting from the latent space of this image which means if we start trying to uh trying to defu undo the diffusion process from this latent we don't have that previous trajectory up to that point there's like a huge gap there right so the previous scheduler is not going to work well for this case um so we're going to use this scheduler instead it's actually a much simpler one that requires much less work which is nice you might ask why weren't we using this earlier and the reason is very simply it doesn't work as well in general but for this simple case it happens to be a better fit what we are going to do is produce ladies let's go back up here and grab let's go grab all this we're going to rewrite a small portion of this so this is the code that takes in the text embeddings and gives us our latency and we are going to want to make a few changes the first thing add a new parameter start step so we are going to pretend we're essentially going to i should explain what's happening here now once we have the latency for this image we're going to add some level of noise not too much noise that we completely overwhelm everything and it's just complete noise but some noise so it's like it's not done being generated yet we're essentially going to turn this into a half done image and then from there we're going to give a new prompt not the old prompt but a new prompt and then have the model redo the undiffusing process from about halfway into the process and then essentially that that will adapt this image into whatever the new prompt is telling it to go into hopefully that made sense if not maybe as we go through the code it will make a bit more sense so the start step will be uh essentially we'll be pretending as if we're already 10 steps into the process of this and and want to generate a new image from there so if we have 50 guidance steps and we're starting at step 10 we would have 40 steps left starting with a version of the latent state of this now i say a version of that latent state because we are going to want to noise it up a bit right we don't want to start with a finished image when we're only halfway into on diffusing process that will kind of mess things up we need it to be as if it's actually an image or the latent space of a latent state of an image at step 10. so we're going to need to add noise to the image to achieve that so the way we do that let me i'm just going to copy from here first of all this stuff that we needed uh this adding the sigmas and all that multiplying i'm just going to get rid of that that is not required for the scheduler that was only for the last scheduler and then i'm going to add this so we set the time steps like before um and then if we're above time step zero start step zero so we're doing the image to image what we want to do is get the time step of the current startup so this is kind of confusing but essentially this is going we're essentially mapping this 10 right here starting at step 10 to uh the time step among the like because remember what this schedule is kind of doing is it's converting our 50 inference steps into where they should be along the line of the 1000 training steps so maybe if we're starting to step 10 instead of step 0 out of 50 infant steps this would be like in the training process around step 200 because 10 out of 50 multiply that by a thousand train steps you get around 200 anyway we're getting this time step because the scheduler has a way to add noise to our image so we can add noise to the image as if it was at that time step i'm not sure if that was a great explanation maybe it was a bit too fast but the tl dr here is we're adding noise to the image so that it is as if it's starting out whatever step we're putting in here instead of the original step or step 0. so that's what's happening sorry if that's a terrible explanation i understood this just by looking at the code for this so that's something you can also go do maybe you want to do that it's all open source under the hogging face diffusion diffusers diffusers library so you can go check that out if you're interested other than that we're doing mostly the same thing here i think there's one difference we want to do down here and it's that this scheduler this step takes in this time step um which is like this the 200 right the scale of 0 to 1 000 instead of the scale of 0 to 50. so it takes in the training step not the infrared step just a minor difference there how do i know that you might ask again i just looked at the code you can tell if you go take a look at it i want to say that's all we need to do for here hopefully i'm not missing anything but then we're going to need to come to this prompt image and we're again going to need this start step and we can set this to like i don't know 0 by default or something and then pass it in here start step equals start step and then i think i think that's really all we need it's not too complicated hopefully i'm not missing anything what we want to do now is give a startup so say maybe start step 7 so seven maybe if we're doing like seven out of twenty this would be like seven steps into the process so it's still fairly noisy but they're starting to get some shape there um so instead of rendering a child's like drawn house now let's render a luxurious house unreal 3d render so hopefully this should transform the house we have before into well this new house so do we have the latents here uh i deleted it didn't i i did delete it here let me let me undo this thing i did earlier i'm going to keep this image latency so these are the image lanes from this image so we can pass these into here and now we're going to start at step 7. i don't know i don't just put a 7 here let's do that so essentially what's happening getting the images uh the latency from this previous image then we add some noise so it says if it's uh going back into the the seventh step um so it's gonna be pretty noisy and then we're going to essentially recontinue the undiffu the d i don't i don't know what it's called i should probably do this denoising step from there um i think that's what it's called but with this new prompt that will hopefully make it look like a luxurious house but oh oh on the very last part of this so this will happen probably when you're working with this i'm not sure if it's because of a memory leak or just because of uh we've made some other things we're just going to need to restart this and and re-run everything to get back some memory we're just going to need to restart this and rerun everything because we've we've created like too many images or something that are still stuck on the gpu so we need to refresh that memory rerun everything and it should work if you get this image that issue that that's how you fix it so i'll be right back after i have rerun all of this okay looks like we're almost done running all of it and by the way you don't need to re-download anything so no problem there and we are regenerating our house poorly drawn by a child so let's see what we get oh jesus that is terrible even much worse than last time um good good so we want to turn that hideous atrocity into uh something good so this time we just need to make the image latents get our scheduler uh re-run our functions here redefine them and now hopefully we should be able to run this yep smoothly awesome so let's see what we get hopefully something good um oh interesting i think there's an issue here though right because we did 20 diffusion steps but really we should be starting from diffusion step seven so let me let me go back up i think i might have missed something here um and it looks like indeed i did we're going to be want or we're going to want to start instead of from the initial time step from the uh this time step whatever time stretch is that no no no sorry start step is what we want the start step we pass in here so then we don't start from the original thing now i'm not sure how much that actually affected the outcome there let's try regenerating it and this time we should only do 13 steps because 20 minus 7 right 13 left okay and not the best thing i've ever seen so here's what i want to do let's maybe try something else so maybe what are two things that are kind of similar here maybe we can do squidward word from or no let's just do a squid squid uh a teal squid and we can try and turn it in to squidward from spongebob and see what happens this sounds like an idea that could not possibly go wrong right oh oh wow let's do a a photo of a squid or a photo of a squid um and then let's maybe do 50 is the last thing we're doing so i'll do 50 steps for you guys we can we can wait it out i'll be right back when this is done oh jeez our photo of a squid here is not nice oh you know what's actually happening we're using this new um scheduler which is i mentioned it's not as good in general right so let me go back to the old schedule i'll redo this until i get a decent image i'll go back to the old scheduler and redo this a couple times or maybe just once or twice i'll be right back okay at least 30 steps but i think i got something i brought i think we can turn this into squidward right like what could possibly go wrong so um let's just rerun this grab those image ladens um oh god oh jesus just look at it this is gonna be great if we can actually turn this into squidward the issue is i'm not sure this uh the training set is probably less probably less data than there was for daily 2. i'm not sure if we're gonna squirt let's just do squidward i'm not sure if it knows who squidward is so 30 steps let's start from step seven and let's just see what happens hopefully something not too atrocious but uh you never know i guess we'll probably figure out if it knows what squirter is pretty quickly oh gee so you can see there's similarities here right let's maybe so if we make this if we start from an earlier step like two this is gonna add so much noise it will be like there's this is almost completely noise so this should give us something that's um completely different almost from the original image but not not completely but almost completely oh jesus that is terrifying so i'm guessing there's no squidward in the or not enough squidward in the training data set so it's just this is what it would think squidward would look like if it didn't know what squidward was but if we go do something like 25 now this is already almost done generating so we should get something very similar yeah so you can see there's not many differences here and we can kind of scale this so if we do 20 i get something a little bit more different oh wow it looks like i have reached my maximum duration i think so many people have been using stable diffusion recently that they've started putting in more uh restrictions unfortunately it is a shame i guess i'm contributing that now but anyway yeah i really don't want to rerun this whole thing so i'm going to say you know i'm going to link all these things down the description if you want to use them i i hope this sort of thing has taught you a lot about how this works about how to use this about how to do this and that your opinions about this versus dolly 2 all of that anyway uh if you like this sort of thing you know consider subscribing it really does help out the channel super appreciate it anyway it's 4 am so i'm gonna head to bed thank you so much for watching and i hope to catch you next time
Info
Channel: Edan Meyer
Views: 193,128
Rating: undefined out of 5
Keywords: stable diffusion, diffusion, machine learning, ai, ml, diffusion model, image generation, art generation, dalle2, dalle 2, dall-e 2, dalle, dall-e, openai, stability.ai, computer vision, deep learning, dl, cv, python, google colab, colab, code, explanation, paper, explained, research, research paper, how to use stable diffusion
Id: ltLNYA3lWAQ
Channel Id: undefined
Length: 54min 8sec (3248 seconds)
Published: Mon Sep 05 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.