Stable Diffusion Deep Dive - CFG - Don't Accidentally Fry Your Images

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello I'm silicon thamaturgy and welcome to my deep dive series ironically generating good images in stable diffusion seems to be more of an art than a science there's a lot of conflicting information floating around the community about what scenes to use this inspired me to design experiments to isolate and test the impact of various settings today we're going to learn what CFG is and what it does CFG stands for classifier-free guidance which by itself tells you basically nothing unless you're an AI researcher or aficionado and that brings us to the nerdy section of this video if you don't want to hear this part and just want to know how CFG impacts other settings just skip ahead to the bookmark in the video description as a preface this explanation of CFG is based on a general implementation of CFG and diffusion models and may not precisely match the specific implementation of stable diffusion itself it is also simplified to allow those not from math or computer science background to follow along at the most basic level stable diffusion is gradually changing random noise step by step into an image we humans can appreciate to do this stable diffusion attempts to get as close as possible to a mathematical goal though this goal is very complicated and cannot be intuitively understood by most people if you've ever performed textual inversion or training the loss is a quantification of the difference between the goal and the actual result image to really simplify things let's say there are two terms that go towards the stable diffusion goal function conditional and unconditional CFG adjusts the balance between these two terms if we only use the unconditional part stable diffusion would just generate a random image from the random noise it is given based on the seed the conditional term comes from the prompt being fed through a trained encoder which for stable effusion is called clip this is where our prompt comes into play what we can see is that at a low enough CFG value the conditional term goes to zero and the output will just be the unconditional term AKA a random image I actually don't think the CFG slider supports values this low since even at CFG equals 1 we get images with some resemblance to the prompt as you increase CFG eventually unconditional term becomes zero and then negative in terms of the output image this gradually decreases the diversity of output as it needs to be closer and closer to the conditional term and further and further from the unconditional term distortions start appearing in the output image because it becomes less and less possible to approach the gold term imagine asking the AI to create a picture that's five times doggier than the doggiest dog to ever dog that is basically what has happened at high CFG but with maths and that concludes the nerdy section for this video for the too long didn't watch summary higher CFG increases the emphasis on your prompt and decreases randomness when testing this I ended up doing the most testing out of any of my videos so far because the relationships between CFG and the other parameters ended up being more complicated than I thought also I may or may not have given myself some headaches by looking at super saturated images for extended periods first I ran some plots of CFG versus sampler at constant steps to try to find where CFG caused a bad outcome at both the high and low ends of the CFG range I didn't test every single sampler just representative Samplers from each group as it turned out though the different Samplers behaved a bit differently with regard to CFG so I drilled down a bit further by running plots of CFG versus steps with individual Samplers to dial in the amount of steps needed to get a good result at a given CFG as always I used a variety of prompts to control for subject matter and style and got these prompts from Lexica so thanks whoever put them there for the results let's start at a very general level at low CFG the images tended to look hazy and blurry due to the low contrast the images also tend to have composition issues and looked amorphous or liquid to use an analogy it's like someone was molding a clay pot and stop Midway instead of firing it High CFG images would become increasingly high contrast and increasingly saturated this is particularly noticeable for black and white images which start having colors appear in spite of the prompt at high CFG eventually that contrast becomes so high that the image breaks down into patches of color in the vague shape of the prompt all Samplers were better able to handle higher CFG values when the steps were increased even beyond the typically recommended range of 4 to 15 CFG in fact I would say that increasing the steps actually dampens the overall impact of CFG not just with regards to color and contrast and this is why my recommendations will end up getting a bit tricky in my subjective opinion there were broad ranges between where the images were typically good and where the images were typically bad additionally this is further complicated because image quality was not always linear versus CFG sometimes you could have a bad image at 20 CFG a good image at 25 CFG and another bad image at 30 CFG the style and subject of the image also played a role in this in general I found that photorealistic images did not hold as well as stylized images because in a photorealistic image you only expect a certain range of contrast and colors next let's go into the results from specific families of samplers if you want to know more about why I grouped the Samplers this way I made a video testing and characterizing all the different Samplers and identifying which Samplers had matching outputs the number of steps these Samplers needed and whether these Samplers converged or not a link is in the description below DPM adaptive is a special case among the samplers DPM adaptive actually uses CFG instead of steps as a result you can see the general Trend in image quality displayed among the other Samplers lower contrast and saturation at low CFG and high contrast in saturation at high CFG in the outputs for DPM adaptive if you don't mind the high contrast and saturation you can use deepam adaptive at the maximum CFG for stylized images though it seemed to fall a bit short for photolithic images however like everything else it also suffers from the same issues at very low CFG of blurriness and losing composition let's talk about the charts I'm about to show you like I said earlier there is a broad range between where things almost always look good and where they almost always look bad I am presenting this in three colors green yellow and red green means you should almost always get a good result regardless of the prompt red means you'll almost always get a bad result regardless of the prompt yellow means mixed results depending on your prompt and luck in the upper right hand corner with both High steps and high CFG things can get weird even though I might show it as green there'll be a small chance of the image being bad regardless that is why you should not view these recommendations as absolutes only Sith still in absolutes view these as a spectrum for photoralistic things you probably don't want to go too far into the yellow zone for stylized stuff you can explore a bit deeper into the yellow for example just inside the yellow Zone near the border between it and green only 10 to 20 of prompts might get bad results on the other hand near the border between yellow and red almost half of props might get bad results first up we're going to cover the other Oddball DPM fast DPM fast is once again doing its own thing and not in a good way even at the Baseline cf2 value of 7 DPM fast needs over 30 steps to ensure a decent result at less than 30 steps you need to bump CFG down to four or five to have any hope of getting a decent result like the other Samplers it does get better as steps increase but it still needs more steps than any other sampler at any given CFG to get a decent result at cfgs above 22.5 it didn't matter how many steps you used things get toasted pretty frequently on to the group 1 Samplers there are a whopping 10 Samplers in this group I tested four from this group and found two subgroups but not divided along the same lines as the subgroups in my Samplers video DPM plus plus 2m and DPM plus plus 2m Keras or less tolerant of high CFG values among the group 1 Samplers I tested the difference between these and the other group one Samplers is mainly seen at low steps and moderately High CFG these Samplers usually didn't see any issues at very low steps until CFG 10. at the default 20 steps you can only get to CFG 12.5 delete 50 steps and above you are pretty much in the clear for any CFG other than the infrequent issues observed at High cfgs hewn and ddim were the other group I found within the group 1 samplers these Samplers are actually the most robust with regard to CFG of all the Samplers except for a DPM adaptive of course I found that you could usually get to CFG 17.5 at very low steps forcing issues occur frequently moving on to group 2 samplers I primarily tested with Euler a though I did verification runs with the other two to make sure they followed the same general trend at the default 20 steps you can go as high as cfd 12.5 without seeing frequent issues bumping up to 30 steps allows you to go as high as CFT 20. you can get consistently good images at very high CFG though the amount of steps you will need to use increases non-linearly as you increase CFG last but not least we have a two DPM plus plus sde Samplers like most other Samplers they are robust to cfg10 at very low steps at the default 20 steps they can handle up a CFG 15. they do suffer more than the group 1 and 2 Samplers at very high CFG around 25 CFG is the highest you can go without starting to see issues that is not to say you can't get decent images at 150 steps and 30 CFG you just have to be comfortable with a portion of your images turning out badly I'm going to talk real quick about hi-res fix with regard to CFG as you might have realized high-res fix fixes things other than the crappy 20 and fractalization that occur at non-default resolutions it also helps with other stuff too like improving faces in the case of CFG hi-respects can mitigate some of the damage if you are playing around in the yellow zone if you push it too far and get a deep-fried mess well it can only do so much next we have prompt emphasis versus CFG based on some of the results I got from my prompt emphasis video I expected emphasis to be additive or multiplicative with CFG with regard to getting bad images since they both appear to have the same high contrast High saturation images at high levels there were a couple tests I ran that appear to support this however they were outnumbered by the ones where it appeared to do nothing so I will say that emphasis should be safe to use with CFG though like CFG you can only emphasize so far before you start getting issues in the output and that concludes our Deep dive into the high-res fix I hope you found this video useful and enjoyable if you did please like And subscribe don't be afraid to leave a comment if there are any topics around AI image generation you'd really like to see a video ever thanks and goodbye

Info

Channel: SiliconThaumaturgy

Views: 11,058

Rating: undefined out of 5

Keywords:

Id: kuhO9zAzetk

Channel Id: undefined

Length: 11min 40sec (700 seconds)

Published: Sun Feb 12 2023