How Much Information Can Stable Diffusion Handle in a Single Prompt? - Stable Diffusion Deep Dive

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
have you ever tried to have stable Fusion generate a simple picture like say a Dungeons and Dragons party then been really disappointed after all what is so difficult about showing a half princess half ninja dragonkin an underwater basket weaving Goblin Shaman and a half wolf half Cory and Samurai battle armor standing on top of an Aztec pyramid in the jungle I mean I can picture it so clearly in my head why does stable diffusion keep getting it wrong why does it give me Cronenberg Abominations forget members or even worse depict the half princess half ninja as a normal human instead of a dragonkin while I can't help you generate that perfect picture what we will do today is quantify exactly how much information the stable diffusion can handle in a single prompt user filter initially asked how to determine the length of an effective prompt and the point at which words in a prompt start to be ignored or misinterpreted this is a great idea but I had to ponder how I could test this in a way where I could share concrete findings with you what I eventually settled on was that I would test the number of subjects that stable Fusion can handle at once with and without details for those subjects initially I wanted to make this a Basics but of course be myself this kind of sparrow out of control and end up becoming my longest Deep dive to date not that that's a bad thing except maybe for my sleep schedule on to the design of the experiment I actually made a chart for this one because it's pretty complicated it'll be hard to understand if I just talked about it the core of the study is what I call the subject expression basically this is where I'm telling stable diffusion what should be in the image I broke this down by the number of subjects how these subjects were communicated how many different types of subjects there were and whether these subjects had specific details altogether there are seven categories when making my previous CFG video I ended up staring at borderline images for a while wondering whether it was good or not since the desired outcome is knowing exactly how often stable diffusion can succeed under a particular level of image complexity for the prompt the experiment was designed so that I quickly know whether something was passing or failing in each prompt I specify a certain number and certain types of subjects sometimes accompanied by a description for me to consider image passing it must have the exact number of each subject specified if it had any more less of those subjects it was considered a fail this may seem strict but by doing this I ensure that the results are more conservative than what the average user would deem acceptable if a subject other than the ones I specified showed up I ignored it for those of you of knowledge of Statistics while pass scale is easier to evaluate it has lower power with regards to predicting results so you need more data to make your conclusions a lot more data I did a batch count of nine for these so between eight subjects and eight prompt endings that has 576 images at each condition I tested over 40 conditions which means it looked at over 15 000 images long story short it was a lot to control for details and styles in the prompt I made a variety of endings I used three main chunks a common neutral description which was photorealistic highly detailed high resolution a description of a style which was impressionist painting and finally some specific artists and you guessed it it's none other than your boys Greg Rutkowski and alfonse mucha I also made every combination of these endings and a blankening for a grand total of eight endings finally I decided to screen some other variables to see if they make a difference for these I tested CFG steps and Samplers before I tried to quantify what level of complexity stable Fusion can handle let's talk generally about each of the individual variables and what overall Trends were found to the surprise of absolutely no one stable diffusion performs progressively worse and worse as you ask it to do more things there are several ways this can manifest when you ask for a specific number of things you can end up getting more or fewer objects than you wanted if you don't want a precise number of things this isn't that bad but stable diffusion evidently has a more relativistic concept of numbers than we humans do more unfortunately stable diffusion might also chop the images up into pieces to have a subject each one of those pieces for my study I did not count this as a fail even though it is something that would likely make you as the end user discard the image for larger numbers of subjects I noticed that they would become Amalgamated or intertwined more often stable effusion also really struggle with larger objects like houses and cars while the subject was nearly always present in the result it was rare to get the correct number Beyond one or two when descriptions were added stable diffusion can start confusing the details between subjects for example if I asked for a brown dog and a black cat I might end up with a black dog and a brown cat instead the descriptions can also get Blended together instead of implemented separately as an example of this I used blonde and red haired a lot in my testing and together they seem to generate a lot of people with strawberry blonde hair instead of individuals that were either blonde or redheaded descriptions also tended to pull everything in the image towards that description not just the intended subject this is especially noticeable with colors the color may be strongest on the targeted subject but the whole image tends to get at least a little bit of that color as with other elements in the prompt not all descriptors have equal impact for example whenever I had read as a descriptor it would become a prominent color in the output image even if it wasn't the first descriptor listed but the subject image not the entire prompt unless you want some generic ugly things you also have to add descriptors at the end of The Prompt to modify the image properties unfortunately my results seem to indicate that any other words in the prompt other than the subject will decrease stable Ephesians ability to match the prompt precisely fortunately neutral descriptions like highly detailed or high resolution have the least impact on this so don't be afraid to use them if you think they'll enhance your output on the other hand the specified style and artist severely impact the ability of stable effusion to meet the prompt that this was not equal across all subjects I tested for example the artist section of the prompt made it very difficult to get an image of a planet even when asking for only a single planet when you think about it this makes sense because Alphonse and Greg didn't have very many depictions of planets in their works or maybe they do my art history knowledge is my specialty similarly the part of the prompt with impressionist painting made it very challenging for the model to display phones by itself during impressionism's Peak there weren't new phones around which means there are probably very few representations of phone and impressionist artwork this is probably why stable efficient struggles portray them in an impressionist painting strangely enough when paired with the Arts portion of the prompt stem diffusion does manage to get phones but they're always being held in someone's hand due to the practically infinite possible variations on prompts it is hard to give any absolute rules here however I will say that when crapsing your prompts you should be mindful of which Styles and arts that you invoke if they would never or rarely use your subject matter that will almost certainly make it in the output you want more difficult and now the moment you've all been waiting for it's time to nerd out over some charts and graphs and really break down what stable diffusion can do in concrete terms in this section I'm mostly going to talk about the number of subjects and details that stable diffusion can handle while I did talk about the influence of the rest of the prompt in the previous section this study wasn't really designed to quantify that so I'm not going to really go into it here first let's talk about how to interpret the set of charts I'm going to show you on the x-axis we have the number of subjects and on the y-axis we have the average percent passing for that particular condition all these charts are generated from the Euler a sampler at 30 steps and 15 CFG there are three lines maximum minimum and average the maximum line is the prompt ending with the highest average pass across all the subjects this is usually going to be the bear subject without any details added though occasionally it is a neutral prompt think of this as the performance of student Fusion under the best possible conditions basically if you write the prompt in a way so that all the details are neutral or support what you want to show the next line is the average Line This is the average of averages this is the average passing rate for all subjects with all eight prompt endings I tested think of this as the expected performance of stable diffusion you write a prompt and maybe it has a detail or two that are incompatible with each other the final line is the minimum Line This is the lowest pass rate for any single subject across the eight prompt endings think of it as the model's capability when you've written a prompt that has multiple or serious incompatible Concepts within it so starting out we have a number of subjects expressed as a direct number without any details as an example this is stuff like one cat or three planets for this one it starts drawing on one subject stay strong at two subjects and then falls off sharply with three or four subjects with more subjects we see a gradual decline by four most of the damage has already been done next we have direct expression of the number of subjects but with the two types of subjects instead of just one like one dog and one cat or two men and two women as you would expect the success rate falls off faster than with a single subject type just asking for one of each subject yields you a lower success rate than asking for three of a single type of subject asking for three total is okay but after that the success rate is in the dirt finally we have direct expression of the number of subjects with three types of subjects like one dog one cat and one man and you guessed it the success rate plummets even faster trying to get three different types of subjects has about the same success rates asking for five of single subject and quite frankly isn't great the next way I express the number of subjects was indirectly by denoting each subject with an A or an for example I would say a man and a man or a dog and a cat using this method for a single type of subject things felt a bit quicker compared to asking directly for the number of subjects once I started requesting three or more of a single type like a woman and a woman and a woman it seemed like stable Fusion got confused and wasn't able to process it properly the clip encoder probably can't handle this format and to be fair why would you do this anyway which neatly brings us to our next category multiple types of subjects expressed indirectly which is something like a man and a chair and a computer surprisingly this actually did better than indirect with a single subject at both two types of subjects had three types of subjects though the Gap had closed by four subjects and at long last we are filing to the part where we start adding descriptions to the subjects this first chart is when we have different descriptions but only a single type of subject an example this would be a blonde woman and a red-haired woman it's hard to compare for this one since repeating the same subject without descriptors didn't work very well but bucking the trend when we specify multiple types of subjects with descriptions we get based on the same performance as when only a single type of subject with different descriptions is used based on this it seems that the subject and description are combined by stable effusion when going through the encoder so a blonde woman is considered a distinct entity from a red-haired woman this makes it easier for a stable diffusion to handle descriptions instead of subjects finally let's talk about the other variables I screened sampler steps and CFG I had high hopes for these CFG is supposed to make the model pay more attention to the prompt and steps gives the model more time to do what it needs to do unfortunately CFG just seemed to act like a slider for saturation and contrast like it normally does if you want the full details of what CFG does and the right ranges to use for each sampler check out my CFG Deep dive video in the link from my video description below I also tested out four different Samplers Euler a DPM plus plus 2m DPM adaptive and DPM plus plus SD Keras while there is a little difference between the minimums and the maximums the average success rate was very consistent between the Samplers so I think it's safe to say that they don't have much of a difference between them finally I tested steps and once again even though I cranked up steps to 150 there wasn't really much difference in the success rate at all so yeah it doesn't help since none of these parameters seem to change things I would guess that the limiting factor in stable Fusion's ability to portray multiple subjects and details isn't in the image Generation section but rather in how the promise being encoded by the clip model if we were limited by the image generation I would expect at least one of these variables to improve results like many of you here I'm a fan of short concise summaries and I'm going to attempt to try to do that here but given that there's a lot of nuance and uncertainty in these results you should take this with a grain of salt instead of as an absolute rule I decided to create a universal ranking for each condition called complexity basically this is the amount of things I'm asking stable Fusion for to get this I add up the number of subjects then either the type of subjects or the number of subjects with unique descriptions then subtract one I calculated the complexity for each condition tested then plotted the average of averages for that condition on an XY plot I use this to create an exponential trend line to generate an equation the equations can be less accurate lower complexity because of the nature of exponential equations if you wanted a truly accurate equation we would have to have a sigmoid curve like the logistics function unfortunately Excel only supports exponential and im2 lazy to for this manually but as we care less at quantifying performance at lower complexity this is fine for our purposes as the saying goes all models are wrong but some models are useful what this equation shows is basically a reiteration of what was stated throughout this video as complexity increases stable efficient rapidly loses the ability to consistently meet all the prompts criteria with each additional level of complexity the success rate is almost halved basically this is like flipping a coin for every element you want and hoping you get all heads however this only applies to the subject specifically generic scriptures for the overall image like Styles or high resolution are baked into this analysis though as noted before if you don't choose these carefully they can make it much more difficult to get results for a particular subjects wow what a ride the original question was how much information can stable Fusion handle in a single prompt the long answer is you get continually decreasing performance as more elements are added with some subset of those elements being present in the results the short answer is in the words of Patrick Star three take it or leave it and that concludes the Deep dive into prompt complexity if you found this video useful and enjoyable please like And subscribe as always if you have any topics about stable Fusion you'd like to see a video on don't be afraid to leave a comment asking for one thanks and goodbye
Info
Channel: SiliconThaumaturgy
Views: 5,237
Rating: undefined out of 5
Keywords:
Id: WsWgOpDqJy4
Channel Id: undefined
Length: 14min 30sec (870 seconds)
Published: Sun Mar 05 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.