ComfyUI: Advanced Understanding (Part 1)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone this is mato and today I'm starting a deep dive into confy UI and stable diffusion this will be the first in a series of basic tutorials about comy but also generative machine learning in general we'll start from the very beginning but we are going to touch rather Advanced topics so even if you know your way around confi I bet there will be something for you as well so let's get into it this is the default basic workflow that we all know and love let's build it from scratch and analyze each element a double click on the work area brings up the search dialogue 99% of the times I add nodes from there but if you don't remember the name of a node you can rightclick and look for the node in the menu first of all we need to load a main checkpoint a checkpoint is a container format that pass Bo three main components the first is the unet model which is the brain of the imag generation the second is the clip or text encoder that will take care of converting the text prompt into a format the model can actually use and the third is the variational auto encoder or vae that brings the image to and from the latent space it's an incredibly important element in the image generation and it's often overlooked let me show you what it does I made this node called tensor shaped bug it shows in the terminal the dimensional size of various objects used by comy UI the shape of these objects or tensors gives us some insight on the information that they contain let me show you I'll put the terminal window here so we can see the result if I open an image and send it to the node I get as a result 1 768 512 and 3 the first is the batch sides the second and third are height and width of the image and the last one is the number of channels red green blue if I make a batch you'll see that the first number becomes two as the tensor now contains two images as you know to do any kind of manipulation over this image i' need to bring it to the latent space the latent is a smaller representation of the original pixel image that stable diffusion can actually use let me convert this image to a latent I need a v encode I connect the V Pipeline and then check the shape with the the bug node now I get 1 4 96 and 64 the dimensions are arranged differently but what we care about is 96 and 64 the are 768 and 512 divided by 8 so our image has been downscaled eight time per side and can now be used for the generation this compression is handled by the vae that is generally very good at it but not lossless when we want to display the latent we need to upscale it back to the pixel space if I take this latent and decode it the two images look the same at first site but if I pass them through an image enhanced difference node you'll see that there are actually a lot of differences basically anywhere there is not black it means that something changed so since the V encoding/decoding is a lossy computational expensive process it's advisable to stay as much as possible in the latent space and convert to the pixel only at the very end also also remember that when creating an empty latent image the width and height refer to the final image resolution but the actual latent size is divided by eight so it's always a good idea to work with images with Dimensions that are multiple of eight next is the clip we can drag the output release the mouse button on the work area and select clip text and code I like to change the color of the nodes so I can easily tell which is which but that's just me the text and code converts The Prompt into embeddings so they can be used by the model to generate something meaningful we will talk more about embeddings later for now let me just connect the prompt to the K sampler if the checkpoint is the brain the K sampler is the heart of the generation and I could talk about this node alone for hours but instead of going through all the options one by one we learn by example during the course of these and future tutorials let me connect the model and of course we need the latent for a new generation we are going to use an empty image models are trained at a specific resolution for sd15 that's usually 512 by 512 so we'll start with that then I need to decode the image back to the pixel space and finally I can display it with a preview as prompt I'm going to try something like close-up illustration of an anthropomorphic Panda wearing a full plate armor in an enchanted forest and some generic negatives let's see what we get since I want to experiment with various options I need consistent result so I'm setting a fixed seed I'm also increasing the batch size so we can get more results so the result is good already but as you can see we have some food which is something we didn't ask for this is because the model doesn't consider full plate armor just one token and if you say plate it assumes that you are hungry so let me remove it and see if it works now it's perfect let me also remove illustration since dream shaper the checkpoint that I'm using has already a strong illustration character now it is is a bit better this is just to show you that you don't always need long overly complicated prompts but uh the right choice of words is much more important okay now it's time to open that can of worms that is samplers and schedulers the one thing that I'm asked the most is probably what is the best sampler I'm sorry guys but the answer is depends it depends on the checkpoint on the CFG scale on the number of steps on the complexity of the embeds and ultimately on your own personal taste let me show you some examples to explain what I mean I'm copying this case sampler and with shift contrl V I get a new node with all the pipe connections from the previous node in the second case sampler I set DPM Plus+ 2 m caras which is regarded as one of the best I'm also setting the steps very low to like 10 to see which sampler is able to converge faster the steps is the number of iterations the sampler has to denoise an image let's see the result as you can see eer even with only 10 steps was able to denoise the pictures almost completely here between the eyes we still have some leftover noise but a lot better than DPM Plus+ now I'm changing the CFG to two it means that the model now has a lot more freedom it's an extreme value that you rarely use with standard checkpoints but it's just for the sake of this experiment Q prompt and now things changed completely uler images are very faded out and blurry while DPM Plus+ was able to generate some very nice results even increasing the number of steps to 20 DPM Plus+ still has the upper edge with this generation we used a very simple prompt and with a more detailed description probably things would change again so as I said it depends there are a few certainties that I can give you though first of all there are two kinds of sampler I'm not aware of an exact name for the two groups but let's divide them in predictable and stochastic the former are converging and the latter not converging eer DPM Plus+ 2m Ddm uni PC produce predictable noise all ancestral SD and a few others have more Randomness hence stochastic if I set a converging sampler in the first generation and a not converging one like DPM Plus+ to MSD in the second you'll notice that uler keeps the overall composition no matter the number of steps the second case sampler instead will keep making variation even after 100 steps why using an unstable sampler you may ask well sometimes you want to spice up the imag generation a little and often a stochastic sampler offers very interesting results so at the end you need to experiment each scenario is different but if you don't know where to start I'd suggest to try eer DPM Plus+ 2m UNC DPM m++ 2 MSD ddpm and DPM adaptive same goes for schedulers caras is generally good but some Samplers like DPM Plus+ 2 and 3M SD really like exponential or sgm uniform there are also some scenarios where normal outperforms other schedulers like when working with upscaled latents for some reason and we'll talk about that soon so to sum up our case sampler steps is how many times the sampler can the noise the image the CFG or guidance scale controls how much the image generation follows your directions lower values the model decides higher values you decide sampler and scheduler Define the the noise strategy and timing and the noise is another important parameter but we'll talk about that later nice with that out of the way we can move on to conditioning I'm removing the second Cas sampler and adding a tensor shaped thebug node so we can see how the embeddings are actually made if I run the generation now and look at the terminal I see that I have two tensors but what I care about is the second value 77 this is basically the maximum number of tokens per embedding that we can have even if we use less words the size will be 77 if I have a very long prompt and check again the size is now 154 or 77 * 2 and even if I only have 78 tokens the tensor size will be 154 this is important because in a moment it will help us understand various conditioning strategies so let's say I want want to put a red scarf to our Panda I'm adding red scarf to the text prompt and see what happens it got a red scarf all right but the color also bled onto the armor here we've got a red Gem and overall the scene has now a reddish tint not to mention that for some reason this one Panda Got Boobs to try to fix that we can use conditioning concat I add a new text prompt and put the Red Scarf there and remove it from the previous prompt now I connect them together and I'm also checking the tensor shape with the the bug node now we have a lot less red the scene lost the red tint and we also got some blue and brown details let's have a look at the tensor shape it's now 154 despite the tokens are far less than 77 what happened is that comy concatenated the two full tensors one after the other this technique makes the tokens in each tensor bleed far less than putting everything into one text prompt there's now another node that I want to try conditioning combine let me move everything to the new node and see what happens well now we got something completely different let's see the tensor size to understand what is happening we have now four tensors instead of two and the size is for both 77 this means that the beddings are not merged but sent directly to the model instead the model creates a starting noise for both embeddings like there were two different pictures and then average them before starting the generation this also explains why we got some snow in one of the images uh because without any other context one of the generation Associated the red scarf with winter so let's give the second prompt more context I'm basically replicating the first prompt but with red scarf instead of full armor and now the scene is more what we wanted there's one more conditioning that I want to explore conditioning average let's say I don't want a panda but a dragon Panda I can try my luck with just one prompt with closeup of an anthropomorphic Dragon Panda let's see well it tried we got some horns and some wings but we can do better now I'm creating two identical prompts one with dragon and one with panda connect them together with a conditioning average and set the strength to.5 I'm also going to check the debug and now we got a proper Dragon Panda since it looks more Dragon than Panda I'm going to change the strength a little I want to give the panda more weight so I'm setting the strength to4 and now it's much better if we check check the terminal we see that the tensor size is now still 77 even though we used two prompts that's because both prompts are under 77 tokens and comy averaged them into one tensor so to recap conditioning concut takes the two embeddings put them one after the other and send them together to the sampler conditioning combined since the individual embeddings creates the base noise for both and then average the noise to start the generation conditioning average takes the two embeddings and aage them before sending a merged tensor to the sampler there's one more conditioning I want to talk about and it's also probably the most powerful conditioning time step I need one node for each prompt and then I'm merging them together with conditioning combine in the first prompt I put illustration of a fantasy a village in Spring and in the second the same but in Winter let's check the result so winter has a stronger weight so I want to try to lower the strength of winter by setting the start option in the time stepping node 2.2 this means that at the beginning the model will ignore the winter prompt and proceed to create only a spring scene at 20% of the composition the sampler will start adding winter to if I increase the start option the scene will get greener and Greener and if I want to introduce some snow back I can stop adding spring at like 70% by setting the end option to 7 in the second prompt this is a very useful tool and I would like to see it used more often remember that the initial steps are the most important so you generally want to start the time stepping at zero for the prompt that has to have the higher importance one quick note about textual inversion and word waiting it's very basic stuff but I thought it was worth mentioning for the sake of completeness once you've downloaded the embeddings and placed them inside the confi models embeddings directory you can access them with the embedding keyword for example I have an embedding called bad dream let's give it a try and if I want to give it more or less weight I select it and with control and up and down arrows I can increase or decrease the weight this is of course also true for any other word in the negative or the positive prompt just remember that embeddings are like words and their position inside the prompt matters so an embedding at the beginning will have a higher value compar compared to one at the end great there's one last thing that we need to cover in this tutorial we talked about how a checkpoint is actually a container for unit clip and Via this also means that each of these components can be loaded separately with their own node for the main model you can use the unit loader for the text encoder the clip loader and for the vae the vae loader the V loader is very important because the checkpoint not always comes with the best V so you can load one externally check the model page on CV Tai or hugging phase most of the times it says what V to use the unit and clip loader can be useful when a checkpoint is not available for example on hugging phase I've found this very curious model designed to make nail art if I go in the files and version tab I find very familiar directories called V unit and text encoder from here I can download each model and place them in the confi unit clip and V directory you can also rename them for easy access now the models should show up in the respective node I can delete the load checkpoint node and connect the clip and model pipelines to the new nodes this model has a trigger which is nail set so I start the prompt with that and then I add something like zombie apocalypse let's see what happens oh well that's not too bad or let's try dragons well by the way now you know how to use models even if you can find the checkpoint format I guess that's all for this first introduction tutorial it takes me a lot more time to do these videos that over the basics compared to my standard content so please let me know if you like it and if you want me to release more my idea is to alternate advanced stuff and these basic tutorials but we'll see depending on your reception that's all for today see you next time ciao
Info
Channel: Latent Vision
Views: 58,154
Rating: undefined out of 5
Keywords:
Id: _C7kR2TFIX0
Channel Id: undefined
Length: 20min 18sec (1218 seconds)
Published: Fri Jan 12 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.