ComfyUI: Style Aligned via Shared Attention (Tutorial)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi Seth here and welcome to the channel these four Images were generated in a batch using the first image as a reference I can get consistency with the overall style I did this without using control net or any adapters take another example style is consistent across multiple references the pose facial expressions background hairstyle and lighting are consistent with the reference and with control net I can get better control you can also use multiple control Nets with style aligned let me show you the workflows and hacks in comfy UI I want to thank all the channel members who continue to support the channel the style aligned image generation via shared attention was developed by Google a big shout out to Brian Fitzgerald and Jason Phillips from GitHub they are responsible for bringing this custom node to comfy UI style aligned is quite different from IP adapter it generates a series of images with a consistent style while IP adapter incorporates and interprets visual information from an image and image generation tasks style aligned is potent and consistent because it employs shared attention mechanisms within the diffusion process to achieve style consistency and image generation control nit is not required for style aligned to function however I will show you how to use control nit with style aligned for better control comfy manager must be installed and some know how on control nit functions is required go to the comfy manager and install the control net auxiliary pre-processors and the W node suit python gos is also required I will use the batch prop schedule from Fizz nodes the seed node from RG3 is used in all workflows lastly install the style align nodes after installation close the command prompt and the browser windows open a new window and go to this GitHub page the link is in the description click on the initialization python link and download the raw file copy and paste the raw file into this folder this is required because the code has not yet been merged into the master branch on GitHub as of the date of recording this video the style aligned reference nodes are still a work in progress the first workflow is quite basic add the vae loader for the checkpoint for the positive prompt add the batch prompt scheduler this node will allow you to send multiple prompts and batches in one flow a normal clip text en code for the negative add the style align batch align node this node is specific for generating images with the same style and batches the reference image workflow is a bit later in the tutorial a custom node for the seat is very helpful here add the RG3 seat node add an empty Laten node and two k Samplers I am adding two for comparison one of the K Samplers will connect with the style aligned the checkpoint model connects with the style aligned input which further connects to one of the case Samplers connect the clip and the vae right click on each of the case Samplers and convert the seed to input then connect the sampler with the custom seed node the second case sampler directly with the checkpoint model positive and negative conditioning and the empty Laden for the entire tutorial I will stick to the real Vis XL version 3 checkpoint with the baked vae this checkpoint works very well with all the workflows the turbo models will not work correctly on all resolutions to minimize errors and inconsistency I suggest you first try the workflow with this specific checkpoint before your preferred one if you are not getting the desired results even with random seeds the problem probably lies with the setting check the guidance and steps value recommended for the checkpoint for this specific workflow I am using steps at 15 and guidance at 5 with DPM Plus+ 2m SD Caris make sure both the cas Samplers have the same settings for comparison this is where you enter Dynamic text in the prompt for example let's say white cow the number on the left is the frame when the prop changes for the workflow it will be an increments of one so the first prompt will be zero the second will be number one and so on the maximum frames value should equal the lines entered in the dynamic prompt since I have a single line the value should be one whatever you enter here comes before the dynamic keywords entered above let's say a cute plushy toy anything you want after the dynamic text should be inputed here example furry on the moon depth of field space cartoon and open mouth if I show you the command prompt you can see it compiles the prompt into a single line when you add more Dynamic text everything else will remain the same and it will generate multiple prompts and schedule them one by one let's add another line say brown horse change the value to two and the batch size as well the style align takes the first image in the batch and uses it as a reference the plushy cow in this instance you will be surprised that it copies more than what we usually term a style look at the way the horse is positioned it accurately reproduced the cow's posture and changed the standing horse to a sitting one also notice the similarities in the background it also manages to change the toys expression another fascinating thing is that it copies the lighting and the shadows as well an image generation using neural networks the difference between group and layer normalization lies in how they adjust the data keep this value at both so far I have not seen any difference between selecting group or layer qkv stands for queries keys and values imagine you're at a large party with many guests and trying to find people who share your interests let's say skydiving queries is you asking around trying to find people interested in skydiving keys are like name tags that every guest in the party wears some guests would have their interest on a name tag values represent the story or info that each guest has about skydiving so you the query look for guests the keys who share your interest and once you find them you exchange stories and tips also known as values similarly in eural networks queries seek out relevant keys and based on this the corresponding values are used to perform the model's task processing images in this case typically for batch prompting Q plusk is recommended qkv is overkill I do use that option further in the tutorial here here the scale value is recommended at one reducing the scale value reduces the consistency of the first image thus affecting all the images in the batch as you can see most of the details are lost let's check out how reliable and consistent style aligned is I am adding a total of eight animals with different colors beautiful every one of them is consistent I want to try changing the shark to a blue J and it works beautifully changing the prompt to try a flat logo illustration with a shadow the best way to go about this process is to pick a subject and keep doing the generation till you are satisfied with the style you want once you choose a style fix the seed and and add more Dynamics it can be limiting if you want a huge batch of the same style one way to overcome it is to create a batch supported by your GPU and then create another batch keeping the first image the same and keep repeating until you get all your desired subjects without style aligned the design and Shadows of the logos are completely different and with style aligned you have the perfect consistency I have tested the batch Style aligned extensively and it works another thing is that the subjects do not have to be related for example adding a lion and a tree also works so without changing the conditioning I am changing the checkpoint and this one cannot do anything correctly except the burger and the pizza with style aligned it makes a huge difference notice how it replicated the blue hue behind the burger for the first image focus on the style rather than the subject and then start the process the reason to use control net is to control the reference image beyond that reference image style align does a better job in control than control net itself I will use the open pose Xcel model right click and add the upscale image node and the open POS pre-process cessor I will use the reference pose from this image connect the positive conditioning through the control net negative conditioning for control net is not required the upscale image node is required here to resize the input use the crop option if needed ensure that the resolution of the image matches the control net and latent image any discrepancies in the resolution will give an error I am alternating between a male and a female subject also the hair of the first subject is Curly whereas the third and the fourth subjects wear different clothing styles take note of the two males even with control nit the images not style aligned do not have the exact same pose whereas with style aligned is identical not only that it even changes the face of the maale to be exactly like the reference it does not change the hairstyle because of the prompt if I removed the hairstyle from the first prompt they would be the same the red color is a bit dark this is because the reference image shirt is blue by changing the color in the prompt it tries to color over the blue instead of a new generation lighter colors fade towards blue I am sure there is a way to overcome this issue but I need to do further testing in this workflow I will first generate an image which will be the target image then I will load up a reference image and try to influence the target style as per the reference while preserving key features of the original generation pass the reference image via vae encoder the this step is crucial for style aligned it ensures that the generated images match the textual description and consistently reflect the style of the reference image now add the sample reference latent node drag and connect the sampler input with a case sampler select node add the basic scheduler for the sigma we will need a separate conditioning for the reference ladent the negative conditioning will be the same however the positive will differ when you want to randomize Generations change the seed from the latent not the sampler this will ensure that the target image Remains the Same also keep the guidance scale here between three and four I did not get good results Beyond this range instead of AK sampler the reference sampler node gets added which further connects to the vae decode the positive from the target gets connected to the positive input of the reference sampler and the positive reference conditioning connects with the reference positive input the negative conditioning is the same for both The Prompt here is significant and sensitive The Prompt should be a close approximation of the reference image to make things easier I will add the blip analyzer nodes for the image analyses via the text to condition node you can connect the blip analyzer to reference positive inputs when using a reference image keep the scale value at0 .85 as for the release paper that is the default okay this is probably because I forgot to resize remember to stick to near sdxl resolution only and don't go too tall or too wide the ability of style align to maintain facial consistency is incredible incedible this reference image is a bit of a stretch due to the sideways pose however it manages to do reasonably fine I tried some random seeds and quite some had distortions around the neck or the back but that's expected also note that I purposely generated the target as just a close-up face without a body to test its performance against several reference images adding control knit to the workflow is straightforward the reference positive conditioning connects with the reference control net which goes in the reference latent and sampler nodes at the same time the target positive conditioning connects with the target control net which goes to the reference sampler node for the reference image I am using Kenny and for the Target I am using depth map both the control net values are at 0.5 that's a good value to start with let's generate the target image first since conditioning plays a key role here let's look at the blip and see the analysis you don't want to put this conditioning in the latent in this case I would prefer to have it say just a colorful mask copy and paste that manually whenever the results are underwhelming change the share attention to Q Plus K plus v there you go it worked here are some additional tips when using control n with the reference images wher ancestral with the schedule at normal gives far better results than Caris I also used double the steps usually required for this checkpoint I increased the guidance scale from a five to a seven all these help with the desired outcome in this example I am taking a sports car as a Target generation and the reference image is of a dragon made out of blue and gold threads this is quite extreme but manageable ultimately I changed the target's positive prompt to say an orange sports car made of gold and blue threads this helped style align Focus its attention where I wanted to however the checkpoint alone could not generate a proper car made of threads even after style aligned the car shape was a little loose so I just increased the control net value of the car to 0.7 which did the trick also I used Kenny for both control Nets here I hope you learned something new and the tutorial was helpful until next [Music] [Music] time
Info
Channel: ControlAltAI
Views: 8,172
Rating: undefined out of 5
Keywords: style align comfyui, style align google, style align image generation, style align ai, stable diffusion, comfyui, style aligned generation, comfyui tutorial, comfyui workflow, style aligned by google, custom nodes, comfyui node workflow, learn comfyui, style aligned, workflow component, comfyui consistent, comfyui nodes, comfyui guide, comfyui easy workflow, comfy ui, ai comfyui, nodes comfyui
Id: 7usmv4L8WQ8
Channel Id: undefined
Length: 18min 21sec (1101 seconds)
Published: Mon Jan 15 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.