ComfyUI IPAdapter Advanced Features

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello everybody this is mato and I am the developer of the confy UI IP adapter plus extension this is a followup to my previous video that was covering the basics if you new to IP adapter I suggest you to watch my previous video first and then come back here for the advanced stuff let's start with the new features this is the base workflow that we all know and love nothing special about it if I want to use two images I need another load image node then we can connect them with a batch image node like so and it works great but what if I want to give one image more weight than another so in that case I have to use another node called IP adapter encoder I can connect the images to the new node instead of the batch images node and if you look closely you'll see that I can set a weight for each image let's try to give the first image more weight than the other as you can see this note only outputs embeds and we don't have embeds in the apply IPA adapter node so we actually need a new node you can drag the embed in an empty area of the workflow and select IP adapter apply encoded so this node will replace entirely our old one so we can delete that and we can connect the IP adapter back the model and this time we have to connect the clip Vision to the encode IP adapter image note now we can give it a try first I'm going to add some noise because it helps with the generation and as you can see now the image is slightly closer to the first reference than to the second I can try to lower it further or I can try to do the opposite and give this one more weight than the other and it works pretty well you probably notice that there are only four slots but you can actually send a batch of images to each slot so I can add another image and I can batch two of the images that will have the same weight and then connect the batch again to the encoder so in this case these two images will have a weight of one while the first image will have a weight of 6 so you can actually have as many images as you want but you are limited to four weights I think that it should be plenty enough one thing to remember is that if you select any plus model you also have to set the IP adapter plus option to tr before I go any further I believe it's worth rering on the importance of the prep image for clip Vision node in this case the reference image is a square so prepping it wouldn't be technically needed but the clip Vision encoder uses by default by cubic interpolation while Lanos is a better algorithm let me show you the difference I'm going to connect the reference image directly without the prepare node and see the result now I'm duplicating this node now I'm going through the prep note so the image is scaled before going into the encoder as you can see the difference is little but the image that was prepped is slightly more defined you can see it in the eyebrows in the eyes and in other small details you can also add a little bit of sharpening if you want very little usually makes a huge difference and now the result is even sharper so we haven't really talked about all the IP adapter models let's go through all of them the base one uses four tokens to describe your image even though it's actually eight because it's four for the positive and four for the negative and it's good for catching the main characteristic of an image if I want to get any closer I have to use a plus model that uses 16 tokens to describe our image let's see the difference so now the result is a lot closer to our reference the actual style of the image is dictated by the main checkpoint so if I change the model I should get a different style and depending on the subject one model might be better than another another interesting model is the light version of the base model the light is useful when your prompt is much more important important than the reference images and it is different than using the base model with a lower weight like 2 as you can see the image is still a little bit burnt and anyway it still tries to keep the original composition while with a light model I can set the weight back to one and get just a hint of the reference also you'll notice that the image is is much less burnt so I can set the CFG scale a lot higher which is always nice another interesting model is the plus phase it is a model trained to describe faces it is not a face swap and sending two pictures of the same person is doing no good all it does is to describe as close as possible the face that you give it to it let's see a few examples let's start with this one this is not a bad reference we have to crop the image to the left and since the model is only describing the phase we have to lower the weight and describe a scene in this case I want to keep it simple so I'll just put portrait of a men and I'll add some security measures to the negatives okay let's see how it goes so the result is decent but not stunning let's see how the image is actually cropped by previewing the prep node this is what is actually sent to the encoder and we want to keep the attention only to the face so to do that we can try to crop the image better I have a crop image node let me try to Center the face better now the face is always we have in our reference I'll pass the new image to the preprocessor and as you can see the result is much better and a lot closer to our reference let's try another image this is a very bad one because the face is covered by the hair and also there's a hand on the way so the result I'm sure won't be great we have to crop to the top that was terrible but we have to tell that is a woman okay well it is better than what I was expecting it's basically a black and white picture of a woman with long hair now let's see if we can do any better let's send this to the encoder it is certainly better but the reference image is not good to St start with so let's consider this a failure and uh one last example and this is what I consider a perfect reference we just need to crop to the top and this will be probably great we can lower the CFG as always try another seed and this is a good example of what kind of image you should send to the face model all we said about the sd15 models is true also for sdxl there's only one thing to be aware of and that is that the base sdxl model needs the sdxl clip Vision encoder the difference is basically the size the clip vision is trained at but it doesn't necessarily Grant better results so depending on the subject the sdxl model is is either extremely good or terrible so you got to experiment a little all the other sdxl models the ones ending with viit all require the sd15 clip Vision encoder and as you can see even if we use the lower resolution encoder the result is pretty good and then of course we have the plus sdxl we have to enable the plus option in the encode image the difference between the sdxl base and plus is not as marked as the sd15 base and sd15 plus but as always you'll have to experiment so next we are going to talk about time stepping so technically there's no time stepping for IP adapter but we can kind of simulate it with a case sampler Advanced let's say you want to make a cyber Punk Woman based on this fantasy image I can try to lower the weight quite a bit and add a cyber Punk woman in full armor as prompt let's see the result so this is not very cyberp Punk we can try to lower the weight even more but we are going to lose a lot of the source image now it is cyber Punk but there's very little of the reference so we can try by using a second Cas ampler and we connect this one directly to the load checkpoint node instead of the IP adapter then we connect the first case sampler to the second we enable the return with Le over noise option we disable at noise in the second case sampler we stop the generation at the sixth step and we start from the sixth in the second sampler set the weight pretty high because we are using this weight only for the first six steps let's see what we've got now this is what I'm talking about the image is very close to the reference but now it is cyberp punk it is very burnt so we have to play a little with the CFG and probably we can lower the weight a little bit more let's try another seed okay let me try with another reference image I have this drawing and I still want a cyberpunk woman but with this style I'm going to increase the weight and let's see what happens let's try another seed anyway this time we were able to keep the style of the reference and do something completely different with it and you can get pretty crazy results with this technique so I encourage you to give it a try and as last thing I want to talk about animate diff because IP adapter can be very important in keeping the stability in your animations so here I have a very standard animate d workflow the yellow nodes are the control net I'm loading a series of images that I resize and then pass through a lineer preprocessor the purple are the anime diff nodes as first pass I limit the animation to 16 frames to be sure that everything is fine and then I will increase the batch so now that I have this cheering Frost giant I am extracting one frame out of the animation and using it as reference in the IP adapter noes so this is my frame I connect the animate diff to the IP adapter and the IP adapter to the case sampler I will increase the frames to 32 and now I can run the final animation of course I have already rendered the animation and this is the result here on the left I have the original animation with without the IP adapter and here on the right I have the animation redone with the IP adapter enabled as you can see in the chest area there's a lot of noise compared to the one with the IP adapter which is a lot more stable also the hair is very noisy while it is pretty stable here and especially the background there's a lot of noise while it's rock solid here on the right but anyway this is a very simple way to add stability to an animate diff animation now as you know animations take a lot of vram and there's a little trick you can do with IP adapter to spare about 1.2 1.4 GB of vram instead of using the apply IP adapter node we are going to encode the embeds so I connect the image to the encode IP adapter image and so the clip Vision I'm also adding the noise I'm using a plus mod model so I'm selecting IP adapter plus and then I'm using the IP adapter save embeds node the embeds are saved under the embeds directory inside your output directory I'll call this uh Frost giant I'm not executing anything else from this workflow so I am disabling the K sampler and all this workflow will do is to save them beds into the output directory so now that they are saved I can remove the clip Vision the load image the prepare then en code image and the save embeds I can use now the IP adapter load embeds this node expects the embeds to be moved into the input directory so move the file that we just saved inside the input directory and if you refresh you should find them beds into the drop down now we need the IP adapter apply encoded so we connect the embed the IP adapter and the main model we can reactivate the K sampler so by doing so you are not loading the clip vision and this should help with the vram a little before executing this workflow remember to stop the conf UI server and restart it so all the vram is restored and run the workflow again so this is all I wanted to cover in this video I hope you found it interesting I know that there's some interest in the IP adapter training script as far as I know there are no uis for training IP adapter so a tutorial about that would be just me typing on the command line I don't know if a video is the best platform to do that maybe a written article would be better so I'll think a bit about it and see what I can do in the meantime if you have other topics you want me to cover just let me know and see you next time ciao

Info

Channel: Latent Vision

Views: 12,177

Rating: undefined out of 5

Keywords:

Id: mJQ62ly7jrg

Channel Id: undefined

Length: 16min 26sec (986 seconds)

Published: Sun Oct 22 2023