How to use IPAdapter models in ComfyUI

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone my name is mato and I am the developer of an implementation of the IP adapter for confy UI IP adapter is basically an image prompter it takes an image as an input it is encoded and converted into tokens that then are mixed together with your standard text uh prompt to generate a and New Image there are actually two extensions for the API adapter on confy UI one is mine and it is called uh confi IP adapter plus and the other one is called IP adapter Cony UI I believe that mine has uh a couple of uh benefits uh the first one is that it follows closely the the the way confy UI uh does things so it is more efficient and it shouldn't break if comi gets update uh it also introduces a couple of um important features one is the noise that should Grant arguably a better result and the other one that I just implemented is the option to uh Import and Export uh pre-encoded uh images so let's start from the beginning uh this is a very basic workflow all the magic happens in these uh nodes so first of all we need to load the IP adapter model uh there are many available for sd5 and for SD XL we'll talk about all of them later and then we need the clip Vision encoder uh there are two of them one it is called sd15 and one SD XL it is not as straightforward as it may seem because the sd15 it is sometimes used for uh sdxl models and we'll talk about that too so uh I loaded the uh sd15 image encoder and the IP adapter sd15 which is the simplest of the of the models I then loaded the uh an image reference and linked everything to the uh apply IP adapter node uh the weight is the of course is the strength we'll leave it at one uh we do not have any text prompt so we can use a very strong uh weight for the for this image uh I am just adding blurry in the negative because that uh really helps so by default the key sampler is at CFG 8 and steps 20 and let's see how it goes so with just one image reference and with no text at all we are already able to uh achieve a pretty decent result but as you can see the image looks a little burned the IP adapter models tend to burn the image a little and we can solve that by lowering the CFG scale and we also need a few more steps just to give the model more time to generate the the image so let's try again and and this is already already better so let's see what we can do to improve this already very decent uh image we talked about the noise option this is an exploitation of the IP adapter model basically by default the IP adapter sends two images one is the reference image and the other one is a black image what I'm doing is sending a very noisy image instead of a black one and the noise option is basically the amount of noise that I am going to send so let's try with 33 and let's see the result and it is very pretty of course we can improve even more with some text as we are using a text prompt we also want to lower the weight of the image itself so the text has a higher relevance and we can try again and this is already a very decent result with just few words one in the negative and one reference image to reach the same result with just text it would take a lot of uh prompt engineering uh we can try to add something in the negative I like to add horror sometimes it helps we can generate some more images just to see how it goes and you will see that they are all very nice and it's very hard to find a bad picture okay now we are going to try other models the most interesting is probably the IP adapter SD 1.5 plus the difference is essentially the number of tokens that these models are going to create for the image the base one is only for tokens per image and the plus is 16 so a lot more and let's see how it goes and we are going to go back to one and remove noise and let's try with a vanilla configuration and let's see how it goes so the image maybe it is not as good as before but it is certainly closer to the reference image that is of course important we can try to improve again the image by adding some noise let's see and put the text back with a lower weight for the reference image and again we reach a very very decent result that this time is a little bit closer to the reference let's try another [Music] one so now let's talk about how to prepare the reference image for the image generation so far we've used this image which is a square so it's very easy for the models to work with this but it may happen to have like a portrait image for example like this one let me show you what happens if I send uh this image without any preparation let me remove the text set back to one and uh yes also remove noise as you can see the image is not centered this is because the clip encoder resizes and crops the image to the center any image that is in portrait on landscape mode it is cropped to a square in the middle in this case the center of attention is of course the face of the girl so need to prepare the image better for the encoding for the purpose I have a note called prep image for clip vision and instead of sending the image directly to the IP adapter I'm sending this to the prep Noe and as you can see the crop position is on the top we can select the better position for the for the crop we have other options but at the moment we don't need them and we send the image to the IP adapter and generate again as you can see the girl is not cropped anymore and is nice into the the frame next we want to send more than one image to the IP adapter so I'm loading a new one uh let's do this one okay to do that I'm using the batch image node we can simply merge the two images together like so and send them both to the IP adapter generate again and I will have a nice merge of these two images of course I can add more and let's [Music] see what I've got here I need another batch image note I can put this one the previous batch here and then everything to the IP adapter generate again and now the generated image gets all the features from three pictures these two images are already a square so they don't technically need to be prepped for the encoding but we can try anyway because sometime for example using a better interpolation algorithm you can get better results let's try and now the face is a little bit more detailed which is nice we can do the same to the other image I take another prepare note it's usually a good idea to prepare the images before sending them to the IP adapter but your mileage might vary so it's always better to experiment there's one last thing I want to show you that is the option to sharpen the uh prepped images so I can by just adding a little bit of sharpening to all three of the images sometimes you can get nicer results let's see it's a lot more defined you can see there's a black line in this Jewel and whatnot it's up to you it's an option that you have a quick note about selecting and using multiple IM Imes for your generation say I want to create Street Fighter characters of famous actors so I picked a series of six uh actual Street Fighter characters and I put Rosario doen as Street Fighter character a few negatives and let's see how it goes so this is pretty good already and I'm happy with the result but uh what happened if I remove four of the images I mean the image is of course different but it is not that is worse the main elements are all there and I reached uh more or less the same result with just two images instead of six uh this is because my reference images despite being all different from one another they all reiterate on the same concept they are all video gamees they are old female characters they are all fighters so it is worthless to add more of the same element so you are not trying to train aora uh you are adding tokens to a composition so when you are adding a new image ask yourself uh what is this image going to add does it really add new elements if it doesn't you don't need to add that picture should we try with just one picture as you can see we lost some elements on the hair but overall the composition is still good so since adding new pictures is actually expensive uh let's try to uh not overdo it let me try to add actually something new like some icebergs now I have some real change so the settings now is on eyes at least okay let's move on to the next topic another very interesting model it is called IP adapter plus pH it is a model specifically trained for describing faces it is not Aura it is not a face swap it just tries to describe as close as possible a phase that you give it to it for example the ethnicity the the eyebrows shape the the expression the hair color and so on so being just for the face we need to give a context to our scene something like this a superhero woman wearing a high-tech costume cinematic hero pose dramatic lighting closeup uh we of course need to lower the weight to to give the text more relevance and let's see what we get so as you can see we have our superhero with a face that is very close to what we gave as reference so let's talk about sdxl now you need of course the IP adapter sdxl model the sdxl uh image en code and of course the checkpoint of your preference let's see how it goes the sdxl base model in my opinion is not great but uh again with the noise option you can get some nice result you can see it's a lot better but fortunately we have other models that we can use one of them is the IP adapter plus sdxl vit so it is important to remember that all VI models actually need the sd15 encoder even if they are for sdxl checkpoint so I select the sd15 encoder but the SD XL V model see how it performs the composition is very close to the original the image itself is not great but uh we haven't given any prompt we can try with noise and now it's very pleasant and also very very close to the or original there are other models you can experiment with uh just remember that the viit dash uh models require the sd15 encoder even if they are for sdxl and uh uh the other interesting one is the sd15 light uh you can use this when the text prompt is much more important than the image reference otherwise it is identical to the Bas sd15 now let's see how to add more conditioning to our image using image to image in painting and control Nets image to image is straightforward uh let's say I already generated this image over here but I want it to be in the style of this reference so uh all I have to do is to encode the image from from the pixel space to the uh latent space and fit it to the K sampler I set at D noise of 35 so the important features of my original image should stay the same but still receive quite a a bit of conditioning from the from the reference so let's see how it goes and look how pretty she is so of course as always we can improve our composition by maybe adding some text um if I add some text I have to lower slightly the weight let's see oh already better and as always is worth a try to add some noise as it generally grants pretty a results and with very little effort we added the style of the Venus of belli to our astronaut what if I want to change only the pH instead of uh the whole image that can be done with in painting I need the in painting encoder we connect the image okay then I create a mask we can roughly select the face connect to the key sampler and set the den noise back to one and for better result I need to select a checkpoint for in painting of course we need to select the mask and go so the reference image remains more or less the same but uh the face has been up dated with the IP adapter model the best way to interact with IP adapter is probably through a control Nets um let's say I want a portrait of a woman with uh this style but uh the head position should be like this uh there are many control Nest that I can use uh the easiest one is probably Cy it's an extremely fast uh preprocessor and it's relatively light on the image generation I connect the cany preprocessor to the apply control net node reducing the strength a little to give uh the model a little uh leeway I also lowered the weight of the IP adapter for the same reason and we can try and check the result so as you can see we have the general look and feel of the reference image that we sent to the the IP adapter and the head position of the control net image the result is already very nice but as always we can add some noise and look at that just as a reminder the noise option is something exclusive for my com fui extension uh for the IP adapter is something that not even the IP adapter uh developer thought uh was possible and it's a kind of exploitation of of the system but it works really well so I decided to make it public anyway we've already talked about image to image and of course the same concept can be used for upscaling uh IP adapter is so good at describing an image that it it is very effective when upscaling with uh pretty high Den noise in this case case I'm using a05 denoise for upscaling and a very simple um upscale I'm not using a model and I'm going to generate two images uh the first one is using IP adapter and the second is uh without let's see how it goes as you can see the woman upscaled without a p adapter is basically another person compared to the reference image while the girl upscaled with IP adapter keeps most of the original features also you if you look at the patches on the space suit on this image is basically completely made up while on the IP adapter it kept something of the original of course we can lower a little the the noise for a result closer to the original image but uh we are going to lose a little bit of sharpness of course if we had to use a model upscaling the result will be even better and this can be useful especially for sdxl since we do not have yet a tile control Net One Last Thing Before I wrap up let's say I have a set of four images that I like as as reference for the IP adapter um I am also um pre-processing them I am cropping to the top this one because it is in portrait mode I am adding a little bit of sharpening to this one because it is a little bit blurry and I really like the result and I now want to create like 100 pictures with with this reference it is a little wasteful to encode uh these images every time that uh I I need this this workflow the images are actually encoded only once and then we hope that the 4 gab uh encoder is unloaded but what we can do is to pre- encode our reference with the IP adapter encoder instead of sending the images directly to the IP adapter node we send them to the encoder together with the clip vision encoder and what we need at this point is a new IP adapter that takes already encoded uh images so now we can connect the embeds to the embeds the IP adapter model to the AP adapter and then the checkpoint we don't need this node anymore and we can remove it so if I generate the image again the result will be the same it is important to note that the noise also has to be set okay the cool thing is that I can take these embeds and save them with the IP adapter save embeds now they are saved on disk you can find them under the uh embeds subdirectory in the output directory of your comi installation and we need to copy this file into the input directory at this point we don't need any of these notes and of course we don't need the image encoder all we need is the IP adapter load and Bets and here you will find we probably need to refresh and here we find all our saved uh embeds we select the the file that we moved into the input directory and we connect to the embeds generate again and the result is exactly the same so at this point your images are pre-encoded and you can reuse them without wasting any resource from my test this configuration saved me about 1.5 GB of vram and of course you can send this file to your friends or post it on CIT Ai and have others create the same kind of images with the with your embeds so this pretty much covers the basics of my extension for comy UI and the IP adapter be sure to download the comfyi IP adapter plus extension and remember that this is not a model that needs training so it is uh worthless to send 100 uh reference images to the to the IP adapter because it is only going to waste resources you really have to cherry pick your reference images and the less the better there's actually a training script on the IP adapt uh repository that can be very helpful if you have very specific needs and it is also uh very simple to use so if you already have experience training for example Lura uh this is something that that you can try uh maybe we can do another tutorial about that specifically that's all for now and see you next time ciao
Info
Channel: Latent Vision
Views: 47,567
Rating: undefined out of 5
Keywords:
Id: 7m9ZZFU3HWo
Channel Id: undefined
Length: 27min 39sec (1659 seconds)
Published: Sat Sep 30 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.