ComfyUI: Yolo World, Inpainting, Outpainting (Workflow Tutorial)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi I am Mali welcome to the channel using the latest zero shot instance segmentation tool I was able to get 104 segment masks in this low resolution image I then tested out the tool on this AI image and was able to adjust the colors in postprocessing combining YOLO world with llama matte and focus's own control and paint model you can easily segment and imp paint a subject with near perfection the workflow is designed to handle all types of images and in paint with intrinsic details and you can even change the subject today's tutorial focuses on Advanced segmentation image processing in painting and out painting techniques let me show you my workflows tips and some hacks in comfy UI I want to thank all the paid channel members for their continued support except for the YOLO World custom node all other installs are straightforward and can be done directly from the comfy manager this tutorial covers seven workflows a basic comfy knowhow is required the YOLO world and efficient Sam comfy implementation is made possible by zozozo YOLO world is a real-time open vocabulary object detection tool with instant segmentation capabilities although its core competency lies in video surveillance and Analysis self-driving cars and Robotics I will be using it for image manipulation via segmentation the second most important custom node I am using is comfy in paint nodes by akley the in paint nodes uses the focus in paint model which allows you to convert any sdxl checkpoint to an in paint model focus in paint or out paint method is one of the best in the industry as it uses its own algorithms and is partially inspired by diffusion-based semantic image editing with mask guidance you can also pre-fill the mask area using the tiia algorithm or the navier Stokes equation in addition llama and Matt and paint models are used for object removal in this tutorial I use differential diffusion self attention guidance and combine all of them to get the desired output I use the impact pack for the preview Bridge note required for in painting some of the was masquerade and vextra nodes are used in the image processing workflow the WD tagger node is used to get tags from images and the graphic nodes are used from comfy roll Studio a node from art Venture is used for color correction the in paint node is one of the main nodes for the workflow the seed and image compare nodes are used from RG3 I use comfy Essentials to get the image resolution and finally install the YOLO World efficient Sam custom node the requirement for the custom node is set to a lower version of in comfy however requires the latest version and the node does work on it after you install the node and restart comfy you might get this warning when this node was released the latest version of inference was 0.9.1 3 this custom node works until version 0.915 however it breaks with version 0.916 released on March 11th I have informed the dev about the issue the inference version of 0.9.1 3 should not interfere with comfy functions roof flow inference is a set of tools that simplifies deploying computer vision models particularly those trained on the roof flow platform these include clip EG segment anything and YOLO world if for some reason your inference gets updated to the latest version via comfy update I have mentioned the steps to downgrade to the required version in the description and any future updates for this will be in the pinned comments for the YOLO node download these two files they go into the YOLO folder in custom nodes next open this web page and download the focus and paint head and the version 26 patch files they go in the models in paint folder then from the GitHub page also download the llama and map models they also go in the same models in paint folder let's start with the load image add the main YOLO World node the model loader and the efficient Sam loader there are large medium and small models I would not recommend the small model as it fails to detect some objects the medium model is acceptable I would be using the large one for the tutorial for the device you have Cuda for NVIDIA GPU or you can just use the CPU connect the image add a preview and a mask to image node I am going to connect the image with a wd14 tagger not node this is very useful as it can list some tags in the image let's try out an image YOLO World model can detect objects that it hasn't been explicitly trained on you do have to play around with this and I will show you some unique examples just type in what you naturally see let's say a car for example to select multiple subjects X separate the keywords with a comma followed by a space this is a combined mask if I turn off mask combined I can get separate masks for each subject separating The Masks allows you to select a specific mask to pass on in the workflow the first mask is labeled as zero the second is one and so on and so forth so if I want to pass off only the car I will enable mask extracted and select zero in The Mask extracted index value to select the bike change the index value to one the system has a slight learning curve for example if I add person to The Prompt the selection Dynamics changes so now the confidence value has changed and it is selecting another car out of frame this behavior is normal with the YOLO World node here if I want to come combined mask of only the car the bike and the person I should try increasing the confidence threshold to 0.3 another quirky behavior is if it cannot detect the specific subject or object it gives an error the tagger here is helpful as it identifies some keywords I can then use these in the YOLO prom you can increase the threshold and character threshold to expand this list let's try a very low resolution poorly shot image this was shot Via Mobile in low light the poor quality and blurring are evident here this will obviously confuse the YOLO model for segmentation let's try a simple keyword car changing the confidence to default it missed the car with the blur which is what you would call a false negative in detection there is a way to correct this YOLO uses bounding boxes for detection it creates a box that has the entire subject or object selected there is something called the ground truth the data set used in training has ground truth bounding boxes these boxes Define the exact location of each object in the image the intersection over Union threshold is the percentage of overlap between the predicted box and the ground truth box a higher IOU indicates a good prediction while a lower value signifies a poor prediction in simpler terms the IOU threshold is like your minimum acceptable accuracy the default for the node is 0.10 which I find acceptable as per the research paper they use 0.5 however a higher value May sometimes cause false positives and vice versa I recommend starting with 0.1 and then increasing or decreasing in case of a false positive or negative let's increase the value to 0.5 and see if that helps detect the center car this is another low resolution image and it has a lot of objects and subjects let's say you want to mask this text the error is because it could not detect it reducing the threshold to 0.01 the efficient segment anything model works very fast and is highly accurate let's try another keyword goggles because of the low threshold it has mistaken some objects as goggles for the object here it could not distinguish between goggles and the bull and these two objects are not goggles at all keep in mind this is stressing the system as the image is very low resolution you may say this is the model's limitations if you remember earlier I said the whole detection Dynamics changes if you change the prompt let's try that this is better but it still struggles with masking some objects for example here it has now masked the entire head I want to show you one last example say you want to select this person along with the shadow later in the tutorial I will show you how to completely erase objects from images with high accuracy the important thing is that you should know how to mask it using YOLO let's try Shadow with the default threshold value you can see how accurately it has been masked now I will also add woman to the prompt to select the subject I won't be able to change the threshold so that it only selects the center subject and Shadow the shadow has a lower confidence value than the other selected woman I can try to change the prompt and say woman with orange bag and this works beautifully this works way better than anything we currently have comparing it with grounding Dino would be a good measure firstly can't separate the mask directly in grounding Dino secondly it is slower and highly inaccurate in comparison whatever settings I tried I could not get a well-defined mask for the center subject and the shadow I have tried and tested a lot of images and in many instances grounding Dino fails I am specifically choosing this image because it is a bit difficult to postprocess first I will apply a color or gradient to the entire image then segment individual elements and finally cut that element and paste it over the main image the color or gradient should match the resolution of the source image so first you get the resolution of the input image with the get image size add a color gradient node from comfy roll Studio convert the height and width into inputs and make the connection for a static color the start and end colors should be the same for the tutorial I will choose a gradient you can also select custom and input any hex color code using an image blending node you can overlay this gradient over the original image image a would be the source and image B would be the color or gradient change the mode to color and percentage to about 80 add a color correct node at this stage this Noe allows you to fine-tune the color grading further by editing the temperature Hue brightness contrast saturation and Gamma add the YOLO World nodes as shown previously and connect with the source image I am going to segment the top clothing first I probably need to reduce the confidence threshold and let's increase the IOU to one as well I am not able to segment the top clothing of the first subject correctly as I said this is a problem atic image trying a different prompt I tried many variations and this is basically the best I could get I can work with this what you need to do now is duplicate the YOLO node and subtract the other mask from it let's try the keywords short pants and arm now simply combine the mask it doesn't matter if there are extra elements here add a bitwise subtract mask node and connect it with the YOLO mask outputs we can get rid of these extra artifacts by adding a grow mask node reconnect the mask subtraction with the grow mask outputs add a value of two it eliminates all the artifacts the cut by mask node lets you cut the Mask Part from The Source image connect the source image and the subtracted mask okay I made a mistake here ensure you connect the color Blended image output perfect add a paste by mask node the cut by mask output connects with the image to paste input use the subtracted mask output as the mask input the image source is from the load image and not the Blended output change the resize Behavior to keep ratio fill the color has changed to Blue and this is good enough I am dragging this node next to the load image you can rightclick and copy the output then select the load image node and paste it using the keyboard shortcut I had trouble getting the mask because the color was too blended with the background mute the bitwise subtract and the second YOLO node and try the shirt prompt again as you can see YOLO world can now perfectly select all the individual clothing so so I am going to use mask extracted to select each individual element bypass the bitwise subtract node and repeat this process I am fast forwarding the process as it is repetitive I will slow down to show you some other tricks I used there is no limit to this and it does not take any sampling processing so it's quite fast you can repeat this process for each subject or object that YOLO world can segment that's the only limiting factor here I want to show you a little bit about bottom clothing pant as a keyword doesn't work for this image but short pants do I will keep the settings the same select each mask change the color and repeat the entire process this is the output I got after doing this for about 10 minutes once the workflow is set up it's pretty easy and straightforward for the back ground I just entered the keyword background and it segmented it correctly if for some reason it cannot then select the subjects and invert the mask before I proceed with the YOLO and in paint integrated workflow you should know how to convert any sdxl checkpoint using the focus patch for the tutorial I am using the Juggernaut v9 checkpoint the self attention node was added to comfy in December last year it improves the quality and detail of images generated self attention analyzes the image during the creation process and helps the model focus on important relationships within the image differential diffusion is now a standard when in painting in comy UI instead of replacing the masted region in one shot differential diffusion Works through multiple smaller diffusion steps steps with each step the model reduces the noise in the mased area while also being influenced by the surrounding image and your prompt this leads to a smooth gradual transition the main advantage of using differential diffusion is that you get a seamless blend and in the process it manages to preserve the details as well here I am setting up a standard V encode in painting workflow I am using a non- inpainting checkpoint typically checkpoints trained for in paint offer a smoother blend I will use the same prompt for consistency let's also look at the in paint conditioning method before comparing it with the focus patch this is way better duplicating the sampler again add the apply focus and paint node connect it with the differential diffusion and sampler add load focus and paint node change the head and the pat models add another vae and code for in painting node then connect the latent output to the focus and paint and sampler input remember that the vae en code method with the patch is used when replacing a subject or object within the mask see the difference this patch method will work with any sdxl checkpoint I cleaned up the previous workflow to only include the focus patch let's change the prompt to a large polar bear on the street as you can see this leaves a very clear masking line we will clean this up add a preview bridge this is for manual masking without the saving and loading this node is a two-in-one node from the in paint custom node it's the vae en code and condition combined this node only works when you need to refine the image or make smaller refined changes you typically use a lower than one to noise value I will use this to blend the masking outline with the background the latent and paint connects with the apply focus and paint and the latent samples connects with the case sampler input this method Blends the in paint perfectly to to recap you use the vae encode in paint node with the apply Focus patch to replace a subject or object then you extend the workflow by adding the vae en code and conditioning node masking the visible mask outline from the previous in paint and reducing the denoise to less than one in this workflow I will showcase some techniques for perfectly removing a subject or object from an existing image let's try using this image of a person sitting on a bench the gaps in between the bench should be challenging add the three YOLO nodes as shown previously we'll use the prompt person with the default settings let's first try using the selected mask and see the results before doing anything else add the grow mask node search and add the in paint using model node connect it with the load and paint model node this model is called mat which stands for Mas aware transformer for large hole image and painting this one is quite impressive it's trained on a larger data set the second model trained on a smaller data set is called llama which stands for resolution robust large mask and painting with Furrier convolutions I will use matte for this image the performance of the model depends on the image I try using both the models to see which does a better job in the initial stage of removing the subject and continue from there this is the area filled in by math notice that it is a bit low resolution and you can clearly see that something has been removed I will try to fix this with near perfect resolution comforable to paid apps like topaz Ai and for some images it's even better for this specific image llama performs very poorly I will show some examples of llama outperforming mat adding the load checkpoint conditioning vae and code and in paint conditioning notes ensure you connect the image from in paint model output to the vae en code and conditioning node the grow mask output will connect with mask input here it's best to put a positive prompt bench Park if you don't the model usually generates a person sitting on the bench like before adding the self attention differential diffusion and connecting it with the apply Focus node add an image par node connect it with the output and the original load image this is the main issue here I can improve on the other details as well since the Imp paint technique requires surrounding pixels I will mask the bench back rest area manually now add a bitwise mask Edition node you want to combine the YOLO and manual mask make sure that the YOLO mask output is singular either by mask combined or mask extracted this wider mask area should help another thing that might help is the blur mask node this node blurs The Mask area which you would normally use to preserve the colors the falloff is the edge of the Mask where the blurring effect gradually Fades from its maximum intensity to no effect connect the blur masked area output to the vae encode input and Q prompt this is very good but we can add another step to improve it further duplicate the case sampler and run it via a second pass with the same settings after this step the output is very very refined here are the results with different images this was done using the Llama model Matt performed poorly here here's another one using the Llama model the details in this are insane these are all done using the same settings the only changes in the settings are either the prompt or the grow mask value and this last one the prompt for this one was empty both models perform equally for this image you can combine everything shown in the tutorial so far starting with yolo world to select the subject then I use the Llama model to remove the subject I refine the image with a positive prompt via a encode and focus apply patch I did two passes for this I further connected the output with a preview bridge and made a manual mask for the mammoth here I experimented with a refinement V encode and in paint conditioning node remember if you use just the vae and code then you should refine it further using another pass via this node for out painting create the basic refined and paint workflow use the vae en code and in paint conditioning node with the focus in paint here add the pad image for out painting node change the left value to out paint on the left side say 128 pixels remember that the model will use some adjoining pixels as a reference and try to out paint add a fill mask area node after conditioning connect the pad image to the fil mask area this has three settings for out painting don't use neutral it is just a plain mask tiia is an algorithm while navier Stokes is an equation both work but they give slightly different results it is recommended that a blur mask area be added for out painting connect it with the pad image and fill masked nodes now complete the remaining connections for out painting go either horizontally or vertically once you do it horizontally for example proceed vertically start with 128 pixels and you can go up to 1024 resolution at a time on either side let's change the value to 768 and try out tiia so there is just a bright bright light on the right and there is nothing wrong with the generation the same is evident from the preview let's revert back and then I will show you how to out paint vertically duplicate everything from the pad image node right until the preview and save image the vae decode output will connect to the second pad image for out paint node the benefit of doing this is that you can use different film mask area styles for horizontal and vertical also it's preferred if the whole image is generated before changing directions it yields more accurate results using the focus Supply patch you get very good results for out painting here you can also try using a prompt let's say grass faded out of focus I hope this tutorial was helpful and you learned something new in comfy until next time [Music]
Info
Channel: ControlAltAI
Views: 9,805
Rating: undefined out of 5
Keywords: comfyui, comfyui tutorial, comfyui workflow, stable diffusion, comfyui explained, yolo world, comfyui inpainting, comfyui inpainting workflow, comfyui inpaint mask, comfyui inpaint only masked, comfyui outpaint, comfyui fooocus, comfyui yolo world, comfyui workshop, custom nodes, mask painting comfyui, inpainting, masking comfyui, outpainting, segmentanythiong, lama image inpainting, stable diffusion inpaint remove object, comfyui node workflow
Id: wEd1wPlCBaQ
Channel Id: undefined
Length: 37min 46sec (2266 seconds)
Published: Wed Mar 13 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.