ComfyUI: Stable Video Diffusion (Workflow Tutorial)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi I am Mali and welcome to the channel stability AI released its first model for stable video diffusion you can get frame control by having only the candle and surrounding objects animate take a portrait AI generated or a photo and add a subtle animation for the hair and eyes or complicate things by taking your DSLR photo for example and creating a short video using latent noise composition I made six comfy graphs which showcase different examples of how you can fine-tune your image to video output let me show you my hacks and the comfy workflow explanation to get such results with stable video [Music] diffusion firstly I would like to thank the channel members who joined last week week we really appreciate your support all the relevant links will be in the description there are a total of eight comfy graphs the Json files and the MP4 videos will be available to all YouTube channel members comfy UI supports both the stable video diffusion models released by stability Ai and you can run it locally I had no issues running it on a 4090 it uses 100% GPU during the k processing I did try higher than 25 frames the 490 could do up to 120 FPS Beyond which it gave a memory error however the model is not trained for more than 25 frames and I will stick to that throughout the tutorial both the models can do video at a resolution of 1024x 576 it works both in portrait and landscape I would say better in landscape the first model is trained to generate 14 frames the second model SVD XT is trained to generate 25 frames I will be using this model for the tutorial download the model in the following folder in your comfy UI directory comfy manager is needed for the tutorial the manager installation has been covered several times in the past please refer to the stable diffusion playlist on the channel open the comfy manager comfy UI and all custom nodes must be updated so update before proceeding some custom nodes are required install the W node suit I will use these two nodes in the last workflow the video helper suite and image resize custom nodes will be required for all workflows after you install the nodes close the browser and command prompt then restart comfy for the custom node to work you need to install FFM peg go to the website and click on download the then click on Windows builds from Gan dodev scroll down to release builds download the FFM Peg release full s zip file you will need winr or 7zip to extract it I have extracted this in the same location I have comfy you can put it anywhere in your system go inside the bin folder and copy the location from the start menu search for advanced system settings click on environment variables then select path and click on edit over here click on new and paste the location you copied earlier then click on all positive options and exit the workflow will be pretty simple in the beginning I will keep building up the workflow and explain the settings as we progress ultimately the workflow gets to a very advanced level there would be a video model option with one node let's start with that select the SVD checkpoint now add a node called video linear CFG guidance search for SVD and add the image to video conditioning node add the K sampler and the vae decode node now this node is the custom node called VHS video combine I would use this instead of the default one called save animated webp this is because the default node only supports webp format output converting webp to MP4 is a bit of a headache you need additional custom software to do that this custom node allows you to export GIF webp MP4 Etc within comfy connect all nodes except for the image input in [Music] conditioning I am creating a separate group for the image loader add the load image node in image select image resize and add the node drag out and add a preview node [Music] let's start with a candle image this is an AI generated image I am choosing this one first because I want to show you how to control control the motion movement and make only certain elements animate in the video primarily the flame before I change the settings let me explain why you need this node the purpose of this node is to maintain the ratio and crop the image you can take any image and align it precisely the way you want it so it does not matter if the image is square portrait or landscape change the action setting to crop to ratio since the m Max height or width can be 1024 put that value on the large side there is no need to define the smaller side as for the video generation we are concerned with the max resolution not exceeding by one24 in width or height for now keep the resize mode to reduce size only the reduce size only is useful when using larger images an increased size only is useful when using smaller images any does it automatically but I prefer to set it manually the ratio always keep it 16 to9 for landscape or 916 for portrait let's hit the Q prompt and see the preview for this image the crop pad Position will not make much difference but to Center the image set it to zero putting it to one will make the candle slightly towards the left I will explain this setting further with other workflows where it really becomes [Music] useful to use this resized and cropped image for the video output you need to connect this node with the SVD conditioning node I am using a random fixed seed throughout the tutorial steps keep it at 20 however the CFG value is significant and relative to the minimum CFG value set on video linear guidance node the minimum CFG value is when the video starts the value you input in the case sampler is the value at where the video ends that is throughout the 25 frames the CFG setting is also relative to the motion bucket ID setting basically the the motion bucket ID and the sampler CFG will determine the camera and motion movement the sampler and scheduler make a lot of difference in the output I will explain this further in the tutorial the denoise will always be at one but I will show you in the end where and why I will reduce the denoise for now you need to let it be at one as we need to model to add the motion animation reducing the denoise will just keep the image still you have to change the height and width as per portrait or landscape always make sure that it's Max at 1024 I do use a lower value than 576 and it works but that is for the last workflow since the checkpoint model is SVD XT I will change the video frames to 25 the augmentation level is the noise level that it adds to the generation it is the level of detail it adds however be careful here as it's very sensitive and higher values sometimes give poorer motion details than a lower value it depends on image to image along with the other settings let's cue The Prompt with these settings and see what we get okay there is a lot of camera movement and panning also I recommend to change the frame rate to 10 from the default 8 higher frame rates are not recommended as the total video frame is only 25 gift supports limited colors webp is a way better alternative to GIF and supports 24bit RGB with an 8-bit Alpha Channel however I would recommend going for h264 MP4 format it's a standard video format and can be used to upscale via thirdparty software let's regenerate selecting MP4 and you can see the quality difference the CRF value here primarily affects when the output is in MP4 a higher value reduces the quality and a lower value increases the quality and size keeping the motion bucket ID the same a higher K sampler CFG increases the movement of different elements within the image let's try a value of 3.5 okay so it looks like the whole flooring of the candle is moving in a different perspective than the background and the entire movement across is slower let's try a value of 4.5 the movement seems to be more stable and less panning however reducing the CFG to 1.5 will reduce the background movement and increase the panning mind you the motion bucket is still at 127 fast panning and because of that there is a loss of details on the candle as well let's change the schedule to simple and reduce the motion bucket value right down to [Music] five now we have a still image with only certain elements in motion including the water reflection this gives you a rough idea of manipulating these values to get the desired effect let's select a portrait image this time this is also an AI generated image of a woman waving it's difficult to put motion here with the usual settings as the whole image gets distorted first change the ratio to 9 is to 16 then disconnect the node and qom to check the cropped image alignment since the image is horizontal changing values for the crop pad Position will shift the image horizontally only a value of 0.5 will centeralign the image change the Schuler to Caris and then reverse the width and height values in the the conditioning the motion bucket is set to a low value remember that in most images the model is trained to detect the motion based on the image so here the hand will be the first thing it will add motion to however it's all distorted you can now use the augmentation level to fix this Distortion increase it to a value of [Music] 0.1 that is much better if you want want the fingers to stop moving completely reduce the motion bucket to one this will just move the hand you can further reduce the augmentation level to 0.05 this is way better since the animation has a cut you can now try the ping pong effect this effect reverses the animation and plays it a loop it works for these kinds of animations another trick I learned is for any close-up facial image changing the sampler to any ancestral will make the model try to animate something in the face most likely the eyes but for this to happen a motion Bucket Level of 5 to 25 generally Works depending on the image let's try with 10 and C cool trick ain't it let's explore this further in the next workflow this workflow has a trick in which you can make the AI for specific subtle animations to Showcase this I will be using a close-up portrait image all images in this workflow are AI generated as well since this is a square image the crop pad position value will change the cropping from top to bottom a value of one will be at the bottom and a value of 0.5 will be at the center note that in close-ups all ancestral May animate the eye however Oiler causes more motion than dpm2 or DPM PP 2s ancestral okay so there is no I blinking when this happens try reducing the sampler CFG bringing it closer to the minimum CFG value let's try at 1.1 and see what happens [Music] excellent the blink is not perfect but you can see the panning of the camera has stopped now all I need to do is make the eyes blink entirely I am going to disable this group and create a new one we will need two image loaders and a set of two images the same image with eyes open and one with eyes closed this can be easily done within painting search and add a repeat image batch node [Music] twice now add two image batch nodes now I am going to mix and match the images connect the first image to image one of the first batch and image two input of the second batch reverse the same for the second image duplicate another batch image node and connect both these nodes to it [Music] the repeat image batch has a value of one which goes to two image batches so there would be a total of four Images let me change the image the second image is with the eyes closed so you can see the first image will be with eyes open then two images with eyes closed and the last image with eyes open again this is how it should be set up for a blinking animation you must also set up the image resize node here before connecting it to the SVD conditioning node [Music] let's connect and test it worked let me pause here and explain some things you need to understand why this worked I did not use frame interpolation which is basically adding frames all I did was instead of giving the AI 1 reference image I gave it a set of four images in a specific order I am just influencing what the AI is already doing this method adds more pixel noise for the AI to work with but notice how the eye color has changed this is because there are images of closed eyelids instead of the iris in between this messes up the color interpretation one way to fix this is to increase the number of open eye images compared to closed eye ones just increasing the value to two will not work let's keep the ratio at 2: 1 and increase both images accordingly let's increase this to eight and the second image repeat to four see the difference this is a proven method and not a random occurrence I will show you two more variants of this workflow keeping the same settings as before I will select another random portrait one with eyes open and one with eyes closed I will disable this multi-image group first I want to show you with precisely the same settings what the results are with just one image there is hardly any motion here let's use the multi-image method and try [Music] again 24 images will get fed into the SVD conditioning input absolutely nothing again the behavior is dependent on the image if you get this you need to change the ratio so that there is more of the second image for every nth number of the first image for this I am increasing the repeat batch to 12 for the first image and eight for the [Music] second there is a blink but with a slight Distortion if I put a 1:1 ratio I guess it would be better let's try that [Music] that is slightly better I don't think it will get any better than this the last thing I want to try is to see if this method works for making some facial like a lip movement [Music] just to note here I did try this with the same sampler and scheduler and it did not work however changing to Oiler ancestral and simple did something that's just a tiny bit of a smile another thing to note is if the second image smile is too broad you will get overlapping images the blinking is easy and works as I said before depending on the image the motion is predictable the AI understands the image and for this one it should make it seem like a forward motion however you can get a smooth or bizarre animation depending on the setting don't try to animate the pedal movement it won't work in the current state of the model the setting should now be clear increasing the motion bucket will animate the legs head and so on increasing CFG with reference to the minimum CFG will inre increase the camera panning and background movement I don't want any camera panning but just a forward moving motion so let's try that this is going back and forth oh I forgot to turn off the ping pong effect perfect in this workflow I want to take a motorbike example and show you the effect of having the right value of augmentation level by right value I mean not too high or too low as said earlier this setting is very sensitive and a 0.05 or a 0.1 is a good place to start then increase until you get the desired effect sometimes having it a little high makes the motion lose details this may vary from image to image but the basic range and effect will remain standard across images [Music] Focus here you see how the wheel is moving and notice the overall motion effect of the image let's see what happens when I change the augmentation level to zero everything stopped moving that fantastic effect is not there anymore see what happens when I increase it to [Music] 0.4 there is more smoke but the most obvious movement of the front wheel is motionless this workflow is a complicated one I would generate two videos and combine the effects I will use a technique called noisy latent composition this is a DSLR photo taken by my colleague with his Fuji I want to move the waves slowly and add clouds in the sky with a time-lapsed motion effect [Music] nice and smooth let's create a group for the second video I am going to use image to video again however you can use text to image or a readymade video this is a cloud image I have generated using photo photop generative fill any image of the cloud would do here and don't worry about the ratio or dimensions of the image I will turn off this group while I work on the other image I am going to add a node called image size to number this will show me the resolution after cropping within comfy itself I will need that as the second video cannot be the same height as the first video I need the second video to fill the sky now I will add two number to text nodes and connect the width and height [Music] the show text nodes will display the width and the height after cropping the image size is 704 by 396 replicate the entire video processing group for the second video connect the image to the second SVD conditioning the height cannot be 576 as it will replace the first video as the resize height of the second image is 396 we need the height to approximate that value 392 is good enough as it can be either this or 400 this will do this group is important as all the prompts and the output from the case Samplers will be combined here for the prompts I will use conditioning combine nodes two of them the order of the positive and negative does not matter as I am going to use a latent composition node the order in that node will matter connect the positive to the first node and the negative to the second combined conditioning node from both the SVD conditioning nodes add a latent composite [Music] node here the second K sampler will connect to the samples from input the first case sampler will connect to the samples to input this node copies the samples from and pastes it on top of the the samples to I want the clouds to be pasted over the main boat image so remember this order for the latent composite add the third case sampler and connect the model from any one of the CFG guidance node connect the conditioning combine and the latent composite nodes with the case sampler here the Deno value should not be one in the previous two Samplers we put one because motion has to be added in addition to the pixel noise in latent space we already have the pixel output now and we don't need to change much of the video outputs except blending it hence the noise here should be 0.2 to 0.3 this will ensure that the output from the third case sampler is relatively very close to the the first two just combined add the vaed code and connect it to the final video output [Music] keep the feather value at zero at first to see how it combines we can then alter the value accordingly to blend it we can try changing the feather value to about 256 and see what happens beautiful this is the entire workflow to get this output the only thing remaining is the text to image Image Group which will connect to the video processing group you add a standard text to image workflow ensure the height and the width are near 16 is to 9 or9 is to 16 ratios whatever works best with the selected checkpoint connect the vae decode node to the image resize node this will connect to the SVD conditioning and that's about it all Json files will be available separately for YouTube members I hope you found this tutorial helpful until next time [Music] [Music]
Info
Channel: Control+Alt+AI
Views: 10,632
Rating: undefined out of 5
Keywords: stable video diffusion, stable diffusion video generation, ai video, stable diffusion video, image to video, stability ai stable video diffusion, how to use stable video diffusion in comfyui text to video, stable diffusion video animation, ai video generator, text to video, stable diffusion video install, stable video diffusion free, stable video diffusion install, stable diffusion video free, stable diffusion video consistency, comfyui video generation
Id: m-ZoxcYNWFg
Channel Id: undefined
Length: 44min 9sec (2649 seconds)
Published: Sun Dec 03 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.