ComfyUI: Stable Video Diffusion | Stable Diffusion | Deutsch | Englische Untertitel

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Hello and welcome to this video, in which I would like to exchange some life time for knowledge again. Stable Video Diffusion has been out for a short time now and there was a ComfyUI update, which allows us to use it here too. And there I wanted to show you briefly and quickly, although we will see if the video is short again, how you can work with it in the ComfyUI. And very important is, of course, you have to update your ComfyUI first. Do this via the Update But in the update directory of your ComfyUI or here in the Manager on Update All. Then you get the latest version of the ComfyUI and then you can get started. We don't need a custom note for this. All this comes with ComfyUI itself and it is also worth taking a look at the ComfyUI page. So this is the ComfyUI github page and if you scroll down here, you might know it. Otherwise you can see it now. There are ComfyUI examples here, click on it and if you look down here, we have Video Models. So and that's the explanation of it that we get from ComfyUI itself from the developers. A short and clear page, how to use it and it's really not complicated. Don't be alarmed and I think you get very good results out of it. Better than Animate Diff and it is also easier to use. It is definitely very, very well suited for playing around. To get started here, up here we see two links. These are the checkpoints that we need for this. These are our own models. We have to download them. The first link leads to a model that is trained for 14 frame videos and the second to a model that was trained for 25 frame videos. If you click on it, you come to this Hugging Face page, click on Download and save it in your ComfyUI Models Checkpoint folder. Here I already have both the SVD, which is the small one for 14 frames and the SVD XT, which is the large one for 25 frames. I'll break off here at this point. So go back here again and when you have loaded them, start your ComfyUI, then they appear and then you can start. How do you get the best entry there? So the easiest option is definitely here on the ComfyUI page. Take the picture here, click on it, drag it over to your ComfyUI tab, let go and that was the whole story. Here we see an image to video example. So up here our recently downloaded checkpoints are also directly inserted. The SVD is for 14 frames. I'll switch it to the SVD XT and then we'll see what we've got new here. There are a few notes about it. So once there is this image only checkpoint loader image to width model, which is important for this. But I have to do that here for a moment. That's my auto snapping, I don't like it when it's so crooked. Image only checkpoint loader, you can find it if you look in the ComfyUI under loaders, there is now the folder video models. There she hides. The video linear CFG guidance note has also been added. It's in the conditioning, I mean. No, that's the other one down here. The SVD image to width conditioning is in the conditioning. And the guidance note was in the sampling area. Here is also video models, video linear CFG guidance. These are the new notes that have been added and we have already seen in the example how to link them all. So we have a clip vision here that comes from the new checkpoint loader node. In it image, that's a load image for us now. VAE also as a checkpoint loader node. The model goes up here into the video linear CFG guidance node and from there simply into the sampler. After the sampler we have a VAE decode and another new node. This is the Safe Animated WebP, because we get these videos in WebP format. They can be played in the browser. If you want to have Gips or something like that, you can also convert the WebP somehow via online tools or something like that into Gips. Or locally installed tools, I don't know. I just left it that way. In any case, they can be found in the... Wait a minute, I have to take a quick look. I think that was in the For Testing area. Right down here. For Testing, Safe Animated WebP. There we find this node. So apparently it's not quite matured yet. That's my guess, because it's in the For Testing folder. Let's wait and see what comes next. Good. We now have the option to make different settings here. And I think we'll go through that once we've rendered a picture. I have a nice example here. I'll just load it from my Input folder. As soon as I find it. Exactly. I just pulled this photo out of the Google Search. Let's load that in there. And now we can specify different things here. For example, the width and height of the video that should be rendered. So we take this image as input. We can still specify the size of our Latents separately here. It does not automatically take over the image size that we have here in front. That can be practical, that can be impractical. But we'll see in a moment. Video Frames means how many frames it should render at the point. We'll stay with the small one, with the SVD Safe Tensors. Because that's trained to 14 frames and we'll stay there. It's a little easier to calculate. That's why I'll stay with the small version. But in principle, there's nothing going on between the two. So we want to have 14 frames generated. And with an FPS of 6. So pictures per second 6. Here's one point I don't quite understand. We go to the front here. We want to generate 6 FPS with this video. But back here we say we want to save 10 FPS. That doesn't quite fit in with my world. I don't know why that's the case. I would like to say from the beginning here. We want to have 7 FPS here. Because if you divide 14 by 7, we get a two-second video. And at the same time, I'd like to export back here. I also want to turn the whole thing down to 7 FPS. That makes it a little more jerky. But for me, that somehow makes sense. Maybe we'll get more information at some point. Let's see. For the other things that can be set up here. There's actually a little documentation down here. On the video page. At the very bottom are some explanations for the parameters. Video Frames. Number of video frames to generate. That's what I just said. That's the number of pictures we want to generate. We have a Motion Bucket ID. That's this value here. Here we can specify how strong we want the movement in the picture. It's set to 127 for the example. I'll leave it that way. But the higher you turn it, the more movement you get in your little video. I had it at 500 or something. It was really good when things moved. That was too much. We'll leave it at 127. But we know that if the movement is too slow or too slow, you have to turn this value up. So that more movement comes out in the end. FPS. The higher the FPS, the less choppy the video will be. Yes, it makes sense. I just said it's probably going to be a little bit rougher. We increased it from 6 to 7. But here at the back. When saving, we also turned it down to 7. I don't quite get it yet. Why we should generate a video with 6 FPS here. And then save it here with 10. If we save it with 10 FPS, that we have double pictures with it. That's still a little bit of a mystery. I haven't figured it out yet. The Augmentation Level. This is this setting here. It tells you how far the image that we generate or the video or the pictures in the video that we generate should be removed from the original image. Or should. Because here it says how much noise is still to be added. So how much noise is to be added to the pictures. And by this noise, of course, we create a distance from the original image. So let's leave it at zero for the example. Or I can start that once and then we'll do the last node. Because until that's done, it takes a while. In my case, in the case of the size, I don't know, 40 or 50 seconds or so. We can already see that the sampler is rattling. It works, but it always depends on the hardware. The Video Linear CFG Guidance. That sounds pretty interesting. Because it does the following according to the description. It takes the CFG of 1 for the first picture we make. And then increases that over time. Here we see in the first frame 1.0, in the middle frame 1.75 and in the last frame 2.5. I think there is also a note in proof sampling for these video models a bit. So the one that makes it possible to make a small improvement in these sampling nodes. I guess it should just serve to stay as close to the prompt as possible at the beginning. And the further the video is removed from the original image, the more creativity the AI should let flow in at the point. So good, that's already gone through. Here we see the result that came out. I think it's very good. Between the initial image and the end result is what you expect. Someone who is currently driving on the motorway. I think it's pretty cool that it shows the vehicles on the other lane well. So that you can clearly see it. It turned out to be a pretty realistic-looking little video. I find it very, very impressive. Really. Very, very good. I can now turn down the scale to 512. Then it goes a little faster. I don't think it's that choppy now with the 7 FPS. And here we see very well, if I turn that down, we get a 512x512 initial video. Independent of the input video. We'll take a look at that again in a moment. But he just cropped it. You can already see on the ventilation slit that it was cut off here. The left hand was cut off. It doesn't work that well anyway. What you get at this point, by the way, are these WebP pictures. But you can look at them again in the browser by double-clicking. Unfortunately, it doesn't work. If we now say we want to open image by right-clicking here. Then he wants to save it. For whatever reason. Why not just show it and you have to make the detour over the output folder. But can a browser setting be with me or something? I don't know now. I don't have much time for that now. Also interesting to observe here is that the CFG is very low. So that the original image is oriented as close as possible. Which also makes sense. We probably don't want that much interpretation in the video. Through the AI itself. But he's supposed to focus on the input. In any case, that's something you can play around with. But the higher you set these values here, I've already noticed. The more ugly the result will be. Play around with it. We'll leave it like this for the example. And I'll just go back a step. On the original resolution. We'll take another picture. Which also achieved very good results. That's that. I got them all from Google Search. They were intended for the IP adapter. But we'll just take them now. We'll let that run through. And there it is. And here, too, we get a pretty good result. A video generated by a still image. I think that's very, very cool. How the background changes perspective correctly and so on. Everything works perfectly. What we can do up here now, of course. We've only loaded an image here now. But of course we take any image as the original image. And that already indicates the second workflow. So we've just loaded the picture in. We can also load the picture in here. Once in our ComfyUI. And here we can already see that it was done the same way here. Up here, a normal picture is simply created. With samplers. And back here it is then converted into a video. So instead of the load image node, we now have the output from our sampling. And then basically the same thing happens here. We go back the step again. By the way, here is a view history. So you can jump back again. We'll just do that now. We'll just take the load image out. Pull that away. And let's build a sampling area in front of it. For this I just take our TinyTerra nodes again. Say I want to have a sampler here. And here we say, cat sneaking through high grass high detail, intricate 3rdress masterpiece 8K UHD Just enter a little something like that. And I'll leave the sampler. We're doing a preview here. What we can still do now, of course. Or what probably makes sense. I'll do that back here. Because we want to convert our image into a video. That means we can definitely move the image over here. Then it works. But of course we didn't do the width and height right now. We're going to generate a 512x512 image. And we can say get image size here. Where does that come from? That's not how I see it. The modded nodes. But there are definitely several. Get image size. Let's just take this from the normal ComfyUI. Is that here? I don't know. There are several packages for this. And here we now load the width and height. I've already done that in several videos before. And now we have the image from our sampler. Threw the image back here in the video node. And at the same time get the size set by us. Or the initial image for our video. Then the image size that our sampler created also fits. And now we throw away the load image node. Now let's rattle the whole thing. No, no, no, no, no, no. I always forget to change the model. I don't want anything. I just want to have the absolute reality. And at the same time we can also set up a VAI. Now it should fit. Now we have to load the model once. Now we have our cat sneaking through the tall grass. Now it goes on back here. The image was taken. We got the right size. Everything rattles off like before. Only that we no longer have a load image. But of course one of us generates the image. The area up here doesn't matter if you have an SDXL workflow. It doesn't matter at all. It is important that we also have the video models when creating the video. Now we got this out of here. It's a bit of a ragged cat. I don't know why. But it works. You know the principle. So you can basically create pictures up here. As you want them. And then throw them into this video process chain. There are still a few other aspects. If you are in the area of SDXL, make sure that it is not too big. If you make it too big, you run into out of memory errors. That means there is simply not enough graphics card memory there. To calculate all these pictures. Because we are already in the image batch area. That means it records all the pictures in the memory. And then works with it. Therefore, it can happen quickly. It also depends on your hardware. How much you can afford. Or how much your hardware can afford. Not you. Of course you can afford everything. But your hardware. You know what I mean. If it's too much, turn the size down a bit. Or the number of frames that should be created. Whereby I think 14 frames. And now with 7 frames per second. So you definitely want to have 2-3 seconds at the end. But it's a nice game and relatively easy to use. What is also very interesting. If we do a preview image here. I don't want to save it now. I want to have the preview image here. If we do that again. It's a bit stupid that we always have to go through here. In any case, we get the entire image batch here. So the video frames that we have created here. At the number 14 now. We get individual from the batch. That might be interesting to know. Because I have to bridge over time. Until he did it. That's why I'm talking about cheese now. And now we have it. I just took a preview image node. But if you take a save image node here. Then you get all these individual images. Also saved on the plate. That means we see here it is 1 of 14, 2 of 14. And here we see the animation. All of these individual images are saved. That might be interesting. If you later with the individual images. Which made this stable video diffusion here. Then again in an image to image area. I want to convert to other styles. Or something else. Because I also noticed. If you go into the conversion process. Somehow Canny and something at the point. Doesn't really work. With my attempts. Canny then always just the first picture. But not the batch of pictures. But always just the first picture. And then the first picture tries. To apply to the others. That didn't work at all. I can't rule out yet. If I did something wrong. It's a relatively new technology. Which we are looking at again. But Canny is already old technology. The control net Canny. But in any case. If you can save the images individually. You can also do it again. With the batch image load. From the was note suite in any case. Also load individually again. And then you have the opportunity again. So not to work in the batch. But to be able to load the images individually. Sequentially one after the other. And then the control nets work. I wanted to show you that again. Because here we only get the video out. But this Safe Animated Web P Node. At the point takes. Also the batch. And then puts it together to a video. But here at the point we get. The individual pictures from the batch. Also extracted well again. I also looked at a bit of upscaling. For upscaling. I found out the following. So it didn't work for me. That I said. I want to use another model. I want to upscale after that. At least not when we come out of the latent area. It will probably work out of the image area. But I'll show you one now. Opportunities that I use for me. For upscaling. I took the NN Latent Upscale Node. I already showed it in the last video. And I said here. I want to double that. It's not more than two. I saw. You have to know first. But now for the video. Enough. And. What I did at the point. I mean, we have our VAID code down here. But we can put the Latents in here too. Direct. And then I basically took another case sampler. So copied this one. I then threw the Latents in here. We are still here. In the batch area. So all generated Latents are here. And here too. I said I want to have the same model again. I want the same. Have positive from here. And I want the same. Have negative from here. And what I do here. Otherwise only change is. I take the new one down to 0.5. Otherwise everything stays here. Same. And then we can. Back here, namely. Again a VAID code. Use. We need the VAE again. That comes here from the model loader. And we take a. Save Web P. At the point again. Let's push down here. The two have a little bit together. And. So that the whole thing goes even faster. Because. Now 14 pictures with 20 steps. To calculate at a double height. We remember. I have 512x512 back here. That means we want the now. 1024x1024. Pump up. I just went there again. I said. I would like to have a LoRA here. And here I just load the LCM. LoRA in. Where is she? But in the meantime. After the last video. I have a lot of trained. LoRAs for faces. Loaded because I liked it so much. LCM 1.5 there it is. I also introduced it in the LCM video. Here we can. Pull in the model. And then down here from here. And then we can say here. We want to be a sampler again. Use the LCM sampler. And. I usually use SGM uniform for this. And with that we can. The steps on. Turn them down to 4. And that accelerates the whole thing again. So we have the NN Latents upscale here. Also much faster in upscaling. As normal VAE upscaling. And we only have 4 steps here. Which we need per picture. Now the note needs up here. Of course we still have a clip. We don't have a clip at this point. But we can avoid that. By just saying here again. We make a load checkpoint. And put our SVD here again. What did I take? The normal SVD. Down here. The same thing we have here in the video. Video checkpoint. Loader node. We load ourselves up here again. Into the load checkpoint. And from here we get the clip. And we can hang it in there. And if I start the whole thing now. Then we get a picture of a cat again. A very nice picture, by the way. I like it. Very cool. Now here at the point. Then our video is generated with 14 frames. We are still in the 14 frames trained model. We remember here. Up here. The small model for 14 frame videos. And the larger one for 25 frames. But now for the video I stay with the 14. And it has another reason. 14 can be divided well by 2. So I can enter a 7 here. If I had now taken the 25 frame model. I would probably have made 50 frames. And yes. That. Sometimes it can not be divided so well. I had an example before. Then I had to go to 12.5. So if you make a second. Then you have 12.5 frames back here. And I think up here. But you can only enter 12. That was a bit too. Too inaccurate. No matter. Well, we now have our cat. That really sneaks through the grass. That's pretty cool. And here we have our cat. And here we have our cat. Which has been scaled up. Has become a bit blurry. But it was relatively fixed. As I find. Of course you can also leave the lcm up here. At the point. There were also other. Things. Comic areas and so on. It worked very well. I can do it again for us. For comparison. I actually tried it with a comic picture. Let's try the save webp note. And just hang me here. The output around. That means our old generated picture stays here. And we generate again. A new one. Whereby I didn't fix the seat. That was. That was. So. Yes, we can't do that. But we can take a look at the quality. Create at the point. We take the lcm out here. And say. We go directly into the model. Then of course we have to do it again. Rearrange. I'll take the brake here. At the dpm2. dpm pp2m. And karas say here. 20 steps. And now I let it go again. Generate a picture. And that will take a while, unfortunately. That's why I'll have to cut here. But we get a similar picture. That fits. We get a similar picture. And then we look at the quality loss. Through the lcm. When the whole thing is rattled through. So it's rattled through. And it took a long time. I'll just have a look. The prompt before. So that took 143 seconds now. So over two minutes. The prompt before was in. Just over a minute. So that makes a difference. Whether you take 20 steps. Or just four. And I have to say. Something went wrong here. The lcm has here. A lot better results. Interesting. Good to know. We can try again. If I load the Absolute Reality again. And then. Send the model directly here. And then make a queue prompt. The difference in quality. I think that's really, really crazy right now. But here I would then. Expect an error. That's always. Flitched out. But again here. We can save these pictures in between. As I just showed. Save images out here. Then you have every single picture. And that makes it possible. Then again every single picture. Scaling up once. What happened to the poor animal here? My goodness. There is the expected error. So it doesn't work. The models. Here at the point. To mix. That's why scaling up works. Here already. If you use the same model. Mixed with the lcm. Technically it would be good to know. That it works. That we have the opportunity. The result was a bit blurry. But still better than the result. Without. LCM. Whereby I am now. I can't explain to myself. Why that got so bad. Despite the. NN Latent Upscale. Normal. Upscaling here at the point. I also noticed. There I am already very strong. To my memory limits. I've only been with the NN Latent. Upscale a little further. But of course that's all. Then together. How you define the chain before. How many pictures you create. What original size the pictures have. With 12 GB VRAM. With me in it. The more VRAM there is. The bigger you can go. Or the more frames you can hold. To calculate. But yes. It's a new technology. We're just getting started. I'm still curious if it's from the. Stable Video Diffusion Models. There will be trained variants at some point. I actually think so. Whereby. You don't need to ask me yet. I haven't dealt with it yet. But I see it in Discord again and again. There are specialists. They know what they're doing. So let's just trust the future. At the point. It's definitely very, very interesting. That it is now with. Relatively little effort. But leads to good results. And. That's not a good result. I admit. A little more love. In the creation of his picture. And then has such results. Or like the pictures. That I just showed. So once that with the car. And once that with the woman from the cinema. Then. Then there are really, really good things. Yes. Then, as I said. Updates your ComfyUI. Start playing around. Experimented. First of all. See more of such small animated pictures. Because they are now. For me, much easier to produce. So you have a little less control. Compared to. To AnimateDiff. Where you still have the camera. Turn directions and so on. Nevertheless, very good results come out here. We'll see them more often in the future. And you know how it is with the AI. Yes. Think about it. In the next few years or so. How the Dali pictures look like. What pictures we can create in the meantime. And now it goes with Stable Video Diffusion. On the home machines. You have to say that again and again. Is it possible to do that. To let go. And to create such things. And there will be a lot now. I hope at least. That will be an exciting future, people. I wish you a lot of fun. When experimenting. And I say goodbye. See you in the next video. Until then, take care and bye.

Info

Channel: A Latent Place

Views: 898

Rating: undefined out of 5

Keywords: ComfyUI, Stable Diffusion, AI, Artificial Intelligence, KI, Künstliche Intelligenz, Image Generation, Bildgenerierung, LoRA, Textual Inversion, Control Net, Upscaling, Custom Nodes, Tutorial, How to, Prompting, Stable Video Diffusion

Id: V8vhzlJpJ3c

Channel Id: undefined

Length: 34min 14sec (2054 seconds)

Published: Sat Nov 25 2023