"NOT SORA" - Zeroscope In Depth - full Workflow & Tutorial

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video I'm going to show you how you can generate this type of stock footage on your local machine we're going to build a workflow together and explore a really really cool technique to upscale the output to Beyond 1080p using animate diff LCM and IP adapter encoding there's a good chance you're going to improve your conf skills in the process and more importantly understand what's going on beneath the hood so let's go the videos you're looking at were generated with zeroscope zeroscope is a model scope video model that unique in the sense that it's Watermark free it's using a two-step system to generate high quality videos at 1024 x 576 which will then upscale further the first model generates 320p video the second step upscales them which is very clever Because unless you have a Grace Hopper cluster sitting in your basement you know you're going to want to explore both the prompt and the seat Space by generating loads of Quick Test video gauge the output and then and only then upscale the better ones to larger resolutions preferably while you sleep as usual I've made the whole thing available online and because I can already read the comments yes evidently this is not Sora neither does it claim to be it's however outputting some of the better video content you can generate locally right now today and given that it was only train on less than 10,000 Clips with 30,000 frames tagged I really wish the team had more financial means to train it even further because I can see the vision not that this video does have chapter but because I'm a new channel YouTube is being very difficult so check the description for links and time stamps let's get noodling all right so we're going to need a bunch of stuff first we need the nodes obviously and don't worry about the models themselves because the author serpent has built them so they would download automatically I still recommend you check out both hugging face repos for the first and the second model and also it wouldn't hurt if you read the docs because there's a lot of very interesting stuff in there which is probably going to get implemented a little bit later I talked to the author he's super nice and I really want to give him credit here for the hard work that is done especially when we get to the upscale you're going to absolutely love it it's pure genius all right so let's get noodling very straightforward stuff we clear the workflow so we have a blank slate and we need to load the first node which is going to be our model t2v which stands for text to video the other one being the upscaler v2v but we need t2v so the node looks like this and immediately we see a little problem The Prompt is way too small so we're going to fix this we're going to go and add a prompt now depending on what you have installed there's a ton of option personally I like to use a specific one for from a developer called surge I'm going to use that prom text and I'm going to go and recolor it because I know you guys love it and I want to make you happy so I'm going to make it red for the negative and green for the positive very straightforward stuff right so now I need a prompt I'm going to type a field of flowers viewed from above like a like a drone shot you can put whatever you want in there by the way and you can play with it knock yourself out that it's all part of the fun it's to explore next I need to go and convert my prompt to input my negative to input so that I can connect them to my existing prompt that's really also very easy this is the easy part of the tutorial uh and I'm going to drag them correctly onto the right inputs very good now I need to put some negative text I'm going to choose bad quality illustration cartoon uh CGI we're going to put um let's see illustration uh let's put Painting let's stick in drawing as well I like Photo realistic I just like it if you want something else use something else and next I need to obviously combine the output of my frames so basically all the images that form the video into one into an mp4 so I'm going to use video combined for that connect the two and now we can look at the options so number of INF steps that think of it as the steps in a case sampler it's pretty much the same if you put a high number it's going to look better sometimes but it's also going to be a lot slower to process now I tend to go for something like a 100 I've also used 200 at times but the default is 25 so I'll go and I'll use say 30 let's go for that next is guidance scale think of it as your CFG uh again the higher the more it will respect your prompt but 12 is a good number for the seed I'm going to pick something like well not 42 because that's the answer to everything I'm going to put my lucky seed of 77 but here you have to be a little bit careful because if you use a prompt that says for example Will Smith eating spaghetti and then you use say bork eating spaghetti and you have the same seed it's going to look like the same scene they'll be sitting at the same chair in the same environment and this I discovered has to do with the low number of Clips this was trained on on so be mindful of this so for now we'll use a random seed when it comes to width and height you don't want to change those parameters because evidently the first model was trained at that resolution so let's leave it as such next is the number of frames and here again there are certain parameters you need to respect above 24 you might see the image become strange artifacts being created and at 48 something like that it's just noise being returned look it's very artistic but that's not what we want and Below 24 you're not going to get great results either so stick to 24 think a bit like uh SVD for example where you know you have a certain number of frames you're going to have to work with in order to then expand by say creating some sort of slow motion effect something like that it works the same next we're going to look at our video combined parameters because sometimes people struggle a little bit with those so the number of frame is going to be 24 we want it to be smooth we want that 24 FPS that movie feel for the compression we're going to use h24 MP4 the reason for that is gives us access to the CRF parameter which makes it much easier than bit rate four is basically loss less or near loss less after that it's kind of a placebo effect if you will and of course the file name prefix let's name it something meaningful zero scope stage one because we want to keep those files you'll see why in a minute now we just need to hit q and it's going to return something I'm going to speed it up there we go so we get results but the flowers are a little bit Bland let's change it let's put a field of multicolored flowers something like that and and look I'm doing this because I want you to understand this is the exploration stage it's very important for you to do these modifications now this is fast it will not stay that fast especially when we get to the upscaling part of things so explore explore explore change the seed until you're happy with the composition that's how it works I'm going to hit Q again and sure enough our flowers are now beautiful and wonderfully colored I'm very happy with those reason results the file is save to disk because we chose the true option for the save but now we need to go and add the second part the second model and therefore the second node that's obviously the upscaler that's designed specifically for zeroscope videos now in there we see immediately another problem The Prompt is asked again so here I'm going to do a little bit of uh video editing magic we're going to create our first reroute uh some people have asked for this so you just type reroute you need RG three installed you select it right click there's keyboard shortcuts but you can follow what's on screen to see how it's done you retitle it you right click it again to make it resizable you resize it to the proper size and you have your first node great now we can build a bus so I drag a line a noodle I should say between those two I'm going to put them over there and we're going to fast forward this and have three of them identical great so now I can grab my prompt and I can drag it to the BS and the negative prompt do the exact same thing and my image because it's it's required of course and Port them over the second node here I'm going to convert the prompt to an input the negative to an input and of course eventually at one point I'll do the seat but let's just do this for now I'm going to drag these uh two lines and we need the frames there okay so now I'm going to copy the VHS video combin and I'm going to paste it which is why the video is repeated if you were wondering I check all the settings they're correct great now let's go and move our prompts around so it looks a little bit better and we're going to add a seed I'm going to use RG3 seed because as you know I'm a big fan of this it's going to help me have repeatable seeds let's hit new fixed random there's a little catch here I'll take you through it later and of course now we also need as you can tell a the exact same inference steps parameter and the exact same guidance scales for both models so to make it easier but it's not required what I'm going to do is I'm going to create a primitive integer which is from any type of note set you may have there's tons of them so pick whichever you like and a float that's for the CFG of course so that way I'm going to have a bus for that data that's going to go from the first to the second model and they're going to be in sync I find that to be a really useful thing especially when you generate hundreds of videos so again here I'm going to do some editing magic because this is a little bit boring and I'll see you when it's finished okay I'm going to quickly change all these items to inputs so that I can drag my noodles and now we have a proper boss and let's check that our VHS combined settings are okay they're okay everything looks great and of course final step we add our upscaler so let's move this over there and we're going to add a standard OBS scale scal ER we're going to use the load upscaler model SL upscale picture via model it's pretty straightforward there's tons of choices as many as you've downloaded from open model DB there's a new one that came out called Forex realistic rescal 100,000g which is a pretty funny name but I'm partial to Forex Ultra sharp V10 so I'm going to use that in this example let's add an upscale image using model and we're almost done we just need to drag this noodle over there there you go almost there and this one over here but the issue of course is we can't just connect the image to video frames because we're using 4X and again this depends on what setup you have so you need to do a downscale first if you're using a Forex model most of you do so let's go and use upscale image we're going to use the most box standard down scale possible to make it easy we'll drag this noodle back to video frames we're going to change the settings to match what the model expects 1024x 57 six there you go if I type properly it' be even better and we're going to switch it to L Source because the quality is slightly better and we're good to go so now we just need to hit Q I'm going to fast forward the time for you and now we get this output that looks quite interesting actually I think quite different from the first so let's go and change our loadup scale model I want to show you what that looks like hey why don't we use the new one why not right and I'm going to hit Q again fast forward the time again and look the output is different again so just keep that in mind you're going to get a different effect and you'll have to choose the right resaler for the job depending on your subject if it's a face you need to use a face model and so on and so forth so now we have some pretty cool flowers there's even one floating in the wind which is pretty neat but obviously now we need to have scale the output and this is unfortunately where we hit a bit of a brick wall why a brick wall well if you watch my other videos you know you have basically two options at this point one would be to encode the frame into a latent and then upscale that latent maybe in multiple passes to save on vram but the issue of course is that you're going to lose temporal consistency because each frame is going to be reimagined individually from each other it's not going to look good guys so the other solution is to use a pixel Bas of scaler something that does not change the frame but the problem here of course is that it doesn't look very good because it's just that just a basic pixel up scaler so of course I can hear some of you saying why didn't you try superp Etc well of course superar was the first thing I tried because I like the way it re invents the image but not that much and unfortunately it didn't work because if I was to use a very high control net it would give me an image that quite frankly didn't look as good as what I'm about to show you and second if I did lose in the control net then well I would lose temporal consistency because of course super internally is using a stable diffusion super resolution of scaler which is going to modify the image between frames so you get the point we're stuck in the loop it's horrible but don't worry there is a solution that the author of these notes came up with and I think it's pretty genius if you ask me so I've implemented it into this workflow and I want to take you through it so you understand exactly how it work step by step because we're going to do several things in parallel it's really exciting let's go okay so let me take you through the basics of the workflow it works exactly like my usual tutorial workflow where we have a model pipeline where we load a model in this case I chose epic realism but you can pick anything you want a Laura loader we'll get back to that in a sec and every possible option you wanted on your mobile pipeline everything is controlled with bookmarks you just hit two and it gives you access to the switches so for example let's imagine we wanted to test something I would enable Zer scope stage one you can click here to go see what it looks like I left you some notes because I love you guys and it gives you all the information you need to get started so here we have exactly what we had prior except it's a little bit better organized where we have a nice little bus here using RG3 nodes but you get the drift it's pretty straightforward to understand and that's our second step now let's go and reenable that and the third step which is our up scaler so I'll start with the second step in the Second Step nothing has changed we just added a preview which allows you to test the various types of ups scalers and their creative output if you will so we're talking about things like Forex Ultra Shop full hard remarry is pretty good as well so give it a shot really depends on your image okay and then here we have the famous upscaler so how does it work well to start it takes a prompt but not just any prompt even ly it takes the prompt that we passed at first and it adds a little bit to it in this case it requires some smooth motion high quality I put 4K in there out of luck but you know guys seriously this doesn't work okay this is just for fun and don't forget your negative prompt Watermark text inature blurry again here nothing to worry about this data set doesn't have any watermarks so it's kind of redundant I've left some for you here so you can play with that as well and try new things it's all about exploring and if you wanted to just upscale a video from stage one or two preferably stage two of course you could select it here now this is from a little short movie I'm making it's about a certain person called uh Imad Alm musk who has a son and that son takes over the world by selling AIS that scare people as a as a marketing strategy you know like fear something I'm not going to say the word anyway so that's what I'm working on right now let's getting back to being serious okay so in the pipeline we have a context node in the there it contains everything you need the model clip V positive negative Etc but of course here we override it with our own conditioning which stems from the post prompt and the negative prompt added for good measure now what's the magic well the magic is using animate diff this is not a tutorial on animate diff but just so you understand I'm using the gen two nodes meaning they look a little bit different than the Gen one nodes but they work exactly the same as the Gen one notes so if you want to put gen one noes because you find it easier you're more used to it knock yourself out it's the same thing now we're going to disable free in it because it does multiple passes as you know and we don't need to waste the CPU or just should say GPU Cycles in this case and I'm not doing anything fancy with the samples there're straight samples there are no custom Samplers nothing like that because the genius of it is that it all goes into an IP adapter which is doing all the encoding of the image into a latent I think that's really cool but instead of a latent here we use embed so imagine if you will this is like doing a v encode like this passing it the pixels of the image the V from the model outputting a latent to a sampler if you don't understand what I'm trying to say let me show you one I made earlier so here's one I made earlier and just to be clear do not do this okay this is the bad way but it's the naive way the the naive way I had because I was young I was inexperience it works exactly the same way but the difference is that I had put this sampler custom because I was trying to be clever and use a LCM Schuler to alternate between two different sampler so that it would look better because hey in animate diff LCM that's how it works right if you guys are used to it so we got the usual beta schedule at LCM 100s everything's happy and we're switching between Samplers here between ul and LCM because the again the output is better but then the problem is that I need to V decode because I need pixels I need pixels for ultimate SD up scale and ultimate SD up scale because I wanted to get the best quality you see I thought I'm going to be smart and I'm going to pass it a model which is going to be using say DPM 2 SD GPU caras because hey quality right but that doesn't work guys because as you can imagine what happens is this becomes pixel and then ultimate SDF scaler crunches through it and generates a different image so what happens not only do we lose temporal consistency but in addition it takes forever because yes it's using LCM sure but it's still still going back to pixels so what's the right approach well we have to go back to the real workflow the one I give you this one so what does it do differently well no sampler no sampler and instead of a v encode what we do is we encode embeds for IP adapter which we then passed booom straight into ultimate SD up scale no sampler well technically speaking ultimate SD up scale is the sampler but you get my drift and it's using LCM with sgm uniform so what does that mean it means fast how fast uh just about 30 times faster so why iterate right why do all this well I have a video about this it's a numbers game guys so you want to iterate you want to do as many images as possible and then of course you have the usual video combine and maybe a little something different here I use film vifi again that was recommended by the author of the zeros scope nodes okay R 49 works just as good in frankness it depends on your video so let's recap we have anime div being loaded now again be careful because anime diff it needs to have its motion model loaded we're going to use SD 1.5 t2v checkpoint because we have an sd5 model I haven't tried this with sdxl if you want to try it knock yourself out it might not work because obviously there's the story about the control Nets but yeah and of course we have the LCM animate LCM Laura because without that we can't use LCM this is your standard LCM workflow really the important part is that instead of going through a sampler we're not I think you get it by now so what else could I tell you about this well it's a 2X sub scale so if you wanted to risk it at 4X if you have the vram but more importantly I think this is more a question of do you have the vram and the processing power to go with it you could the seed you can leave it fixed technically that's not going to change much the steps obviously six it's LCM it would be fun if you guys want to try lightning right try lightning model see if it work there's a lot of things we could try hey it's Stefan from the future here so yes someone had the idea just at the same time as I did so that's cool I'll go and I'll implement this in the next version of this workflow you can find it on my Discord thank you we do need to fix the tiling so that's why I'm using half tile Sim again this is not an upscaler tutorial but it works exactly how you would expect the upscaler to works you need to fix the seams because if you don't because this is styled you're going to have lines in your image and we don't want some ugly lines we want a beautiful picture the other thing I need to touch on is the control net so we apply a control net but it's not just any old control net it's control gift for zero scope I've included the link in the description like everything else and I also included the recommended values but of course don't lower it too much because then the image loses consistency and look sometimes it's great for some special 3 effect I've done some cool stuff with that but it's up to you how you use it and what value you put in and another tip I can give you is you need to test your models so what I do for this I have a workflow which does just that it's super simple it's got a sdxl prompt it's got an SD 5.5 prompt and then what it does is simply case samples the image I can modify the parameters and it tests three up scalers now why do I do this because on some images you see those ugly lines yeah it's not that great but other upscaler might do a better job this looks a little bit organic but it's got weaknesses maybe the background's not as good so you know we could go into masking and all this but it becomes completely Overkill at this point in any case it's a good way to test your models and I would recommend highly that first before you choose any settings in the real workflow that you go on a little workflow like this one in order to test your CFG your steps your sampler your scheduler in order to make sure that the output from the prom corresponds to what you expect as an output because whatever the output is will be what the upscaler will use in order to improve your video so you want to get this as close as possible to the results you expect as usual join us on Discord if you want to share what you've created with this workflow or how you've improved specifically how you've improved it because I know it can be done especially with this new upscaling technique for videos please consider leaving this video a like if you find it useful or if you learned something it really helps with the algorithm talking about algorithm there's some great videos on your screen we should go check out because the algorithm is never wrong I'll see you on Discord guys have a good one take care bye-bye
Info
Channel: Stephan Tual
Views: 5,821
Rating: undefined out of 5
Keywords: AI, comfyUI, svd, sdxl, sd15, ipadapter, controlnet, animatediff, loras, models, checkpoints, tutorial, stable diffusion, sora, open ai, ArtificialIntelligence, MachineLearning, #technology, zeroscope, ai videos, SORA, sora opensource, ai generated videos, ai art, animated diff LCM, samplers, schedulers
Id: 7Q-KAV0MY3A
Channel Id: undefined
Length: 22min 13sec (1333 seconds)
Published: Thu Mar 07 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.