Stable Video AI Watched 600,000,000 Videos!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Finally, it is here. From today, we can all become  film directors! Yes! Text to video and image to   video that is open source, and free for all of us.  And it can even make images of memes come alive. This is Stable Video, which has studied  600 million videos. And we have this too,   and this too. Oh my, three amazing papers. Yummy!  So what is going on here? Well, simple - you just   write a piece of text, and Stable Video can  generate a video for you in about 2-3 minutes.   It compares favorably against the competition  at this moment. I say “at this moment”, because   this result was recorded at a particular point  in time, and these systems improve so rapidly,   that for instance, Runway may be way better by the  time you see this comparison. But, it doesn’t end   there. Not even close. There is another amazing  text to video AI that you can kind of try right   now. And there is even more. In fact, there is so  much going on, I don’t even know where to start. Dear Fellow Scholars, this is Two Minute  Papers with Dr. Károly Zsolnai-Fehér. So first, Stable Video. This was trained on about  600 million videos, and now can generate new ones   for you. It is free and open source, however, you  still need some computational resources to run it,   I’ll put potential places that can run it for you  in the video description. If you found some other   place where other Fellow Scholars can run it for  free, please leave a comment about it. Thank you! It takes approximately two to three  minutes to create a video. And there is   a lot to like here. Finally, an open source  solution. This means that you will soon be   able to run this on the phone in your  pocket as freely as you wish. Glorious. However, it is not perfect. Not even  close. Sometimes you get no real animation,   but instead a camera panning around. Also, you  probably already inferred that from these results,   but it cannot really generate longer  videos, but that’s not all. Its generated   videos also typically showcase not too  much motion. Third, you know the deal,   don’t expect good text outputs from it. Not yet  anyway. And fourth, it is a bit of a chonker.   What does that mean? Well, you need a lot of video  memory to perform this. I am hearing 40 gigabytes,   although there is already a guide to  get it down to under 20, or maybe even   10 gigabytes. Link is in the description.  From seeing the nature of these limitations,   my guess is that the memory requirements  will be cut down substantially very soon. However, there are more tools coming up  in the meantime. Here is Emu Video. This   is incredible. Look. It is so good  at generating natural phenomena,   and it even has a hint of  creativity. Wow. Fantastic results. And the paper showcases this, which is  a sight to behold. Goodness. Are you   seeing what I am seeing? This is a user  study where humans look at the results,   and whatever other technique you see this  compared against, it has a win rate of often   in the 80% region, against Imagen Video  - here is what Imagen Video looked like,   and that is definitely one of the best ones out  there. And now, this new one, still better. Wow.   But it gets better. Just creating high-quality  results is not enough. Just consider a technique   that always gives you a high-quality video,  perhaps always the same, and ignores your   prompts. That is a high-quality result, however,  faithfulness to the prompts also needs to be   measured. And on that, this new technique  has no equal. Nothing is even close. Wow. And, fantastic news, you can kind of try  this technique out for free in a website   right now. The link is obviously, in the  description waiting for you Fellow Scholars.   You can assemble these text prompts and see  immediately what the system will do with them,   I love the creativity here, every solution  is at least pretty good in my opinion,   and some of the solutions are just excellent.  For instance, this one. You can also look at   some images here and perform image to video.  These really came to life here, so good. Or,   search for a gallery of text to video results.  I loved the robots here, but when I looked for   scholarly content… nothing. We need more  scholarly content. Well, maybe next time. And this is a great paper, so it contains a user  study that is so much more detailed than this,   they also look at sharpness, smoothness,  amount of motion, yes, you remember from   the Stable Video project that this is super  important, and object consistency as well. Now, not even this one is perfect. The  resolution of these videos is 512x512.   Not huge. But this is almost guaranteed to be  improved just one more paper down the line. Also,   this is not open source, not at the moment anyway. Now, why is it important if it is free and open  source, like the previous Stable Video? Well, have   a look at this. I love this image. So why is this  interesting? Well, have a look and you see here   that the best performing large language models are  all proprietary. These are closed models. However,   there are other language models that are  nearly as good, just a step or two behind,   but these are free and open source. So,  this means that intelligence is not in the   hands of just one company, but a nearly as good  intelligence you can run yourself on your laptop,   and soon on your smartphone too. Just imagine  if the best model out there is unwilling to   help you or starts hallucinating, in that  case you would have no other choice. But,   with open source models, this will never  happen. There is always going to be a kind   little robot helping you. And this is  the importance of open source models. And if you think that we are done, well, Fellow  Scholars, hold on to your papers for the third   amazing paper for today, Emu Edit. This helps  us edit images iteratively. The iterative part   is key here. This means that we can start from an  image that we like, something that we got from a   text to image AI perhaps, and then, it is rarely  the case that everything comes out exactly how   we envisioned. So from now on, not a problem  at all. We just add subsequent instructions,   and much of the image will remain, only the  parts that we wish to change will be replaced. So if you need the same emu, the same background,   but make it a fireman, there we go, and…oh  my. Look! Finally, scholarly content. Good! And when compared to the competition,  this one is also so far ahead of them,   goodness, look at that. InstructPix2Pix  is from just a year ago, and MagicBrush   is from less than 6 months ago, and both  of them are outperformed significantly   here. There are other cases which are a bit  closer, but I still prefer the new one here. So three amazing use cases, three amazing  papers. I hope that you share my feeling   that this is an incredible time to be alive.  Research breakthroughs are happening every   week. What a time to be alive! Subscribe and  hit the bell icon if you wish to see more.
Info
Channel: Two Minute Papers
Views: 149,961
Rating: undefined out of 5
Keywords: ai, stable video, emu video, emu edit, stable diffusion video, stable video diffusion, text to video, runway
Id: XwDaQKOxgFY
Channel Id: undefined
Length: 9min 50sec (590 seconds)
Published: Sun Dec 03 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.