Finally, it is here. From today, we can all become
film directors! Yes! Text to video and image to video that is open source, and free for all of us.
And it can even make images of memes come alive. This is Stable Video, which has studied
600 million videos. And we have this too, and this too. Oh my, three amazing papers. Yummy!
So what is going on here? Well, simple - you just write a piece of text, and Stable Video can
generate a video for you in about 2-3 minutes. It compares favorably against the competition
at this moment. I say “at this moment”, because this result was recorded at a particular point
in time, and these systems improve so rapidly, that for instance, Runway may be way better by the
time you see this comparison. But, it doesn’t end there. Not even close. There is another amazing
text to video AI that you can kind of try right now. And there is even more. In fact, there is so
much going on, I don’t even know where to start. Dear Fellow Scholars, this is Two Minute
Papers with Dr. Károly Zsolnai-Fehér. So first, Stable Video. This was trained on about
600 million videos, and now can generate new ones for you. It is free and open source, however, you
still need some computational resources to run it, I’ll put potential places that can run it for you
in the video description. If you found some other place where other Fellow Scholars can run it for
free, please leave a comment about it. Thank you! It takes approximately two to three
minutes to create a video. And there is a lot to like here. Finally, an open source
solution. This means that you will soon be able to run this on the phone in your
pocket as freely as you wish. Glorious. However, it is not perfect. Not even
close. Sometimes you get no real animation, but instead a camera panning around. Also, you
probably already inferred that from these results, but it cannot really generate longer
videos, but that’s not all. Its generated videos also typically showcase not too
much motion. Third, you know the deal, don’t expect good text outputs from it. Not yet
anyway. And fourth, it is a bit of a chonker. What does that mean? Well, you need a lot of video
memory to perform this. I am hearing 40 gigabytes, although there is already a guide to
get it down to under 20, or maybe even 10 gigabytes. Link is in the description.
From seeing the nature of these limitations, my guess is that the memory requirements
will be cut down substantially very soon. However, there are more tools coming up
in the meantime. Here is Emu Video. This is incredible. Look. It is so good
at generating natural phenomena, and it even has a hint of
creativity. Wow. Fantastic results. And the paper showcases this, which is
a sight to behold. Goodness. Are you seeing what I am seeing? This is a user
study where humans look at the results, and whatever other technique you see this
compared against, it has a win rate of often in the 80% region, against Imagen Video
- here is what Imagen Video looked like, and that is definitely one of the best ones out
there. And now, this new one, still better. Wow. But it gets better. Just creating high-quality
results is not enough. Just consider a technique that always gives you a high-quality video,
perhaps always the same, and ignores your prompts. That is a high-quality result, however,
faithfulness to the prompts also needs to be measured. And on that, this new technique
has no equal. Nothing is even close. Wow. And, fantastic news, you can kind of try
this technique out for free in a website right now. The link is obviously, in the
description waiting for you Fellow Scholars. You can assemble these text prompts and see
immediately what the system will do with them, I love the creativity here, every solution
is at least pretty good in my opinion, and some of the solutions are just excellent.
For instance, this one. You can also look at some images here and perform image to video.
These really came to life here, so good. Or, search for a gallery of text to video results.
I loved the robots here, but when I looked for scholarly content… nothing. We need more
scholarly content. Well, maybe next time. And this is a great paper, so it contains a user
study that is so much more detailed than this, they also look at sharpness, smoothness,
amount of motion, yes, you remember from the Stable Video project that this is super
important, and object consistency as well. Now, not even this one is perfect. The
resolution of these videos is 512x512. Not huge. But this is almost guaranteed to be
improved just one more paper down the line. Also, this is not open source, not at the moment anyway. Now, why is it important if it is free and open
source, like the previous Stable Video? Well, have a look at this. I love this image. So why is this
interesting? Well, have a look and you see here that the best performing large language models are
all proprietary. These are closed models. However, there are other language models that are
nearly as good, just a step or two behind, but these are free and open source. So,
this means that intelligence is not in the hands of just one company, but a nearly as good
intelligence you can run yourself on your laptop, and soon on your smartphone too. Just imagine
if the best model out there is unwilling to help you or starts hallucinating, in that
case you would have no other choice. But, with open source models, this will never
happen. There is always going to be a kind little robot helping you. And this is
the importance of open source models. And if you think that we are done, well, Fellow
Scholars, hold on to your papers for the third amazing paper for today, Emu Edit. This helps
us edit images iteratively. The iterative part is key here. This means that we can start from an
image that we like, something that we got from a text to image AI perhaps, and then, it is rarely
the case that everything comes out exactly how we envisioned. So from now on, not a problem
at all. We just add subsequent instructions, and much of the image will remain, only the
parts that we wish to change will be replaced. So if you need the same emu, the same background, but make it a fireman, there we go, and…oh
my. Look! Finally, scholarly content. Good! And when compared to the competition,
this one is also so far ahead of them, goodness, look at that. InstructPix2Pix
is from just a year ago, and MagicBrush is from less than 6 months ago, and both
of them are outperformed significantly here. There are other cases which are a bit
closer, but I still prefer the new one here. So three amazing use cases, three amazing
papers. I hope that you share my feeling that this is an incredible time to be alive.
Research breakthroughs are happening every week. What a time to be alive! Subscribe and
hit the bell icon if you wish to see more.