Here, we always talk about these amazing results
of recent AI techniques like this. This is Sora, but it is currently unreleased. That means we can
marvel at the results, but we cannot try them yet. However, oh my. The first results of
Stable Diffusion 3 are now available for us to look at. What is that? Stable
Diffusion is a free and open source/open model text to image AI that we can
all use for free. And interestingly, I also hear that version 3 builds on
Sora’s architecture. I’d love to see that. Previously, we talked about a version
called Stable Diffusion XL Turbo, and it was extremely fast. So fast that we don’t even measure it in
frames per second. Frames per second? No sir! Cats per second is where its at. And this
could generate a hundred cats per second. That is fantastic. However, the quality of the cats
was not as good as what I saw in other systems, like DALL-E 3. So, can we finally
get a free and open system that creates super high quality images?
Well, let’s have a look together! Dear Fellow Scholars, this is Two Minute
Papers with Dr. Károly Zsolnai-Fehér. Well, first, the quality and the
amount of detail in these images is absolutely incredible. But it
gets better in 3 different ways. One. Text. Remember the times when we told DALL-E
that we need a sign that says Deep Learning, and we got this? Well, those days are still
not over. We are still not out of the water. Current systems do much better on
text, but they can only do short, rudimentary prompts and we often have to
run it 10 or more times to get something meaningful. This is DALL-E version 3 trying
the same, and we are still not there. But here, would you look at that. We get some text
on the chalkboard, or look at this. This is not just text slapped on top of an image, it is an
integral part of the image itself. It also knows styles quite well, this could easily be a desktop
background for many, and graffiti styles are also appearing. Now not all text on this image seems
perfect, and we don’t know how much cherry-picking was necessary to get these, but we will soon be
able to try it ourselves, and then, we will know. Two, understanding prompt structure. This is
going to be really tough. The prompt is “Three transparent glass bottles on a wooden table.
The one on the left has red liquid and the number 1. The one in the middle has blue liquid
and the number 2. The one on the right has green liquid and the number 3.”
And…there we go! Now, wait, I also ran this 10 times in
DALL-E 3 and I was really strict with it, and still it did extremely well. It was able to do
it 8 times out of 10. These were the good cases, and these were the failure cases.
Even these are not so bad. It just switched up the colors or added some extra
text. So why is this interesting? Well, Stable Diffusion 3 can also do this, but it
is an open system that is free for all of us. And three, creativity. I love how it is
able to imagine new scenes that we’ve likely never seen before. It can
use its knowledge about existing things and extend that knowledge
into new situations. Loving it. If everything goes well, the paper will appear
in the next few days and I am also hoping to get access to the models soon. You know, images of
Fellow Scholars holding on to their papers need to be done. Subscribe and hit the bell icon if you
are interested in a deeper look when it arrives. However, we know some details. For instance,
the earlier Stable Diffusion 1.5 has about 1 billion parameters. SDXL is 3.5 billion. And
this new one is 0.8 billion to 8 billion. So even the heavier version of this will still
likely generate images in a matter of seconds, and the lighter version will I think easily
run on the phone in your pocket. And to have this capability right in your pocket,
my goodness. What a time to be alive! And in the meantime, you can do a heck of
a lot more with already existing tools, for instance, the Stability API can
now help you with a great deal more than just text to image. You can get it
to reimagine parts of the scene as well. And, don’t forget, StableLM also exists. That’s
free too. If everything goes well, we will talk about how you can run these free large language
models privately at home soon. And we will talk about more amazing models, DeepMind’s Gemini
Pro 1.5, and get this, a smaller, free version of it that is called Gemma that you can run at
home for free. That video is coming soon too.