The idea of we have good control over text to
image models probably came across our mind
one or two times because of
how well we can generate now.
And ever since stability AI released Sable
Diffusion 2.1, we were like, yay, depth through
image is going to give us one more way to
control image generations other than image
to image and text to image.
Yes, that was pretty amazing, but have you ever thought
about accurate human post to
image, precise normal map to image, coherent
semantic map to image, or even line R to image.
Maybe something that can generalize the idea of
whatever to image, that would be game changing.
Let me introduce you to control net, which is
a neural net structure that controls large
diffusion models in a way that supports additional
input conditions much better than any current
existing methods.
This may sound like you're every other scribble to image or semantic to image model, but actually
this is something much more generalizable and it is definitely going to improve people's
workflow by a lot.
From the same author of Style to Paint, which is
a 5 years old project that Lvmin Zhang developed
to help artists to colorize line art with
AI, he explains that control net copies the
weights of the neural network blocks into
a locked copy in the trainable copy.
And while the trainable one learns your
condition, the locked one preserves your model.
With this, training with a small data set of
image pairs will not destroy the production
ready diffusion models and can perform basically
any input conditions which you train to generate
images with the quality of the original model.
With more control, higher quality images can then be generated by the same models.
To make this more understandable, depth of diffusion's new depth through image only
takes in a 64x64 depth map, while the 2.1 stable diffusion model itself is capable of
generating a raw 512 or even a 768 image.
But with control net, now you
can input a 512x512 depth map.
So that the diffusion model will be able to
follow the depth map more accurately since
it's in higher quality, so a better
generated image can then be made.
Control net was built with the idea that text
cannot fully handle all the problem conditioning
in image generation, because text and image
both are ideas that are based in a completely
different dimension, and with text being the
hardcore carry as the interface for us to
communicate with diffusion models, I think
you can relate that sometimes your ideas are
hard to be expressed
efficiently in text too right?
And if only the AI can understand your posing
image a bit better, it could just save you
so much time.
Let me just show you the results and you would understand.
Just keep in mind that the official demos from control net are all in stable diffusion
1.5, so the quality with what stable diffusion 2.1 can generate may differ significantly.
But it's not because of control net.
Just look at the depth clarity
compared to SD2.0's official result.
Even though the control net is controlling
SD1.5, the generated images are just a lot
more clear, especially the background
or the jaw of the old man.
And I would say I would not be able to
tell the difference if they are unlabeled.
What's even better is that control net reduces
the training time from 2000 GPU hours with
more than 12 million data down to 13090ti
in less than a week with only 200k training
data.
This can save so much money.
Human pose to image looks so clean too, everything
synthesized around it perfectly and the anatomy
makes sense while the R
generated is coherent too.
Even though Michael Jackson is in
the middle of the air in this one.
The author specified that these are not cherry
picked results and if you want to verify,
you can run the code yourself.
It's open source by a college student.
Even poses where all the limbs are folded or
not included, the resulting images do not
go hem at all and they obey the human pose
input faithfully even in different contexts.
The arms will be posing correctly
and it just feels so satisfying.
There are even more tests that the author
has made just to show you how generalizable
this control net is and they are
all actually pretty amazing.
Like using HED boundary as an input reference.
HED boundary is one of the edge detecting methods and will preserve the edges that are
highly contrasted in input images, making this pretty suitable for recoloring and stylizing.
And there's using MLSD lines which is also another edge detecting method that does line
segmenting and can be used as references to generate a scenery realistically with layouts
that make sense and details that are coherent.
Our usecanny edge where it will extract very
detailed complex edges for you so the generated
AR will have those detailed attributes that
normal textual image or image to image will
not be able to achieve or preserve.
You are probably fed up seeing the amount of scribble to image and semantic segmentation
demos on the internet.
But that works pretty well too so I'll
just put it here as a quick mention.
But normal map to image is
going to be interesting.
Imagine using a normal map that you generated
from Econ which is the latest and I think
the best image to mesh AI that
it didn't have time to cover and be able to use that
as a reference input.
This could be a very useful
tool similar to depth to image.
Normal map to image will be able to focus
on the subject's coherency instead of the
surroundings and the depth so it can make
edits to the subject more directly and maybe
even have more control to
edit the background too.
But to be honest the highlight of this is
definitely the line art colorizing method
that the author originally
proposed for style to paints 5.
The reason why we have not seen any method
like this is because the current image to
image method would struggle to preserve the
line art details and would not work as a viable
colorizing tool for black and white artwork where
you have to faithfully follow the outlines.
Control net is probably what the style2paints 5 is based on which would do exactly
like accurately preserving the details like
how other edge detection input to image work.
However he did not release the colorization
tool yet due to technical issues and ethical
concerns but it will probably be released
when he finishes improving the tools and has
ways to tackle the ethical aspects.
Then maybe I'll make a video about it again.
This research is definitely going to change
how the Big 5 train and control their large
diffusion models and with its GitHub page
getting 300 stars just under 24 hours without
any promotion, it is safe to say that Lvmin's
work is coin to worth millions of dollars to
these companies.
I'll link his paper down in the description and to quote one of my discord member, I read
this paper and it was insane and Lvmin is too good for Stanford.
Which is pretty funny and join my discord if you haven't.
This opens up the realistic possibilities for artistic usage, architectural rendering,
design brainstorming, storyboarding and so much more.
Even black and white image colorization may be possible with diffusion now with extreme
accuracy because now you can specify the day and age of the image so that it can color
it very precisely.
Not to mention image restoration, that is
probably going to be possible with diffusion
now too.
Thanks to control net.
He also made a page for training our own
model and use case with control net so check
it out if you're interested.
Or check out today's sponsor OpenCV if you are also interested in generating AI art.
Yes, you heard it right.
The computer vision org OpenCV decided to
sponsor this video to promote their first
ever in depth AI art course that will cover
the basic and the advanced topics related
to generating AI art.
Not gonna lie, it took me by surprise too but OpenCV has a really good track record of coding
courses that ranges from a few hours to a few months that teaches you how to master
computer vision, high torch, tensorflow, and even an advanced course in real world CV applications.
If you haven't seen them, even the free ones are pretty well taught, especially how they
cover pretty much everything OpenCV has to offer.
So if they have an AI art course, I
think it'll be pretty high quality too.
Right now, they are launching a Kickstarter
on February 14th which is to fund their AI
art course so that they can spend time
developing the best AI course they can.
Previously, they were able to raise a total of
$3 million for various courses and projects
and this AI art course is the next that
they are planning to venture into.
The pricing of course will be relatively lower
than their OpenCV courses as it would be a
course that can be completed in a few weekends.
To also celebrate their Kickstarter launch, they are hosting an AI art generation contest
with the prize of one iPad Air.
So definitely join the contest if you are
interested in getting a free iPad and check
out their Kickstarter page for more
information about their AI course.
Thank you so much for watching as usual, a
big shout out to Andrew Leschevias, Chris
Ladoo, and many others that support
me through Patreon or YouTube.
Follow my Twitter if you haven't and
I'll see you all in the next one.