ControlNet Revolutionized How We Use AI To Generate Images

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

The idea of we have good control over text to image models probably came across our mind one or two times because of how well we can generate now. And ever since stability AI released Sable Diffusion 2.1, we were like, yay, depth through image is going to give us one more way to control image generations other than image to image and text to image. Yes, that was pretty amazing, but have you ever thought about accurate human post to image, precise normal map to image, coherent semantic map to image, or even line R to image. Maybe something that can generalize the idea of whatever to image, that would be game changing. Let me introduce you to control net, which is a neural net structure that controls large diffusion models in a way that supports additional input conditions much better than any current existing methods. This may sound like you're every other scribble to image or semantic to image model, but actually this is something much more generalizable and it is definitely going to improve people's workflow by a lot. From the same author of Style to Paint, which is a 5 years old project that Lvmin Zhang developed to help artists to colorize line art with AI, he explains that control net copies the weights of the neural network blocks into a locked copy in the trainable copy. And while the trainable one learns your condition, the locked one preserves your model. With this, training with a small data set of image pairs will not destroy the production ready diffusion models and can perform basically any input conditions which you train to generate images with the quality of the original model. With more control, higher quality images can then be generated by the same models. To make this more understandable, depth of diffusion's new depth through image only takes in a 64x64 depth map, while the 2.1 stable diffusion model itself is capable of generating a raw 512 or even a 768 image. But with control net, now you can input a 512x512 depth map. So that the diffusion model will be able to follow the depth map more accurately since it's in higher quality, so a better generated image can then be made. Control net was built with the idea that text cannot fully handle all the problem conditioning in image generation, because text and image both are ideas that are based in a completely different dimension, and with text being the hardcore carry as the interface for us to communicate with diffusion models, I think you can relate that sometimes your ideas are hard to be expressed efficiently in text too right? And if only the AI can understand your posing image a bit better, it could just save you so much time. Let me just show you the results and you would understand. Just keep in mind that the official demos from control net are all in stable diffusion 1.5, so the quality with what stable diffusion 2.1 can generate may differ significantly. But it's not because of control net. Just look at the depth clarity compared to SD2.0's official result. Even though the control net is controlling SD1.5, the generated images are just a lot more clear, especially the background or the jaw of the old man. And I would say I would not be able to tell the difference if they are unlabeled. What's even better is that control net reduces the training time from 2000 GPU hours with more than 12 million data down to 13090ti in less than a week with only 200k training data. This can save so much money. Human pose to image looks so clean too, everything synthesized around it perfectly and the anatomy makes sense while the R generated is coherent too. Even though Michael Jackson is in the middle of the air in this one. The author specified that these are not cherry picked results and if you want to verify, you can run the code yourself. It's open source by a college student. Even poses where all the limbs are folded or not included, the resulting images do not go hem at all and they obey the human pose input faithfully even in different contexts. The arms will be posing correctly and it just feels so satisfying. There are even more tests that the author has made just to show you how generalizable this control net is and they are all actually pretty amazing. Like using HED boundary as an input reference. HED boundary is one of the edge detecting methods and will preserve the edges that are highly contrasted in input images, making this pretty suitable for recoloring and stylizing. And there's using MLSD lines which is also another edge detecting method that does line segmenting and can be used as references to generate a scenery realistically with layouts that make sense and details that are coherent. Our usecanny edge where it will extract very detailed complex edges for you so the generated AR will have those detailed attributes that normal textual image or image to image will not be able to achieve or preserve. You are probably fed up seeing the amount of scribble to image and semantic segmentation demos on the internet. But that works pretty well too so I'll just put it here as a quick mention. But normal map to image is going to be interesting. Imagine using a normal map that you generated from Econ which is the latest and I think the best image to mesh AI that it didn't have time to cover and be able to use that as a reference input. This could be a very useful tool similar to depth to image. Normal map to image will be able to focus on the subject's coherency instead of the surroundings and the depth so it can make edits to the subject more directly and maybe even have more control to edit the background too. But to be honest the highlight of this is definitely the line art colorizing method that the author originally proposed for style to paints 5. The reason why we have not seen any method like this is because the current image to image method would struggle to preserve the line art details and would not work as a viable colorizing tool for black and white artwork where you have to faithfully follow the outlines. Control net is probably what the style2paints 5 is based on which would do exactly like accurately preserving the details like how other edge detection input to image work. However he did not release the colorization tool yet due to technical issues and ethical concerns but it will probably be released when he finishes improving the tools and has ways to tackle the ethical aspects. Then maybe I'll make a video about it again. This research is definitely going to change how the Big 5 train and control their large diffusion models and with its GitHub page getting 300 stars just under 24 hours without any promotion, it is safe to say that Lvmin's work is coin to worth millions of dollars to these companies. I'll link his paper down in the description and to quote one of my discord member, I read this paper and it was insane and Lvmin is too good for Stanford. Which is pretty funny and join my discord if you haven't. This opens up the realistic possibilities for artistic usage, architectural rendering, design brainstorming, storyboarding and so much more. Even black and white image colorization may be possible with diffusion now with extreme accuracy because now you can specify the day and age of the image so that it can color it very precisely. Not to mention image restoration, that is probably going to be possible with diffusion now too. Thanks to control net. He also made a page for training our own model and use case with control net so check it out if you're interested. Or check out today's sponsor OpenCV if you are also interested in generating AI art. Yes, you heard it right. The computer vision org OpenCV decided to sponsor this video to promote their first ever in depth AI art course that will cover the basic and the advanced topics related to generating AI art. Not gonna lie, it took me by surprise too but OpenCV has a really good track record of coding courses that ranges from a few hours to a few months that teaches you how to master computer vision, high torch, tensorflow, and even an advanced course in real world CV applications. If you haven't seen them, even the free ones are pretty well taught, especially how they cover pretty much everything OpenCV has to offer. So if they have an AI art course, I think it'll be pretty high quality too. Right now, they are launching a Kickstarter on February 14th which is to fund their AI art course so that they can spend time developing the best AI course they can. Previously, they were able to raise a total of $3 million for various courses and projects and this AI art course is the next that they are planning to venture into. The pricing of course will be relatively lower than their OpenCV courses as it would be a course that can be completed in a few weekends. To also celebrate their Kickstarter launch, they are hosting an AI art generation contest with the prize of one iPad Air. So definitely join the contest if you are interested in getting a free iPad and check out their Kickstarter page for more information about their AI course. Thank you so much for watching as usual, a big shout out to Andrew Leschevias, Chris Ladoo, and many others that support me through Patreon or YouTube. Follow my Twitter if you haven't and I'll see you all in the next one.

Info

Channel: bycloud

Views: 98,980

Rating: undefined out of 5

Keywords: bycloud, bycloudai, controlnet, style2paints v5, line art to image, ai art colorization, anything to image, depth to image, edge to image, pose to image, ai pose to image, pose to image ai, normal map to image, line art colorization ai, line art colorization, control net, controlnet + diffusion, controlnet ai, controlnet diffusion, text to image depth model, diffusion model, ai colorization, text to image colorization, style2paints, lvmin zhang, controlnet stable diffusion

Id: rCygkyMuSQo

Channel Id: undefined

Length: 8min 8sec (488 seconds)

Published: Tue Feb 14 2023