DiT: The Secret Sauce of OpenAI's Sora & Stable Diffusion 3

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I think we can pretty much all agree that we are near the top of the sigmoid curve in the development of AI image Generations right now half a year of progress used to look like this but for the last 6 months there's been no comparable change like what we have experienced before but now can you even tell which one is a real image and which one is a fake here and it is even harder to tell without any fingers or text to nitpick from however we are also not completely on top of the sigmoid curve of progress either AI image generation still has fingers f words to generate and details to perfect that's why nitpicking these parts is still the easiest way to identify AI generated images while people can still easily cover up these faults after the initial generation with techniques like highr fix or image and painting researchers still can't really call a DAT if they can't do it in a single pass because I guess it's just not as satisfying a simpler solution might just be needed too because they are like a billion workflows and workarounds to configure and generate images right now and this is definitely not it so we need a new yet to simple backbone but we cannot get rid of the fusion models since they are the best AI architecture at generating images hm let's just take whatever else is working and combine them together like AI chat Bots with diffusion models maybe that will work well that worked more specifically the attention mechanism within large language models that powers AI chatbots is actually super useful for language modeling we use the attention mechanism in the first place because it let the model attend to multiple locations when generating a word this is is important as it can encode information about the relations between words for example the it within the sentence the chicken didn't cross the row because it was too tired is obviously referring to the chickin but without the attention mechanism the AI wouldn't be able to tell that the it is referring to the chickin since referring to the road is also grammatically correct so for generating images if we can have the AI to pay attention to other specific locations it would be much easier to synthesize the small details like the text or fingers within an image consistently since there needs to be a very strong relational connection to generate coherently which convolutional neural network or more specifically the unit that is used for diffusion models could not provide so maybe attention is actually all we need because if you look at all the current state-of-the-art models like stable diffusion 3 or even Sora which is a text video generation model by the way you would realize that we are pivoting towards diffusion Transformers which are introduced in both of them actually but with slight modifications however the whole idea of combining transform forers that have the attention mechanisms with the fusion models was actually introduced a while ago and researchers have kind of already known for some time that this will be the state of thee art but no one was willing to take the step of spending Millions to train it from the ground up just to confirm it but luckily a year later we finally got to see the results and not just in text to image but also for text to video for stable diffusion 3 while it is still not officially out yet and we only have the technical papers along with the results from emat tweets to look at we can kind of already tell that it is on another level we have not seen before keep in mind that this is a base model and its performance has already surpassed a lot of fine tunes and other pre-existing generation methods the overall proposed structure for stable diffusion 3 is pretty complex too shout out to stability AI for this super detailed diagram by the way with the introduction of other new techniques like by directional information flow and rectify flow that may have improved its capabilities at generating text within images the fusion Transformers still probably play a key role and damn just look at it generating in 1024 * 1024 it is such an eye bleach with how well it generates the details especially for synthesizing complex scenes with the addition of text sd3 had no problem generating words even in cursive the only few mistakes it made from the official example was adding an extra s or missing an F in the word diffusion while written in cursive I cannot confirm it since it's not out but hopefully it's that good EAD has also claimed that it is the best at understanding complex scene compositions like this of a red sphere on top of a blue cube behind them is a green triangle and on the right is a dog and on the left is a cat no other models have been able to accurately generate this before by the way we have also been teased on technical paper that s d3's dit is also a multimodal dit which means that image generation with sd3 can directly be conditioned on images which means we would not need control net anymore but not much information has been shared about this so I guess we'll see it when sd3 is out dit's capability and composition and consistency can also be seen in the latest AI sensation Sora a text to video AI published by open AI you can check out my old video for some more context recently the key researchers have published some newer results and oh God they look way too real and in a recent interview they had with MKBHD they said that it wouldn't be available anytime soon but I guess that is pretty reasonable as the general public is definitely not prepared for it if you don't believe me just look at how some Facebook users react to AI generator images while Sor is super impressive it is probably not a research Marvel that people thought it was and actually it might have been an engineering one you see the fusion Transformers have been existing for quite some time and looking at their technical paper the most unique Parts s ahead is adding space-time relation between visual patches that were extracted from the individual frames so the only new thing is adding space-time relation because extracting visual patches is already something dit does and not much else about the architecture was uce or added which may give people ideas that it is not as complicated under the hood so for Sora to generate images with such high fidal and coherency it might have been a work of scaling the compute over tens of thousands of gpus for just training which is kind of crazy if you think about it however in a recent interview they did share that it only takes several minutes to generate a video from Sora those videos are 720p 20 seconds long how long does it take to generate those it could take a few minutes depending on the complexity of the prompt so I might be really wrong here maybe the dit architecture made a big difference but the compute probably made an even more of a difference by how big of a leap in quality they had compared to previous state-ofthe-art like stable video diffusion or P Labs so besides safety issues maybe the amount of compute required for inference is also one of the reasons why Sor is not available for public use which resulted in only a handful of demos while these are all speculations it is still a hard fact that dit may be the next pivotal architecture for media Generations because not only image generation is being perfected by this architecture but also video Generations Sora is kind of like a stamp of approval for dit which made other dit based research like def it from Nvidia and HD from stability AI hold bigger promises for the future if you do want me to dive into diff it and hdit let me know down in the comments and if you're excited to try out some text video like Sora but couldn't today's sponsor Domo AI might actually be a great alternative for you to check out Domo AI is a Discord based service that lets you generate videos edit videos animate images and stylize images really easily I personally have actually been following Domo AI for a while and they are really good at generating video to video or image to image conditioned on text what this means is that if you give it a video or an image you can prompt some sort of style for it to reference and it can generate the video or the image in the style of her prompt Domo AI is especially good at generating in the style of animations and that is how I actually found them in the first first place they have a range of customized models for you to pick from each with different anime or illustration Styles you can use and generate with they are what I've seen so far with the best results while needing the least effort because if you remember from my old video of how people created AI videos they had to suffer through a billion workflows while Domo AI can just do it for you all in a few simple steps your highlight though is definitely the image animate feature where you can turn images into videos all you have to do is to provide a starting image then it'll use that initial image to create a moving sequence for you very cool if you want to make a real or AI generated image move so if you do want to try out Domo AI you can get started with the link down in the description to join their Discord and start generating thank you Domo AI for sponsoring this video a big shout out to Andrew lelz Chris Leo Alex J Dean Alex marce mulim fifal and many others that support me through patreon or YouTube follow my Twitter if you haven't and I will see you all in the next one
Info
Channel: bycloud
Views: 51,614
Rating: undefined out of 5
Keywords: bycloud, bycloudai
Id: OPqFpm3wksY
Channel Id: undefined
Length: 8min 26sec (506 seconds)
Published: Thu Mar 28 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.