SDXL 1.0 Released! Stability AI Shares Secret Stable Diffusion Weights!

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Hello friends welcome to AI flux stability AI just released sdxl 1.0 and it's looking better than ever this actually came as a huge surprise because I saw this come up on a Hacker News aggregator I wrote and basically stability I said yeah we're going to talk about this on our Discord live we're going to go over some of the differences and the modifications we made in lieu of the leak of sdxl 0.9 and a few hours later all the weights are available in all hooking face and they're linked below my initial question here is that sdxl is roughly in the same ballpark as mid Journey 5 quality wise but the main value is the array of tooling that's immediately available for it and the license which you can now use commercially without any problems you can fine tune it on your own pictures use higher order input so not just text and you can also daisy chain various non-imaging models and algorithms to actually get pretty incredible output so this can include objects slash so this includes you know object and feature segmentation depth detection processing subject control Etc also an example of subject control would be you know something like control net and you can end up with really complex really nice output and this can either be procedural or one of it's all experimental and very improvised but it's a lot of fun to mess around with and it's incredibly powerful especially for anyone who does CGI stuff or has worked with 3D tools in the past so I believe automatic 1111 support for this is already crossed over I'm not entirely sure about comfy UI but I'm sure we'll get there soon enough I also love that the performance of this model seems to be better it reminds me of how Snappy and reactive stable diffusion version one or version two on the same resolution were and of course in this case the resolution is much higher so you're getting the Best of Both Worlds and the latest tooling from stability and what's really crazy about sdx01.0 is it's actually less pre-sensor than other models so of course you can toggle the uh the toggle we all know about but what's interesting is from what I've gathered using this at least in terms of anatomy and other kinds of maybe prompts that otherwise you'd maybe look towards like unstable diffusion to create it doesn't look like it is and uh what they tried for 2.0 2.1 was way too overdone and I think they realized that that was too restrictive for people who are using stable diffusion on a day-to-day basis and it's also nice because it doesn't lean too far the other way which is also nice because otherwise you know these models get kind of annoying to use in that regard so for those of you who aren't aware the biggest departure with sdx cell 1.0 maybe in comparison to some other scaling methods that have been used with gbd3 3.5 and 4 and maybe even the next version even things like llama 65b with large language models what's Curious is the Mantra of more is better is not necessarily what's being applied here stability AI has applied some clever tricks here to get the best of two different clips and two different specific models to extract more detail and basically give you a higher Fidelity larger image output in the end so what's crazy with this is you're getting effectively about eight to ten times as much detail as prior versions of stable diffusion but with the data set that's not actually 10 times larger and what's cool is right now it's actually seemingly possible to run a relatively full version of sdxl even on a GPU that has only 10 gigs of video memory and there's some people who actually managed to run sdxlo.9 on an iPad Pro with just eight gigs of RAM which is kind of crazy so I've extracted some pretty interesting insights here but first let's go to the release page from stability AI to see what they had to say so in a prior talk Ahmad was very apprehensive and very careful when he was talking about what they were going to do before they released sdxl 1.0 and a lot of this had to do with tuning based on feedback from researchers and curiously you know since the weights were leaked I think it might have actually sped up that process so again you know I think that with the next iteration of this hopefully we can avoid a leak but a curious side effect of these leaks again just as they were with llama from some of the meta models I think it actually accelerated the pace of improvement and the pace of development at stability AI at least in the near term so stability here is pretty much saying that they're excited to release the next version of sdxl 1.0 what they're calling the next iteration and the evolution of text to image generation models what I really want to see here is if they mention some of the other sort of Fringe features of this which included animations video and adaptive forms of outpainting similar to what we see in the journey right now what they're calling us now is the best image model from stability Ai and I'm I'm really curious how they're actually defining this so they say sdxl 1.0 is the flagship image model from spill the AI and the best open model for image generation we've tested it against various other models and the results are conclusive people prefer images generated by sdxl so basically this is pretty similar to what it is what is happening when anyone uses mid-journey when you present images and people pick the best one I'm not sure how they conducted this study however they say it's from external testings there's no way to validate like how accurate this is but basically they're saying that the kind of images this produces compared to other forms of stable diffusion not necessarily other image models generally people prefer sdxl and the mod has been really big to profess that and it might have been really big on emphasizing that stxl is not just about generating images it's about transferring Styles it's about applying loras and doing this all in a way that is as cohesive as possible and realistic to the human eye so photorealism is no longer sort of this pipe dream with these models it's actually now a competitive feature that's getting better and better day after day and interestingly they're really specific here they say in addition sdxl can generate content steps that are notoriously difficult for image models to render such as hands text or spatially arranged compositions and spatially ranged compositions are something that I've talked about here on the channel before and this is having a model that's one aware of depth of field a focal point or even vocal multiple focal points which has been a pretty difficult problem to solve even in like real physical cameras and then also from a lighting perspective you know atmospheric scattering those kinds of things and it's crazy to see these models moving to a point where these are actually getting better and more performant than even trying to model these things with physics so with Ray tracing path tracing Etc now one thing that has been highly contested with some changes made to the clip models with stable diffusion XL was people wanted to maintain a high degree of control and not have it become simpler to mid-journey where you can give it three words and you always get something incredible looking or that maybe would be intriguing to the human eye but you lose degrees of control in terms of getting what you really want which really was where stable diffusion excelled because is you could either give it a three word prompt you get something pretty good probably not as good as the journey at least in terms of what you were looking for but if you gave it 30 words and I'd really highly tuned way you had a much greater chance of getting exactly what you wanted and more importantly excluding what you didn't want which mid-journey still gives you okay tools for but in terms of tuning stable diffusion for the longest time it's basically your best option now ironically we were just talking about how this model gained a ton of performance without necessarily getting massive but it is generally speaking the largest open image model that we can actually look at and Pull Apart in terms of how it's built technically so they say here that sdxl 1.0 is one of the largest parameter counts of any open access image model right now we don't really know how much replicate or how much mid-journey really operates on but for now they say that the new architecture is composed of a 3.5 billion parameter base model and a 6.6 billion parameter refiner and these were numbers we didn't know as of sdxlo.9 they were sort of up in the air in terms of where those would end up but importantly there is a base model and or a finer model the full model consists of a mixture of experts pipeline which is actually really similar to how gpt4 is provisioned which is kind of cool in the first step the base model generates noisy latents which are then further processed with a refinement model specialized for the final denoising steps note that the base model can also be used as a standalone module so you can use one you can use both generally the model is intended to be used with both for the most performance now here is where it gets really interesting so the two-stage architecture allows for robustness in image generation without compromising speed of requiring excess compute resources so the reason you can run this on say like a RTX 3080 with only 10 gigs of RAM is because you only have to run one of those models at a time you don't have to load both of them in and then push everything through so this is why they're saying that sdxl should work relatively well on consumer gpus with only eight gigs of vram now fine-tuning is what a lot of people think about when they hear stable diffusion just because this is something you can't really do with other models very easily at all you can provide an input image you can provide input references but in terms of having Laura's or other ways of doing that stable diffusion is really the leader in this case and they say here fine-tuning with custom data is easier than ever custom lawyers or checkpoints can be generated with less need for data wrangling and quite frankly it's ironic as a software engineer most people don't realize that most of the process of fine tuning is more data wrangling and doing a lot of just arranging um these weights and data rather than writing like wildly complex code to actually do it a lot of it's just kind of massaging this until you get it to something that produces what you're looking for and then you cross your fingers you don't screw it up and you hope that your process is you know adaptable enough that you don't have to change it entirely going forward once you have something that you like they appear to say here that fine tuning stable diffusion even with wild input images is actually quite easy this announcement is pretty short and sweet because I think they got really technical in the 0.9 release so it's cool you can use it on clip drop right now the weights of sdxl 1.0 are now officially released and just hours after there have been times where stable diffusion has released these models but not necessarily the weights or they've slow rolled it what's also cool is we're starting to see the benefits of stability AI actually working with Amazon so sdxl1.0 is now officially just available on Amazon infrastructure with sagemaker and Bedrock basically meaning if you have a Amazon infrastructure they're making it very easy for you to use stable diffusion and yeah as always you can use it on their Discord and dream studio also has this available now so now let's look at some images so one thing they've been really excited to show are the addition of text into actual subjects in the image so what's cool here is clearly there are multiple focal points so there's a focal point on this man's face there's a focal point on the subject here that's actually behind the text which is kind of interesting so this is something that you know previously would have been kind of hard to do without something else that's kind of interesting is supposedly the reason for some of the delay was more so a licensing delay rather than a technical one and of course with the leak of sdxl 0.9 the licensing part got kind of more complex now one thing I like here is a lot of times with characters or portraits in AI you run into an issue where you can either have the portrait look great or the subjects look great and this is an example where both are quite good right out of the back you have uh sort of the these metal features here with some illuminated portions that are clearly affixed to whatever this character is wearing and metallic elements clashing with skin tones is actually also quite common it's sometimes it's hard for some of these models to understand what skin tone is and what's sort of like a clothing tone is and what's also cool is there's even some refraction going on here so clearly the hair is white but you have this blue diffraction coming through and cohesion is everything what's great here is although we don't see a lot of the character below her face we can tell that it's cohesive and to your eyes it looks pretty right and Distortion is something that's actually really hard to suss out with some of these models and here we have something really abstract sort of this line art style and I've always liked this sort of cyberpunk feature I've always liked this sort of cyberpunk style you can get with these models but what I like here is that there's much more cohesion as to what these buildings are clearly they're boats and you have kind of a water kind of wave texture you can make out people here and what I think is interesting with this is there's a lot of detail being made here with really only two or three primary colors and the rest is all line work and Line work is something that sometimes these models overdo to the point that you like everything becomes a scribble and what I like here is you can clearly see these models have sort of a delineation like a lateral delineation and then there's this building here which I guess is supposed to be just sick clad and metal or just much more reflective without any texture and you can tell the line work changes entirely and you still have the notion this is supposed to be Hong Kong of like kind of this Hillside here that's getting darker with these buildings in the foreground and even when you you give it very little context um I feel like the awareness or the temporal awareness of what is in the foreground what it wants your eyes to kind of look at or kind of scatter across is much better than any model I've seen from stable effusion before even their beta model and to close it out uh here are four images of poutine which I've been spending some time in Montreal lately and these look pretty cool I mean food is always weird with AI because you can never tell if it's making it too greasy um cheese textures are sometimes weird but what's cool here is you know it's a great Benchmark because we have a lot of different textures so what's interesting here is the poutine is clearly the subject the beer in the background is not we have the fries that have that sort of greasy texture and the gravy and the cheese looks specifically different um we have kind of this this candy corn drip here that is making me a little uncomfortable but interesting enough what's cool here is we have two different plates and I think coffee with poutine maybe would not be very good but again uh we have interesting sort of mixture here of the cheese texture with the sort of drooping more liquid gravy and again the fries look nearly perfect um some of these temporarily are getting a little bit weird but all the you know randomizing French fries is a pretty hard shape to actually envision and here we have something that you know would probably I would I would imagine you could see like an Instagram ad for a restaurant or something I don't know what's in these shot glasses that's kind of concerning but we have a singular uh subject right this is clearly a portrait of food we have actually the texture of the table here we have the reflection of the plate and sort of a studio lighting here and this is really cool and you know I know one person who has a ghost Kitchen on doordash and they haven't used AI but they've to like just generate the images entirely but they have used AI to improve their images and when they make menu changes they've actually started to use AI to generate certain images of the food from multiple inputs so as always I hope you learned something from this video please like And subscribe if you like our content and we'll see you in the next one

Info

Channel: Ai Flux

Views: 18,709

Rating: undefined out of 5

Keywords:

Id: JuE347R6MdQ

Channel Id: undefined

Length: 14min 53sec (893 seconds)

Published: Wed Jul 26 2023