Stable Diffusion and Generative AI with Emad Mostaque - 604

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] laughs [Music] all right everyone this is Sam cherrington host of the tormal AI podcast and today I'm coming to you live from the future frequency podcast Studio at the AWS re invent conference here in Las Vegas and I am joined by Ahmad mustak Ahmad is founder and CEO of stability AI Ahmad welcome to the podcast thanks so much for having me uh super excited to talk to to you you are of course the founder and CEO of stability stability is the company behind stable diffusion which is a a large multimodal uh model that has been getting a lot of a lot of fanfare I think um and and I'd love to jump in by having you share a little bit about your background yeah no I think it's been uh super interesting I think civil diffusion is kind of a specific text to image model and I think it's that large but we can talk about that a bit later which is one of the fun parts and that's for me um obviously I started off Matt's computer science at Uni an Enterprise developer and then became a hedge fund manager and one of the largest video game investors in the world and artificial intelligence I was doing that it was a lot of fun and then my son was diagnosed with autism and they said there was no cure or treatment so I quit switched to advising hedge funds and built an AI team to do literature review of all the autism literature and then by molecular pathway analysis of neurotransmitters to repurpose drugs to help him out and it kind of worked he went to Main Street school and was super happy that's awesome that's kind of cool good trade good trade um then I went back to the hedge fund World once my Wars was like that's boring so then decided to make the world a better place so first off took the global X price for learning that was 15 million dollar prize from Elon Musk and Tony Robbins for the first app to teach kids literacy and numeracy of that internet my co-founder and I have been deploying that around the world and now we're teaching kids in refugee camps literally numeracy in 13 months and one hour a day they're about to AI the crap out of that in 20 20 21 I designed and led the United Nations or the United Nations AI initiatives against covid-19 uh kayak Collective augmented intelligence against government 19 launched at Stanford but by The Who UNESCO and the World Bank and that was really interesting because we're trying to use the world's make the world's knowledge free on covid-19 with code 19 so there's a 500 000 paper data set freely available to everyone yeah and use AI to organize it because it's really confusing right uh during that lots and lots of uh interesting Tech kind of came through but I realized these Foundation models are super powerful you can't have them controlled by any one company it's bad business and it's not the correct thing ethically so I thought let's widen this and create open source Foundation models for everyone because I think it can really Advance humanity and again I think it'll be great to see these things cleverly so we can have a open discussion about it and also have the value created from just these brand new experiences that's awesome and when did you get started down that uh that part of the Journey about two years ago stability has been going for about 13 months now yeah when I think about the you know a lot of stable diffusion goes back to this uh the latent diffusion paper which was you know not even a year ago it's not even a year ago I think the whole thing kind of kicked off with um clip released by open AI in January of last year so I actually had covert during that time while doing my covert thing okay my daughter came to me and said dad you know all that stuff you do taking all that knowledge and squishing it down to make it useful for everyone can you do that with images like well we can so a bit of system for her um based on vqan and clip okay um so an image generating model and then clip is an image to text model where she created like a vision board of everything she wanted a description what she wanted to make and generated 16 different images and then she said how each one of those is different and it changed the latent and then generated another 16 another 16 another 16 and then eight hours later she made an image that she went on to sell as an nft for three and a half thousand dollars and donated the proceeds to India code relief okay and I thought it was awesome she's seven years old wow and then I was like this is transformative technology image is the one it's at language you're already at 85 we're going to go to 95 image we're at 10 we're not visual species like the easiest way for us to communicate is what we're doing right now we're having a nice chat you know then text is the next hardest image like be it images or PowerPoints are impossible let's make it easy this Tech can do that so we started funding the entire sector Google collab notebooks models all these kind of things latent diffusion was done by uh the conference Lab at the University of Munich who are LED on the stable diffusion one as well amazing love led by bjorno and uh led by Robin rombach who was one of our lead developers here at stability and then there was work by Catherine crowson Rivers have wings is a Twitter handle on clip condition models and things like that and the whole Community just came together and built really cool stuff yeah then he had entities like mid-journey where we just gave grants for the beta that started operationalizing it and it's all come together now to the finality of stable diffusion that was released on August 23rd so that was led by the confis lab and then we ourselves a stability Runway ml or Luther AI community that we kind of helped run and lie on all came together to put it out 100 000 gigabytes of Image level text pairs two billion images turned into a two gigabyte file that runs natively on your MacBook that can create anything yeah it's kind of insane I think it's uh yeah and the speed in which it all came together is mind-boggling yeah like our model was to have a core team and then it's like contributors and partners from Academia and then these communities that we kind of built and accelerated so like tens of thousands of people from open by ml were doing protein folding work to a Luther with language models to harmonize with audio and it turned out that's a really good system just iterate and experiment with these things at exactly the right time and now it's progressed so like when we started with stable diffusion and launched it in August 5.8 seconds for a generation on a100 as of yesterday 0.86 seconds as of two weeks from now it'll be 20 times faster with our news done models yeah so you get into 24 frames a second high resolution image creation from basic blobs a year ago I don't think we've ever seen anything that fast and the uptake has been crazy so I believe uh on Monday the number of GitHub stars for stable diffusion over took ethereum and Bitcoin it's overtaken Kafka everything else I think it'll overtake pie torch and tensorflow in like a month or two and that's since Inception like over the last month I think Mastodon has had 6 000 GitHub Stars over the last week stable diffusion 2 has had six thousand and stable diffusion 2 was just released uh this month right yeah we won last month now it was at least a week ago yeah last month now stable so stable diffusion one we kind of use the lion data set to create the image model and then we use openai's clip l14 um to kind of condition it so we combine the text model and the image model with stable diffusion two um we instead use something called open clip run by kind of the lion um charity whereby we had an open data set for both because open eye did amazing work open sourcing clip but we didn't know what data was inside it so it learned all these Concepts and we're like how does it know that and so when we launched stable diffusion um kind of as a collaboration we had all these questions about attribution about what's in the data set say for work not safe for work but you can't control that if you don't control half the data set right and after learning so still refusion 2 had that but it also had a better uh text encoder model so now basically it's heading towards photo realism yeah um you can get photorealistic outputs from it yeah and again kind of insane like you just see these things generated in a second you're like it can be completely like artistic or completely photorealistic these people do not exist this landscape or this interior does not exist I don't think we've ever actually seen anything like this because the majority of humanity doesn't believe they can visually create just like before the Gutenberg Press you couldn't write or read but now hundreds of thousands developers think we've had like 308 000 developers sign up I think on hugging face and now using this to create ridiculous things and now that it gets to real time what does that even look like when people can just seem to see communicate visually like we could literally in a few months a year definitely this podcast you could generate a live video almost on it of all the topics that we're talking about which is insane yeah one of the examples that you'd like to use is killing PowerPoint so we've got the text that's where usually start and then you go through this long process to to make it pretty or engaging aesthetic right yeah because you know what these models do like these attention-based models like it's interesting so with my son with his autism uh autism is kind of uh social interaction disorder it's caused in my opinion largely by a Gaba glutamate imbalance in the brain so Gaba calms you down when you pop a Valium glutamate excites you and obviously in our industry a lot of people kind of have people they know on the Spectrum or they're just highly there because it lends itself sometimes when there's a dual wedge thing yeah because of all that stuff what happens is that there's too much glutamate it's like you know when you're tapping your leg because there's too much going on in your brain imagine that was like that all the time you couldn't think straight yeah and so you can't form the connections of like a cup means cup your hands or a cup or a World Cup in your brain that's why there's a lot of cases where they can't communicate properly addressing those factors can calm it down then you basically start teaching reteaching them just like when you have a stroke a cup means it's a cup means that cup means that and they can start talking or you know progress with these attention-based models you've moved from kind of giant extrapolation of data to paying attention to the most important parts between words and pixels which is kind of crazy for the denoising process of diffusion the latents that are built up there where it has all the concepts of a carpet means that if you have a cup in a sentence it understands what that is in that context a World Cup or covering the hands and then can do these images which is kind of insane so it works like that part of the human brain I think that's what's so exciting that's what lets you have the compression of knowledge like it's a hundred thousand gigabytes into two gigabytes is like where Pied Piper from that Silicon Valley HBO show right like it doesn't make sense yeah you know like a lot but that's because a hundred thousand gigabytes 100 terabytes right yeah was our input data and the output files two gigs yeah and it's not optimized yet we reckon we can get that to 400 megabytes oh wow a 400 megabyte file that now works on an iPhone that can generate any image in the seconds by description yeah and you can go the other way as well you can take an image and turn it into text and that text encoding is only a few lines that can generate a high resolution masterpiece it's insane that's nuts and I think we were kind of a bit misguided by not miss my good but you know the focus was on scale is all you need 540 billion parameter trillion parameter large language models yeah stable diffusion is 890 million parameters how does that kind of work what about large earlier yeah exactly large it's not large it's actually quite small yeah you know and this is kind of pointing something to the Future because like you know open AI took GPT 3 175 billion parameters and they instructed it so reinforcement over human feedback by getting annotators to use it and then seeing which neurons kind of lit up these kind of late in space things in strike GPT had equivalent performance I think they probably use a larger version of that at 1.3 billion parameters because kind of you don't need all the information of the world completely to do stuff you just need some of it image models though are surprisingly small like the largest we've seen was the 12 billion parameters Ru daily model but now like I said we're 900 million parameters and we've had great success with our 400 million parameter models our four billion parameter models are better actually the largest is party which was from Google at 220 billion we don't know what an optimal data set is what optimal parameter size is for these particular non-text models yeah text models themselves text is quite a dense encoding I think will tend larger but combining these models is going to be super interesting as we move forward yeah it's a lot of your efforts thus far have been on shrinking the model to make the performance better to make it smaller faster um do you see a pull towards large models or do you think it's a different Paradigm altogether where there's not going to be that kind of drive to make the model bigger and bigger I think there'll be a mixture of things again like what we saw with the deepmind chinchilla paper was that the scaling laws weren't necessarily appropriate so that showed that a 67 billion parameter model trained on five epochs would outperform 175 billion parameter model you know but actually what it really showed if you dig into the details is that data is what you need and what does that data look like we haven't done the proper data augmentation and other studies but this is also like you can think of these models like stable diffusion one was a precocious kindergartner and we talked about the whole internet so it occasionally turned like a little bit off in some of the outputs yeah it's double diffusion two you get into like grade school now but still super precocious and you know we made it safe for work and it's safer for work and a whole bunch of other things dedupe the data sets we're still not feeding at the right information once we know what information to feed it we'll make it even better I don't think that Trends to large I think it tends to more efficient and I think one of these things is the accessibility because we optimize stable diffusion kind of as a group and Collective to be available on low energy devices not just like 30 90s or a100s you can download it on your MacBook right now a Macbook M2 as of today can generate an image in 18 seconds of any type in a couple of weeks it'll be less than a second so you can have Pi torch you can have jax or whatever and you can just start coding as that opens it up to so many people it's a new type of programming primitive you know this hash file that can create anything dive into the connection between programming and and stable diffusion so if you think about it you're creating an experience during programming right yeah and so if you use the diffusers library from hugging face it's like a couple of lines you can be using stable diffusion in a code base and again it can run you a MacBook with no internet okay so what type of experiences can you do when you have this verifiable file words go in images come out it opens up a whole world of possibilities it's like an ultra library in a way like the live recondense an AI model and we're not really used to that like you know we've had birds and kind of some of these other things but nothing that has this massive range shall we say like two billion images a snapshot of the internet compressed down yeah you're kind of thinking more broadly like we a lot of the conversation about uh stable diffusion today is about art and uh kind of you know the the creation part of that process thinking more broadly about practical applications and this is maybe getting into something I wanted to speak about later just where you see the company going um you know talk about some of the the way the the other things that are disrupted Beyond just you know making pretty pictures arts and crafts right yeah I mean I think Arts is like we think about it as like oh man artists never make money right unless they do you know like my seven-year-old daughter she's obviously you know one of the OG's now in generates of art yeah I actually asked her why don't you make any more art anymore and she's like well Dad there's this thing called supply and demand if I reduce the supply and you can make this whole industry depart from my stuff will go up so the value of golf like you're paying for your own University a creative industry is worth hundreds of billions of dollars a year now video games 170 billion like movies like 80 billion this will all be disrupted by this technology if you think about the creation process like uh one of our directors he was doing a shoot with a famous actress it was arranging that's going to be a hundred and thirteen thousand dollars because fly her out and do all this and get all the other people just for three days he fine-tuned a stable diffusion model did it in three hours two thousand shots photorealistic meaning the entire suit was generated as opposed to yeah like all the shots because it was going to be a shoot to kind of put her in different things to go into the movie kind of process so concept artists are using it to become more efficient um there's a group Corridor Digital they created uh Spider-Man everyone's home which is like a two and a half minute trailer in the into the spider-verse style by having uh spider-verse model that they train on like 100 images okay and you can't tell it's like wow this is amazing animation no it isn't they just interpolated every single frame and used stable diffusion to kind of do image damage it's the craziest thing it would cost millions of dollars before they did it like a few days so I think media is going to be the first to be disrupted here because that creation process is hard right and now it's easy I would think industrial design for example wouldn't be too far behind like Autodesk you know they spend a lot yeah they've got amazing kind of data sets you know you've got the canvas of the world that have every single click on design it can make all of those easier because the system learns like it's a foundational model in some ways because also like a base Foundation that you can then train on your own things and it learns physics and all sorts of other stuff which is a bit creepy um but it can learn about that specific type of design that you might want to do we're working with car manufacturers right now who want to have custom models based on their entire back catalog and they want to iterate and combine different concepts and then it automatically stitches together these cars and combines them you know we also didn't just release the model we also released an in-painting model so you can delete parts of a picture and have seamless edits based on your text conditioning on that you've got an image to image model that can Define it into any style we have a four soon to be eight times upscaler that's like enhancing Hansen Hans on a TV show you know yeah and all of these are going sub second now in terms of the speed of iteration on them yeah so I think creative is the first but then I said some of this design kind of things then it goes into more visual communication like I said slides if you've got a image model combined with a language model combined with a code model you never need to make a presentation again it understands what Aesthetics are like one of the things we do with the stable diffusion thing is that we create a Discord bot where everyone rated the outputs on stable diffusion 1.1 and then we use that to filter down our giant two billion image data set into the most aesthetically pleasing things using clip conditioning on that and then we trained on that it became more aesthetic and pleasing uh you know a bit weird in some ways but again these feedback loops become very very interesting because to get the wide range of viability on these image models language models Audio models others the human the loop factor is essential because your typical training data is quite diverse but you want to customize it to the needs and wants of the humans or the sector or the specificness of that yeah yeah there are other models out there you mentioned mid-journey a few times you mentioned Dolly we've talked about performance as a kind of a Target differentiator what are some of the other ways that you see the stable diffusion kind of defining itself relative to the other things that are popping up open source will always lag closed source because they can always just take open source and upgrade it especially Foundation models right um I think data is kind of a key thing so with you know there's been a recurring theme that's come up in our conversation a lot this the idea of the human in the loop and data refining the data versus evolving the model the whole data Centric AI idea yeah and so it's kind of a data Centric thing where like if you look at people adapt to these models right now they're doing few shot learning right or they're doing basic fine tuning there's no point in training your own model because it's freaking moving so fast yeah I will have stable diffusion version three in a few months you know uh like we had a 20 time speed up yesterday on the model this is insane these kind of moves I think we've ever seen anything quite this exponential but what happens then is that if you go through an API there's only so much you can do that's what a lot of these companies do well if you go via the interface like you know in the mid Journey or something like that or dally if you've got the model yourself that you can play you can experiment you can adapt it so the language models from the Luther Community GPT neojnx their GPT level models but only up to 20 billion parameters they've been downloaded 20 million times by Developers but they need to tell anyone they just get on with things and so one of the interesting things for me is that the positioning is the tooling around this you know because once you've got those Primitives you can build stuff around just like you've seen loads of community web uis and other interfaces to interact with stable diffusion and for our own company it's a very simple thing this is like a database on steroids yeah yeah you think about it like it's a database that comes pre-filled with interesting stuff and that's how most people are using it right now but soon when we upgrade it a few bits and it comes with me it is it's a data it's a kind of a magic box box database of images and your query is your prompt exactly yeah it's a data store except for it's a super efficient data store yeah 100 000 gigs to two and it can do all sorts of wonderful things yeah so right now everyone's using like the pre-baked version like the Lauren ipsum version right right but then in a few years everyone want their own custom ones so our business model is very simple take the exabytes of data from content companies convert them into these models and make them useful because we think content is turning intelligent and it goes beyond media companies to bio farmer and others and we're probably the only foundation model company building Cutting Edge AI that's willing to work with people and go to the data so models to the data I think is a very interesting thing based on open Frameworks so you don't have the lock-in of some of these other ecosystems you'll be like I'll try to model for you but you have to be locked into my things yeah yeah um one of the things that you mentioned in past you've seen the model learn physics what does that mean so like if you type in a lady looking across a still Lake it will do her reflection in the water yeah you know raindrops it gets correct and things like that and as you train it more it learns more and more Concepts okay of how things interact which again is a bit insane yeah like you can show it the sides you can train it on like a experimental car like a Cybertron how much effort's gone into yeah in the visualization Community uh trying to get that stuff right exactly so like you can read you can show it parts of like a cyber truck yeah and it doesn't have a cyber truck say for instance and then you can ask it what the back of the Cyber truck looks like and it will guess and it'll probably get it right and knows the essence of truckness yeah so rather than having these very specific models that learn stuff you can now have something that can do just about anything in terms of lighting and you know they've got prompt to prompt where you can say make this picture sadder or you know turn him out into a clown or a stormtrooper and it automatically does that because it understands the nature of these things and the physics and balancing of that which again is kind of insane this has big implications for the rendering industry and other things because this is a far more efficient renderer that can do image to image and transform something into something else um nobody's quite sure how it works and I've got theories this is one of these things with these Foundation models like they're just an alien type of experience when you first really start pushing it most people are surface level when you start pushing through it you're like it's really curious that you can do this it doesn't have agency yeah so yeah it's a two gigabyte file but the fact that you can have that compressional knowledge like understands Concepts it's really interesting would that always be a fundamental limiter uh meaning you know if you want a quick and dirty approximation use something like stable diffusion but if you want a precise rendering you know you have to turn to traditional techniques I think it's going to be I always say it's part of a process yeah architecture you shouldn't try to zero shot everything that's what people tend to fall into a travel like yeah I just wanted to know like have kind of K n's or knowledge graphs or retrieval augmented systems or kind of whatever put it as part of a process pipeline but definitely quick and dirty it does very very well better than anything but then I think that also this is why we have our in painting and all these other models it's going to be part of multiple models doing multiple things for multiple purposes sometimes there might be a giant model once you get to a certain stage as at other times you might just want to have a quick and dirty 256 by 256 iteration Loop and so what's what we've seen as well like with stable diffusion 2 we actually flatten the latent space through gdping and also bunch of other things so it's more difficult to prompt still diffusion 1 was quite easy to prompt WW2 is more difficult but it's got much more fine grain control but where we're going we're not going to use prompts well I think it will just be a case of like you'll have your own embedding store that points to points in the latent space and then pulls up like the things that you like most commonly so it learns and then kind of there's that interaction between the two things so you know Eventing is being a multi-factor representation of kind of what's in there so I think that people's own context is important and AI models haven't really understand people's AI person context and that or companies or other things and again this is fine-tuning effects where you can with a two gigabyte file actually have your own model and then why do you need to prompt training on Art station 3D octane render and all these things when it learns that that's what you want to have that's this type of style that you like right having said that I think prompting is just very difficult like my wife's been trying to prompt me for 16 years that she hasn't quite managed uh you've touched on a couple things uh open source versus API um and very briefly uh this idea of kind of customization um and I think you know based on stuff that I've heard you talk about in the past like you're very strongly opinionated around the the model that through what you're kind of delivering the technology Beyond just the technology itself can you talk a little bit about um your thoughts there and kind of what's driving what's driving that so I think uh this is incredibly powerful technology I think it's one of the big Epoch changes in humanity because you have a model that can do anything and approximate there's two types of things in type one and type two logical thinking and then principle-based thinking it's kind of get to principle-based thinking like we still don't have ai that does good old-fashioned reasoning with logic this can take leaps that's what we said like quick and dirty approximation you can do that yeah you type it in and you get like a hundred different images of like a book or a files or something like that you can then iterate and improve on just very different experience so I think was like again put this out as Foundation models like again Benchmark models that people can then develop around because the pace of innovation will outpace anything that's closed but also IT addresses things like the digital divide and minority is nothing so like with open Ai and Dally too um they introduced the anti-bias filter which automatically for non-gendered Words added a gender and a race so when you type in Zoom wrestler we'll do Indian female sumo wrestler yeah which I suppose could exist but probably not many of them would you want it's probably 19. because they're kind of limited whereas with our model what happens we released it and then a team manager pan created a Japanese text encoder that's alternative so salary man rather than meaning man with lots of salary men are very sad man you know kind of these local contexts these local elements these local fine tunes I think will work very essential and also widening the discussion because a lot of the stuff that occurs with these big powerful models is that we won't release them because we're scared about what's going to happen right because the no no no no no no so that's fair you know that's an opinion that shouldn't mean that it shouldn't be available to other people because of the power of this technology because otherwise the lizards corporate's fascinating won't be available despite the fact it could uplift them creatively and communicatively in other things one way to think about it if I'm really mean in some discussions sometimes it's like why don't you want Indians or Africans to have this technology there's no comeback you can't say that more education is needed or it's too dangerous and they're not responsible because the reality is this is technology for Humanity and it's an echo of what happened with cryptography we can't let cryptography be open and the government classified it as a weapon here in the US right because bad guys might use it yeah but we use it now to protect ourselves as well open source will always be more secure than closed Source if the community rallies together because what do we run our infrastructure on here at AWS it's not really Windows servers is it it's Linux you know our databases on MySQL and things like that because Community can come together and build stronger systems and more effective systems but it's crazy how fast this is going and so it's difficult line toe down yeah you've mentioned that more recent versions of stable diffusion include like safer work filters that kind of thing sounds like something that you're thinking about and care about and not to putting out without any kind of controls yeah so the original version um look again it was led by the conference lab and we said very specifically you guys get to decide and we will advise because it was an academic endeavor you know even if the people like one of them works for us another one works Runway and Etc right is the nature of the thing and so we're very respectful of kind of entities that we collaborate with because it can be a Minefield right you're trying to whitewash anything so it's released under a creative ml open rail license which is a new type of license from hugging face that said you have to use ethically add a safety filter because the decision was made by the developers not to filter the data so it could be a baseline from which we could then figure out biases and other things and that removed a lot of nudity and kind of other things especially because it was accidentally creating it disabled diffusion 2 is trained on a highly safe for work data set um so it's massively more safe it doesn't have a filter because it doesn't need one um it has some drawbacks such as one of the things that we saw during the fine tuning after stable division one is that people trained on not safe for work images you know internet is for that whatever fine-tuned it they yeah they fine-tuned it they took lots of images that were not safe for work right so obviously there was the standard effect of that because again they're free to kind of use it as a community but the side effect is that when you actually used it for safe for work prompts it is amazing humans like photorealistic because it learned about Anatomy from these not safer work images it's quite funny so still diffusion 2 out the box is a bit less good at Anatomy because we removed a lot of those things not much and again we're adding it back in safely we really care about that the other thing that we care about a lot is you know we read this community as big we're creating Millions hundreds of millions of artists so yeah so artists are part of the community they were asking can we opt out of the data sets so some we're actually asking can we opt in because we're not in the data set um and so we worked with spawning and lion and others on opt-in and Uptown mechanisms because I think that's the right thing to do like I think that it's ethical to use web scrapes to create models like this especially because the diffusion process doesn't create copies or photo mashes it actually learns principles it's like a human but at the same time if people don't want to have their data in the data set they should opt out if they want to end the shop in in fact we've had thousands of artists sign up for the system it's been 50 50. opt-in and knocked out which I think is really interesting and not maybe what some people would expect yeah yeah um maybe shifting gears a little bit uh stability AI as a company as an organization I've heard it described as you know very you know variously an art studio kind of looks and feels a little bit like a research lab feels a little bit like a funder of things a provider of gpus and instances uh how do you describe what it is I mean stability AI is a platform company so try and build the layer one for foundation model Ai and we think the future will be open source on this so our research lab is you know researchers who have loads of freedom and they can in their contracts open source anything they create and there's a revenue share for when we run the models on the API even if the researchers don't work in stability they still get cut checks which I think is a very interesting way of doing things yeah we've got a product team that takes the open source staff just like anyone can and productizes it into things like dream Studio we have dream Studio Pro coming up which is a full Enterprise level piece of software with like 3D keyframing animation video audio everything we've got a forward deploy team whereby for our top customers with the most content that will be transformed in Foundation models we're basically embedding teams inside there and saying you don't need to build a foundation model team we're your team because we do all the modalities from the text to language to audio okay that's something that's super appealing to people then we've got empathy that is supporting our five six thousand a100s and the infrastructure to scale apis to billions in support with Amazon and others is like can you talk a little bit about some of the the ways that you engage with Enterprise is like what are the kinds of things that they want help doing with these models so the pace of ml research is literally exponential with a 23 month uplink it looks crazy so they can't keep on top of this and there's very few papers polished papers on archive yeah okay it's always nice when you see it actually exponential they could in AI to help with that yeah you know um but like when you look at this they're realizing they need to be on top of this technology now and they come to us as kind of almost Consultants it's like a parent here type model yeah where we're like we'll fine-tune some models for you and it'll make them usable through dream Studio but you shouldn't train your own models now because the models aren't going to mature for another year when that time comes we will train the models for you we will fine-tune them for you we'll create custom models for you that's our highest touch engagement with a couple of dozen entities and when you're telling them they shouldn't train models are you talking about from scratch from scratch or they should okay they will be able to eventually but right now it's not a sensible thing to train a model from scratch I stable diffusion took 200 000 a 100 hours to like 600k you spent on yeah 600k yeah well we actually spent lessons about discounts but I can't save our discounts are you know what I mean you can figure out the retail yeah retail retail still diffusion two about a hundred thousand hours retail um open clip because we have to make the clip model about five million dollars so yeah these things add up yeah quite a large bill yeah so I think that when you kind of look at all these now's not the right time to do big trainings for big companies because again the model architecture is just increasing on a ridiculous rate yeah but then it's going to level off you can't keep improving forever and then that's the right time to train up your own models they'll be better than these fine-tune models but then you have multiples of multiple modalities you know this is part of the reason we've kind of partnered with Stage maker because people need to get used to this technology now and they'll have all these different Primitives these different models they can mix and match to create brand new things going forward yeah and sagemaking makes it kind of easy to do that and it makes it easy to address the tail because apart from the top couple of dozen companies we just want to have a SAS solution for everyone else to be able to access use and modify these models following up a little bit on the sagemaker and the AWS announcement I kind of read as uh you know you selected AWS from my understanding you've been using AWS to some extent all along yeah so AWS built the core cluster and now you know we reached this point it was uh it was originally a 4100 cluster which on the public top 500 listed about number 10 supercomputer in the world which is kind of insane so very great job building that but then we had to decide what's next like the managing the resilience to some of these other things we build our own next cluster Amazon came and they said let's use the sagemaker service to offer a high level of resilience and optimization so the sagemaker crew for example took our language model Gypsy Neo X again 20 million downloads of this family yeah they went and took the efficiency of a 512 a100 training from 103 teraflops per GPU to 163 by optimizing it for Amazon EFA Fabric and Pipeline powerism and cost attention and kind of all these things that was amazing thing so they're helping us optimize our entire stack from influencer training through to having resilience so when gpus fail they come back up and the final part of it was just how do you make this accessible through sagemaker and services than ecosystem they built around that now we're going to make our models available on everything right so like I said today they became available on the MacBook M1 with Native neural engine support one of the first models ever to have that yeah that's massively sped it up we've got it working on Qualcomm we're going to work on iPhones all these things yeah but Amazon's a really great partner because they're infrastructure players one of the biggest cloud providers in the world and so that's why we kind of picked them as our preferred partner also you know super grateful in that we wouldn't be here if they hadn't fill us a freaking enormous cluster and really believed in us because we're only a 13 month old startup yeah so everything's been in the cloud the entire time all the entire time we had a machine learning Ops Team of four people managing four thousand a100s now we're up to nearly six thousand was that team managing the that cluster you know kind of bare metal with your own tooling or you know how much of the Amazon tooling have you so it was it was ec2 and then the Amazon had a system called parallel cluster with slim that was used to kind of manage it and so we've been working for the last four five six months just constantly improving it together and again it's open source yeah if you go to the stability AI GitHub you can literally download all our configurations to run your own parallel cluster okay and again this is part of what we really like the fact that the stack is open source and anyone can take it and they can build their own clusters maybe not quite to the size that we did unless you're feeling really Punchy yeah but still I think these knowledge and these things should be shared because you find that large model training isn't really an art is it really science is more of an art like one of the most interesting reads you can do is the Facebook opt 175 logbook for the 175 billion parameter model they just try stuff and it often fails and there's the occasional weird thing like uh I believe it was the Azure kind of customer support on the 23rd of December deleted the entire cluster and you're like man I feel for you guys like it's kind of there but like I said this is not just an easy click and play kind of thing these models are difficult to train yeah the smallest Hardware thing can throw it out they can be just weird stuff we're making it up and figuring out as we go along because remember Transformer architectures are literally only five years old yeah he's thinking about open source and that you know that direction broadly that the the company is taking one of the um challenges that uh comes up in open source as it matures is this idea of governance um you know you know maybe it's early you know talking about governing Community that's just months old but um do you have thoughts on on how the community you know governs itself over time yeah so I mean again it's complicated one right AI confidence and is it policy LED is it community-led who are the voices at the table because there's some important things there's such powerful technology it's going to be essential I believe to the future of humanity so like for example Luther AI is 200 years old that's our language model Community 15 000 people and developers um we're kind of incubating at the moment we're going to spin it out into its own separate 501c3 because it shouldn't be us influencing the direction of Open Source large language models right it should be a collective effort but now we're really going through the governance thing and looking at different examples and the Linux Foundation is an excellent example of that so Pi torch has just been given to the lungs foundation and so we're in talks with them a whole bunch of others to say what are Best Practices here and what should we look like given the power of these some of the decisions you need to make about that our stability itself we're setting up with subsidiaries in every country such that first off 10 of our equity in those goes to the kids using our tablets because I think they should influence it because that's the Next Generation this AI will be important to them but then we want this to be independent entities that run the AI for India or Vietnam or kind of Malawi Etc because we need to train up a next generation of people to make those decisions for their own country because right now what we have is a situation where you've got a few people in San Francisco making decisions on the most powerful infrastructure of the world for everyone because that's all deny ourselves this AI is infrastructure it's essential for where we're going to go yeah and it shouldn't be controlled by any person or entity like I'm very supportive of the whole ecosystem the one time I by almost a very direct so I spoke out against open AI because for Dali 2 they banned ukrainians from using it they removed any Ukrainian entities from that as well yeah and this is during the time when they're being oppressed I said basically you have excluded and removed and deleted and oppressed people and that is ethically immorally wrong but it's their prerogative as a private company and if it wasn't for us there would be no alternative and so I literally took Ukrainian developers the houses were destroyed and brought them to the UK you know and so this is part of the thing as well if you have control of this artificial intelligence given to an unregulated entity like these big companies they can't help themselves but behave in certain ways because they can't release it more than that they tend to optimize so I did a lot of counter extremism work um advising multiple governments the YouTube algorithm Got Hijacked by extremists because the most engaging content was extreme okay that's not YouTube's fault that's full of great people ad driven AI companies they will use this technology to create the most amazing manipulative ads I guess it's not their fault it's kind of what they are so regulation needs to come in appropriately governance needs to come in appropriately but we need to educate and widen the discussion on this and the only way to do that is open source otherwise it will never happen and so you will have ai basically being a colonial tool in some ways with very Western Norms when this is essential infrastructure like I said I believe for everyone and I think the the common retort to that is it needs to be controlled because it's so powerful so dangerous yeah so who are you to control it I mean this is the thing like I've heard it Viking to a nuclear weapon unlike this nuclear weapon that can allow humans to create visually and so you're restricting it I mean again it comes down to that question like I've asked this I've never had a question why don't you want Indians to have this for Africans and the only answer is because they need more education so educate them more yeah yeah you know because they can't use it responsibly and you can it's racist like I think fundamentally if you think about the digital divide we've seen this with technology being restricted from minority groups and from the rest of the world frequently it's fundamentally racist because we think we know better in the west when it's reality we don't because people can take this and extend it and people are generally good people are not bad and if people are bad as a society we build systems to regulate that so even if they create deep fakes we build our social networks and just have creation mechanisms you know we build authenticity schemes like content authenticity.org that we back and that sounds like you're the core of your answer is that the the ecosystem will solve the problem the Bad actors come in they use these tools to cause whatever having they'll cause and then you know we'll find faces Bad actors have the tools already they have tens of thousands of a100s you're the proof point of this right open AI was keeping Dolly closed behind you know apis and wait lists and things like that and you know you came up out of nowhere and released uh something and look 4chan has had this technology for three months what if they created nothing right yeah you know like this isn't going to topple humanity and have more more people know about it so we can bring this discussion you know we took a lot of flack we had a lot of benefits yeah but we brought this discussion into the open into policy in other fields as well like again it's my hope that now react to the forcing function so I reckon Dali 3 will be open sourced you know it's like the open source whisper and I think this would be fantastic let's bring it out into the open because again this is foundational infrastructure for extending our abilities it should not be closed yeah if I don't believe that it should be free forever and like even actually it's not open source because it doesn't conform with rule 0 of Open Source in a pure open source way the creative ml license is not because we say You must use it ethically you know do we hope to move it to open source yes under CC by five or MIT license just like our other models like our Korean language model the polygot one from the Luther or open clip or things like that yeah but again this needs to be an open discussion I think rather than who is deciding it I don't know if Regulators want to come and regulate it again that's Democratic decision and so I'm a big supporter of democracy and kind of these things but let's use our institutions and our processes rather than try and make these decisions ourselves in closed rooms awesome awesome well Iman thanks so much for taking the time to chat it's been wonderful speaking with you and learning a bit more about what you're up to it's a pleasure I hope you have a seat as well nearly done nearly done thanks so much take care
Info
Channel: The TWIML AI Podcast with Sam Charrington
Views: 5,912
Rating: undefined out of 5
Keywords: TWiML & AI, Podcast, Tech, Technology, ML, AI, Machine Learning, Artificial Intelligence, Sam Charrington, data, science, computer science, deep learning, amazon web services, aws, generative ai, stable diffusion, ai generated art, dall-e, clip, open ai, latent diffusion models, emad mostaque, open-source, API, stability ai, huggingface, re:invent
Id: 63Y1sMmidj4
Channel Id: undefined
Length: 45min 27sec (2727 seconds)
Published: Mon Dec 12 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.