😕LoRA vs Dreambooth vs Textual Inversion vs Hypernetworks

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

at the moment there are literally five different ways to train a stable diffusion model to understand a specific concept like an object or a style we've got dream boot textual inversion Laura hyper networks aesthetic embeddings I've done videos on all of them and today we're going to work out which one you should use okay so to answer that question properly what I went ahead and did is I read all the papers I trolled through the automatic 111 code base I scraped a whole bunch of data off civitai to see which models people are liking where they're not liking I compile the big spreadsheet I made a nice diagram and today we're gonna answer that question that's what we're gonna do at the end of this you will know you will know which one you need to use by the way if you like that video there this was made by someone called Dreaming torpo who seems to be very good at AI art so if you want to follow that person on Twitter it'll probably be a good idea but I digress this is how we're going to structure this video what are the methods and how do they work under the hood that's the first question we're going to answer the second question is what are the trade-offs you know according to the data which ones seem like they're better and which ones have which kind of benefits and which kind of downsides that's how we're going to structure it okay so there are four methods we have dream Booth textual inversion Laura and Hyper networks there's actually a fifth method out there it's called aesthetic embeddings I did a whole video in aesthetic embeddings um here's the thing about aesthetic embeddings they're not good uh they don't give you good results they're bad so don't use aesthetic embeddings that would be my suggestion and because of how little respect I have for aesthetic embeddings I have taken them out of this video and hopefully that that won't get clipped in like five years where aesthetic embeddings is ruling the world all four methods work in very similar ways but we're going to start with dream Booth because it's probably the most straightforward so the way that dream Booth works is by actually altering the structure of the model itself um I made another video on what happens when you do this badly which is kind of fun to check out I think in dreambooth you have two inputs you have to be concerned with first input is the concept that you want to train so in this case we have this picture of this Corgi in reality you might have five different pictures of the same cookie because you're trying to train the model to recognize that Corgi and on the other hand what you have is this sentence which has something called a unique identifier in it in this case this SKS thing is the unique identifier and the whole idea of dream Booth is that you teach the model to associate this unique identifier with that concept of the Corgi that's what we're trying to do trying to get an association between this and this that's the plan so sort of drilling into this a little bit more deeply the way this works is that you take your sentence and you convert it into a text embedding where each word is represented by a vector what's called a vector but that's basically just a list of numbers or an array of numbers and the numbers are usually like floating Point number the zero point something something something the idea is that each word has its own unique vector and the vector you know contains some semantic information about the word we're not going to delve into embeddings at the moment but the idea is that you know this vector uh contains some information about the ah the art word this Vector contains something about photo and this one uh that's associated with SKS will be quite random and it won't have any meaning because of course SKS doesn't really have any meaning in English language and the idea is to associate this concept with that new Vector that doesn't really have any meaning okay and the way we do this is that we pass in the text embedding and then we apply a whole bunch of noise to the sample image until it's quite noisy and then we also apply a little bit less noise right so we might first apply 10 steps of noise to the one that we're going to pass in and then we also create another one which certainly has like nine steps of noise applied to it and we're gonna try to get the model to take the 10 noise one and output the nine one that's the idea so put in the put in the 10 noise one with this text embedding and we say stable diffusion denoise this image and return it to what it usually was now originally because the model doesn't know what the photos of you just said photos of SKS the model looks at that and goes well I don't know what an SKS is and it'll probably do a pretty bad job it'll probably output not a photo of the Corgi a bit denoised instead it'll output something crazy in SKS it might put a sort of sign or some text or something and then what you're going to do is you're going to compare what it outputs to the nine output the one that was supposed to create right so you've given it this very noised one you've got a slight less noised one you're expecting it to produce the slightly less noised version of the same image so you compare them you create a loss depending on how separate they are the loss will be really high if they're very separate and if they don't look like each other at all lastly very low if they look exactly like each other and then you perform something called a gradient update which that is a whole can of worms but basically you punish the model if the loss is high and you reward the model if the loss is low so having done this a few times the model will start to get an idea okay when I when I have this SKS thing thrown at me when that SKS thing comes through it has to look a bit like that Corgi that they keep passing me so okay um I'll I'll try to make this image look a bit like the Corgi and of course you have like eight different images so it can't just narrow in on one single image but eventually you get a model that you can pass in noisy images and this SKS thing and it'll try to denoise them and turn them back into a nice clean image of a Corgi that's that's what you'll eventually get so that's how dream Booth works and that took a long time to explain but hopefully we're kind of all on the same page on how it works and we can use this model to understand how the other techniques are going to work the important thing about dreambooth is what you're actually doing is you're creating a whole new model you have your initial model and you make updates and change the internal structure of the model until it understands this concept um and because of this and for reasons that I'll get to later in the video this is probably the most effective means of training um a particular concept into stable diffusion dreambooth is probably the best and there is some evidence for that but it is pretty storage inefficient because every time you do dream Booth you have a whole new model just sitting out there right so you train your Corgi great now you have two gigabytes of model to deal with and you have to remember that's the Corgi model and then maybe you also want to train your cat then it'll be a cat model it'll be two gigabytes of course it is possible to train multiple Concepts into the same model but sometimes that'll actually make the model get a little bit confused so basically these big models that you have to carry around these two gigabyte files that you have to share around and stuff they're sort of the downside of dream Booth but other than that it's probably the most effective method next we have textual inversion and we can see that basically the setup is identical right there's not much difference we still have this SKS thing we still have our Corgi and we're still trying to teach this setup to learn how to Output the Corgi at the end of it and we still do exactly the same thing with the denoising to the 10 and the denoising to the nine and then comparing and making a loss the difference with textual inversion and to me this is the coolest one actually is that rather than performing a gradient update punishing the model when um it doesn't get the right output you actually update this Vector over here when it doesn't get the right output so what you're doing is you feed in the embedding you're feeding the noise the model does its thing it outputs an image at the beginning it'll look nothing like the Corgi and then you penalize it for not matching the the slightly noise the nine noise image and with the penalty you perform a gradient update on the Vector instead you don't perform the update on the model you perform the update on the vector and slowly slowly the vector gets closer and closer to the visual phenomena that you are looking for which is this Corgi and eventually you'll almost have this Corgi Vector which knows exactly what this Corgi looks like and can be used to describe that Corgi perfectly to the model and the interesting thing about this is that it doesn't seem like it should work with dreambooth right you can tell how this would work right this model is a very complex model it can understand thousands of thousands of Concepts we're just introducing a single other concept into this model you know this SKS thing we're just associating this with that easy you know that's that's the kind of thing that this this kind of dream Booth model is very clever it can do that in its sleep with textual inversion it's really not obvious that it should be possible to just create a vector that is so special and so perfect that it tells the model about this Corgi the model's never seen before in its life but it turns out the textual inversion it does actually work like really well um so textual inversion is very very cool when you're training you don't update the model at all you just update this vector and the big benefit of textual inversion is that the output isn't a whole new model which is worth two gigabytes instead you end up with a tiny little embedding which is like 12 kilobytes or something and with this embedding you can give your embedding you can share it all over the Internet and someone can just download their embedding and then plug it into their model and they'll also get that Corgi and again I cannot express how cool this is it means that these models have such a nuanced understanding of visual phenomena that just by creating the perfect Vector you can generate arbitrary visual phenomena that still make sense to humans so textual inversion it's the coolest in my opinion although dreambooth is probably slightly more effective next we have Laura which stands for low rank adaption it's also actually really cool but in order to understand exactly what's going on here I'm gonna have to look inside the diffusion model itself and how it works the way neural networks today day is set up is As a bunch of consecutive layers so in this case we have three often there'll be you know hundreds and what happens is you take your input which is usually a big Matrix of numbers and you pass it to the first layer and the first layer we'll do some sort of calculation on that big Matrix and it'll spit out another Matrix which I've made small just to make it nice and Visually appealing but it would also you know this would be usually huge and then this intermediate Matrix is passed to the next layer which then does a bunch of calculations and spits out yet another sort of transform Matrix and finally you pass it the last layer and then you get your output the idea is that as the this sort of these weights these outputs get passed through the model the model learns more and more about the structure of what was in the input originally and by the end it knows exactly it was in the input it knows exactly the precise thing to give you in the output you know that's basically how neural networks work so where does Laura fit into all of this well Laura is actually trying to solve the dream Booth problem so remember with dream Booth you have this New Concept that you're trying to teach the model and you train the model and update the model until it learns the New Concept and then you're left with a new model which is like you know it's big usually you have this big model um and this is the problem that Laura is trying to tackle they want to be able to teach the model a concept without having to make a whole copy of a model because stable diffusion isn't a very large model dreambooth is actually okay you can still kind of use it but originally Laura was being used for large language models which are like immense which have like billions and billions of parameters and they're really not feasible to make a copy of every time you want to train you know a new little sub task so what Laura does is it simply inserts new layers into the model so originally the model looked like this it had three layers and then two intermediate States and now the model looks different it has two extra layers in there and it has now four intermediate States so input goes to the first layer layer makes an output output doesn't go directly to the second layer it goes to the Laura layer which then outputs the second intermediate state which then goes to the middle layer and so on so basically just insert new layers and these layers are like tiny they're like tiny layers and by default these layers are like they're set up such that they won't impact the model at all so like when Laura's training starts you end up with just blank layers that output exactly the same thing as they input then as training proceeds and with this loss you update those intermediate layers so that they start outputting slightly different values as they as the information passes through that happens to all the Laura layers so slowly slowly as these layers become more and more opinionated the outputs that are actually running through the model is intermediate States will end up being quite different from what they were originally and you know once you do this a lot and you update the intermediate States enough eventually you'll be able to do what normal dream Booth training can do and actually get the model with these new layers in it to understand the New Concept so that's how Laura works it's kind of like dream Booth but rather than updating the weights that are already there you sort of add in new weights and update those new weights to achieve exactly the same effect and again you know you're using the same gradient update process so you pass in the prompt you pass in the noisy image and we compare the output to the output that we expect and then we have a loss and if the loss is huge then the lower layers get updated a lot they get sort of punished if the loss is like small or it's like a good score then the lower layers get rewarded and slowly slowly the lower layers learn how to make the model output the correct output okay so what's the upshot of Laura why wouldn't we just use dream Booth well firstly Laura training is a lot faster and takes less memory so that's like a big benefit and also Laura layers are very small and you can sort of pass them around and add them into different models and share them much easier than you can the full model I think usually the Laura layers end up being about 150 megabytes compared to two gigabytes for the whole model so they're like a lot more compact there's also some really good math behind why Laura layers work so well but we're just going to leave that to the side for now so that's the benefit of Laura basically you know as far as we're concerned it's just it's just quick to train and finally we have arrived at hyper networks these work basically exactly the same as Laura um there's no official paper for the hyper networks as far as I'm aware at least not one that talks about hyper networks in any context that's similar to the stable diffusion models but this is information that I got from reading the automatic 111 code base and this basically seems to be how it works again you have these intermediate layers inside the model but instead of directly updating these intermediate layers and optimizing them you have a model called the hyper Network which it outputs the intermediate layers so just like the stable diffusion model will output a matrix of numbers which we then interpret as an image the hyper network will output several matrices of numbers which are then used those matrices as intermediate layers inside the diffusion model exactly the same idea as Laura you're sort of inserting these additional layers into the network and they're updating and getting better and better at giving you the output you want but rather than updating the layers directly compared to the loss you're updating a network that learns how to create these layers and the network gets updated and updated and this is the one that becomes cleverer and that's hyper networks they haven't actually been extensively studied in the context of stable diffusion but my intuition which could be wrong is that hyper networks are just sort of a worse version of Laura because there's a whole bunch of clever math stuff that happens in Laura to make these inserted layers like really easy to optimize and easy to train and just good and I suspect that doing it indirectly through a hyper Network another model that learns to build those layers my suspicion is that's probably a bit less efficient and less good um but it does sort of have the same benefit in the end you end up with about 150 megabytes worth of stuff to pass around as opposed to um two gigabytes of the dream Booth model so it's quite similar to Laura in that respect okay okay so that took a long time and I don't know about you but my brain is kind of fried from that and my my vocal box isn't working anymore but I get to put a tick here now because we've now seen the underlying information and now let's get to that juicy juicy data done the qualitative now we're going to do the quantitative okay so this spreadsheet contains a whole bunch of information just based on my own research about important facts about each training technique so like how much RAM it takes to train how many minutes it takes to train Etc et cetera so go have a look at those things the main takeaway is that they both they all take about the same amount of training Ram which was very surprising to me but apparently they all do um they take very different amounts of time to train and the output sizes are like very varied with textual inversion being like really really tiny compared to the others the other thing that this spreadsheet contains is a whole chunk of data downloaded from civitai about how much people like and use different models and civitai by the way quick shout out to civitai they're amazing use civitai do it do it now it has a whole bunch of models and embeddings and uh you know checkpoints and stuff that you can download and try and a lot of them are really good so zipitay good good use so I went ahead and generated a bunch of summary statistics about Which models are like the most popular Etc and I made them available in the spreadsheet so all this data is based on the top 100 models for each category on zivitai so for instance average downloads of the top 100 models of dream booth how many times on average have they been downloaded by users and on average it's five thousand times compared to 427 of hyper networks etc etc um so basically the main conclusion that we get from this data is that dream Booth is by far the most popular so dreambooth has by far the most downloads ratings and favorites and because of that it doesn't necessarily mean that dream Booth is like a really good model but it does mean that lots of people are using it and that's a big benefit in knob itself because it means that there are going to be more resources on that model there's going to be more models out there for you to use if I really needed to teach a model a concept right now I would use dream booth just on that basis because I know there'll be better tutorials there'll be less hunting around in you know forums for Stuff it'll probably just work pretty well because people seem to be using it a lot when it comes to the actual rating though dream booth and textual inversions seem to be out the same which is kind of interesting because when I talk to people I tend to hear that dream Booth works like a bit better than textual inversion but according to the civitai statistics um people seem to like those models about the same the other two have a lot lower over rating which is quite bad news for hyper networks in my opinion just on the basis of how poor this average rating is and sort of how few downloads there are I would say that hypernetworks is probably a strategy to avoid using and only use it if you have no other reason and there's nothing else you can do um it also is a bit damning for Laura but the thing about Laura is Laura is really new and there were actually only 11 Laura models in this data set that I use to take the statistics from so these these numbers here may not be like super indicative of how well Laura can perform so in conclusion probably just use dreambooth it's really popular and people like it a lot there are only two sort of trade-offs to keep in mind firstly the large size of dream Booth so if you're really concerned about size you want to make like tons of embeddings like hundreds of embeddings then maybe textual inversion is a bit better because again it's quite popular it's very well liked but the outputs are very small and they're easy to swap around so that's that's one reason you might use textual version is the flexibility and there is also a reason to use Laura in my opinion and that's the short training time so I know exactly what your workflow is but I know that when I'm trying to train one of these embeddings they'll often be like a lot of iterations before I get the embedding right and when your iterations take 15 minutes as opposed to 45 minutes that can be like a huge benefit okay so those are the trade-offs I can now put my nice big tick in that box and I am done so tomorrow there's a live stream you can join it if you want there are thousands of links in the description literally millions of links about different places you can go if you want to learn more about any of the things that I have said here today and there are great people talking about all these things and it's all good and there's also a Discord if you want to join it and if you have any questions you can go to the YouTube comments and that that is all we have and you you good luck okay because there's a lot of information I hope you do something with it

Info

Channel: koiboi

Views: 133,436

Rating: undefined out of 5

Keywords:

Id: dVjMiJsuR5o

Channel Id: undefined

Length: 21min 33sec (1293 seconds)

Published: Sun Jan 15 2023