🤔 Ok, but what IS ControlNet?

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

okay so what actually is control net you've got like these eight models and you've got this and you've got that and you've got the other but what's actually going on here there are a lot of videos out there explaining how to use control net but that's not what we're doing today this video is about the underlying mechanism so we're going to answer three questions firstly what is control net secondly how does control net work under the hood I love staying under the hood and thirdly why should we care why does it matter is control that actually important from like an AI research standpoint so if we answer all these three questions then we've done a good job so what is control net it is research paper very good research paper I might add and Associated GitHub repository the paper is actually very readable so if you're a paper person you're actually going to have a lot of fun reading this paper and in this research paper these two excellent researchers present a new means of fine-tuning a stable diffusion model in fact this method could be applied to like any number of models but they applied to stable diffusion in the paper the researchers claim that this method is super effective and to prove their point they trained eight different models to perform like specific subtasks related to stable diffusion and then they spent ages giving us all these nice image comparisons of all the different models improving the models were like really good so that's the view from a thousand feet control net is a method for training stable diffusion to complete specialized subtasks control that isn't like a single model it's a model training mechanism but how does this training technique work exactly well the best way to explain it is to start with the motivation so at the moment stable diffusion is like a really good really powerful model but there are some things that it simply can't do that may would like it to do so imagine that you wanted to get stable diffusion to give you an image of Jesus dabbing and you open up whatever program you use stable diffusion and you're like okay I want an image of Jesus dabbing you put in the text and out comes an image like this which is pretty close voice you can definitely see that stable diffusion is a very clever program you know it knows all these things about compositional elements and knows what a human looks like knows about lighting the image looks pretty realistic and good but the thing we're missing is the actual dabbing here we're not able to be as specific as we would like so ideally what we'd want is to be able to feed more inputs into stable diffusion we'd want some other model you know that's not stable diffusion stable diffusion can't do this we want a model where we can give it a text and we can also give it some other kind of input in this case like a depth map would be a really good example of something we were given we could take a depth map from this image it turns out that it's quite easy to extract depth maps at this point there's very good models to do that so we can extract depth map depth map from this image um and pass it in and then along with the text you this hypothetical model should be able to do better than stable diffusion is done in generating an image like this just by the way it's conventional to refer to extra information you provide to a latent diffusion model as conditioning this is something that comes from probability the idea is that the model is spitting out some kind of image and if you pass in the text you're sort of conditioning the output on that image or you can say that the output is conditional on the text you give it so ideally what we want is a model that can take other forms of conditioning like say a depth map that's what we'd like that's what we're looking for and again stable effusion can't do that its architecture is only Built to take in text or of course you can do image to image you can take this white noise and you can replace it with an actual image which will help the process get closer to what you want but nevertheless there's only one source of external conditioning which is the prompt but we don't want to have to train this model from scratch because we already know that stable diffusion is like really close to what we want you know it's got so close considering the huge immense space of possible images this image is so close to what we want that you know it's tantalizing us at this stage so starting and trying to train a model from scratch that seems silly we want to somehow be able to leverage all the knowledge and the wealth of weights and training that's been done to this model and then put it into a new model that can also take this additional conditioning input that's what we'd like and so that's sort of the question that the authors were trying to raise they were like hey guys um what is the most efficient way to do that what is the most efficient way to train a model to take in additional conditioning inputs and this is exactly what they say in the first sentence of the paper we present a neural network structure control net to control pre-trained large diffusion models to support additional input conditions so that's what this whole paper is about okay so these guys have to create a new architecture that leverages the stable diffusion weights that already exist to quickly build a model that can take an additional input and how do they do this well kind of in a pretty straightforward way actually now if you read the paper there's actually a lot to it they have this complex architecture where they make a bunch of like very opinionated design choices they invent something called the zero convolution they have a bunch of math to back it up but I'm just trying to boil this down to the most the biggest thing they do so in the highest broadest possible sense this is how control net works you take your stable diffusion model and you freeze it so you prevent any gradient updates to it you make sure that it doesn't change at all you just have this locked Frozen stable diffusion model that doesn't change in addition to that you create an external network in the paper they refer to this as the trainable copy um and you give the external network the responsibility for handling whatever new conditioning input you're giving it in this case a depth map so you allow this external model to take in a depth map and then you also create a mechanism for allowing the external network to flow information into the main model so the idea is the external network learns what the depth map is and how to understand depth maps and how to use them properly and the external network pushes that information into the main model the initial model sort of injects that information in there and we'll get to exactly how that works in a second and in this way the main stable diffusion model is able to take the new information into account without actually updating its weights at all or changing in any way this idea is very closely linked to something called a hyper Network which is a term you may have heard floating around a few times it seems to be a pretty broad term people use it to refer to things that are a bit different but basically the idea of a hyper network is you have a very powerful big strong model that is really really clever and to make it perform on a subtask or a new task you create an external network like this one to sort of assist that model rather than training the model itself that's that seems to be sort of the hyper Network strategy just to do sort of like a bad analogy it's like you have a law firm that specializes in tax law and one of your lawyers is really excellent she's like the best tax lawyer lawyer in the country and then suddenly the market changes and a bunch of contract law cases come up and you need to do a bunch of contract law now um and the taxation lawyer she doesn't know anything about contract law so rather than getting a taxation lawyer who's like the gun the best to go back to UNI and learn about contract law you hire a contract lawyer who's like maybe the Junior and the contract lawyer sort of handles all the contract law stuff and then passes the information in a really palatable way onto the tax lawyer then she's able to understand what's going on you want to make the right decisions and make the right calls I don't know what lawyers do but she can write the contracts without actually having to know anything about business law herself so that's kind of the idea behind control Nets uh the actual control net itself elf is this whole thing so the whole thing can be thought of as one model the external model plus the SD model so for instance if you go to the hugging face page the author's set up you'll see there are all these like big five gig models these are all the full control net right so it's a stable infusion 1.5 model plus tacked onto it also this external model so this is why it's a bit larger than just a normal stable diffusion model whereas if you go to say this web UI repo there are a whole bunch of smaller control Nets it turns out these are only the external network that's all they are and the expectation is you'll download the external network plug it into say your automatic one one one one web UI and the web UI will pair that external network with uh whatever stable diffusion Network that you're already using locally so the control net is like both of them together but you can also just download these external networks and throw them around obviously if you train the external network with a particular stable diffusion model it might work really badly with a different model that it hasn't been trained on because there's probably quite a tight bond between the external network and the model it was trained with and obviously if you try to use a different version of stable diffusion it's not going to work at all because the compatibility is set to like 1.5 because that's what the author's trained on so hopefully now all of us everyone has a pretty good conceptual understanding of what control net is and how it works but I want to get a little bit more technical because I find that really helps my personal understanding so buckle the heck up okay a single forward pass of a control net model consists of you taking the input image which is usually noise but it could be maybe you're doing image to image instead so it could be a particular image and you pass that image both to the external and to the Frozen model you also pass the prompt to both models as well and then finally you pass the conditioning the extra conditioning information only to the external network the information flows through both models at the same time and then about halfway through the external network starts shoving information into the main model if we want to zoom in a little bit more it'll end up looking a little bit like this so the external network has a bunch of encoder layers which are very similar to the encoder layers of the Frozen Network and you pass the conditioning information in and then each layer is linked to a particular layer in the internal Network the layer in the external network will pass the knowledge it's gained the information it's extracted from this conditioning information into one of the internal layers so let's zoom in a little bit more if you look at the internal workings of a stable diffusion model which by the way this is a pretty reasonable representation of how a stable diffusion model looks like inside it's got these four encoder layers got a middle layer and then decoder layers and you know you start off with a big image and then the image gets smaller and smaller and smaller and then it gets bigger and bigger and bigger when you output it there's a whole bunch of other connections and stuff in between these layers but it's pretty reasonable okay so if you just look at these three layers what will happen is that the fourth encoder layer it'll pass out a bit of data to be more specific we'd call this a tensor and it's just like a Big Blob of data basically and that tensor gets passed into the middle layer which passes out a new tensor which goes into the first decoder layer and so on and so forth um the way that uh researchers refer to these blobs is as activations activations are the thing what gets output from a layer once you put an input into that layer so they're just like outputs activations are just intermediate outputs so when we add the external network what we end up doing is rather than passing the activations from the fourth encoder directly into the middle layer instead we add those activations together with the activations that come out of the middle layer in the external network so we have this tensor which comes out the external network this tensor which is just flowing through the locked Network as usual we add them together just a plus just a normal like mathematical plus operation and we end up with a new tensor which is sort of a mix or like a blend between these two pieces of information the idea is that since the external network has taken this new conditioning information into account now this new conditioning information is sitting inside the main model and the main model can use it to its advantage okay so let me pass this Blended tensor into the middle layer the middle layer produces activations and then we mix those activations with the outputs from the last encode layer in the external network and then we produce another blend which goes into the next layer and so on and so forth so if you zoom out that's kind of what this looks like you've got these arrows of the information going from the layers in the external network into the layers of the stable diffusion Network the idea is that with that additional information the lock network will be able to produce an output that takes this conditioning information into account okay so now hopefully this really abstract picture and this Arrow that's just pointing here makes a bit more sense and it's easier to understand exactly what's going on there and one more thing I really just have to mention about control net you'll notice that the external network as depicted here looks a lot like the first five layers of the regular stable diffusion model and the reason that is is because the paper authors were really keen on leveraging the in the knowledge and the information that was already in stable diffusion as much as possible they were very much against training things from scratch so you could make any kind of external network you could train it from scratch you could do whatever you wanted you could make this any kind of structure you liked but the paper authors were like well since we already have all this information in stable diffusion what we'll do is we'll just copy out the first four layers of stable diffusion unfreeze them and allow them to be trained um and that will be our external network so that's what the paper orders did and that's probably you know that decision probably had a big impact on their results um but in theory you could have any kind of external network you liked okay and one last thing to make just completely clear here is that you actually end up training this external network so you get your huge data set of image caption Pairs and you pass them into the model and each time you do that you also generate a depth map for the image and you just do the normal stable diffusion training routine where you get it to create an output and then penalize it for producing outputs that are wrong and you just you just do that for as many iterations as you need to but you never update any weights inside the main model you just update weights inside the external model so again I want to emphasize that control net is a training strategy where you do usual regular stable diffusion training stuff to train this external network instead of the main Network okay so now we can go ahead and do the really fun thing when we draw a tick in this box and say We Now understand how control net works again the authors made a whole bunch of really interesting decisions that a lot of them seem really cool and to really understand it you'll probably have to read the paper but this is sort of the very high level okay last one I have actually tried to make this video like five times and I think I'm just going insane I'm just finding it really hard to condense my thoughts properly but we're almost at the end why should we care about control net is control net actually important so obviously control net is seeing a lot of uptake if you go to the GitHub repo it's on like 12 000 Stars the reason that I think control net is interesting though is that they kind of made a bit of a weird decision normally when you're training a big model to perform a subtask or like a task very similar to the one that already does you actually just go ahead and train the model right like control that does this weird thing where it locks the model and increase an external model and like hires a contract lawyer whatever usually what you do is you just take the really knowledgeable powerful model you add some new inputs to it um you know and maybe like a few weights so that it can handle those inputs a little bit better and then you just proceed with normal training with the new inputs and the new expected outputs and then after a while what you'll get is the model that now performs your new task that's like the standard Paradigm for fine tuning right just take the existing weights and then train them a bit until they're new weights for instance that's exactly how dream booth and Laura work um I've done a video on those if you want to know what they are um but anyway that's how they work you have your normal stable diffusion model and then you give it a few images of your face and then you just keep training it like usual and update the weight inside the model until it understands your face um and that's that's like the normal thing to do but the control net guys were like no no no no no no no we're gonna freeze it and do an external network and for some reason we think that's going to work better and like it kind of does actually work better and I say that because there's actually this really nice comparison out there um a few months ago stability AI along with their stable diffusion 2.1 release released this model called stable diffusion to depth which is uh more colloquially referred to as depth image this model is highly cool it was trained by the same researchers who did the original latent diffusion models paper it takes an image and a depth map and then a prompt and then it allows stable diffusion to create images that match that depth map which like obviously that's really cool and the method that the researchers use when training stable Fusion depth or depth to image was exactly the one I described before so all they did is they unfroze the model added like a tiny extra sliver of Weights onto it so that it could take this new image into account in fact the way they pass this new information in was just like as an extra Channel you know images have like three channels like red green blue well it just added like an extra one which was like the depth map layer they added a few more weights to allow stable effusion to process that new layer and then they just kept training just like normal and they ended up creating this depth image model which works like actually fairly well um and so the control net guys to give themselves a nice point of comparison went ahead and trained exactly the same kind of model but using the control net strategy here are the results they got and this is kind of a little comparison the authors did these are the images that come out when you use depth to image these are the images that are coming out when you use control net the authors claim that their results are like a bit better I'm always really skeptical when authors do that so I decided to do my own little uh study so I just downloaded this image of this gentleman dabbing from the internet I created a prompt that described what I wanted and then I passed that into just regular image image stable diffusion and I also tried the same thing with depth to image and then I tried exactly the same thing with control net and then here are the results like judge for yourself which ones of these are better but from what I can see the image to image results are like pretty bad because none of them are actually dabbing the depth to image results are a lot better they're way closer because the sort of arms seem like they're in the right position but none of them actually look like they're dabbing really whereas the control net images every single one actually looks like they're dabbing and some of them give like a really sharp like hmm you know that's like a good dab like that's a dab you know what I mean oh that's a good thought control net seems to have significantly better fine grain control so if you look at these images you'll see that they all seem to have these sort of perpendicular lines like a little sleeve here and the Shadows on their garment the shading of the Garment and like the fabric seems to lie in this pattern which is very similar to the pattern on the original image you can also see these sort of perpendicular lines in his shirt whereas all of the depth to image models they don't have that or like most of them don't so I thought that was very impressive that it's able to capture these sort of fabric details just from a depth map so I don't know uh this is by no means a conclusive study but as far as I'm concerned control net seems to be a bit better that's not the important bit though the important bit is that depth to image was trained for more than 2 000 GPU hours or at least the authors estimate that's how long it took and that seems to be about right on a100 gpus which are like really chunky like industrial Gray gpus um in 2000 hours it's like three months whereas control net was trained for less than one week right that's like 168 hours on a consumer grade GPU so that's kind of that's that's big that's like an order of magnitude Improvement in terms of training efficiency and it's kind of a little bit rare in research that you see things that are orders of magnitude better so that's actually kind of exciting and if we're really lucky it means that this method here this sort of external like hyper Network approach to fine-tuning is actually a lot more efficient and little guys like you and me who don't have a house that's built out of a100s will actually be able to do really cool stuff and of course now there's this obvious question which is why is it better why is it so much faster um we can kind of answer the faster question and that's because when you're training and control net you're only training the external network which is like smaller in this case it's about half the size and because it's smaller it's faster to train you know you're only updating like n weights as opposed to 2N weights so that's probably why it's faster why is it better though that's kind of that's kind of hard to answer the authors claim that the reason they decided to go with this control net strategy in the first place is to avoid overfitting when the data set is small and to preserve the production ready quality of large models um which basically means that they're worried that if you fine-tune the big model on a small data set or I suppose a data set that's like kind of complex you risk ruining the weights in the model to begin with so I guess the theory is that if you give like a completely new input to this very specialized model it's gonna kind of wreak a bit of Havoc for a while you know if you expect that really good model to score really well on this new input immediately you start punishing it for not performing well immediately that might cause some issues and that might mean you'll have to train for a long time before you end up getting good results whereas if you lock the model and say no you know the weights in this model are good don't touch them do all the training in a different network then the training process will go faster in in this case we have an external network you're sort of locked into this very good Baseline you know you can't stray too far from this very good effective regime which you're locked into um something along those lines and then the nice thing with control net which is something the authors point out is that once you've finished training and once you're getting these good outputs using the external network then you can go ahead and unlock the main Network and then do a bit more training and allow this main Network to take some of this information into account as well and in that case you sort of don't really risk doing damage to this model because it's already so close to good outputs anyway but that's like pure conjecture okay so forget I said any of that we don't actually know why it's faster or why the images seem to be better those things we have yet to determine okay looks like we're done Happy Times let us summarize so what is control net it's a new way of doing fine-tuning teaching powerful models to perform subtasks how does it work well you lock the original model and then create an external model which you then train to understand the new inputs and then flow that data from the external model into the main model and why should we care because it seems to give better results and take an order of magnitude less time to train than the Orthodox approach [Music]

Info

Channel: koiboi

Views: 35,978

Rating: undefined out of 5

Keywords:

Id: fhIGt7QGg4w

Channel Id: undefined

Length: 25min 31sec (1531 seconds)

Published: Wed Mar 08 2023