Creating Embeddings and Concept Models with Invoke Training - Textual Inversion & LoRAs

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey everybody today we're going to talk a little bit about training your own custom models this will specifically focus on the open-source scripts that we make available to everyone for free to run locally on their machine we'll talk about some of the highle concepts that you'll need to understand and then we'll dive in with examples let's break down the different ways that you can train things using the invoke training Scripts currently we make it easy to train two different types of tools that you can use in the generation process embeddings and concept models we'll break down what these mean but you'll notice that the names underneath each of these likely reflect how they might be referred to in other tools that are really just focused on the technique that's used to train them rather than what they are textual inversion is used to train embeddings and Laura training and Dora training are used to train concept models now I'm going to get technical a little bit here but then I'll explain it in a more accessible way don't get intimidated I promise it'll be easier to understand than you think in the generation process we have two key aspects that we have a lot of control over the prompt and text encoding and the model weights and the interpretation of our prompt when we think about what is happening when our prompt is passed into the system it goes through a process called tokenization and text encoding coding what tokenization does is it takes our prompt and breaks it down into smaller parts and pieces that can be mathematically analyzed by the system now this specific process is something that may change over time but as it stands right now this is the common way that you'll be managing and manipulating prompt information to pass it into a model now while this break down here is something that we can interpret what is actually passed into the model is something that looks a little bit more like this each token that is broken down has an ID and the model is looking at these numbers and sees where they are in the position of this list and during its training has developed an understanding of those relationships so when we think about what's happening with the generation process we have the conditioning that's generated by our prompt being passed into this very complicated set of model weights that determine the relationship between each of these numbers and the visual content that those relate to now I know that's technical so let's use an analogy to kind of explain how this might be more easily understood you can kind of think of the prompt and text encoding as creating a set of light sources that are being passed through a lens the shape of that lens is going to dictate how each light is refracted and ultimately what the resulting picture is going to look like now this is an analogy it's simplistic and it doesn't capture all of the Nuance of what's happening in the generation process but it's a good mental model because it helps you understand that there are two parts and pieces to the process the prompt and text encoding and then the shape of that model itself and how those weights have been structured to interpret what is coming in now this is why if you use the same prompt on a different model you're going to get different outputs and how these two parts of what you can control relate to one another one thing that I like to call out is that the model weights really Define what is possible to be generated it serves as kind of the world that has been seen before so it may know Common terms and references but it likely isn't going to know very specific things so for example if you are developing a new character or you have a new vehicle and you want that to be understood by the model create in embedding that allows you to manipulate the prompt layer more effectively is still not going to be able to inject new content into the model in order to do that you would need to actually restructure the model itself so that it can interpret the thing that you want to generate as a concrete example let's say that I was making a futuristic sci-fi game and there was this really cool kind of drone vehicle that we had come up with it was was kind of triangular in shape and it had these like big rotors that propelled it up into the air and it had all these lasers all this kind of cool stuff if I wanted to be able to prompt for that specifically I really need to create a lot of that understanding in the model itself now I might be able to get a prompt that somewhat captures that thing but it won't be reliable or consistent so to recap embeddings allow us to more efficiently manipulate our prompt layer means we can consolidate a lot of what we're trying to prompt for by creating a new tool called an embedding that allows us just to more efficiently prompt for the thing we're thinking about but that relies on the content already existing in the model what concept models do is inject or extend the base model that we're working with to include new information and Concepts understanding this is is key in designing effective tools for us to use in the generation process one Advanced technique that we won't dive into too deeply on this video is called pivotal tuning and what that does is it allows you to train a new embedding that works with the specific concept being trained in a concept model so effectively creating the entire structure here to use as a tool let's touch a little bit more on how we would create a data set for each of these and then we'll dig into the interface of the open source script when we're creating an embedding we're creating a data set of images and training a tool that allows us a really strong conditioning reference to either a subject or style and that we can use in a prompt so in this case if I really wanted to be able to prompt for this specific concept very efficiently I could train this as a new token and then use this as a prompt in any model it would probably rely on that model having a common understanding of you know watercolor and cats and fire hydrants but that would be my effective way of capturing this as a tool however when I'm creating a concept model I actually need to go in and caption each of the images because I'm redefining how prompts at a foundational level are interpreted by the model and typically when you're creating a concept model you want to understand is the concept that we're training a new style or a new subject because that will Define how you should caption your images as well as how you should structure your data set in this example I'm training a new style called the Neo watercolor style into the model and what I'm trying to do is provide variations of the subject matter so that it has enough variation to see what a neow watercolor style might look like in a bunch of different contexts now as far as data set size in all of these cases more data is going to be better but typically for a textual inversion a relatively small data set is fine 10 to 20 is going to work well enough because again it's just crafting the kind of prompt information into a single token however when you're training a concept model more data is better the more variation that you can show it to really inject new understanding into the model the better 20 to 40 images is probably the lowest that you'd want to go and higher quality models are going to have much more you might see something like 100 to 200 images for capturing a more useful and practical tool in your toolkit now that we've covered the basics let's dive in to the UI we start the invoke training app we'll realize pretty quickly it's a no frills way of getting started with training it's a pretty simple application designed to help you get in prepare a data set and train a model we mentioned earlier that you're going to want to be able to caption certain data sets for the purposes of training aura or concept model that will be what the data set tab is particularly useful for but you're going to want to use this to organize your images even if you aren't going to caption them so let's get started with that on this page we can either load up an existing data set which is organized as ajsl file or we can create a new one and then in this case I'll create a new one I've got a folder that I've organized a couple of images in and so I am going to paste that in and type in the new Json L path I'll just call it neow watercolor.jpg now this particular data set is just a couple of uh generated images that all use this kind of waterc colory style I created these using invoke and I'm now going to train an embedding using them because I'm going to train an embedding rather than a Laura I don't really need to caption them I can just go ahead and use uh the automated Tooling in the training script itself if I were wanting to caption this all I would do is type in my caption here uh well in this case say you know a building tall architectural drawing in a Neo watercolor sty obviously can include as much detail about the subject is I think useful uh and then when I'm ready I can just hit save and go to next and that'll actually save that into the Json L file see that's already been saved and if I come back to it that caption is there for me to use again like I mentioned I'm not going to caption everything uh but you kind of get the point hopefully this gives you enough context for how You' use this it's a pretty simple tool uh and simply just helps you create the file you will want to note this specific path because this is what we're going to use in our training script to reference the data set that we're going to use to train so let's go back to the home and jump into the training at the top of our training configuration panel we have tabs for training an SD Laura and sdxl Laura SD textual inversion embedding and sdxl textual inversion embedding and the beta of our Laura and textual inversion pivotal tuning again this trains both a Laura and an embedding at the same time in this case we're just going to go ahead and train a textual inversion embedding for sdxl and we'll run through how we can update our configs now the reference config here is just kind of the sample default if you go through this process and save a config big file that you like you can always copy that path in and reload it it'll update all of your settings but we'll start from the default and update this so that we can go through exactly what you would change the first time you're using training script let's go Section by section and explain what we're looking at our basic configs allow us to set what our base training model is as well as where our training outputs are going to go it also includes some settings like how long of a training run are we doing how often often is it saving a copy of the model for us to evaluate and how often is it validating the quality of that model's outputs by generating an image for the base model you can leave it as the defaults here it's just the sdxl base uh model with the fixed vae but if you don't want to download another copy of the sdxl base or if you want to use a different model that you've downloaded you can update that model path here if you have models that you've already loaded into invoke you can easily just copy that direct folder and paste the path without any quotation marks into this model section you can also do the same for the vae and again this is primarily just to avoid downloading another copy you can either use the hugging face hub's model name or a path to a different local model or leave it as the default for our training outputs we've got a output directory I'm going to update this to uh match our training name which is Neo watercolor I'm going to leave these settings as the default this just allows us to extend the length of the training run or change how often a model is being saved and validated you do have the option to choose training by epox if you prefer another way to understand that is that each step that is taken in the training process is processing a single image and an Epoch is when you're in entire data set has been processed one full time so in the case of our Neo watercolor data set I have about 25 or 26 images that means 26 steps is one Epoch if we leave it as is we'll get about 80 Epoch of training which might be pretty high but that's okay we'll monitor for this as it's validating and kind of keep an eye on that as it's training we can update our seed if we want to setting a seed allows the training process to be deterministic that means if we train something that we really like and want to retrain it on the same data set using the same seed will get to the same resulting model now we'll go ahead and scroll down now to the data configs and this is where we'll talk about some of the data loading options our data source can be a number of options it might be an HF Hub image caption data set so if you're pointing to hugging face you might be able to find an image caption data set that you can use the one that we just created in our data set tool was a Json L data set that's the one I'm going to select but I'll go through the rest of these options just to explain them a directory data set is when we have a directory of images and captions stored in text files the text files have to have the same name as the image but if we have a folder of that structure we can import it that way and if we just have a data set of images we don't have any captions we can use the image dur data set type and that will just pull in all of the images it finds in any folder we could use that for our Neo watercolor data set but because I did create a Json L data set I'm just going to use that I'll copy that path in and we'll use that here I've got enough memory to use the keep in memory option but if you have a smaller GPU you might want to turn that off in the data loading section we have some options for our captions in this case we can choose to either use the caption preset or create our own caption template this will replace all of the captions for whatever data set we have with a custom template we can use the uh curly braces to indicate where we want our placeholder token to be injected I'll explain that in just a second if you're not familiar with how textual inversion trainings work in this case I'm going to use the style caption preset primarily because that's what I'm looking to get out of this training I not going to select keep original captions but this is what I would select if I wanted to go in and manually update my captions have them be crafted manually using the data set tool and then have this caption preset or template prepended automatically when I'm running the training another tool that can help with ensuring that there's a little bit of diversity in the captions that are being processed by the system as it goes from Epoch to Epoch is the shuffle caption delimiter if you have a caption that is separated by commas or periods you can type that delimiter inside of this caption shuffler and it will reorganize randomly based on that delimiter what that can do is it can help make the models understanding of individual Concepts more resilient because it's a little bit less dependent on where in the prompt it is we'll leave that empty for now I'm going to leave my resolution at 1024 because that's what an sdxl model typically wants I'll leave my data loading workers at four you can decrease this if your system doesn't have enough resources we can either choose to randomly crop to the Target resolution or Center crop depending on what you're trying to do and what your objective is you might want this on or off you can also set to randomly flip the image again this kind of goes back to the shuffle caption piece if you flip the image it can sometimes provide a little bit more diversity for the system understand different orientations of the content that you're using in this case because I have a small data set I figure I might as well go ahead and turn that on aspect ratio bucketing is really for use when you have a large amount of data that is going to be in different aspect ratios this is particularly useful when you're training Aur you would turn on aspect ratio bucketing and update your resolutions your start Dimension and end Dimensions to be around whatever Target resolution you're going going for in this case if we were doing uh sdxl I might do 1024 uh 768 and 1280 and you know I use that I don't think I've got too many that need to be bucketed and I'm just doing a textual inversion so I'm going to leave that off for now when we get down into the textual inversion configurations we've got a couple of things that we need to update I won't go too far into the advanced stuff I just recommend leaving some of the things you don't understand alone the number of TI vectors in this case we'll just leave that as the default you've got two things here for an embedding that you want to update the placeholder token and the initializer token the placeholder token is going to be a special word to associate the Learned content with you want this to be a very unique token you don't want to use just a normal word this isn't going to be what you end up using and invoke we just use the name of the embedding so this is something that you typically want to have be a very unique kind combination that is unlikely to already be used or understood by the system so I'm just going to do Neo watercolor with a three z uh just to really kind of emphasize this is unique my initializer token or phrase is going to be a word or a phrase that is a good place to start in the grand understanding of what this word means so if I were to for example want to train in my new kind of interpretation of watercolor my Neo watercolor it might be a reasonable place to start with watercolor now this actually might not work because as we saw earlier watercolor is actually multiple tokens so we will go ahead and use the initial phrase of watercolor when we get down into the optimizer configurations this is really just controlling how the system is going to update its understanding of what it's training on over the course of the training process the optimizer is effectively deciding how how much its understanding of the concept should change at different parts of the learning process and the learning rate is the initial start of that process for how aggressively it should learn new content by default this is at a 02 A Higher Learning rate means that you're going to get a more aggressive learning process in place which means it could learn the concept more quickly but will likely have a lot of volatility because it might make errors in the learning process a lower learning rate might not learn the concept as quickly but it doesn't have catastrophic results if it makes a mistake too high of a learning rate can make it really hard to land on the proper understanding because it's constantly over indexing on mistaken information and too low of a learning rate means that you're just going to have to train forever to get to anything meaningful or substantive in the training process process a great place to start always is the default try it see where it's at see if it's able to pick up what it needs to and adjust from there these advanced settings change other hyperparameters of the learning process I'm not going to go into each of these because we would be spending a lot of time on math I think the general takeaway here would be unless you understand or are following some guide that's telling you to update these just leave them at the default you can typically get a good training process completed without updating these Advanced terms parameters the speed and memory configuration section allows for us to tune the training process to leverage more resources if we have them or utilize techniques that allow us to reduce the amount of memory that is needed in order to generate the training at the cost of speed I'm going to leave these as the default but obviously if you need to update these you can go in and increase your in accumulation steps if you want to reduce your vram requirements and you can also utilize some of these other options by following the instructions our general training configurations allow us to update things like the learning rate scheduler this again just kind of determines how that learning rate is being adjusted over the course of the training process we can introduce warm-up steps which can ease the training process into updating the weights and we have some Advanced Training configurations which as I mentioned before you may not want to touch unless you know what you're doing if you do have the vrrm for it batch size configuration is here higher values will increase the speed at the expense of more resources we'll always want to update our validation prompts to match our uh placeholder token uh or the trigger words if we've captioned a Laura data set for it in this case I'm going to do a Neo watercolor painting of the beach and a Neo watercolor painting both of these of a library when I'm satisfied with my configuration I can generate that that'll show me all of the configuration variables I can save that by hitting the save button on the top right that will allow me to reload this in the future if I liked my settings and want to reuse it and then when I'm ready I can hit start training and I'll check back in a little bit once we've got a new embedding so the training's done and where we're at right now is inside of the output folder that I had created and set up in the configuration file within the training we see three folders checkpoints logs and validation checkpoints are going to be the actual files that we could then import into invoke logs will kind of just be be for evaluating the training run itself and then the validation folder is where we can see our validation images we'll go and open up our validation folder just so that we can see what we're looking at we had set up the training run to Output a validation image at every 200 steps uh and so we also save for an embedding for each of those as well and what that means is if we look into these steps we should see the progression of training from start to finish and kind of how that embedding changed the prompt output we can figure out which of these steps is most useful to us and we can then import that directly into invoke so we'll take a look at uh let's take a look at 400 and actually I'm going to do I'm going to open up tabs here I'll do 400 and then we'll do a th000 and then we'll check out 1600 and basically 400's the early part of the training 1,600 is the late part of the training and we've got one in the middle to look at as well uh so this is 400 we'll take a look at our first prompt and so we can let's take a look at this a little bit larger so we've got a watercolor image here this is again step 400 and this is our library prompt so these are the three images that we had generated let's go a and move to the step 1000 version of these and take a look at those Okay so we've got more of the interior here on all three of these images and again this is the exact same seed as well in generating these validation images so we're actually looking at a pretty significant change uh this has gone from the exterior of the library on this third image to an interior I think I think I like the interior a little bit better this is St 1000 and let's take a look at uh 1,600 so we can even see in this one it's picked up a lot more of the white outline kind of centrally placed watercolor image rather than a full uh full one here and this is kind of that interior of the library so you know I think if I'm looking at this I like somewhere between a th000 and 12200 or 1600 let's check out what 1 1200 looks like this is 1 1200 steps and I think this I might like a little bit better than the 1600 this is another of our 1200 um output let's let's just take a peek at400 to see if that one's better or not we take a look here this is400 uh I think 1200 is kind of the one that I like the most and so what I can do now is back inside of my output folder I can go to checkpoints and you'll see that we have the safe tensor outputs for each of those steps uh again this is an embedding file so they're relatively small uh if you were to do a Laura these are going to be a lot bigger and you'd actually want to make sure that you delete anything that you don't like uh but in this case you know I probably won't worry about deleting too many of these just to have them around but what I can do is I can uh copy the path here and just import this directly into invoke as an embedding now what I might want to do I can either do this after I import it in the model manager or I can uh do this right here but I'm just going to rename this kind of what I want that info name to be as in I'm prompting what will the embedding name uh what will the embedding name be so we'll do Neo watercolor and when I import this into invoke I should be able to just uh use this as part of my prompt so let's dive into that okay so now we're in invoke I am going to copy the path to this embedding and import that in and now when I open up my embeddings I have the Neo watercolor embedding to choose so I'm going to go ahead and do a couple quick prompts here so we'll do a Neo watercolor painting of a c lug uh and you know just for fun shaped like I know cool try another one here just to kind of see how it works across domain very nice now let's just compare and see how this has changed its definition of kind of the watercolor remember we used watercolor as our initial phrase so what we want to kind of do is measure the difference like how useful is this versus using just watercolor does it have a different kind of output so I fixed my seed yeah so I think we definitely have more of the kind of like light-hearted deal with the uh Neo watercolor which was really what I was going for this is a little bit too detailed a little bit too realistic this is a lot more like you know artsy and that was really what I was going for with the data set that I pulled in so even with the same seed we've really changed that definition uh and kind of used a better prompt word that is targeting what we're looking to use we'll go ahead and wrap this up as a Basics video uh but what I'm going to ask you to do if you're interested in learning more and training maybe something a little bit more advanced like a Laura or a pivotal tuned Laura is go ahead like subscribe and leave a comment share a little bit more about what you're looking to do the project you're trying to run and what what you're struggling with in going through the user interface that type of feedback is going to be really helpful for us in understanding where we need to evolve the interface but it'll also really help us ensure that we're bu building content that's useful for you I'll also share that we're going to continue to evolve the training scripts as well as the tools that we make available to Professionals for training in the near future this type of feedback gives us a lot of really useful Insight in how we approach that so we'll keep an eye out shoot us a note here or on Discord and we can talk a little bit more and we'll cover more about training things and improving the quality of your training in the future take care

Info

Channel: Invoke

Views: 2,436

Rating: undefined out of 5

Keywords: stable diffusion, invokeai, ai art

Id: OZIz2vvtlM4

Channel Id: undefined

Length: 30min 41sec (1841 seconds)

Published: Sat Mar 30 2024