How I trained a simple Text to Image Diffusion Model on my laptop from scratch!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
today we will explore the core ideas behind conditional latent diffusion models one of the most popular generative AI algorithms used to create AI art in this video I'm going to tell you everything that you need to know from Concepts to code to math to some random little nuggets of Awesomeness into 15 simple to understand points which will step by step build your understanding about diffusion models and to see the concepts here in action we are also going to train our very own very simple very small text to image diffusion model on my laptop to generate human faces keep in mind that these models that I'm training and the generated images I'll be showing is not going to blow your mind and they're all trained in an hour or two on my laptop on a very small data set of about 200,000 images so there's no point comparing them to something like mid Journey which are probably trained on billions of images with a $100,000 budget so mid Journey people relax I'm not trying to compete with you guys my goal with these models was to just learn the implementation myself and to also show some code and some practical demonstrations to you guys as I'm explaining these Concepts so feel free to have a chuckle if you see a really nasty and horrific AI generated human face or just marvel that my weak little neural net somehow figured that a human face contains two eyes a nose and that men tend to have facial hair eyeglasses are a real thing in the human world and when people smile their mouth opens up and revealing white teeth underneath either way I hope you enjoy the video Welcome to neural breakdown let's start diffusing with point one of 15 so almost all generative algorithms like variational autoencoders generative adverse serial networks or Gans and yes diffusion as well input some random noise and convert them to an image if you're simply generating images without any prompts or conditions it is called unconditional image generation where the algorithm just puts the noise and outputs any image according to the inputed noise when you want to generate specific images however we train with prompts and label information and these are called conditional generative models in this video we're going to start with unconditional diffusion models and then slowly ramp up to conditional prompt driven diffusion modeling diffusion models generate images by taking a fully noisy image and then iteratively removing noise from it through a neural network until we get a C image so technically speaking diffusion neural networks are really noise removal or denoising neural networks to see how these denoising networks are trained we have to first understand the concept of forward diffusion and reverse diffusion imagine when you add some milk to water the milk slowly diffuses or spreads into the water as time goes by at frame t equal to Z you had Clear Water But as time progresses the liquid gradually turns and becomes fully white the the same concept can be applied to images you got an image and you keep adding small gajian noise to it gradually removing its perceptual features after some time the image will just disappear and all that will remain is noise this act of adding noise to gradually convert any image to complete noise over a series of time steps is called forward diffusion the ddpm scheduler is a type of forward diffusion algorithm which we are going to use for our project there's an implementation of this in the diffusers python Library by hugging face and as you can see to initialize this module we need to pass the number of steps and also this Beta start and beta end parameters I'm going to explain the beta stuff here in a bit these number of steps basically determines the total number of steps the scheduler must take to go from the clear image to the fully noised image to actually add some noise to an input image we will generate a random noise tensor let's say from a known distribution like the gun then we pass the input image the gener ated noise and the time step to the add noise function and the function returns us the noisy image at the input time step internally one single step of noise is added following this formula it may look a little daunting but it's actually pretty straightforward here XT is the image at time step T and XT minus one is the image at the previous time step and Epsilon is our randomly generated unit goian noise and because it is a unit goian its variance is one and when we multiply it with this root over beta T term its variance becomes beta T effectively this entire term controls the amount of Gan noise that we are adding to the image we also scale down XT minus 1 by root 1us beta T so that the variance of XT does not grow when we are adding noise it's basically a balancing term because of the amount of noise that we added over here with root over beta T Epsilon also know that the beta value itself is dependent on T the authors call this variance schedule and it basically ramps up at higher values of T and that basically means that as the time step progresses we incrementally add more and more noise into the image but this formula tells us how to go from one image to the next noisy image at the next time step but this can get really inefficient if we say wanted to know the noisy image at time step 100 instead of repeating this step 100 times there are some mathematical tricks that we can do to reformulate this to write XT as a function of x0 instead of XT minus 1 here Alpha T is just the cumulative product of the original beta T parameter and x0 is the image at time Z meaning our original unperturbed image this basically allows us to jump to any time step in the forward diffusion process instead of following it step by step basically making our algorithm much faster I leave a link in the description that explains the math behind this so at this point you may be thinking dude you just showed how to corrupt a perfectly good image into Pure Noise who cares about that to train a generative model we need to do the exact reverse of that to convert noise to images or the white liquid back to water yes and that is what reverse diffusion is all about the entire point of forward diffusion is to actually generate a data set that we can then use to trade a neural network to learn how to reverse this very process coming back to our forward diffusion equation the network is now going to input the image at time T XT and the value of the time step T and it will be trained to Output the noise Epsilon T we will then scale and subtract the noise using this formula to get back the image at the previous time step this Epsilon Theta is the Network's prediction and beta T and Alpha t comes from the variant schedule parameters discussed in the forward diffusion section scaling the predicted noise before subtracting it from the noisy image of the previous time step basically adjusts the contribution of the predicted noise to the image at the current time Step In general the neural network architecture we will use to train these noise removal models is the unit architecture and I will cover the unit in the next point but for now just imagine it as this blackbox architecture that inputs a noisy image and predicts the noise that must be removed from that image and using our noise removal ddpm formula we will remove this noise to get the new image at XT minus one and we will run this algorithm in a loop for end steps until we slowly denoise our way back to a clear image and so with that understanding about how reverse diffusion Works let's talk about the unit architecture next in more detail but before that if you're enjoying this video the best and easiest way to support the channel is to just go hit the like button right now and subscribe to the channel you can also support us through our patreon page members to get some nice perks so please go check out the link below and learn more about these perks back to the video unit is a type of fully convolutional neural network that takes an input image passes it through a series of convolutional layers and outputs another image of the same size as the input if you want a deeper dive on convolutional Nets check out this video on visualizing how train CNN pick up patterns in images basically CNN's learn feature kernels that capture specific patterns in the image and these kernel scan through the input images to generate the degree of overlap of the pattern with the image at various locations in the input a technique known as convolution unit model in particular passes the input image through multiple convolutional layers to First downscale it to a lower Dimension and then upscale it back to its original shape while upscaling the unit also adds skip connections that reuse the features learned during the downscaling path we also train separate positional encodings as high-dimensional sinusoidal embeddings to encode the timestamp input the input timestamps positional encoding is then added to the image embeddings at different layers adding the timestamp information gives extra hints to the unit model about how far or close the algorithm is from the final image and so how aggressively should it be removing noise from the input image unit introduces a lot of inductive biases into the training such as feature reusing where unit can build off new features in the second stage by making additive changes to the features that it already learned during downscaling and unit also combines lowlevel features of the image in the shallow layers of the network with high level global patterns that it picks up in the deeper layers of the network so here is how the entire training Loop looks like we load in a batch of images apply some augmentation to it like random rotations flipping adjusting the brightness and contrast and resize it to 32x 32x 3 which is our Target image generation size for this experiment later on in the video we will train much larger image generators of 128 by 128 RGB images but for that we still need to go through some more steps anyway we generate a random noise tensor also of the same 32x 32 X3 shape for each image in our batch and then we add the noises to our main image and then we do forward diffusion on randomly generated time steps to create the noisy images using our ddpm scheduler forward pass the noisy image and the time steps through the unit calculate the mean square error loss between the network prediction and the randomly generated noise tensor back propagate and upgrade the gradients and Bam repeat the straining Loop a bunch of times through our data set and we got our denoising neural network effectively our unit learns to predict how much error must be removed from this noisy input image to reverse the forward diffusion process by one time step after training is complete let's see how image generation works during inference we initialize a 32x 32 X3 image with random Gan noise this is our image at tal to n and we pass it through the unit to get the predicted noise the error will be scaled and subtracted from the noisy image to get us the image at n minus1 we pass this new noisy image back to the network to get the image at n minus 2 and we iterate until we get our clear image and double bam we have generated an image out of pure noise and here are some results from the diffusion model that was [Music] trained so so compared to other generative models like Gans diffusion uses a very simple and well-defined objective function of these mean square error between the predicted noise and the actual noise and that leads to much staer training signals compared to Gans who are kind of disreputably known for their unstable gradients issues also because diffusion generates images through this iterative noise removal process they can take multiple small steps in order to generate an image compared to Gans or V that have to generate the image in one forward pass we don't make the network directly predict the final image because at the beginning stages of noise removal the input image is purely noise and there could be multiple Correct images that could be generated from it so the network will just end up output some kind of aggregate of all these images and result in a blurry subar image we also don't make the neural network directly predict the noisy image at T minus one because then the network will have to learn the harder task of generating the noisy image at all the end different noise levels instead we just make the network predict the noise to be removed for the latest time step and put it into the reverse diffusion equation to generate the next image and although this makes diffusion more computationally intensive compared to those other networks it does help it cover the data distribution more thoroughly and empirically they are just known to generate more diverse images and not succumb to Mo collapse as Gans do in fact you can rerun the diffusion process from any point of the reverse diffusion and each time you will get a slightly different image and depending on how close to the final image you choose this reset point the resulting image will either be slightly or extremely different than the original but there's still one issue though right now I'm just generating these 32x32 images if I want to generate bigger images like say 128 by 128 then this algorithm is going to run a lot slower because of the high computational complexity of running these models and something that used to train in 2 hours will now take 16 hours to train and that is a huge no no for this project because I'm just going to train everything on my laptop and the solution to this problem is latent diffusion modeling and that is the next big topic of the video so latent diffusion models divide the diffusion training task into two separate phases instead of training the model directly on generating uh 128 x 128 grid of pixels we will first separately train an autoencoder to learn to compress these images into smaller 32x32 latent representations if you don't know about Auto encoders then don't worry the next point is all about training autoencoders but basically an autoencoder consists of two convolutional neural Nets the encoder and the decoder the encoder compresses the images into a lower dimensional space through a series of convolutional layers and this resulting compressed space is also called the latent space and then the decoder reconstructs the original image back from this latent space because the network has to reconstruct the full dimensional 128x 128 image from that smaller bottleneck the encoder basically learns to pack as much information key information as possible into that Laten representation such that the decoder can then successfully decompress it back why is this important for diffusion well instead of training our diffusion model on the entire 128x 128 scaled images which again is going to be extremely computationally heavy and not scalable we will instead use our autoencoders encoder to compress the entire data set into 32x 32x 3 latent representations and then train our diffusion model on these and during inference we will initialize randomly generated noise of the same 32x 32 X3 shape run reverse diffusion through unit to generate the final 32x32 by3 latent embeddings and this will then be passed through the autoencoders decoder to convert it back to our desired 12 8 by 12 128 image I already have an entire video on how Auto encoders work and the many cool things that latent spaces do so go watch that if you're like really interested in this topic but here are some useful bullets that are specific to training ldms let's go into it so autoencoders compress and reconstructs images using a series of convolutional layers and we train the network using the Reconstruction losss between the input and the output and that could be as simple as the pixelwise mean square error between the output and the input but for our case we will be training it with something called as a perceptual loss function basically we take an already trained computer vision model like the vgg or alexnet and then we reduce the error between the activation maps of our input and the output at different layers of this pre-train model I use the activation maps of the first three layers of a vgg model for the perceptual loss function this basically incentivizes the network to learn High higher level feature representations than the pixel level ones that we get from simple mean Squad error and since each layer capture different lowlevel abstractions in the image like say the hairstyle the eyes the nose the Skin So on our Network learns to focus on producing the facial features accurately compared to the background pixels I also use a KL Divergence term as recommended in the ldm paper this ensures that the Learned latent space is not arbitrarily high and that it is somehow bounded so it's easier to train the latent diffusion model to learn the laden space patterns and the paper also recommends using a discriminator loss so I trained a separate neural network that inputs the real images and the autoencoder reconstructed images and output a similarity score between the two the higher the similarity score the lower the discriminator loss is going to be optimizing for these three losses we train our final autoencoder and here are some of the image reconstruction results using the auto encod also note that the K Divergence weight is kept very small compared to the perceptual loss and the discriminator loss and that means that the trained Auto encoder cannot really be used for generative purposes the goal of the auto encoder is to learn a compressed representation and the only reason we even have the K Divergence term in the loss function is to just keep the latent representations from getting arbitrarily High by the way I will upload the code used in this project for supporters of the channel so do check out our patreon page if you want to get access to the code it will also have a code walk through video and other things like the slides animations uh narrations from this video and also all of the other videos in the channel and of course it also supports the channel in a big way and helps it going so do check out the perks today after the autoencoder is trained training a ldm or a latent diffusion model with it is pretty straightforward forward we load in our images in batches and augment as usual but this time we first pass it through the auto encoders encoder to derive its 32x32 X3 Legend representation and then train the diffusion Network on these latent features instead of the full dimensional images during inference we start at usual with random noise we run our reverse diffusion for end time steps and the final output we get is the latent image which we then pass through the autoencoders decoder to generate the final image and here are some images that were generated from the model as it was training not bad images after just 1 hour of training in the next section we will talk about conditional latent diffusion models discussing how to go from this basic LM that we trained to a text to image generation model where we can control the generated images through an input text prompt so to do text to image we first need a data set that has pairs of images and text captions the cele dat set that I'm using does come with a table of facial attributes of the image like the gender hair color the age and a bunch of other features I wrote a script that converts these features into multiple valid textual captions and we're going to use these to train our conditional diffusion model of course to train with text we first need to embed the text that is to convert this into a numerical representation that our Network understands well one easy way is to just use a text embeding model that basically generates a single flat embedding for the entire text prompt and then use that just like we use the timestamp embeddings in the model by basically just adding or appending it in different layers of the unit now that would work but using a single flat embedding to to represent text is generally not advisable because it severely constrains the detail of information that can be encoded into that flat single embedding a better approach is to embed text as a sequence of embeddings and then use say attention mechanisms to condition the model let's break it down what I mean by that in the next two points so given an input text we want to First tokenize it into a sequence of tokens and then embed it as a sequence of embedding vectors there are multiple ways of doing this but the ldm paper uses a Transformer based clip model to do this and so would V given a data set of image and text pairs the clip model uses a Transformer to embed the text and our pre-trained autoencoder model to embed the image and then we create the dot product of the two groups to form this n byn correlation Matrix of the two multimodal inputs to train the network we will do contrastive learning and increase the alignment between the correct pairs of text image inputs denoted by this diagonal of elements while simultaneously decreasing the distance between incorrect pairs by decreasing the values of the off diagonals this helps clear to train a joint embedding space where both text and images can be represented in the same latent space and with that we are approaching the final boss of the video the text to image conditioning part to do this we will introduce a new layer to our unit architecture an attention based conditioning layer and its job is to contextualize the input feature maps with the text encodings from the input prompt if the following section is a bit difficult for you to understand you may want to watch my attention ention Series where I explain what neural attention is from the ground up with all the math and intuition that you might need to help understand this next part first we convert the text embeddings into query embeddings by passing it through a neural network next we will flatten the image embeddings from a unit layer these embeddings are passed through two neural networks to obtain the key and value embeddings finally we will do a DOT product attention computation to generate the contextualized image embeddings by doing this soft Max of KQ * V operation and what this will do is contextualize the image embeddings with the text embeddings basically Maring the Two Worlds into one fused embedding State and just like that we have a conditional unit layer and now we can replace a handful of our downscaling or upscaling layers in the unit with these attention layers making some of the layer activations conditional on the input prompt using the diffusers library to initialize a conditional unit is as simple as this we just mentioned how and when to use the down sampling and the attention down sampling or upsampling layers and inside the training Loop we will first call our clip model to encode the text part and pass in the text encodings into the conditional Unit Model through the encoder hidden States argument and then we let the model train and this one took a while to train like 5 to 6 hours but the results were pretty encouraging I can see that the model definitely understands how to generate different faces according to our prompt I'm sure that with a lot more training data a larger model and a better pre-trained vae the generated images will be much better but I'm happy to see the concepts work in [Music] action so we started with a simple unconditional diffusion model graduated to Laten diffusion models followed by conditional models that generate with text using clip embeddings hope you had fun watching this video do hit the like button I leave some related video links and articles in the description below you're magnificent goodbye and a huge shout out to this month's supporters thank you [Music]
Info
Channel: Neural Breakdown with AVB
Views: 2,662
Rating: undefined out of 5
Keywords: machine learning, ai, deep learning
Id: w8YQcEd77_o
Channel Id: undefined
Length: 24min 58sec (1498 seconds)
Published: Sun May 26 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.