Diffusion Models - Live Coding Tutorial

Video Statistics and Information

Video

Captions Word Cloud

Captions

hey guys um my name is Damian and I've been um working in the field of machine learning deep learning for a while now um I've worked in the field of um self-driving cars I've taught robots how to assemble car components right now I'm working um on the specification of neural networks and in this tutorial I would like to show you how to code the diffusion model from scratch so many of you probably know right now the field of AI is pretty much overwhelmed by the new discoveries in the space of vision where we are able to generate high resolution High Fidelity images and this is due to the advances in the um in the diffusion models right so stable diffusion is a very good example um of a method which can really do astonishing things but before we go before we dive deeper into diffusion models I would like to first talk about diffusion in the context of your you know High School um chemistry class so what's the diffusion diffusion is basically a movement um of particles from high to low concentration right so if you put dye inside the water molecules over time the molecules of the diet will spread out and mix evenly with the motor with the water molecules um this is basically a second law of Thermodynamics right the entropy over time increases and all the systems basically all the systems have the impulse to go towards higher entropy so basically we have some particular structure um and at the time step t0 right which gets destroyed over time um it diffuses and this notion of diffusion is the driving idea um here in the context of diffusion models so the way how we so the goal is to teach the network to basically um given some training set PX to be able to autonomously on its own um generate new novel image samples which look as if we're as if they were coming from this distribution PX right um so so how does Network learn this um it uses the diffusion process as a kind of like a proxy so we have two diffusion processes we have the forward diffusion process which is fixed and does not involve the neural network and we have the reverse or backward process which is actually driven by the neural network right so in the forward division process what we do is we take a clean image and we gradually over T time steps apply gaussian noise right so what we effectively do is we destroy the structure of the image in the same way that the dye diffuses in the in the water molecules and then we ask our Network to basically reverse this process right to repair the the damage structure and by learning um the denoising process right so the network overtime learns for example to take this noisy image and um remove the noise to get the original image right by learning this capability the network at some point just receive um just just a just a white noise just a gaussian noise and from that infer What the clean image would look like this is the idea right so as I have said we have the forward process which is going from x0 clean image to XT which is just noise and this happens programmatically in t steps those steps follow the Markov chain so the Markov chain is basically a set of transitions where every transition only depends on the previous one right so in case of the forward diffusion our transition is just applying a gaussian noise with some noise variants to the previous image right which can be noisy if we are not in the x t equal to one and this is um as I mentioned there are there is that there is no learning no neural networks involved this is just adding noise but if you want to go backwards um we want to basically sample from distribution p x t minus 1 given x t so given a noisy image gives me something less noisy and this distribution is very hard to learn naively um so in machine learning World there is this um family of methods which fall under um the umbrella term of variational inference which basically their goal is to enable us machine learning people to learn a very complex very often intractable distributions by approximating them using simple distributions for example um multi-dimensional gaussians um and Yes stable diffusion does use variational inference I will not go into details but basically um our goal is to find this distribution right so Q x t x t minus one this is um a simple um a simple transition right applying gaussian noise but going backwards um going from noise to the structure this is difficult right so this is why we employ neural networks to help us solve this problem so this will be the the the the the reverse process will be actually learned by the neural network um and the funny thing is what I'm going to show you is not very different from state-of-the-art conceptually I mean um the main reason why diffusion models are so powerful right now is that um we have understood that we can do this diffusion not in the image phase that in the pixel space but in some latent space um which basically gives us much more capabilities um yes so um after you're done with the tutorial I recommend you to read the um the paper infusion on latent diffusion right yeah this is the paper and when you understand this paper and it's a very um very very cool read it's not difficult I would say you'll actually understand how are we now able to generate you know those images those incredible images that um the whole whole world is right now talking about okay um three sources which I'll be using for a tutorial um the the great um block by Lilian Wang where she describes um you know the the fundamentals behind the fusion models one of the classical papers on the diffusion models this is by any means the only paper but this is the one in which I um read initially and I very much like um the the I just like the paper um it does a very good job in explaining what's going on um and we'll be implementing those two algorithms and I also recommend you to read um the kind of like a summary a wrap up um a a review of the basics of the diffusion models I think for for visual for visual domain yeah so it's like like a like a nice summary of what has happened over the past several years in a nutshell and it this paper very well um lays out the technical fundamentals um behind behind diffusion models all right so um yeah let's let's let's do some coding first thing we're going to do we're going to implement the forward diffusion so forward diffusion process is basically a Markov chain um where we start with the clean image clean the image with without any noise and then we gradually apply more noise so at the end we get a fully nose noisy image just a Pure Noise pure isotropic gaussian distribution um so how we go from a step to step um is basically adding more noise to the previous step right so getting from x t minus 1 to XT is basically sampling from the gaussian distribution with the mean multiplied by this Factor 1 minus beta and then we also increase our our variance by the factor beta of T right so basically what beta is is basically the um the amount of noise that we want to inject um at the at this transition right so the the bigger the beta the smaller so the mean is going to shift from its original mean to basically zero this is the idea right like if the beta is very small this this whole thing is going to be also um very small makes sense right so we basically go from the original image to something which is going to be a gaussian of the mean zero and um and um unit variants yes so um here the authors of the paper I'll use a very smart trick so we know how to get from um you know less noisy image to a more to a more noisy image right this is how this is this is the equation basically a my coffee Markov chain um but we can also have a um a closed form solution for getting from x0 our original image right from x0 to any state to any um to any degree of noisiness um in one pass I mean like we don't have to compute um X T minus one to get X of T we can directly go from x0 to X of t we can basically how we can do this we can basically um rewrite X of T in terms of any previous um in the previous state right um and if and if we do it um you know several times we can see a pattern and we can actually Define a very nice um Clause from expression for X of T which only depends on the initial image right so this is the thing which actually you know governs our forward diffusion process this is the thing which we want to really Implement so let's let's get to it let's do some coding cool so we have our Clean Slate so let's first say what we need to implement we want to get this thing right so we want to get um our new mean our mean we basically squared of a hat of T times x zero and our variance our variance will be now watch out this is um standard deviation so we actually need to have square root of that which is 1 minus um Alpha hat t um and then that's it then we just multiply this um by the random noise and this is basically you know this this is how you implement going from x0 from the clean image our input image to Any Given um time step T in our Markov chain okay cool so let's get to it let's import the torch and first let's define our x0 it's going to be an image um actually it's going to be just a random um a random tensor let's say it will be that size two three channels and 32 by 32. it's like a you know okay dummy image basically um and now we also let's say let's define let's define our betas right let's define our um our our um variant schedule right so let's say our beta um our betas are it's basically um um you know uh like a One D vector and we go from All Points 5 0.1 or points 19 0.2 0.25 right just yeah just a dummy dummy um linear schedule um and also we can specify our um our time steps um let's say for example that we want to get a x of t at time step um one and Times tab 3 right so T is torch tensor at one and at three um what's important is that you know um that the batch sizes of x 0 and T is the same right so basically x0 is those are two images right um and we also want T to have the same batch size as X zero okay cool now um we want to get the value a hat T right so as we have said before um beta or betas are let's say amounts of noise that are being applied at every time step of the diffusion process um Alphas are actually one minus betas so it's um it's you know um amount of um original images information that is being preserved after a decision process step so Alpha Alphas are actually one minus betas right this is how we Define Alphas and what's Alpha hat Alpha hat is basically a summation of the T first Alphas so Alpha hat of T is Alpha 1 times Alpha 2 times Alpha T right so it's a cumulative product um which is very easy to get in pytorch so um Alpha hat is just a torch torch comproed of um Alphas axis zero right we can verify if our Alphas is this the our Alpha heart will be yeah this will be Rewritten this value is the zeroth value of Alphas times the first value of alphas and then the second value of alpha hat are those three multiplied right so it's a it's a cumulative product what else do we need okay now we need to um do one additional step um let's take a look at the shapes of of alphas and x0 right so um Alpha hat it's a vector of five um of five values it's a tensor with five values but at some point we're gonna be adding the information from alpha hat or multiplying information from alpha hat with our x0 now our x0 is its size is different right so you need to have a way how to extract the information from alpha hat and also be able to then multiply this information with x0. so what we do is um first of all um as shown here we need to extract from alpha hat the alpha hat value which corresponds to the given timestamp right so for example Alpha hat um contains those values and we want to only fetch those which correspond to the specified time step so we want to get the first Alpha head value and the third Alpha hat value so what we're going to do is we're going to use our cutter function which actually too cold it's on Alpha hat Alpha hat gather access we're gonna use the default axis the first one and we're gonna use um the T Vector so basically what gather does is it will from alpha hat it will pick those values which are specified by T So T here is being acts as a um as an index like a list of indices and given those indices we're going to plug out the values from from alpha hat okay oh yeah um tea well let's take a look [Music] I also had gather why doesn't it all right let's just verify it yes so we want to get the first value and the third value and this is what we get right Perfect all right this is great now um what we also need to do additionally so right now we have a result which is of size 2 same as our batch size and now we are what we all need to do is we need to reshape it so it has um four dimensions I mean it has um the number of Dimension is four cool and now we'll be able to basically um multiply um what this value by by our original input image right so at this point we can already say you know um mean is equal to result so squared times x zero and variance will be equal to um square square root of 1 minus Alpha t sorry result times random noise and what is noise will be basically a a tensor of the same shape as our x0 um but the values in this tensor will be basically um sampled from the gaussian distribution with the mean zero and standard deviation of one which is torch run run like this guy x0 uh let me take a look fine okay so our X of t will be mean Plus variants and X of T will be it's actually two images as we have set here two images One image at the time step t with less noise and one image at time step 3 with more noise alright so XT two images first one and the second one so I'll do now I'll clean up this code um I'll actually create the function out of it and then we can test this function on a real input okay so I've cleaned up the code we have our forward diffusion function so basically I took all the um everything I showed you before um and just you know made it a bit more neat um yes so let's um let's test this function on a real input so I have here on a URL it's an image an image of a raccoon an owl download this and save it onto this I of course moved to do some imports your lab I may as well already import ant image all right and now let me open this image okay fine so that's our test image as you can see it's a it's a raccoon um so what we would do first um this is this is a pill image right so in the transformation which will take this image from pale to um the pi torch uh tensor right so we can import Vision or we can just import transformations foreign forms yeah so our um our um our transform um will include so this one goes from pale to um so this image will first it first needs to be um it first needs to be resized so we say um transform transforms resize and um we're just going to be operating on the um you know on the same yeah 32 by 32 sounds good to me so first we resize um then we let me actually convert this to a pi torch tensor and finally we do the once more one small smart trick on we will be adding noise to this image right and the noise comes from isotropic gaussian distribution so it's going to be sampled um with the mean of zero and some standard deviation right so the pixel value of this noise well um there will be you know centered around zero but there will be negative and positive so our image um by default at this point would be scaled from zero to one this is how torch does it so we need to do is we actually need to have some Transformations transformation which will scale it from -1 to 1 right so the noise and the image they have the same the the pixel values live in the same range right the domain is the same as you'd like so we're going to use a Lambda transform which will um which will do the following first we'll multiply every pixel value by two and then we'll subtract one so this will map the two extreme values zero and one to zero will become you know this is going to be 0 minus one so zero will be minus one and then um one another extreme value will be one times two minus one so one will not to one all right so this way we um we have transformed our pixel values from the range zero to one to minus one and one as we have expected to do um and maybe just for the fun of it let's define a reverse transform which will take us back so front torch propel um so we'll just reverse the whole process right so we'll say transforms Lambda t um first we add one and then we divide by two right um then we need to do is we need to um when to do permute um so here we need to um by convention in pytorch the channel dimension of images follows the batch size it's not the last Dimension it's the second one I mean it's not the so the first one is that size then it's channels and then it's height and width right this is the torch convention in numpy um the the channel Dimension is the last one right the fourth one first we have bad size and height and width and then channels so here we need to basically um make sure that the channel Dimension is not on the first place following the batch size but it's at the end all right so we're gonna do we're just gonna say permute and one two zero right um so there is no batch size here in this case right so you basically say um uh the zero Dimension which is channels it is to go at the end so effectively um we go from channels hot width to height we have Channel High 12 channel yeah cool uh then we need to um multiply every pixel value by uh 255 right to be in the valid range for for numpy or pill and lastly we need to uh go from torch tensor to a numpy array and then actually want to to change the type to unsigned um integer and finally to pill looking good right so we have one transformation which takes us from the pale image to the um torch sensor you know in the form in which we are actually going to feed the images to the network and um we also have a transformation of this image from pytorch to pill um we're going to use it mainly for um plotting right to basically see what's going on with our with our image as we add noise for example cool um exciting so the chord image is transform image oh all right couple of bugs too fix quickly huh okay so our torch image looks good and now let's um plot what we have computed um Let me show um reverse transform launch image oh yeah nice right so what we have done with the original image we have resized it to 32 by 32 and that's it all the other Transformations here are being undone um by this function but this is basically how our input to the network looks like okay um and now the fun part um we can say the following thing um Fusion model we have we don't have a model so right now we have a forward diffusion function so what we can do is we can say um okay our our betas right it's a linear schedule with five values right so our our Markov chain has five time steps so for every time step let's um see how the image will look like when we add how the image looks like when we progressively add more noise right so we can say that t is equal to torch tensor um zero one two three four five I reckon and now we can say um noisy images so what we do here is actually we we we perform the forward diffusion we say forward diffusion and now x0 is our um sorry torch image T is T what's going on ah all right um one thing that we need to do um okay for now let's say that our batch of images is porch stack um torch image times five so we're gonna give it we're gonna give the function five batch batch of five images and then also a batch of five um T values okay those are our noise images and then we can actually plot them so for image and noisy images blocked in show image uh all right yeah um the forward diffusion function Returns the um the XT but also returns noise so it returns a tuple of of of of objects um this will be important later on um so right now you can just say the noisy images and then nothing is returned by default diffusion um oh yeah what's going on okay looking good looking good result squirt oh God we are missing reverse transform function all right so this is how it goes um we have our image attempt at zero right so this is a original image that has not been altered and then we apply um gradually more and more um more and more noise to this image and thus we basically destroy the structure of the image um and this is actually how we're gonna generate um you know the the data for the network for the stable um for the diffusion model um to to train on right so that's it for now I will clean up the code um and in the next part I will um basically show you how can you start training um a a decision model all right so I have um cleaned up the code a bit so this is the current state um some imports um so far I'm working on the CPU um the function to get the raccoon image um actually create a class um diffusion model and one of the methods here is forward which is the forward um diffusion function which which implements the forward diffusion um also I have my two transforms here and I have um a nicer implementation of this small visual I have shown you last time where we basically showed the initial image click cleaning one and then we display it once again but right now with the noise added to it right so this is just a cleanup version of what I have presented last time um yeah so right now um we're gonna implement the training um routine the training uh of the neural network first let's think of you know what we really want to have here so the thing is that um we want to learn the structure of our data set right so the forward diffusion basically allows the network it's basically destroying the data right so the network can learn to undo the destruction and by I undoing the distraction by actually adding a structure to noise because the network will um how Network generates the images right it basically takes a noisy image it takes noise it takes chaos it takes entropy and basically um adds a structure to a noise right so um uh this is what we are training a function basically which can on create structure from noise and this this this uh this network will basically be um a unit right so the u-shaped Network which takes image as an input um down samples damage and then up samples it um so it takes an image down samples up samples and then um it gives you something right so in this case the unit will take the um the noisy image and will learn to Output the noise right and this this ability to extract the noise from the noisy image will be useful um for for the for the um backwards for the for the backward diffusion process which will actually go from Pure Noise to structure um in a position to the forward process which we have seen which goes from structure to entropy to noise right so do we reiterate we want to teach the network to take a noisy image and then extract the noise just the noise from the image right and this is going to be used for uh this is being used in diffusion models to basically um generate uh yeah images which which look great to us which have structure um but this is how they learn to do it by basically um by being able to capture the distribution of the PX the distribution of the inputs the distribution of images exactly by um uh learning how to go from noise chaos entropy to a to a meaningful distribution of structure okay um so the way how we do it we can take a look at the paper the training um is fairly simple so first what we do is we we sample a so from all the images in our data set right we sample a mini batch of images then from all the possible time steps in our um Markov chain we basically sample some random time steps right so we can have a bit noisy image at time step 20 or super noisy image a timestamp 200 um then we sample the random noise right and we have seen this equation before right this equation is the forward diffusion this is something which we have implemented last time right so basically um what we do is we have our Network parameterized by Theta which takes the noisy input output noise and then minimize the noise with respect to the ground truth noise right that's basically this this is basically a a mean Square loss um between the ground truth noise the real noise which has been applied to the to the net to the image and the noise predicted by the network yeah so we basically we're going to basically be implementing this so let's get started um the unit right now you know the unit the architecture of the unit is not in the scope of the tutorial you know it's uh but you can trust me this unit is nothing serious nothing fancy um there are some blows and whistles there which enable the stable the which enable the training of a diffusion model but I'd rather talk about it in some um I would say addendum to this video and not really talk about this right now because this is nothing really special so let's treat right now this unit as a black box okay um okay fine so unit is the internet I'm gonna train so first what we're gonna do we're going to do something which is very um characteristic when we want to kick off with the new project well um we'll over overtrend on one batch of data just to see whether our Network worked properly like yeah if the networks fail to um to solve a very simple problem um then probably we should not advance and give it um more difficult um task we should go back to square one we should go back to drawing board and revisit our assumptions revisit our um yeah architecture because something must have gone wrong if we cannot solve a very simple task right so what we're going to do is we're gonna um okay let's first specify the hyper parameters so we're going to train for let's say 100. 100 epochs um our print frequency 10 so how often are we gonna print the training status training progress our learning rights All Points one I think is fair and in bad size 128. okay cool um an optimizer just simple Adam foreign so our for Epoch in the number of epochs we want to do the following thing here [Music] we want first to say um yeah so as mentioned here our forward step returns um X of T right noisy sample Pro Plus the the noise the ground truth noise right so we can say we can say um noisy image from truth noise equals Fusion model forward um x0 t will be sample uniform uniformly from diffusion um from the number of possible diffusion steps x0 is our input patch and then given all those information right now we can employ our unit to predict noise from the noisy image of course given time T networks has to know in which time step this image has been created um because you know the higher the T um the more noise has been injected the network needs to know this and then basically uh you know change the weights of the unit such that um you minimized the L2 noise between um predicted noise and background shoot noise this is roughly the pseudocode which which we're going to employ here okay so um first let's create a batch size that we want to over train on um so badge status is 128. um so right now let's just say coach stock that size maybe for now let's set number of airbox one looking good um our key will be a random integer um from the range between zero maximum number of time steps which is defined in our diffusion model here time steps so by default we have 300 time steps and this tensor actually needs to have the batch size size right and this needs to be a long so this is an integer we just we're just you know setting this t as integer and not s a float value for instance um Okay cool so we have this we have this now we can create our noisy image and um run through with noise um by employing the forward model t and then predicted noise is unit um of that size and T sorry noise image empty um yeah and now you know um rudimentary optimization code so first we need to zero our zero out all the gradients then we need to compute the loss okay this comes from my unit let me quickly fix this and I'll come back to you okay so we can't compute the loss between um the ground truth noise and predicted noise and maybe just for fun you can track um the MSD loss per epoch and then we need to do the back prop and finally um this is how we end this is our um our our gradient step um fine okay um and maybe let's just make it the Run a couple of times thank you okay so now we can probably also say if epoch is divisible print frequency we can first of all um plot noise production um between you know um the ground truth noise and the predetic noise so let's just take the first noise from every batch it's going to be pretty interesting to see um I'll explain the second and then also plot noise to deal quickly with those bugs in the helper functions um this is what happens usually when you um just paste the the the code um but I also decided that the to to migrate to a machine or I have access to um to the GPU right so this is my it's my GPU device why um not to waste time basically when I'm training and it's pretty important because I want to really um I want to really overfit to this one batch to my raccoon batch um I really want to overfit hard because if I um if I do I'll be able to um to generate you know new images and of course right now my expectation is if I overfit to one image of a raccoon my network book consistently generate this image of a raccoon from the noise because this is the whole distribution that it has ever um learned so let's rerun the whole notebook and actually let's see those helper functions which I have implemented um of course with the first random yes so there are two helper functions one helper function shows side to side the ground truth noise and predictive noise so right now as you can see early on the training the predicted noise is very far off from the ground truth this can be also seen here so our ground truth distribution it's a gaussian isotropic gaussian right and we would like the um the predicted noise to mimic the distribution right um right now the orange printed noise distribution um is you know um that's not overlap with with the with the blue one because um those value come from the initial um um initialization of Weights but over time as you can see here bam the predicted noise and the ground shift noise they start to really you know overlap which means that networks is slowly in the structure um which it has seen in the training set and the same goes for the noise you can see that the pixel values are very much the same in the protected noise and in the ground shoot noise so we're on the right track I will um keep Network training for um for more epochs and then um we'll see whether um we can also Implement backward process and start you know generating images this is no the goal of the diffusion models we are done overfitting to our single batch so now let's get to the um you know backward process which is actually image generation from noise so um before I start I should probably um talk about the theory so the forward process was going from clean image to the noisy one you know um removing destroying the structure the the backward process goes from noise to a a clean image from from from chaos to structure now the derivation of the backward process is pretty complex it basically involves you know reversing the formulation of the forward process using the base rule but because it's pretty you know it's not difficult it's just mathematically involved I'll skip the derivation and just go to the pseudo code so we were um implementing the training before right now we'll Implement sampling so we go backwards right so first we sample um Pure Noise and then we're going to have a for Loop where we iterate backwards over the um diffusion steps right so if our forward process involved 10 diffusion steps here we're going to go from the 10th to the zeroth right so we're gonna go ten nine eight and so on and so on so at every diffusion step um well sample some noise this is the noise um for the variance if we are at the last at the last time step right we are expecting to get a clean image um we will not add any noise but otherwise we'll basically sample the noise and we'll generate our new image x t minus one so a bit more a bit cleaner image right an installation um by taking the old one um and basically you know we just substitute the the previous image XT in this equation as I mentioned this equation comes from the um patient theorem applied to our problem um looks a bit more involved it is not um one important thing to notice is here we have our Network parameterized by weight Theta which takes as an input x t and t also um this standard deviation or variance Um this can take different values for the sake of the tutorial I took the simplest one which is just beta at T and it works for me very well um so yeah this is what happens you just go backwards [Music] um when it comes to the derivation I took the liberty of implementing this directly in our model here backward um I very much followed the same recipe as with forward um mod with the forward method so I not want to repeat the whole the whole coding process I think it's pretty pretty um self-explanatory here um as I mentioned um as a my posterior variants T which is um this value is just beta so we're fine pretty pretty simple if T is equal to zero we just return the mean otherwise we we we remove the noise basically and of course this method needs a decorator torch no grad which basically disables um collection of of possible um gradients will not do any back prop here um and um it's possible that you know Computing and steps where n is um equal to 300 in our case Um this can be um just heavy for your memory right because the the pythonch will install all those gradients um from this gradient Tree in your memory so um this can you know not not a good idea okay so what we can do now is say um so we started from the noise right so we just say image equals torch random noise um and this would be you know one batch three three channels 32 32. well we should also reinitialize our diffusion model perfect um yeah and now for I in reversed range of diffusion uh model time steps right I'm basically going to go from 300 down to zero so for every eye what we do is we create our tensor t which is just a tensor with one t value I mean in other words we cast our I a variable into T tensor and then we just you know image so this is our X at 300 this is our X um foreign eight so this is um the variable which we're going to change in this for Loop Fusion model backwards image and then t and the munet uh how do we call this unit unit unit in Evo mode yeah pretty much and then we can say um you know we call to see how this noise gets removed at every time steps but we don't want to you know print 300 images each for every time step so let's just print last 10. um so if I is smaller than ah 15. um plot and show reverse transform right image zero plot mm-hmm um something is not properly um this is correct but I think it will be even cooler um to do this right so that's our noise at time step 240. this is our last time step 180 120 60 and then zero right so this is the proof the evidence that we have implemented the forward and backward um process correctly because from the Pure Noise we get the the mode of the distribution I mean we have only one image in our data set here but we get it correctly right um so I think at this point um I will finish this part the next part would be basically training um the unit on the imagenet sorry not image net on the cipher cipher10 data set um and seeing whether we can also you know notice something interesting um maybe play around with the the capabilities of the of the diffusion model all right so I uh I took the code um that we have seen before and right now I applied the the the model the training process the diffusion model to a bigger data set to sci-fi 10. so cypherdan um is an image classification data set which contains 10 classes um cat dog deer truck ship Etc so right now um I set up my hyper parameters here so the learning rate remains at zero 001 I train for 100 epochs with a pretty big large batch size um I also invoke the the training loader and the test loader and here I basically um rewrote the the training routine um I copied the training routine from our previous um part right so in this part we do the um the training here we do the the testing one important thing is um just for just for just for fun I've added the capability of the network to also take labels as inputs so right now um I was hoping for a network to to learn how to um generate images conditioned on the class label um nothing fancy in there there is no attention module there is no group Norm um it's just um very basic embedding of the um label uh into into some into some um into some latent space and then concatenating this latent space with the with the input so this is what we have here um as we've seen before we um uniformly sample from a we sample from the uniform distribution our time step t um we run our forward diffusion process to get a noisy batch and noise and then we predict um the noise I'm using the unit conditioned on the class label right um here this is where the where the back propagation happens and here we just have our validation run so I've trained for roughly 90 appbox to spare you the the the waiting time I um saved um those those models and I chose to take a look at the one saved at Epoch ad I like this guy the most um visually when it comes to the results um yeah so here I basically have a um a grid with the generated images by my network so what I do here is basically um for every class and just to remind you we have 10 classes inside for 10 I display five images right um so what happens here is for every class our Network generates a an image from noise so this is basically um the the Showcase of what the network has learned using our code um as you can see the results are not um fabulous but I think that it still proves that even though our unit um you know the architecture has not been adapted a lot to to handle conditioning first of all we can see that um the network learns um the the distribution of sci-fi 10 pretty well I mean those samples they do like real images from cipher10 um but also the conditioning is it's not too shabby I would say so if we take a look at the plane glass okay um you know this could be a plane I think um also the car this looks like a card to me pretty much this guy as well the birds I can see a bird here I can see a brand a bird on a tree a branch here so pretty pretty well okay cats cats are a bit um weird but still the images does look the they do look this the same um you know the the Shades of Gray dominate those images which looks good um dear same you know um green background very very um characteristic for the deer class I could see a deer here for example um okay dogs are rubbish frogs are also a bit weird but horses I mean I can see a horse here I can see a horse here definitely ships I can see a ship here I can see a ship here here and um trucks um well this is a truck obviously yes so all in all um you know um in those this tutorial I've shown you how to from scratch how to implement a simple diffusion model which is able to um generate new images given a a training set distribution it also has some notion of of label conditioning yeah I hope it was useful see you next time and finally as I promised um I'll just speak briefly about the unit so don't get me wrong this architecture is not really um tweaked a lot um this is basically a generic unit that borrows from the resident architecture right so it's an hourglass Network which first um takes an image down samples it down to the bottleneck and after the bottleneck it up samples the image right so this is the the actual um unit class so what happens here is we have our image input our time indices and you could also pass here um labels if you decide to condition on the class labels and what happens here is first we have a a process of down sampling and just to remind you since this is based on the resident 50 we also have this notion of adding the residuals to the output of the of the convolution after after down sampling after every down sampling step and conversely um after after we hit the bottleneck we start to up something right um our our block is also pretty generic the only thing that happens is um we do we we do condition on on two things right so our X here is a feature map which is derived from the input image um we process it using a convolutional layer non-linearity batch Norm but then we also um at a a betting which is derived from our time Index right so our time index is being embedded using the soluble position embeddings this is a method known from the Transformers paper attention is all you need um and this is basically a pretty clever method to basically encode the notion of of order in the in the network right so in a similar fashion as in Transformers we do it here in Transformers basically the input sequence for example the the the sentence to the Transformer it was never a list right it was always a set there was no notion of of order um between those tokens um so this suicidal position embeddings in in that paper was basically telling us you know about the about the order of those tokens it was like auxiliary information um and similarly in this in this implementation we use those embeddings to also tell the network to to to to to encode this notion that um if um T is a Time Step Zero this is before T at times step one and this is before the T times step two right so we use those um sinusoidal position embeddings to basically inject the notion of order of some temporal undependency um and then we concatenate the information from the feature maps of the image with the information from the um time steps embedding and if we have labels we can also do a similar thing we can basically map our label information um into uh some some latent space and add it or concatenate it with our um with our o with our um information so far right

Info

Channel: dtransposed

Views: 14,737

Rating: undefined out of 5

Keywords:

Id: S_il77Ttrmg

Channel Id: undefined

Length: 84min 6sec (5046 seconds)

Published: Mon Feb 06 2023