Ultimate Guide to Diffusion Models | ML Coding Series | Denoising Diffusion Probabilistic Models

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

what's up guys in this video i'm focusing on covering diffusion models so basically the family of models that are powering some of the most famous uh ai models over the last couple of months such as i guess the le2 uh imagine from google uh glide from open eye as well and many other models so the idea the the the video will be quite ambitious so i'll try and first walk you through two of the seminal papers behind the future models and then i'm gonna actually go through the the code base so the skimming of the paper just serves the purpose of me showing you the formulas the mathematical formulas which we can then map and relate to actual code so it's not going to be a deep dive into the papers per se but hopefully uh this gives you the necessary context to kind of later on cope with the with the actual code okay having said that i'm covering two papers uh one is the nursing diffusion probabilistic models that's pretty much the paper that um made uh diffusion models uh practical and then i'm gonna cover improved denoising diffusion um probabilistic models from open ai and i'm actually going to be covering uh like the code base behind this paper okay but let's start with this one so this is going to introduce the necessary basics if you haven't already uh like learned anything about the fusion models hopefully that's going to be some context for you i did cover the glide paper so i did cover some diffusion models there so do check it out i'm going to link it somewhere here but this this video having said that this video is fairly self contained so yeah you can continue watching okay so here is how the diffusion model looks like on on a high level so the idea is you start from an image here and then you slowly gradually start adding gaussian noise on top of your image so that's called the the forward uh diffusion process and ultimately you end up with an image such as this one which is as you can see here basically a complete uh complete noise of an image and now if you learn how to reverse this process so this is the forward process here as you can see if you learn how to rever reverse this process so denoted as p sub like theta uh if we learn how to do that then basically if we start a training procedure and we learn to do this for all of the images in our data set you eventually learn the underlying data distribution and then you can basically start from a random noise sample and start uh denoising that image until you end up with a hallucinated new image from your underlying data distribution so it's going to be a novel image obviously okay so that's a very high level um explanation there let me now walk you through the formulas so again this is the this is the how the um like forward process looks like uh this is the um basically at the joint distribution and if we want to sample uh from from like if we want to do one step of the uh like reverse process here's how we do it so basically we are going to learn a neural network that's gonna predict uh the mu of theta and the sigma of theta which are basically the the mean and the variance of the gaussian so now like there is a bunch of theory for why this is possible so they showed that if you have steps small enough uh basically um that means that if your forward process is like adding gaussians they show that you can also approximate the reverse process as as as like sampling from a gaussian so it's not not kind of completely obvious why this is the fact but uh yeah we'll have to take it for for granted so the idea is to learn those two uh let's continue here so here is the actual forward process you can see how we sample from the forward process here basically you you kind of downscale the the current image and then uh so this is how you form the mean you take the current image uh you downscale it and basically then you you have this um like uh covariance matrix and then you just sample from the uh from from this gaussian here to end up with the uh x of t okay so again we condition on the x t minus one and by sampling from this distribution here we end up with x t okay so that's basically going from here all the way to here that's one step um what's what's next so now they showed that you can basically train these models by optimizing this variational um uh bound uh the idea is so here we have the log likelihood of our data we want to obviously maximize we want to fit to tweak our model such that all of the data points from our data set are super likely under our model so that's the idea how we train most of these like all of these um pretty much uh generating models not only diffusion models and uh now this is a standard thing uh with variational uh bounds basically you find the surrogate loss uh which is like a lower bound uh basically uh for for for this loss here and then by maximizing it uh basically you're certain that that the likelihood of your data is gonna be at least as big so that's that's the that's the main idea there uh now i'm not gonna dig into formulas here we're gonna later see how they decompose this into into an actual actual expression that's going to be leveraged later on in the code but here is the equation for now now this is an important this here is a super important finding so instead of having to sample every so during the forward process instead of having to sample uh multiple times and in practice they use 1000 steps uh they show you can literally sample at arbitrary x of t starting from x0 where x0 is your original image by sampling from this gaussian here so these are some important coefficients we'll be seeing in the code as well so we have these alpha ts which are one minus bts beta these and beta is basically as as you can see here it's basically just um like your your your covariance matrix here uh in practice the original dtpm so this paper here i used fixed uh like uh fixed schedules whereas uh the later papers such as the one from openid improved one uh used learnable uh like i like schedules and when i say a schedule i just mean how does beta weary when we go across the whole for forward process uh then there is this alpha alpha t bar which is just a product of all of these alphas which are defined here from starting from one all the way to t okay so one is the we start from the uh usually one is denoted as the start of the process before the image has been deteriorated and then st grows we're going towards uh image becoming pure gaussian noise okay so here is the expression you basically can use uh square root of this alpha t bar multiply that so we multiply the original image here this is how we form the variance and then we just sample from this distribution and we get the x of t so that means we immediately get arbitrarily uh noisy image nice um i mentioned the the the loss we'll be using so here is the loss just uh reshaped into different forms so this is the form that's going to be actionable so this is the form we'll be using in the code we have l0 uh lt minus one and lp so these are kind of three classes of of of similar of similar components here um this one is super important basically kl divergence between this here as you can see is the one step of the reverse process and we're going to do kl divergence with this uh basically uh forward process posterior okay and then we have this component which will actually in this paper they can ignore it because uh this is going to be um like uh pure pure gaussian and uh because they don't learn the the variance is basically you can ignore this term here and uh this one here is basically the the the negative likelihood of of the image uh conditioned on the on the previous step anyways um a lot of details i'm gonna have to kind of hand wave explain all of this just for you to understand and see the formulas that's the important part uh okay so let's see what's up what's up here so we can actually calculate that's a cool thing we can calculate this um posterior uh of the forward process um analytically here uh and you can see how how this mu t tilde is computed and you can see how beta tilde is computed here so these expressions are all gonna appear in the code uh so just gonna take mental notes here although i'll be comparing uh the formulas with code uh like side by side so that's gonna probably be useful for you guys uh so anyways because we have uh this is a gaussian and this is a gaussian that means we'll end up uh when when you do the kl divergence between two gaussians you end up with simple analytical expressions we're gonna see those uh a bit later okay so what what's up next so actually here so here is how these lt minus ones are gonna simplify to so we simplify them to these expressions here basically we just find the ms e so the um mean squared error between the means so this is going to be our learnable one this is the the one we get from the forward posterior and that can be further uh like basically just simple algebra here because we know that x t equals this and that's from the so-called nice property here so that's this property here so you can kind of imagine that sampling from this gaussian is actually uh equivalent to computing the following expression so if you want to get x t you basically do the following so you do this square root alpha t bar x zero and you basically add plus square root because this is variance we want to have like standard deviation so minus alpha bar t so it's going to be under square root and then we just multiply times epsilon which where epsilon is just basically your your normal distribution uh okay so that's the gaussian with the mean of zero and variance of one okay so let's go back here and then we just plug in x t here so we uh sorry we just calculate x zero here we just kind of do the algebra and then we plug it in here and this is the next expression and then this simplifies because we know how this is computed up here so here is how we compute that one so it's just like bunch of manipulations of symbols and i'm i'm kind of skim i'm skimming over it so you end up with this expression and then you basically this is what you want your your neural network to learn so this is going to be your diffusion model it's going to learn how to predict uh the the noise uh here okay so let's see that simplifies to what uh basically we want to have the the learnable uh like uh mean needs to be equal to uh whatever we had here so we basically want to have this thing we want to it to be like the same as this term here because then the loss will obviously go to zero that's what i show here and if we achieve that if we learn that mean then how we sample how we'll be sampling from the reverse process is again a simple computation so here's the mean so that's the same expression as this one here and then we just add the uh basically standard deviation uh times the uh this vector here which is going to be a sample from a normal distribution okay and ultimately when you simplify even further that expression you end up with this type of parameterization so you can either learn a neural network that's going to predict the mean or you can learn instead just this this term here just the the actual noise and that's what i uh end up doing so this expression as you can see is just a means squared error between the uh like sample noise so we're so we're basically learning what type of noise was uh added to our image and then we have these weights for each of our terms lt minus one so this is again this is l t minus one those are the loss components we saw and we're gonna see that this weight basically uh is a function of the step in the diffusion process okay i know i know this is a lot of formulas bear with me it's going to become much easier as as the video progresses because i'm going to start introducing code as well but let me quickly just explain what this means so we start from an image here this is some uh image and then basically we add on top of it some noise it's going to be some something inside of there like let me just draw some human being and then after adding the noise uh like this let me change the color so let's imagine we added a bunch of noise here so now your diffusion model is learning uh this green stuff and if it learns the green stuff then you basically know how to go backwards okay so you learn the green stuff the noise that's the epsilon here and you know how to denoise your images cool it's kind of magical um it doesn't click you probably won't understand it immediately it's not completely straightforward to understand why this works i'm still struggling to be honest but uh the formulas are here and we're going to follow them for the time being okay i'm going to skip this and starting from this lt minus one they they basically derive empirically this simplified objective where they drop the term here so the term that depends on the basically time step of the diffusion process and we end up with this one here and so as you can see basically we're doing simple msc loss between the noise so we are trying to predict this this noise added on top of the image and we do that for various uh time steps of the diffusion process so basically what we are doing again is we start from from an image here uh we add some noise on top of it we end with an image and then we keep on doing that for like let's say a thousand steps and we end up here and basically we'll be trying to at each step of the way understand which noise was added here which noise was added here which noise was added here and by doing that we learned the the the the reverse process of the diffusion and that's going to lead us to like a powerful generative model cool uh let's continue here let me just show you so here are some images they get from the model uh not that important because you already know that these models are super powerful so here you can see how the diffusion process looks like in practice you start with the noisy image and then gradually keep on denoising it until you get to the sampled image which is the image sample from the underlying data distribution okay here they show how depending on from which latent you start from if you start from a latent like x thousand so that means thousand steps of diffusion and then you start three independent reverse processes because they are sarcastic you end up with three different images but as soon as we start taking um like latents that come from later in the reverse process so let's say x 500 then you can see that three independent uh reverse processes lead to images that are quite similar having seen all of that now i'm gonna quickly walk you through the innovation that the second paper brought so it's basically building directly on top of the denoising diffusion probabilistic models or ddpm's for short um let me show you mean contributions of this paper so first of all they have a learnable variance schedule so this is how they are going to do it remember this formula we're going to see it a bit later so basically they predict this vector v and then they kind of do interpolation between beta t and beta tilde t which is the posterior variance and this is the uh forward process variance okay so that's one thing they've done here so this formula here the second thing they've done is uh they use this hybrid loss so l simple we saw what the one is that's when you drop the terms that depend on the time step and then we have the l vlb which is the variational lower bound so that's the actual uh original loss with all of those complex uh terms and by just creating this type of uh weighted average between those two and using this one to learn uh the the variance and using this one to learn the mean uh basically they show this is the best uh this was the best trade-off so there is a there's there is a lot of experimentation going on here uh so it's kind of a lot of hacks put on top of the dtpm uh in order to make this work as well as ddpm itself had a bunch of hacks such as using constant variances instead of learning them etc etc so there is a lot of hex going on in diffusion models at least in these earlier papers okay so they say here along the same line of reasoning we also apply a stop gradient to the mu of theta output for the lvlb term which basically translate to you only so this component is only going to be training this uh variance expression here okay so that's one thing uh so that's the second thing actually and then the third thing is instead of using a linear um like basically a noise schedule so those betas being a simple like a linear uh sequence instead of that they they propose this cosine schedule and uh what that brings is that you can see that uh alpha bar theta uh alpha bar t sorry uh basically has much less like much more gradual drop compared to linear and because of that uh that determines directly the amount of noise and thus if you use the linear schedule so basically using the linear schedule will lead to noisier images earlier on in the uh forward process of the diffusion okay that's a third thing they they do that helps a lot and then let me show you a couple more expressions here uh one very cool thing is using um basically uh the values of the loss uh for each of the time step to understand how much weight we want to put on top of that time step so this is kind of middle ground between using the simple objective where all of those constants with all of the lt minus ones are basically constant like and equal to i guess one then you have the lvl b which had those complex expressions and finally we have this type of important sampling where depending on the loss so if for example if you're struggling with one particular image in the process let's say this one the ith image here basically what you do is you increase the loss for that particular uh xt so you'll put additional focus on trying to predict that noise that was put on top of this image so that's that's the idea with with this expression here let me continue and we are almost done finally this expression 19 not that important basically what they show is that during training and during sampling you don't have to use the same type the same length of a diffusion process so for example your training chain has like 4 000 uh images whereas you can during the sampling time you can just have 100 uh images 100 latents and basically they show that how to remap the betas and beta tildes uh the posterior variances such that this actually works out nicely and they get high quality images and they obviously save a lot of computation which is super important because you don't want to sample 4 000 images every time you need to generate an image that's going to be super expensive okay guys that's pretty much it couple more things here and we are done with the papers uh they show and this is super interesting they showed the scaling laws for diffusion models and this was back i think in 2020 so if you if you kind of read this paper you could have expected that they are going to do the same thing as with gpt3 and that's to scale up these models and that eventually led to glide and then the lee and the le2 and you can see here again we have the par law we see we have we see that the fid which is the metric that shows you how high quality that your samples are you can see that with the increasing size we keep on getting smaller and smaller fids uh whereas with nll which is the negative loss likelihood it does not exactly follow the uh uh power law but it's kind of still going down here so yeah in any case these two charts kind of are indicative that scaling up diffusion models will probably be a good like a good avenue for future research okay let's see the conclusion the likelihood is improved by learning uh the variances using our parameterization and hybrid objective furthermore we have investigated how ddpm scaled with the amount of available training compute and found that more training compute trivially leads to better sample quality and log likelihood okay guys so you hopefully got the gist of of diffusion models we saw how we have this forward process of noising the images and then we have the backward the reverse process of denoising the images we learn that and then plus add to that a bunch of hex to get this thing to work but ultimately they are now the best generative models we have because they have compared to gans they are much better at covering covering all the modes of your data distribution uh whereas we know that gans suffer from mold collapse we also have much more uh like it's much more stable to train the fusion models as as compared to to gans so yeah basically diffusion models are currently uh what gans were a couple of years ago cool guys let's now switch to the code i'm gonna show you how this thing is actually trained and how we sample from the diffusion models uh before that let me just show you the architecture they're using to actually learn the epsilon so the noise uh during the diffusion model training so this is a simple unit this is how the architecture looks like we're gonna see this in the code this is less important arguably you could use some other architectures as well uh they just kind of stuck with with the unit okay let's start doing the training okay i went on and downloaded uh the the the github repo had to create some modifications to getting to this thing to work on my a single gpu windows machine um i can kind of um submit that to my github do let me know if you if you want me to do that if so i can easily create a uh like a um i'll easily just push this code on onto my github repo okay so i just created this launch script so i basically used uh the recommended settings that they've uh showed in their readme file here so you can kind of take a look at the readme of the of the repo which we're going to be using the improved diffusion repo and basically yeah so that's what i've said in my launch and now we can start uh training here that's less important we're going to see what's going on here uh in a minute okay so let me start training uh we are first gonna obviously have a bunch of arguments so here they are i'm gonna kind of print them out but don't try and understand what's going on there's too many of them we'll gradually start on analyzing what's going on so we have uh the cipher data set which i downloaded using their cipher uh script uh then there is like a bunch of other stuff learning rate uh blah blah we'll see all of that later so i'm gonna kind of ignore the parts that are not crucial to understanding diffusion models which means i'm going to be ignoring the distributed training i'm going to be ignoring the loggers all of that and i'm going to focus only on diffusion so here is the first important step we basically take these arguments convert them into dictionary and we start creating the the the actual uh unit model and then the diffusion uh object here so let's see how the model is going to be constructed so we basically pass image size going to be 64. uh usually they have lower resolutions because remember the latents are of the same uh like dimensionality as the as the x0 as the original image contrast that with vae is conscious that with gans with any other model the latents are always much of of smaller dimensionality whereas here uh it's a bit more computational intensive and so because of that uh what they end up doing in practice is training models for smaller images and then training additionally super resolution models and they actually have uh like here scripts as you can see here uh basically super sample super s train i'm gonna skip those because they really work very similarly to how diffusion works i'm going to focus on these two scripts here the image sample and the imagery okay anyways number of channels um that's going to be internal dimension of the unit number of rest blocks nothing interesting there learn sigma set to true means we are learning the variances instead of uh hard coding them instead of using the fixed ones uh so that's the innovation from the improved diffusion paper uh class conditioning out so we will not be using that but it's very easy to have class conditioning we'll see how they use temporal conditioning so how do we pass the t because t is a scalar how do we pass that into a neural network we're going to see they're just using simple sinusoids like simple similar embeddings as what the original transformer paper used so and then they can alert i'll show you how they kind of fuse that into the unit model and we would be doing the same thing with this class conditioning uh if we were using it which we are not at the moment any case okay checkpointing um it's just an optimization technique i'm going to kind of ignore uh what checkpointing does is during the forward pass instead of storing the activations for every single layer you don't do that and because of that you save a bunch of memory but as a on the con side uh you'll have to kind of when you do the the back prop in order to calculate the gradients you'll have to do recognize computations again so you're trading off uh the memory for time so you'll spend up much more time but you'll save up memory by doing this we'll not be using the checkpointing so yeah just fyi certain layers of unit have basically a v8t type of attention so that means uh each of the token of the image is going to be attending to each of the other tokens and that's what they show here so 16 means at resolution 16 by 16 uh they'll be uh doing the this attention and at eight by eight resolution they'll be doing the same thing so when i say eight by eight it's inside of the actual unit because you you know that unit has that characteristic shape okay so number of heads again just the parameter for that attention layer inside of unit uh nothing important there we're gonna see how this parameter is used uh basically this is how how we depending on this flag they'll have two different ways of combining of conditioning the the images uh with with the time steps so yeah we'll see how this plays out a bit later okay so here is the create model function uh basically because image size is 64. they specify this is how they specify how unit will be constructed then we have this attention ds attention ds basically converts this attention resolution into how many down sampling layers we have to wait inside of the unit before we start using these attentional layers so just another way of specifying when do we start inserting those attention layers into our unit okay not that vital you can kind of ignore that if you didn't understand it so input number channels obviously three we're dealing with rgb images uh we specified number of channels here because we are learning sigma because of that this uh we end up with six output channels and the first three channels will be predicting the epsilon so that's the noise and the the the second three uh channels will be predicting the actual variances so that's why we have six here okay uh number of blocks blah blah blah nothing special dropout blah blah blah we'll not be using class conditioning so we end up with none here for a number of classes we're not using checkpointing okay specifying details of attention etc okay let's enter the uh constructor i'm going to quickly walk you through unit so we are starting step by step we'll first see how unit works and then we're going to see how it fits into the whole like training loop uh later on okay so let's see here we just store all of these parameters inside of uh like internal fields here uh nothing fancy there so we here we create this layer the sequential layer which consists out of this um basically inverse bottleneck uh shape of mlp this is a simple mlp that's going to be used to transform the sinusoids uh before we use them to condition uh the model we'll see that a bit later okay so we can ignore this because we're not not using classes and now we start adding blocks to form the unit again we have three types of blocks we have the input blocks then we have uh the middle blocks here and finally we have the output blocks so that corresponds to what we saw in this diagram here basically whoops my one note is glitching so you can see it here so we have we kind of have the first part here the input blocks then we have the middle blocks and then we have the output blocks here that's roughly how how how this code is structured so let me now get back to it and let me quickly walk you through there is i'm not going to dig into the actual details there's a couple important things i want to show you how they fuse the temporal information with the image information that's the vital thing i want to show you here okay so there's this um wrapper object called time step embed sequential uh we're gonna see that a lot what it does basically is the following depending if the layer in here is from this time step block which is just a simple interface a dummy interface that where where the forward function supports both the x which is the image representation and the embedding the temporal embeddings uh in in that case we'll be pat we'll be calling the layer passing both arguments whereas if if it's not inheriting from time step block then we'll just be passing x and ignoring like the embeddings so as i said a simple wrapper nothing too interesting okay so the first thing we do is we create this a comm 2d layer y com to d because this is com nd a generic layer they created and then number of dimensions is two and because of that we end up with a com2d so that's how the unit model starts uh next up uh let me show you what is going on here so we have channel multiplication we're going to iterate through this uh array and then this is the interesting part we start adding residual blocks so the interesting part about residual blocks is um actually only the forward function and i'll i'm i'm gonna put a breakpoint here and later on i'm gonna show you how this uh temporal fusion is is happening but for the time being let me quickly step through the constructor of the of the of the resonant block just quickly uh so we just stored the number of channels uh the the the number of uh channels for for the um basically uh temporal vectors blah blah blah nothing fancy here let me see whether there is something interesting i need to focus on here so we have we specify these input layers we specify the embedding layers we'll see how these are used a bit later so bear with me here and we finally have out layers normalization blah blah blah silly just activation unit there is like a zillion of these activation units and it's not even worth mentioning okay so basically all of the fun will happen later on during the forward pro forward prop and that's when i'm gonna kind of step into the resna block and show you how the how the temporal information is mixed into the network okay so after that sometimes we'll be adding as as i said here so if this thing is uh if we are if if we down sampled four times then we're gonna add the attention block which we haven't at this step so we're gonna skip this part for now and then we add as you can see here temple blocks we just add uh the layers which we accumulated during the loop and we just wrap it up into this time step embed sequential which is again that useful wrapper we saw a couple minutes ago that's it guys uh nothing fancy there uh so i'm gonna kind of skip to mid layer here uh okay let me ignore this thing here so here we are um this middle block consists of a rest block a tension block and an additional red block i can skip all of that let's continue here and finally the output blocks are literally what what we just saw in the input blocks uh so it's the same pretty much uh list of objects and here you can see by doing this we are reversing um and we're creating a symmetric uh construction for the output blocks okay so because of that i'm just gonna skip all of this and and end up here and here we add the normalization layer which is i think group normalization but again not that important to understand and finally we end up with a com layer and this zero module just zeros out the weights of these kernels i'm not sure why they're doing that if anyone knows that feel free to comment down below why do they initialize some of the layers with all zero weights i'm not completely sure cool again main takeaway from the from the unit model is um like it has this very interesting type of a shape uh and with the input blocks middle blocks and apple blocks and the most important thing i want you to remember here is they have this part where they are mixing in the temporal information we're gonna see that a bit later but just keep that in mind for now okay that was the creation of the model we do the same thing for diffusion so we have uh like i've said 50 steps it was 4 000 by default uh this is just so that i can train a bit quicker on my machine otherwise it's very slow to you to actually train this okay so we have learn sigma nothing special there noise schedule is going to be linear so the betas for the forward process are going to be uh basically sampled uh as we sample the linear function instead of the cosine we don't use the kl uh basically this if we were to set this to true we'd be using the variational lower bound loss instead of this we're going to be using the hybrid loss so we'll see all of that a bit later we are not predicting the x start x start is just the starting image the original image from your data set and we will be um basically rescaling time step all of this is not that important we'll see what those parameters are a bit later but they're not that interesting okay so here we have uh we we first formed the the the beta schedule so we passed the um like name of the schedule and number of steps and this just returns uh those betas we saw uh in the paper in the beginning of the video so let me quickly show you uh the two uh like schedules they support one is linear layer as mentioned it's just a simple linspace between the the beta start beta end and we linearly interpolate uh using the number of diffusion steps here for the for the cosine schedule you have this a bit more complex expression where you create those um we saw these equations in the paper i'm going to kind of ignore those for now not that important okay so let's see which loss we pick we're going to use something called rescaled msc uh mean squared error that's because we are using the hybrid loss we'll see how that plays out a bit later um times steps respacing again this is important only when you want to reduce the number of steps during the sampling uh process which we will not be using so we can kind of ignore all of that again this is the function that does that type of a logic and basically for our case we can skip this because we end up with let me let me show you how this all steps look like it's basically just mp arrange so you end up with as you can see here 0 through 49 because we have 50 steps and so there is nothing no uh like sub sampling that we'll be doing in this training okay we pass the betas will be uh model mean type is going to be epsilons we will be predicting epsilon instead of predicting uh x start so in some of the ablations they actually tried predicting the x star which means the original image so instead of trying to predict the noise that the that was added on top of the original image why not try and predict the original image but empirically they showed that it's better to predict the epsilon okay uh let's continue so what's the model variance type so because uh this is set to true uh that means we'll be using the learned uh range so we'll be learning the the variances uh loss type is as we saw rescaled msc and that's pretty much it now let's step into the space diffusion uh basically the important piece here is the construction of this caution diffusion object so let's see how that thing is going to look like so that's the gaussian diffusion class uh that's important that this is the file where all of the magic will happen basically this gaussian diffusion.pi that's the most important file in this whole code base so again we specify the mean it's going to be the epsilon the variance which is going to be learned and then the loss type we scale the msc blah blah blah and next up we we form the we just basically convert betas into numpy array uh we do some error checking asserting that that's just like a 1d array and that all of the betas are bigger than zero and smaller or equal than one that's all we grab the number of steps here and here we start creating those equations we saw the coefficients not the equations we saw in the paper now for these formulas i'm just going to put things side by side so it's going to be easier for you to understand what's going on okay let's start so here are the alphas the alphas we saw those coefficients here basically one minus betas that's how we form the alphas and then we have alpha bars which is just the products of alphas and you can see how we form those alphas it's called comprod like a cumulative product uh we basically just call this mp comprod and we end up with array so this thing is still going to be 50. uh the length of this thing is going to be as you can see here it's going to be 50. that means that by indexing into this array we end up with a like a basically um well we basically specify t of this product so it's not like we just calculated um the whole product of all of the alphas we actually have uh well we actually have the array that contains alpha t bars for all of the t's which means uh zero through 49. okay next up we have this this this parameter uh called this coefficient called alphas uh comprod pref like basically what we do here is we uh what we do the uh right shifting so we add one as the first element and then we take uh the the first n minus one elements and and pre like uh append them to this list and we also have this uh comp rod next so where those are used is let me just see i'm fairly sure it's going to be used for this mean yeah it's going to be used for the posterior so it's basically used for this uh posterior uh like forward process so for the mean and for the um as you can see here for the for the variance okay so we're gonna see how we construct those uh in a second so we we now construct square root of alphas comprod so those are gonna be used as you can see here so this uh let me kind of change the color here and you can see this is now the term we constructed here uh then we create square root one minus alpha scan product which is going to be uh let's see whether we have that one somewhere here i don't think so but like yeah um it's gonna be us somewhere later on okay uh we do the same thing for log uh square root blah blah there is a lot of these so this is square root uh reciprocal alphas compress that's square root of one over this okay so the square root uh so this expression here is actually used for the component in the loss here you can see it here uh as i said for each of these we'll have uh we'll basically calculate all of the coefficients we have the posterior variance which is again this expression here let me just kind of let me show you that this is indeed the case so we have betas here are the betas we have one minus alphas comprom so that's this expression here so that's the numerator of our expression here and then we have denominator which is one minus alphas comprod so that's it as i said they are now literally going through formulas in this paper and calculating all of the necessary coefficients we do the same thing here and then we have this i'm going to show you quickly these two so we have the posterior mean coefficient basically it's equal to um betas as you can see here and then times square root alpha's compromise so that's this one square root of alphas previous uh n bar okay and then we divide that by one minus uh this expression here so hopefully i this this convinced you that indeed they're just going on here and calculating all these coefficients and that's pretty much it i'm going to go back into the full screen here and continue with the code okay so uh this is again not interesting because this is only vital in the case of we doing a sub sampling uh during sampling which we will not be doing i using less steps during the sampling procedure which we will not be using so i can kind of safely ignore all of this and finally uh they use uh those betas to construct the uh uh this gaussian diffusion model so this one was just used for for this process here for all practical purposes this init function here is going to construct the same gaussian object we just saw so i'm going to skip i'm going to skip everything here and so that's that's it we end up with a gaussian diffusion object okay guys so we have unet we saw how they literally go formula by formula and compute all of that uh inside of the constructor of the gaussian diffusion object now let's go on further here let's exit all these functions we are back to our main function what they do is they just push the model to gpu in case you have gpu they just now create this schedule sampler so this one is fairly interesting i'm going to show you how this one looks like it basically um specifies how do we want during the training sample uh so which steps i which loss components those lt minus ones do we wanna train and so what they use by default is just a uniform sampler which means all of the steps are equally likely but then they also have that other sample which i did mention where depending on the loss of particular lt minus one they'll maybe increase like upscale that weight so that we focus more on uh basically yeah learning that particular uh part of the diffusion process okay let me just kind of go through through here so here is the uniform sampler uh basically here is how that sampler looks like so the weights of the uniform sample are obviously all ones so we have number of steps so we have like 50 ones and then during the sampling what happens is this sample function is called because as you can see here uniform sampler and here is from schedule sampler this object here and here's what happens so we have weights so this is all ones and then we divide here so we normalize and actually have a probability distribution because the sum needs to be equal one as you know uh and then we just call mp random choice so we basically do uh this is how we wait all of the uh 50 time steps and we just basically sample whatever the bet size is of those of those time steps so that's what's going to happen and then some conversion here nothing fancy so the interesting sampler is that second one which i'll just kind of briefly show you but i will not go into it maybe later if we have enough time so last second moment resampler so this is the one where they basically use the loss history to uh decide on how so you can see here depending on the loss you'll have bigger weights for for those components where the loss is bigger so that's roughly it but we'll not be using this one so i'm going to kind of skip it for the for the sake of time okay let's continue here we have our uniform sampler uh we now create the data set so basically i'll be just using cipher that doesn't need to be the case they also used imagenet 64 by 64 in the paper and you can use whatever data set you want and that's it now we have the training loop this is where the actual training is going to happen we pass in the model we put like that's the unit we pass in the diffusion um model on top of it we feed in the data batch size micro batch so this is going to be um like uh we're going to be chunking the actual batch into micro batches because we like as i said this is fairly memory intensive that's why they have all of these optimization methods such as check pointing and micro batching to kind of cope with all of that uh excess memory okay ema so that's your exponential moving average because uh they'll actually be using um uh so ema type of weights for for sampling later on nothing fancy here logging saving resuming checkpointing blah blah fp 16 we don't care about mixed position training i just want to focus on diffusion uh we pass in our uniform sampler blah blah some optimization details like weight decay for the atom w okay we can enter this uh train loop let's see what's going on here so we pass in all of these we kind of store them into the internal fields okay um we have all of this i'm gonna kind of skip through all of this nothing interesting okay we have our model parameters we have the master parameters again this is a uh like a consequence of the training this in a distributed fashion across multiple machines so we for our like for all practical purposes we don't care about the this we just have a single set of parameters okay thinking again a consequence of of uh distributed training we don't care about it they create the admw optimizer okay uh now they have for each of the ema uh rates specified uh they create uh like a deep copy of the parameters so because we only have uh one email rate we'll basically just have a single um like copy of the parameters again not that important for you to understand let's go in here this is distributed data parallel a wrapper in pi torch again we'll not be using that so for all practical purposes we can ignore it let's jump into the actual meat of the code of the training code so we first go on to sample a single batch of images and position potentially the classic the classes that correspond to those images uh let me quickly walk you through how this load data looks like basically this is going to collect all of the um images of the paths from my uh data sets so i did go on and kind of download those as i told you so you can see them here and if i kind of step over that it's gonna collect all of my images and let me now show you how it looks like whoops if i copy paste the all files and i print the first let's say two paths you can see 0 0 6 png and 13 png and those are exactly those are exactly these two first images here so here is that one well it's super small so you will not see you decipher after all okay okay then they go on to form this uh image dataset where just they just do some resizing cropping but nothing too interesting and here we have the data loader with batch sizes of 128 and we finally yield the batch from our data set okay that ends up giving us a batch of size i guess it's going to be i expect it's going to be 128 364 64 because it's cipher it's rpg rgb images and 128 is the bad size so let's see whether that's the case uh indeed it is the case and conditioning we don't have anything so it's going to be just empty dictionary as you can see here okay cool so here's the first step let's now this is the this is where all the magic happens i'm going to ignore everything afterwards because it's just kind of it's just your common machine learning boilerplate code so i'm going to ignore all of that so we have a forward forward backward call here okay so we just use zero gradients of our unit we want to clean those gradients before we recompute them again and then do the update okay so here we load the first micro batch so you can see we just subsample this batch and we end up with like 2 3 64 64 because our micro batch dimension is two and we do the same thing for conditioning which because it's an empty dictionary we don't really care we just check whether this is a last patch which it's not because we just started the loop and uh now this is uh where we do the uniform sampling of the diffusion process okay so this is this is the idea we we now sample uh p's so the time steps for each of the images in our batch okay so let's do that so we we we do that and we end up with two random t's so 44 and 3 so that means we'll take the first image we'll basically add noise 44 times in practice we'll just have a single step because of that nice property we saw in the paper and we'll get there and then this second one just has three steps of noise and then we try and predict back the noise so that's going to be the goal okay so this is the main function in this uh training code this training losses function we'll see it in a moment we just kind of wrap it up using the function func tools partial so that we don't have to pass these parameters every single time so just kind of convenience wrapper there and again we just you can ignore all of this because that's just distributed code nothing interesting let's focus on the compute losses this is the most important function here okay so what's going on first of all we generate noise that has the same shape as our input image our input micro batch so because x star is again 2 3 64 64 we're gonna end up with basically normal gaussian tensor of the same shape okay so that's again our noise uh of the same shape as the input images and then we do the q sample so let me remind you what that thing is okay so here's the formula that we are going to use uh we are basically going to do this computation here actually this computation i showed you here so let's see that that's indeed the case so let me just kind of see whether i have a break point here i do have a breakpoint so here we are we are in the in that function you can see that uh this is everything we do so we have this extract into tensor we'll see what that does but we have the square root alpha scan prod which is this part here let me just change the color so okay so we have this part here and then we multiply that element wise with x start which is x zero which is this thing here okay and then we add uh square root one minus alpha scan product which is this part here we add up this and we multiply it with with noise as you can see here so we literally are just computing formulas from the paper nothing fancy there okay now i'm gonna one time just show you how this uh extracting to tensor function looks like uh for the sake of your um i guess understanding of the code so i'm gonna kind of add a break point there so what we do is the following so we uh take that um so the the array we have like the let's say let's say the alpha t bar or what not and we basically just extract using the time steps so time steps are if you recall like we had um oops time steps we end up with 44 3 so we're going to end up taking the alpha 44 bar and we're going to end up taking the alpha 3 bar which are going to contain respectively 44 terms the product of 44 terms and the product of three terms and then the only thing we do is literally we add these dummy dimensions as many times as possible so that we have the same dimensionality as the image we are doing element wise multiplication with and then we just expand here uh the the vector so we we just want to have the same shape of this of this app yeah so why are we doing this well because we cannot multiply a scalar with the tensor and because of that we're just basically just copy pasting and broadcasting this this scalar so we end up with a tensor that contains whose all elements contain this particular alpha 44 bar or whatnot okay hopefully that was clear enough and now i'm gonna kind of uh let me first ignore this i'm gonna not will not be stepping into this function anymore and that's it that's it guys we have we have the xt so we have our noisy version of the image right here okay let's see what the next steps are first of all because our loss is uh gonna be hybrid we're gonna ignore this part and we'll see that we'll we'll actually be computing the same thing a bit later down below because hybrid also contains this variational lower bound loss as well okay so here we are we have the our loss type is if you recall rescale msc and we end up now here so this step is supposed to output the epsilon so we need to learn to predict the noise that if you recall uh we basically took that noise and we now we basically using that noise we form the xt's now we're trying to predict especially like particularly this noise so this is the term we'll be trying to learn how to predict those are the green dots i showed you in the onenote people so this that's going to be the first three channels the last three channels are going to be about the variance and we'll see those uh uh in a bit but for now let me show you how this a forward prop is gonna work so this is again unit we're gonna do something to the time steps so we're gonna somehow merge them into this uh together with this uh with the image representation and let me quickly show you why we are doing this why are we passing x and t so this is the particular expression we are now dealing with here we have our u net so this epsilon t in practice so let me again change the color so this thing here is our unit we have this expression is the x t so that's the thing we just calculated so that's the x t again x t and we also pass t which are the time steps and that's why we have this particular expression that's why we are computing by executing this line of code here okay let me let me go back here and let me step uh through this particular thingy uh we can ignore this wrap model um basically it serves the purpose of again only if you have the the sub sampling during the the your sampling procedure which will not be doing so none of this really matters and then we do some rescaling nothing that's super interesting just some legacy stuff so that they are comparable with the ddpm original paper etc so here is the actual u-net forward pass so we have the new ts so this is just gonna some rescaled versions but still scalars of our original 44 and three if we recall uh these time steps here and now let's let's pass that and this is just empty like the key uh word arguments are just empty we don't have anything there so we just passed the time steps and we pass the images so let me show you the shape here shape here is again 2 3 64 64. okay now let's step through the forward pass of the unit model here is where the magic happens first of all we have this time step embedding which is going to map the scalars into certain vectors which are just like heuristics basically we're just using some sinusoids same thing as in the original transformer paper so let's see how that looks like okay so here it is again not going to dig into the details you can see a bunch of sinusoids cosines what ultimately matters is that we end up instead of with scalars we end up with 228 so we end up with two vectors so now we can we can kind of work with vectors in neural networks it's kind of hard to work with scalars uh after that okay i stepped into something uh whatever okay so now we have this it was actually the the activation function of this time in bed and if you recall timing bed is this inverse bottleneck uh of of of mlp so let me just find that thing quickly so time embed where are you here it is it's just a combination of two linear layers uh whereas the this uh inner most uh like layer has 4x the dimensionality of the input and output layers hence i call it inverse bottleneck and this is just like this times four is stuck with us since 2017 the original paper people just keep using that again just a simple transformation and we end up with some we'll end up with some vector of different dimensionality so let me show you that let's kind of step over here we end up with what we end up embedding vector here being 2 5 12 okay so that's it that's our time information now let's let's insert that time information somehow so we go through the input blocks remember the first block is going to be like the comp layer and then we're going to have the rest blocks so let's see the the first thing so the first thing nothing nothing super interesting because it's just a comb layer but if i have hopefully added a break point to duress block let's just find the rest block here uh where are you okay here it is here is the residual block here is where the magic of of merging happens let me show you this so okay here we are we do some processing on the image that's not that important so these in layers are just like normalization some convolutions what not nothing interesting there next up we take our embedding vectors which are again 2 5 12 right so that's going to be 2 5 12 and we pass them to the embed layers so let's see what those are again that's just like a linear layer uh that does some additional processing on top of those representations nothing super vital let me show you the shape okay 2 2 56 and now what we do is we just add the dummy dimensions until we have the same shape as this as these image features h so h has the shape of as you can see 228 64 64. because of that we're gonna keep on adding dummy dimensions to to our temporal tensor until we end up with the same uh number of dimensions you can see here and now because these two are one one we'll basically be broadcasting those so copy pasting the temporal vectors before we combine them with the image features so here's where the magic happens so this is the parameter we saw before the boolean use scale shift norm that's one option how we can do it you can just also just simply add you can simply add the temporal information to the image features and that's how you can condition the the model okay but they i guess empirically found out that this approach works a bit better instead of just adding up the temporal representations to the image features instead of that let's kind of chunk this so we start with two two fifty six one one by doing this chunking along the first dimension that means we're going to split this 256 into 128 okay so let me kind of step over there let me show you the scale and shift are going to be 2 128 1 1 right the same thing for shift and then this is how they kind of combine the image features with the temporal features one plus scale and then we add the shift and we just have some normalization for the image features beforehand and then we do some um additional processing that's it guys that's everything there is to to to how this unit works i think this is like everything you really need to know i'm gonna remove that breakpoint and continue on here uh and that's pretty much it and now we're gonna keep on doing this input blocks middle blocks apple blocks we're gonna keep on adding and merging the the temporal information and that's it i'm gonna just keep over all of this okay so we've done the photo prop um here we are we are exiting the uh the function so we end up with let's see the shape should be as i said we should have i guess um 2 6 64 64. that's the prediction let's see whether that's indeed the case so 266 464 indeed why is that well because basically um uh we we we have six channels because the first three channels again are epsilon which are the noise which we're trying to predict and the next three channels are the variance so let's see how that's gonna be used here um later on so because we have uh learned range we're gonna enter this part of the uh this branch here uh we just extract some dimensions so xt is again that's our tensor of noise images to 364 64. so we're going to split the output into two groups each of with three channels that's what i just explained so model output is going to have the epsilon so that's going to contain the noise whereas this is going to contain the variance and let me now show you the shapes basically it's going to be 2 3 64 64 whereas previously we had 2 6 64 64. okay now what we do here is uh basically we take the epsilon we detach it from the computational pythord graph which means will not be updating uh the the the the epsilon whereas we pass the variance like this and why is that well if you recall from the paper and i'll show you that in a second as well we basically just uh in the hybrid loss we use the um this variational uh bound uh loss to train the variances whereas we use this simplified a loss objective to train the mean uh or in this case to train the absolutes which is equivalent because we have that different representations parameterizations so let me show you again the paper and compare this side by side okay so here it is basically as i mentioned we have the hybrid loss consists of uh simple and l so the variational lower bound objective and they mentioned here along the same line of reasoning we also apply a stop gradient 2d mu of theta or equivalently to the epsilon data output for the lvlb term so that means this term is gonna we're gonna freeze we're gonna detach from the combinational graph that's why we are doing this this part here okay now let's go on and compute the actual uh lvl b terms so we are now computing literally this we're computing this thing here right that's what we are currently doing and later you'll see in a couple of seconds we'll be computing the alt symbol again just executing the formulas from the paper that's it do let me know whether this side-by-side comparison between formulas and code helps you or not because these videos are super long and take me a lot of time to to create them so any feedback is very much appreciated continuing on here we do that we concatenate those two and then we call this uh vb terms bbd so let me just kind of uh see where i have a breakpoint there yes i do um let's go back to the code okay so what we do here is we pass this dummy model why dummy because that this function here will later be calling this model but we will not actually be doing a forward prop through the unit as you can see this is how we define the model you basically see this is just a like a dummy lambda function that just returns uh so no matter what you pass in it's just going to return r which is the frozen out which is this thing here so we'll see some model like we'll see a forward prop but it's not going to be a forward prop it's just going to be this dummy return of of what we already computed just keep that in mind okay so while we pass we passed the x start which are the original images we passed the xt which are the images that were noised uh using uh like t number of steps which we have here and now let's kind of step into this uh function and see how it's computed so what we do is we compute the um posterior mean and variance here by doing that we basically compute the um the forward process posterior and we're going to basically do a kl divergence as you can see here between that and between our learned reverse process again let me show you the paper to make this a bit more concrete and then we're going to dig into the code okay guys so here is the expression we are calculating it's nothing more than this so we have the as i said posterior here it is like of the of the uh forward process that's testing here here we have the learned reverse process so this thing here corresponds to this lines here the q part the posterior of the of the four process corresponds to this line here q posterior mean variance and then we have the kl divergence which is just this part here again this is just this part here kl of those two distributions and those are the lt minus one terms so the variational lower bound terms that's it nothing nothing too fancy now let's kind of dig and see how that's going to be computed so we pass the x start x t and t which are the original image noise image and t the time steps so let me kind of start uh enter this function again so it even says here what it's computing you can see here so it's very nice to add this type of comments that helps a lot to be honest so here it is how we compute this is we take this mean coefficient 1 we multiply it with x start we take the mean coefficient 2 we multiply it with x t and that's it we end up with a posterior mean again let me show you this side by side with the paper here it is guys so we saw this already actually so we have again posterior mean coefficient one is just this thing here so we actually computed that if you recall in the constructor of the gaussian diffusion uh object so this is the the the first coefficient here then we multiply that with x zero as you can see here x start just a different notation and then we have posterior mean coefficient two so that's this part here and we multiply with x t so that's this part here so that's it that's how this formula corresponds to this line here let me now continue here let me step over all of this we end up with the posterior mean we do the same thing with a posterior variance so that's just those betas again here are those betas that's what we just computed here that's the posterior variance and the posterior log variance i'm not sure how this one is going to be used we'll see that a bit later hopefully or maybe not okay just some assertions and that's it we computed uh the uh basically this distribution we because it's a gaussian we can just return the mean and the variance and that's it that's perfectly describing the gaussian okay now we do the p mean variance so we're now doing um that other step so again that's that's this part here this is what we're trying to compute at the moment here let's see how that's going to look like let me see whether i have a break point inside of here yes i do so we passed uh again model remember here model is just uh dummy it's just gonna return what we already computed beforehand so the epsilon and the variances and we pass in x t and t so that's the the noise image and uh time steps okay let's enter here again uh it says here we are basically computing this we are computing the p mean variance okay so let me step over this nothing fancy there again uh this does nothing this just returns this just returns we're gonna see how we are not going to enter the forward function we're just gonna return here okay so we just end up with the tensor we already previously computed so it's 2 6 64 64. let's see how how this is going to be computed so because we are actually our model their type so we are we have the as i said we are we are learning the variances that's where we enter this branch here we split the model output again into the epsilon and variances and now because we are um learn range and not learned we enter the else branch and as you can see here we are now calculating equation 15 from the improved dtpm so uh paper okay let me show you that side by side okay so here it is guys uh here's what we are computing this equation 15. so when i said we have variance i was kind of lying because that's not the case that's just a component that when you combine it like this then you end up with the variance okay so let me let me show you what they do here let me kind of remove this uh panel here uh okay so they have the mean log so they have the posterior log variance clipped so that's going to be this part here so that's where the log part comes into play so this is the the part that they compute here it's just the mean log and then they compute the max log which is this part here actually because this is betas it's why it's the other way around but okay it doesn't matter that much you get the point uh finally we have this um just normalization of the output the output is minus one one they add plus one so that's zero two and then divide dividing by 2 we bring it into the 0 1 range i guess and then we just do literally this equation here so you can see v times this plus this blah blah same equation so this line is literally this equation here and then we just add the additionally this exponent parts that's this part and we've successfully computed equation 15. okay let's continue on here this is just some clipping we can ignore it for now let me zoom in here uh we're not predicting previous x we're predicting epsilon that's why we enter here and we enter this branch so we first want to predict the the the x start uh given the um epsilon so the noise given the currently noisy images and time step we want to predict the x here uh we'll see why we're using that in a second so let me kind of step into this and show you this part so here we are we're predicting x-star from episode uh here is how we do that uh let me again have the equation on the side so we'll be able to understand what's going on here okay guys this is the equation we are computing so recall that x t is computed like this if we just rearrange the terms and we compute x0 x0 turns out to be this expression here and that's what we are currently computing here okay so here are the square root reciprocal alpha scam products that's literally this square root blah blah blah this thing here then we multiply that with xt you can see xt here then we have minus uh square root reciprocal minus alpha blah blah blah so that's literally this term here you can just read it kind of word by word and understand that that's this term here and we multiply that by epsilon and that's it so that's how we calculate d x zero or x start let's go back here and now we're going to use that to find the model mean and by model mean i literally mean this expression here no pun intended so this expression here okay so let's see how this is gonna work uh posterior mean variance uh i think this is well yeah this is the expression we already saw uh we already kind of computed this thing here before let me kind of zoom in uh so i can kind of skip through this yeah i'm gonna skip through this part here we're going to ignore this because we already saw it and that's it we end up with our mean and that's what we return back we return back the mean the variance the log variance and the predicted so the x start and after that we just hear past the true mean true log variance that's the the ground truth and we want to minimize the scale divergence so we want to basically be as close as possible to that distribution we pass our mean log variance we calculate the kl do some normalization blah blah blah okay finally there is this step this is l0 so in case that the time step was zero we do some different computation if you recall from the variational lower bound loss let me show you that side by side okay guys so here it is so here are the lt minus ones we were computing so those are decal divergence we here compute this expression minus log p uh data x0 condition like x1 so that's what we compute here this discretized gaussian likelihood blah blah blah this week how we get the negative log likelihood decoder negative log likelihood and that's pretty much it and now we uh compute depending on whether that as i said depending on whether t equals zero if it's equal zero then we have decoder and l that's what we use as the loss otherwise we use kl fairly simple fair enough okay let me kind of zoom in here and that's it we have our first term that's the vb the variational lower bound term and we just do some rescaling that's why the loss is called rescaled msc let's continue i'm going to ignore this part here because we'll be extracting epsilon which is noise so we can kind of ignore everything else here so let me just make sure this doesn't have any yep doesn't have any break point so we're going to step over this we end up with target it's just going to be the noise uh and here we just do msc between the model output which are the epsilons that we predicted from the unit and we just do msc here between that and the target which is the noise and that noise let me see where i can find it whether it's in this function okay and this is the noise that we initially created here before we even started creating these xt's that's it it's fairly actually simple um to be super frank here it's not easy to understand why this exactly works it's kind of still magical to me but i'm slowly learning about these diffusion models more and more i think the first time i kind of took some time to understand these was when i was preparing the glide paper so again do check out that paper i do have some short introduction maybe even better than the one i made in this video uh when it comes to understanding the fusion models but in any case uh finally because we have the hybrid uh loss we just combined the msc and the variational lower bound loss and that's it that's our loss after this we just do the um uh like basically do some waiting uh and for each of the time steps and because we use uniform sampling we'll just have ones everywhere uh do some logging blah blah and finally we do the computation of the gradients by calling the backward function on our hybrid loss here so that's pretty much it next up we just keep on repeating micro batches blah blah blah i'm gonna kind of stop here this is everything you need to know to understand how the how the training procedure works hopefully that was useful now let's quickly dig into the sampling code and then we are done so i'm gonna stop this training here i'm gonna enter the sampling script so that's going to be let me find it here so it's going to be the image sample function here the script here and let's start with the main so i'm going to now start and run this one i'm going to set the sample configuration from my launch pi in vs code by the way i love vs code he has beautiful design amazing debugging experience and i mean i don't know why you would use anything else unless you don't have a choice i guess that's that's a reasonable excuse i guess okay let's step over this again bunch of arguments we're going to ignore distribution blob distributional training logging um so i'll skip the model creation as well so i'm just going to disable all of the breakpoints and i'm going to step over all of this because we already saw all of that so i'm going to ignore it now we load the actual model they do provide the checkpoints on their in their github repos to do do check it out i can download it and then just set it here so i can have it stored somewhere so it's an imagenet 6464 unconditional model trained 100 million steps i guess um so we loaded we put it onto to gpu we set it into eval mode and we now start creating the samples okay so because this is not a class conditional model we can ignore all of this part and now this is where the magic happens the sample function first of all there's this paper called ddim i might cover this in the next video but for now we're just going to ignore it because this is false we're going to be using the p sample loop this function here that's the one we're going to be using not the ddim it turns out that ddim in a nutshell works better when you only have 50 or less time steps during your sampling procedure as soon as you pass the 50 threshold roughly then this method here uh operates better and this is just like uh the the the method from the improved uh ddpm paper the one i showed you in the beginning of the video okay so here's the sampling function this is our desired shape we want to have one three sixty four sixty four images this is the image we wanna generate uh this is kind of empty okay let's enter this functional let's see how it works okay before that let me just enable the breakpoints so enable all breakpoints okay now we can enter that here let me enter and here we are so first of all there's going to be this um p sample loop progressive uh basically this thing is going to keep on producing the samples uh we pass in the model we pass in the desert desired shape uh noise is none we don't pass anything here we don't specify this blah blah this is just boilerplate i'm going to ignore all of that so here we are p sample loop progressive again we just pick a device uh it's going to be gpu in my case because i have uh like gpu here um then we generate the image so here's how we start we literally start with the gaussian noise so this is your your normal distribution so so basically gaussian with mean zero and variance one okay with the desired shape on the desired device let's generate that image now we have some number of time steps which is set to 100 again i think i just modified this for the sake of time otherwise i think default was a bit bigger like four thousand um in any case so we generate the indices and you can see here we reverse them because now we are doing the uh the reverse process we start with the uh 99 with the index 99 and we go all the way to zero which is the original image when i say original image here this is a generated image from the underlying data distribution that we learned during the training procedure that's that's the idea there okay let's step over here so let's go over the indices we start with the index 99 so this is going to be first 99 okay we kind of um well you can see here because batch dimension is one nothing fancy happens here but otherwise if you wanted to generate like five images in the batch then you would just copy paste uh 99 five times because uh for all of the all of the noisy images will be following the same process so that's where we start with with the last time step okay and in this function is where all the magic actually happens everything else was kind of boilerplate let's let's focus on this one so we pass the image which is again in the initial point so now in the initial step it's going to be just pure image pure sorry pure noise we passed the time step uh blah blah blah nothing fancy there let's enter the p sample and here is how the sampling works and actually we already saw this during the training so you pretty much know how this is gonna work let me enter here uh this is that part of code where we were literally computing the um we split the model output the model variance then we compute those mean log max logs that's the equation i saw the equation 15 from the paper we get the variance and then later down below what we do is we basically predict the x0 and then use that uh to to compute model mean so we we already kind of stepped through this code so i'm not going to do that now so yeah because of that let me kind of step over here and ignore this so i'm gonna just gonna disable all of the breakpoints let me disable all of the breakpoints let me step to here and let me click f5 to step over okay we end up here uh now i'm gonna enable the breakpoints again in any case so we generate the noise again this is your normal distribution uh we generate these masks which basically are always going to be one if t is equal is different from zero because we treat the zero the zero step uh differently if you recall even during the training we had that an nll loss as opposed to variational lower bond loss similarly here when we when we sample we're gonna have a bit different behavior depending on the number of times step so we just take the mean that's the mean we computed and then we take the log variance we do the x x and log cancel out we get the variance we multiply that with the noise and we end up with a sample okay so this is now we are in step 98 okay and we're going to keep on doing this until we get a completely denoised image and at the end once we end up with the zeroth time step this is going to be equal to zero and that means the only result will return is the final mean again that's just a detail of how this thing works and we return the sample that's it basically the sample now becomes image that we feed into the next iteration of this p sample function and that's how it pretty much works i'm gonna now stop this because you pretty much saw everything i'm gonna go to launch i'm going to go to sample i'm going to increase the number of diffusion steps to let's say 500 and i'm gonna run the script again and just show you the results we get this time i'm going to disable all of the breakpoints i'm going to run this and just show you how the results look like let me see yeah i added this image show plot show line so that we can visualize the actual sample and yeah i'll get back to you as soon as the image is generated okay guys here's the image we got some dog image again do do keep in mind that i only have uh 500 uh steps uh it'd be much better if i used uh more steps obviously like thousand although there is some bug with this code when you use 4000 you can check out the issue in their repo uh basically you get like completely saturated image so that means that the optimum quality of an image is maybe around thousand two thousand or something but like at four thousand the images hit the saturation point and you get like just pretty much junk out of these of this model guys this is pretty much it uh hopefully you found this format of a video useful do let me know whether you find it useful uh if so i'll keep on making videos such as this one combining papers combining code putting code and paper side by side trying to make these abstract mathematical ideas a bit more grounded and hopefully i've done a job at doing that in any case if you found this video useful do share it out with others who want to learn more about the fusion processes uh subscribe to this channel join the discord community you can find the link down below in the video description and until next time bye bye [Music]

Info

Channel: Aleksa Gordić - The AI Epiphany

Views: 35,493

Rating: undefined out of 5

Keywords: arxiv, paper explained, the ai epiphany, ai, deep learning, machine learning, aleksa gordic, artificial intelligence, code walk-through, diffusion models, ddpm, denosing diffusion probabilistic models, improved ddpm, glide, dalle-2, midjourney, google imagen, openai

Id: y7J6sSO1k50

Channel Id: undefined

Length: 88min 55sec (5335 seconds)

Published: Thu Jul 07 2022