Miika Aittala: Elucidating the Design Space of Diffusion-Based Generative Models

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right thanks for the intro and inde is uh the title of the paper is elucidating the design space of diffusion based genc models this is work with the myself Dio and Su from Nvidia and um the sort of agenda here is to try and make sense of this zoo of these uh recently immersed diffusion models uh try to really dig into the fundamentals and with that understanding then ask what are the best practices for for Designing and running these methods so for a beef brief background on gener just uh generative modeling uh there are many ways to do it um but the idea is usually you have data set of something for example in this case pH photos but it could be anything even not images and you want to train some kind of a neural method for basically converting random numbers into random novel instances from the that data distribution and after recently Gans we the uh leading contender in this space and these are from there but um but now uh now U the denoising diffusion methods have have really emersed as the as the leading Contender here and um so I'm sure we'll we've all seen these super impressive results from from these models like a stable diffusion and everything I'm going to say is basically stuff that runs at the bottom of these things and that is in some way directly applicable to anything like this okay so all of these methods the dening diffusion methods uh the way they implement this idea is uh you start from Pure random noise you feed it to a newal d noiser you keep feeding it and reducing the noise level until it reveals a random image that was hiding underneath the noise and now you've generated random image so this is a generative model one concern with these methods is efficiency you need to call this D noiser tens or even thousands of times in some methods to get the best quality on the other hand it's a it's indeed a tradeoff uh with uh with the quality of the individual generated images and uh and with with the with the distribution as whole and these trade-offs are not really well understood this previous work and some methods simply work better than others and it's bit of a fork Clore that this one these one seems to be good one good and so on um and there are many ways to formulate the theory of these methods you can approach it with like moch chain stochastic differen equations and some more exotic ways but when you kind of strip away all those fancy theories in the end they all do something like this um but they differ vastly in uh in Practical design choices uh like uh at what rate do you reduce the noise level at different stages of the generation do you do this oh it's showing on your do not does anyone know what right yeah thanks thanks for spting yeah um yeah whether you deter do this deterministically or so stastically we'll see the difference soon uh how do you deal with the vastly different signal magnitudes at different stages of this proo do you predict the signal or the noise uh and so on so and given that ultimately these are the only differences between these existing methods these must be the explanation for their vastly different performance characteristics also and this is something we wanted to understand in this pro project so we'll be building on the differential equation formalism by Young song and colleagues in their paper from a couple couple of years back where the images seem to evolve according to stochastic or an ordinary differential equation and in in principle it's known that this uh kind of generalizes all of all of those other methods you can express them in this framework but nobody has really gone through the through the work of getting their hands dirty and sorting everything into a sort of common framework where you could compare and uh and understand the the impact of these design choices so that's the first thing we are going to be doing here and armed with that knowledge we'll then ask what is the most the best what are the best practices for running the sampling process namely how do you manage this chain of denoising steps in the best possible way first the deterministic version and then the stochastic version and then finally we'll come to best practices for training these neural networks how do you precondition them how do you how do you what are the loss functions why does this keep coming back okay right uh and just one thing we won't be looking at the actual neural architectures like whether you should use d or not we leave that for future work okay so let's start with the common fail so we'll be studying a few keyworks in this field uh there's this paper that presents the So-Cal VP ve method variance preserving variance exploding then there's DD IM denos diffusion implicit model it's not really that important for us what the differences between these are but on the fa pH of it they look kind of like like a like a packages that you have to take as a whole you cannot mix and match their properties but this not really true so the running theme in this uh throughout this uh paper is that we identify this complete and exhaustive set of design choices that completely characterize and reproduce any given method in uh or at least these three methods and many others in the space and uh this gives us sort of an x-ray view into the internals of these methods we can ask what are the exact design choices they made about this and this aspect now don't worry we won't be looking at slides like this I'll try to keep it Visual and uh and intuitive to the extent possible but the important Point here is that this can be done and with this knowledge we can then ask what is the best choice for any any given uh design Point here and that gives us our method which will be building piece by piece and that then yields significantly improved results and we'll be measuring our progress with the FID metric which is sort of the current goal standard in in evaluating any kind any kinds of uh generative models so let's start looking at how s and colleag colleagues build this U or formulate this uh denoting diffusion problem uh using different all equations so throughout this uh talk I'll be using this running toy example which is actually one the toy example which is actually quite uh actually in some many ways completely representative of the actual thing that's going on with images so in a way this is a ond images ver uh you would have more dimensions on the X AIS uh the vertical axis with actual High dimensional images like a one megapixel image is Million numbers so that would be a million dimensional space but this describes the essential essential characteristics of it so the point is we have U some distribution of data let's imagine the cat and dog photos or something and it happens to be this Bodel thing so so certain pixel values are more probable than others and uh we want to learn to produce novel samples from this distribution we have a handful of examples or let's say millions of examples uh which is our data set and based on those we want to learn to do this so in this analogy one of the samples we have might be might be this dog photo uh on the other axis we have increasing time which is a essentially increasing noise level that's what we are going to be dealing with when we and want to reduce this noise but before we do do that let's look at the easier direction of adding noise like destroying an image so if I start take this uh image from the training data set I gradually start adding noise onto it I end up doing this uh random walk in this pixel pixel value space until the image is completely drowned under this uh this white noise and if I have a population of images uh in the end they'll all become indistinguishable White noise so if I plot the density that these trajectories make uh it'll look like this so the density of the data on the left Edge becomes diffused over time until it's completely normally distributed at the end and this is really nice now because has disappeared again okay no okay uh I'll try one thing I'll can I get it well maybe we'll just leave with is do you think we can do that yeah yeah let me let me know if something important seems to be missing underneath it so okay okay so yeah as I said we can sample from this uh normal distribution at the right Edge we just call Rand n in pyth and that'll give us a sample from that edge and the magic is that there exist ways there exists a way to sort of reverse this path we took earlier so go backward in time and that will land us on the left Edge where we have the density of the actual data and that of course generates an image and so if I have a population of this complete random noises oops um okay yeah if I had many images I would have gotten different instances of the image okay and what makes it sck is that uh this can be seen as a stochastic differential equation in this example is it's about about the simplest one we have when we go forward in time over a small very short time period the change in image DX equals D Omega which is a white noise so that's just the mathematical expression of doing cumulative sum of of random noise now the magic is that to this forward equation corresponds a backward version that has this same uh stochastic component random walk component but on top of that it has this uh term that kind of attracts the samples towards the DAT density you see some kind of a gradient of the of the DAT dens P but the problem of course is that this p is unknown and here is the actual magic uh this is a well-known function from previous literature in data science uh called the score function and it has the property that you do not need to know the P if you have a least course optimal D noiser for this data set uh d uh so you can directly evaluate that formula above by the formula below and this is an opportunity we train a newal network to be such a denoiser and this means that we can run this kind of backboard equation uh Evolution using that uh learn D okay so sh colleagues also present this deterministic variant of this uh uh where you don't have the stochastic term you only have this core term scaled in some appropriate way and this has a somewhat different like a visual character you see it's kind of fading in and out uh instead of like a jittering around and this one actually provides a much cleaner view into the sampling Dynamic so we'll be looking at at this first and then returning to the stochastic theator and with this I can now always draw these uh paint uh flow line of this OD so the idea is that we are trying to somehow follow these lines to do the generation and in the way that happens is by discretization I take a little but macroscopic steps in this space I reduce the time and for any change in time I want to jump the OD formula tells me how much the image changes and again the OD formula is evaluated this new network so so the ne Network tells us where to go on the next step that's the general idea and that gives me a step I keep stepping until I hit time zero and that's my generated sample uh with the SD we would have some kind of noise addition on top of this so would kind of kind of Jitter it but I said we'll leave that for later and now we have exactly produced this uh intuitive picture uh using different Sol questions okay so that was s song and colleagues for our purposes and uh let's now Identify some design choices involved in making this kind of an OD or an SD but before we do that we should understand what can go wrong in this proc what are the error sources well the obvious one because I might end up like in a different place than I should have when I do this sampling chain so the obvious one is that if the net n work gives me an incorrect Direction I end up moving in the incorrect Direction and in the end I end up somewhat in the wrong place it's more subtle than this but this is a kind of a cartoon the other source of error is uh that we are trying to approximate this continuous trajectory in green here uh using these linear uh linear segments and uh if I try to jump too far at once the Curve will kind of move away from my feet and I'll end up veering off this path is of course familiar to anyone who's done like a physical simulation with ODS or something and um the brutal solution to that is to take more steps but that's exactly what we want to avoid because that directly means more compute to generate generate an image okay and so what we argue and what is underappreciated in previous work is that these two effects should be analyzed or can be and should be analyzed uh in isolation you don't have to sample in a certain way just because you train your new network in a certain way and so on you can decouple this and indeed we'll be looking at sampling first and then coming back to the training later okay so I promised to show you some design choices and here here is one finally so um when I built this example I added noise at a constant rate over every time step and that gives me this uh implicitly gives me this schedule where the noise level increases as a square root of time because that's how the the variance will grow linearly so standard Dev will go will go square root that's what you get if you call Rand and then do a comum on top of that had I added it at a different rate I might have arrived at a schedule like this for example where the standard deviation instead cross linearly and indeed I could do any any kind of a choice here I could do something even something crazy like this wavy schedule here in the middle if I wanted to for some reason and indeed uh we generalize in the paper the the OD form in such or we reparameterize it in such a way that we get a clear view into these effects uh so we we can parameterize the the noise level we want to have reached by explicitly by this function Sigma uh but the real question is why would you want to do something like this well one reason for that could be that if you look at this picture for example you see almost nothing happens until at almost zero noise level suddenly curves rapidly to one of these two basins and uh there's High curvature there so we' probably want to be careful in stepping we'll want to take somehow be more careful in sampling that region and less careful here in the bulk so there's two ideas of how you how you might do that you might first you might take shorter steps uh at the more more difficult Parts usually it's the low low noise levels because that's where the image details are usually built the other alternative would be to instead warp the noise schedule in such away yet that you just end up spending more time at these difficult parts and uh it's tempting to think that these two approaches would be equivalent and this is an implicit assumption I think that many previous Works do but this is simply not true because the error characteristics are can be vastly different between this choices like the eror that comes from this uh tracking this uh this continuous curve and we'll see the effect of that later so now we've identified the first pair of design choices here time steps and the noise schedule but um let's introduce a couple more so then there is singal scaling uh and this address is the following problem I zoom out a little because in reality we add a ton of noise so at the The Other Extreme the noise level is very large I've been showing this zoom in so we can easier see what's going on but I zoomed out now to see see what's here so the issue if you don't do anything is that um the signal magnitude grows as the noise level grows you keep piling noise the signal is quite simply bigger numerically like the values that are in your tensor they are much larger at the high noise levels than in the low noise levels and this is known to be really bad for newal Network training Dynamics and uh these kind of effects are actually critical to deal with uh to get good performance so the way many previous Works approach this is is by using something called variance preserving schedules where you effectively introduce this um additional scale So-Cal scale schedule where you squeeze the signal magnitude into this constant with constant sted constant variance tube so that that makes that's one way to make the network happy here uh so we generalized this idea again uh by uh uh just formulating an OD that allows you to directly specify any arbitrary scale schedule and viewed in this slide it again becomes appropriate that the only thing that the scale schedule does is distort these flow lines in some way so you're just doing a coordinate transform in a way on this XT plane now there is an alternative way to deal with this scaling issue and it is quite simply this instead of changing the OD at all you could change your Neal Network in such a way that it has an initial scaling layer that uses the known uh signal scale the scale is to something that's Parable for the new network and again you might think that this is completely equivalent with the OD but this is simply not true because again the error characteristics are vastly different between these two cases and I'll come back soon to what the how the chos in practice but now we've identifi a second set second pair of design choices the scaling schedule and the and the scaling that happens inside the neural network itself and that we count as a So-Cal preconditioning of the neural network um okay so now we have quite a few collected here and at this point we can ask like get our hands dirty go look at the appendices of these papers read their code for the Final Ground truth and ask what what formulas actually exactly reproduce their their their approaches and they are these again don't worry about don't even try to read them but the question now is uh what choices should we actually make which ones of these are good which and which ones suboptimal and that's going to be the topic of the next section and for now we will be ignoring these newal Network training aspects we will be using uh pre-rain networks from previous work uh we won't be retraining anything yet we'll just try to improve the sampling so now we move on to the deterministic sampling and actual prescription prescriptions of what you might want to do so first the noise schedule why would some some of them be better than others for example this wav one must be terrible for some reason but why well now we get a clear view uh well uh parameterizing things in this way gives us a quite a clear view to this so let's zoom out again and um and consider the fact that uh we are trying to follow these curving trajectories by following these linear tangents and that's probably going to be more successful when the tangents happen to coincide with this this curve trajectory so when the trajectory is as straight as possible in other words so if I use a bad schedule like this one you see there's there's already a visible gap between the tangent and the curve so you easily fall off if you try to step too much and indeed if I show you this just random family of different different schedules we see that some of them seem to be better in this regard than others um in particular this one and this is actually the same schedule used in the previous work DD which is known to be quite good and this in a way explains it uh so this is the schedule where the standard deviation goes linearly and we do not use any scaling and indeed we'll be leaving the scaling for Neal Network parameterization and the reason for that is that the scaling also introduces unwanted curvature into these lines that's this uh yeah it just turns them unnecessarily at some point it's it's actually better to better to let the signal in the OD grow from that perspective as a further uh and yeah with this the this the OD becomes very simple so as a further demonstration like an actual mathematical fact about this uh this schedule and why it allows us to take long steps is that if I took a step directly to Time Zero then with these schedules and with this schedule and only this schedule the tangent is pointing directly to the output of the D noiser and that's very nice because the dener output changes only very slowly during the sampling process and this means that uh well the direction you are going to doesn't change almost at all so it means you can take long bold steps and you can consequently only take a few steps or many fewer steps than with the alternatives okay and then I said we want to direct our efforts to the difficult places now we've tied our hands with the with the noise schedule so the remaining tool is to to take different length steps at different uh different stages of the generation and indeed when you go look at the possibly implicit choices the previous methods have done all of them take shorter steps at uh at low noise levels because that's where the detail is builds again and uh yeah we we formulate this family of these these discretizations like a polinomial steplength growth and we find that there is a broad Optimum uh of good good schedules there you can read those details in the paper so there's one more thing that this OD framework allows you to do uh which is not so clear with for example the mark of CH chain formulas is use higher order solvers so again there is going to be curvature and it can be quite quite rapid at places so you can still fall off fall off the track if you just naively follow the tangent and that method of following the tangent of course it's called the oiler method but there are higher order schemes for example in the Hoin scheme you take a second tentative step and you move it back to where you started from and your actual step is going to be the average of that and the initial one and this makes you much better follow the trajectories uh this of course has a cost you need to take this substeps and what we find in the paper by extensively studying this is that um is that this ho method strikes the best balance between these higher order methods for sort of the extra bank for the par and the Improvement is actually quite substantial okay so those are the choices we made and now we can evaluate so we'll be evaluating these uh results throughout the talk on a couple of very competitive uh generation categories uh cat at resolution 32 uset at resolution 64 and I want to say a couple of words on uh this might sound like a useless to example to to you if you're used to seeing like a output from stable diffusion or something but uh the way also those methods work is they first generate a something like a 64x 64 image and then they upsample it sequentially and it turns out that generating this 64 image is the difficult part there the upsampling just kind of it just kind of works so this is highly indicative of of improvements we get in a in very relevant uh very relevant classes of models okay so if we look at the performance of the original samplers from these Works uh from a few few previous methods on on these data sets uh we see that uh we have the quality on the y-axis the FID lower is better and we have the number of samples we need to take like the number of steps or the function evaluations on the x-axis we see that we need to take something like a hundreds or even thousands of steps to get kind of saturated quality to get the best quality that model gives you so introducing the H sampler and our uh discreation schedule we vastly improved this notice that the xaxis is logarithmic so we've gone to from like hundreds to dozens of uh evaluations and further introducing um the the noise schedule and scaling schedule further improves the results by a large amount except in the DI IM which was already using those schedules so now we've already made it quite far here okay uh and uh using some super fancy higher a a solver it's not worth the effort okay so now we have covered the deterministic sampling and uh let's next uh return to the question of SD which we put on the back burner on the other slides and remember instead of following these nice smooth flow trajectories the SD sort of GS around as some kind of exploration around that the Baseline so it can be interpreted as replacing the noise as you go on top of like reducing it and the reason why people care about the SD is of course well one reason is that you that's where this stuff is derived from but the other is that in practice you tend to get better results when you use the SD instead of the OD at least in previous work and um the reason for that will become aent soon but let's first general this idea a little so in the paper we present this generalized version of the SD which allows you to specify the strength of this uh exploration by this sort of repl noise replacement schedule so especially when you set it to zero you get a get just theod but you can also do more exploration by boosting boosting this factor or you can do more exotic schedules like uh like something like this where you have it behave like an in the middle and like the OD for and so on and samples would look like this but again there's the question of is this just a nice trick or uh like what's the point and yeah as I said empirically this improved the results and now looking at this SD the reason becomes somewhat apparent so don't try to read it unless you're you're an expert in SD but uh we can Rec a couple of uh couple of familiar Parts here so the first term in the SD is actually just the OD from the previous section so that means that we still have this uh force that is driving us towards the towards the distribution and making it follow the flow lines and remainder we can identify something called a l ofan diffusion stochastic differen equation where which is a well-known thing from a long ago uh it has this property that it makes the samples sort of explore your distribution and if the samples are not distributed correctly it will kind of reduce that error so it has this healing property uh and because we do make errors during the sampling it kind actively corrects for them and this is how it looks like so let's take this extreme situation we are we have our samples blue dots and uh let's say they are really bad they are not following the underlying distribution at all uh they skewed to one side so if we keep following the OD it does nothing to actually correct that skew and we completely miss the other basing of the data for example so when I introduce stochasticity to this process Stu looking like this so these samples do this kind of random exploration and gradually forget where they came from and forget the error eron initial position and now we've covered both modes for example in the generated Imes on left Edge so that's the sort of reason why wi stoas is this helpful now arguably this is the only benefit of the SD over the O but there's also downsides in using SDS uh for example we we would technically have to use these complicated solvers that are arguably designed for much more complicated um complicated cases uh where you have more General SDS so we asked the question could we instead directly combine the OD solving with this idea of this uh churning of the noise adding and removing it and the answer is this so this is the this a stochastic sampow we proposed in the paper so we have our current uh image noise image at noise level T TI uh remember in our paration the noise level is now act completely equivalent with the with uh with time so we have two substeps in one step so first we add noise so this represents the L ofan exploration so we r at some land at some random noisier image so the time increases here and then we solve the OD to where we actually want it to go which is a lower noise level and we do and that only that simply follows the flow line there and in practice we do this with a single Hoin step so we keep alternating between this noise addition and the Hoin step and this brings as closer and closer to Time Zero as we want but underneath it is the OD running the show and uh and guiding us uh guiding us along these lines where but on top of that we now have this jittering which which which corrects errors okay so this all sounds really nice you get three error correct rection but it's not actually free because the L of VM is also an approximation of some continuous thing and you introduce new error also when you make it so it's actually a quite delicate balance of how much you should do this and now with this clear view into this Dynamics we actually find that it is really finicky it's you need to tune the amount of stochasticity on a per per data set per architectural basis you get the benefits but uh but it's really annoying so it's a mixed bag nonetheless it is very useful so if we compare the ODS from the previous section is their performance with just original SD Samplers from these respective Works uh we get U we see that the SD solvers are simply better in the end but they are also very slow now uh applying all of these improvements um uh with our method uh with the optimal small tune settings for this data set uh we reach both much better quality at a at a much faster right and uh yeah there's been some previous work that applied applied also higher order solers but yeah so I want to highlight one result here this this image net 64 highly competitive uh just with this change of schedule we went from a pretty mediocre FID of 2.07 to 1.55 which at the time of getting this result was state-ofthe-art uh but that record was broken before the publication but we'll have our revents in a few in a few slides so but uh just to show that uh this is uh just uh taking the existing Network and using it better yous huge improvements already um okay so that's it for the sampling at this point I have to say I'm going to go a bit overtime because of the hustle in the beginning and because this is kind of incompressible anyway so if you need to leave then no problem um so yeah that's it for St gastic sampling and for sampling as a whole and uh we are now done with that and as I promised we are now going to be looking at how to train these networks how to pariz them in such a way that they they give reliable estimates of where this where this steps should be pointing and again we won't be looking at the architecture itself so just a brief recap U the way this worked was that uh the role of the OD is to give us the step Direction uh which is given by the score function which can be evaluated using a denoiser which can be approximated using the new network and that is the role of the new your network here it tells you where to go in a single step or what direction you need to go to and uh the theory says that as long as the D noiser does something that minimizes this loss the L2 loss uh L2 D noising loss uh the theory will be happy and uh you can do this separately at every every noise level so you can wait these law according to the noise level also but before we go to these loss weightings let's look at the D noiser itself so I drawn the CNN there in a bit bit of a hazy way and this because it's actually a bad idea to directly connect the noisy image to the input of the network or to read the den noised image from its output layer rather we'll want to wrap it between some kind of a Signal Management layers to manage those signal scales um of both the input and the output to standardize them somehow and also in this case we we can often recycle stuff from the input because let's say if the image input image is almost Noise free then we don't really need to denoise much we we should just copy what we know and only fix the remainder we're going to come come to that soon and this is super critical here I mean this might sound like boring technical details but this these kind of things really are this are like a critical for the success of annual Network training and we've seen this over and over again over the years and in this case the noise levels vary so widely that this is extra extra critical here so without too much to do here is how one of the previous methods the ve method um implements the denoiser so the idea of this setup is that they are uh learning to predict the noise instead of the signal using those CNN layers and the way that works and I'll explain why soon the way that works is of course the loss will be happy if if if the dener can produce the the the clean image and we can interpret this uh this model as having this kind of a skip connection so the noisy input goes through that now implicit implicitly the task of the CNN will be to predict the negative of the noise component in that uh that image and then they have explicit layer that scales the scales that noise to the known noise level and so now when you add whatever came from the skip connection to this you get an estimate of the clean image so this way they kind of turn it so that the CNN itself is concerned with the noise instead of like the the signal I'll explain soon why that is relevant but first let's do the thing I promised to do a long ago uh I said there's huge variations in the magnitude just the numerical magnitude of these input signals and U this architecture fails to account for that which is which is problematic and so we quite simply introduced this uh inut scaling layer here that uses the non standard deviation of the noise to scale the scale the image down I want to highlight this not like a batch normalization or something is just a we know what the noise level is we know what the signal magnitude should be we divide by by an appropriate formula so that deals with one of the whes we had on that orientation slide on the output side we actually have something nice already so this very good because now the network only needs to produce a unit standard deviation output and this explicit scaling to the no noise level takes care of applying the actual like a magnitude of that noise so this makes it again much easier for NE Network it can always work with these standard Siz signals and that deals with the second hope we had there but now uh the question of should we predict the noise or the signal and why so it turns out this is actually a good idea at small noise levels but a bad idea at high noise levels so I'll show you what happens at small low noise levels so if we have low noise uh the stuff that goes through the skip connction is almost Noise free already and now the CNN predicts this negative noise component and it's scaled down by this very low noise level and this is great because uh that new network is actually the only social error in this process so if the network made errors now we've downscaled them so it doesn't really matter if the network is good or bad we didn't do much error in this case so that's great we are sort of recycling what we already knew instead of trying to learn the identity function with the with the new network so that kind of deals with the third hope we had on the slide but on high noise levels this situation is reversed whatever come through the SK connection is completely useless it's a huge huge noise signal with no no signal at all and now the CNN predicts what the noise is and then it is massively boosted at this stage so if Network made any errors now there are going to be huge errors after this and those are directly passed out of the D noiser so now we've introduced a huge error into our into our stepping procedure in the OD this also a bit of an absurd task because you're trying to subtract two massive signals to get a normalized signal and uh kind of like trying to draw with a 2 m long pencil not optimal uh so instead what we'd like to do is somehow disable the keep connection when the noise level is high and uh in that case effectively the task of the CNN will be to just predict the signal directly there won't be any need to scale it up so we won't end up boosting errors and the way we Implement that is uh by adding this sort of switch but in a continuous way so we have this So-Cal skip scale which uh when set to zero effectively disables the skip connection set to one you get the noise prediction and furthermore we make it so that it's actually a continuous value between between zero and one that depends on the noise level and uh if it's somewhere in between that means that we are predicting some kind of mixture of a noise and the signal uh in this uh instead of just one of them and there is a principal way of calculating what the optimal uh optimal skip weight is uh but I won't go there in the interest of time uh we have it in the paper of Enix and that deals with the with the remaining issues we had on the slide and now we can look at what the previous Works did and what we did so these are the actual formulas that Implement those ideas okay then there is the couple of training details how should you wait the loss based on the noise level and how often should you show samples of different noise levels so the general problem if you don't deal with these issues is that you you might have a highly lopsided distribution of like gradient feedback so if you don't if you're not careful you're just on most iteration you might be proding the weights gently to One Direction or the other and then every few iterations you have this massive massive uh gradient smash on the uh on the weights and so on and that's probably very bad for your training Dynamics so the role of the loss weighting or the scaling the the numerical scale in front of the loss term should we be to just equalize the magnitude of the loss or equal equivalently equalize the magnitude of the gradient feedback it gives and then the noise level distribution May how how often you show images of Any Given noise level the role of that is to kind direct your training efforts to to the levels where you know it's relevant where you know you can make an impact and for that we in the paper we have this sort of an important sampling argument that whatever we do we end up with this kind of a loss curve so we don't make much progress at very low and very high noise levels but we do make a lot of progress in the middle uh for example at a very low noise end you're trying to predict noise from a noise fre image it's impossible but it also doesn't matter if you can't do it so we we uh based on this we find that it's enough to or suffice this to sort of a uh yeah have this very broad uh distribution of a noise noise levels here that are targeted towards the towards the levels where you know you can make progress and this is a logarithmic scale on the x-axis so it's a l normal disuse okay so those are those choices and it's starting to look pretty full there's one more thing which housekeep in the interest of time we have some we have some mechanism presented in the paper for dealing with vastly like a two small data sets when your network starts overfitting by this augmentation mechanism you can look at it there but uh but yeah let's not go there it's really only relevant for very small data sets like Cipher with imet we haven't found benefit from it I think okay so with all this improvements we can stack them one by one these the lines here and in the end we get state-of-the-art results uh in various competitive categories uh in a determinist exampling we get an FID 179 97 on the cipher categories which might still be state-ofthe-art uh also at very low sample counts compared to most previous work uh that's more interestingly okay that was with deterministic sampling when we enabled the stochastic sampling and tailor it for the for these architectures for for image net and use this retrained these networks we trained ourselves using these principles we get an FID of 1.36 which uh was a state-of-the-arts uh when this paper came out uh it's been uh overtaken I think in the last few weeks possibly earlier but uh but yeah so all in all we've turned this model that was okayish uh in the beginning and by stacking all of these improvements we get uh basically the best model in the world at that time for generating syst for image net images interestingly the stochasticity is no longer helpful with C far in this regime uh or after the after the training Improvement so it appears that the network has become so good that it does just doesn't make that many errors and any any exploration L ofan exploration you do introduces more error than you are actually fixing with it but this is still not the case with the imet so there it still pays to do to do stoas okay so that was the uh that was mostly it just a brief conclusion um so we've sort of exposed this uh completely modular design of these diffusion models instead of viewing them as tightly coupled packages where you can't change anything without breaking something we show that you can pretty much change every like you get a valid method no matter what you do as you as long as you follow these loose guidelines and then with that knowledge we get a clear view into what we should actually be doing with those choices and doing so Pace off in a big big way we get much improved quality at much much more efficient models one takeaway about stochasticity it's a bit of a double-edge s u as said it does help but it also it can it requires that annoying perase tuning there are no clear principles how to do that tuning there is also a danger that you can even have bugs in your code or something and the stas will just kind of fix them to an extent which is of course not what you want to do if you're trying to understand what your potential improvements are what their effect is and so on so ideally you'd be able to work in a the completely deterministic setting and if you want then in the end just kind of reintroduce the stoas the the final final cherry on the top okay so we haven't talked about all the fancy stuff like higher resolutions Network architectures classifier free guidance and so on but probably many this would be right for a similar principal analysis we hope hope this inspires you to also think about that kind of things and certainly we are so with that uh the code and everything is of course available I would argue this is probably one of the better places to copy paste your code if you want to experiment with stuff it's very clean clean codebase that directly implements these ideas uh yeah so thank you for your attention so do you have time yeah I have time question all right explanation for why stas only um it probably has to do with the data complexity like the cifar is maybe a bit too simplistic in the end it's kind of learnable entirely but uh but it seems like that uh that uh something like imet it's still so extremely
Info
Channel: Finnish Center for Artificial Intelligence FCAI
Views: 3,551
Rating: undefined out of 5
Keywords:
Id: T0Qxzf0eaio
Channel Id: undefined
Length: 52min 45sec (3165 seconds)
Published: Sun Oct 22 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.