Diffusion Models Explained : From DDPM to Stable Diffusion

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
as you can tell from the title of this talk I'm going to be talking about diffusion models this talk was originally given as part of the machine learning Singapore meetup group I gave this talk about two weeks ago if you're interested in this kind of content Please Subscribe um but also please feel free to join the machine learning Singapore Meetup where we have this stuff live and then you can participate um in real time the approach I'm going to take is basically to look at the history of the ideas that have Arisen within diffusion models while the diffusion idea is quite old this whole um sudden evolution of ideas has occurred since the beginning of covet so we're talking about relatively recent times even in deep learning terms I'm also going to talk about some of the extras like in painting or deriving nice pictures from drawings and also how this might be applied to other fields um and we're seeing results of this already including recently to movies so that's within the last two weeks so in terms of History I'm going to cover a bunch of different models from the original ddpm the denoising diffusion probabilistic models um through Dali and clip which weren't exactly diffusions but they can't carry important messages in terms of how we're going to deal with text I'll talk about open Ai and diffusion classifier free guidance and then move on to latent diffusion Glide in the current state of play um after that we can move we can talk about more specifically about stable diffusion how that works and and then I'll move to the next steps so for the first of these models which is ddpm this is 20 2006 after the beginning of code for real um Along came a and a nice model which described the process so this is diagram you'll have seen you may have seen a few times before and um I'm not going to get delve into the mathematics too deeply um but let's at least lay some groundwork with the notation so what they what we normally do is look on the right hand side of this diagram where you have a genuine picture and that's going to be the x0 that's the real picture and then successively we're going to apply this Q process to make the picture more and more noisy now so that the actual picture going you know from X to T is going backwards so making it noisier is actually kind of an easy process because all we have to do is add some noise to the pixels um the tricky process is how to unmoise it but given that we can make a noisy picture and then we know all of the steps in between actually training a model to do the denoising we've we've set ourselves up for success in some ways um we can also see this kind of process in many of the other unit models where you might take say a color picture make it black and white and then say okay how do I make a black and white picture color I've already got it I've given myself a task where I know that the beginning and end so kind of set ourselves up in the same way for diffusion um please someone ask me for the slides um all of these links are clickable you can find slight papers and code and extract so in terms of noising this each D not each noising step we're going to essentially reduce the size or reduce the magnitude of the image pixels a little bit and kind of replace it with gaussian noise now this this n is just a normal normal distribution kind of the typical Bell shape curve and so what you do is as you keep adding the noise and kind of shrinking the pixel magnitude from before gradually the noise comes to dominate if not completely swamped out to the original picture so when you get to the full XT you time it so that at that point you're a Pure Noise kind of distribution it's so one of the nice things with is adding a normal distribution at each step is if I add a bell-shaped curve to another bell-shaped curve I get yet another bell shaped curve so we can compute exactly the statistics at every step because we're using these nice gaussian distributions so this is what is this the accumulated normals um for all of these time steps are also normal noise but essentially that means we can calculate what the statistics for the normal distribution at any time should be without going through all of the steps we know for sure just by moving in you know one Leap we'll know what the distribute what noise we have to add so we don't actually have to run this process of noising the image um you know to do it expensively we can do it one step we can compute any any transition we want and then just train the model to know how to do all the Transitions and within a batch we could do and you would want to do a batch with lots of different images all at different time steps so it learns Lots about the distribution of everything rather than just one image at every time step all in one batch where it would tend to not learn as well you want to mix in a batch now in terms of denoising what we're going to do is use this function P with a little Theta and the Theta are the parameters of a unit so a unit is a diagram which I've got a little later um but basically this is how you would typically used for something like turning a black and white picture into a color picture or it could be working out the image segmentation of a scene for self-driving into where is the road where are the trees so unit is Maps pixels into pixels but it also because of the the the U shape of this thing and we will see what that means um it takes into account like the the a hierarchy of different sizes within the image so what we want to do when we're doing this denoising is we're going to try and guess what the cleaner image picks are going to be in fact we don't actually specifically want to guess what the actual images what we really care about is what is the distribution of what a good image would look like so this is kind of a statistical um feature that people like to see rather than guessing what the image is what we're going to try and guess is what are what is the distribution of images which would have given given rise to this noisy image um but because all of these changes are gaussian it's actually sufficient to guess what is the mean of that distribution and we actually know what the the variance is going to be we can kind of hand wave it away um but people have found that just getting the mean is good enough um what we've also found is that it's actually more stable to predict the changes in the pixels like how much have I got to change this fuzzy image to make a good image rather than judging from these pixels what are the exact pixel values clearly I could take my original image add the changes now I've got the final image but it's more stable to predict like what are the errors which I'm trying to fix up so here's a diagram of the unit where essentially we're going to take in the original image which is a noisy version and then using a CNN of some kind map it onto the output image but we'll also down scale it to something of maybe half the resolution and we're going to then map that into a down scaled version and then use that upscaled to add on to like more information which is gathered from more of the whole image we're going to do this several times and you can see why this is going to be called a unit by having these skip connections and this downsizing and upscaling will enable us to you know get information from the whole image so it's like a coherent image now another feature of this denoising process that we're going to do because we're doing at lots of different time steps we need to inform our one unit what time is it like am I trying to deal with extremely noisy image making it just a little bit less noisy or am I right at the other end trying to make a pretty good image even better or the final step will be okay make it make the image perfect so you can tell at the time by adding in like a Time embedding which is basically this value T um expanded into a bunch of different diet Dimensions just like we would do with the positional encoding for a Transformer so if that doesn't make sense um just say well we're going to add some time information so it knows where it is later we're also going to add some other information like what am I trying to draw like what direction am I heading for instance am I trying to abuse a picture of a cat or a dog or what sentence am I trying to represent so there's kind of the opportunity to add in extra information where these layers can take that information then use it to inform how they're going to do this denoising okay so if you're interested in actually following up more in terms of the maths reference um I would recommend that there's a very nice uh explanation um where the um and I have links in here and I'll put links in the description um where it's actually explained all of these different steps I'm not going to replicate that here there is a good explanation out there um and it also delves into like what are the shortcuts which are being made and what terms are kind of thrown away um because people find them now in practice not to matter so there's a bit of a hand waving thing going on um but people have found techniques which really work and that's what matters in many ways so in terms of the original output that the paper had um back in the day like basically two years ago the state of the art for doing this kind of image generation with Gans would be to run it on something like the Elson bedroom samples so this was a big data set of pictures of bedrooms and by training one of these things you could get it to produce new pictures of bedrooms so if you've seen uh sites like this person does not exist or this bedroom doesn't exist this cat does not exist these are the kind of data sets that they're trained on but because it's trained on a bedroom's data set you know that it's going to produce a bedroom but you can't really direct what kind of bedroom it's going to be so this is unconditional like uncontrollable and while these images look pretty impressive they weren't as good as Gans were of this at that point so this is kind of early days with diffusion is kind of a a bit of an oddity to study this um but at least it's getting somewhere with um you know producing bedrooms which don't exist um slightly competitive okay okay so next up is probably at the beginning of 2021 um Dali and clip was announced by openai and this is the kind of the the typical example we see from that awesome blog post is the armchair in the shape of an avocado so this is truly impressive at the time and the way in which this works was not a diffusion process um it represented the images using a vision Transformer kind of way which basically is they represent patches of the image as tokens and given that you can produce an image via these tokens um they they make it so that essentially you can take in an image convert it to tokens use the same tokens produce the same image so basically this is a an auto encoder Style model but then basically once you've got some tokens we can then say well suppose I had some tokens for uh some text I can then feed that into like a language model which would take some tokens texty tokens and then produce some imagy tokens and I would then treat those output as the new image so basically this is heading for what openai has done really well at that point was clear um now Cliff references of doing this treating everything as like a language model and so this was a it's not a diffusion thing this is very much a kind of language model type approach um but one of the things which they used this clip model for which is a text orientated model was to actually they train this model to tell the difference between is this image a good pic a good depiction of the text so they can trade this using the captions for a very large data set so they have a caption with a known good image and then they train this contrastively so what they do there is they essentially say well if this picture is of a picture of a beautiful tabby cat and there's a bunch of other images in the same batch they will then contrast Lucid is a picture these two things correspond this caption in this picture but all of the other ones are not examples of that so these things are like the opposite of correct so basically this pushes the representations that clip has for all of the different texts into a space which relates images to text so clip can be thought of as a text to image translation tool or a release of text to representation tool so what they used clip here for was basically to kind of automatically cherry pick which images were good images of avocado um shaped armchair um and so they will generate maybe 50 images and then they pick the ones which the clip thing thought was most like the text and so this is kind of Auto cherry picking a very smart idea uh nothing against um the idea of having a model which generates huge numbers of samples and then pruning it pruning the the list that works very well in May settings um but it's it's slightly unfair to say that this is purely what Dali was doing producing great images from the start basically darling's producing a lot of images and then they're pruned I mean they're also awesome images but clip was also a big element of this so this is kind of a diagram of the two sides of what clip is doing on the top here we've got um the text going in and being mapped through some kind of mapping um on the bottom side we've got the images coming in this is also being mapped and then this contrastive thing says well the two things which are the same should be should be yes the things which are not the same should be known um then when when it comes to actually using this you can use it in a variety of ways one of which was kind of uh was super cool is they could actually use this to examine imagenet images just by saying well the the caption should be this is an image of a red panda or this is an image of a cheetah so they were essentially all the Thousand glasses in imagenet they would make a very small caption and then just they would say well what does clip like for this image and they found that they could actually do imagenet classification to quite a good degree of accuracy just using the clip model so this is kind of an um remarkable result and what's more is they could also use not just the original imagenet images but they could use cartoons of imagenet images um so this is this uh understanding of the image in big boats basically it far exceeded what a an image net train model could do basically you could see that the clip model had extracted some kind of more semantic understanding of what was going on in images and text so just after openai came out with Dali which was not attract not a diffusion model you can see that they started to dip their toes in diffusion models maybe um thinking that this was an interesting thing but maybe not on not seeing that this was the future to some extent so my guess is what happened is that these authors built systems to make sure that the whole diffusion thing can work and then started to chip away with just some results to make it a bit better so they can then essentially monetize this in a paper what they identified is that the scheduling done which was typically done in thousands of steps spend a lot of time in terms of improving the image a lot of time fixing up very very noisy images and then do the same like same intervals were then spent improving it towards the the final steps they reasoned well wouldn't it be worth more to essentially span out um the steps we spend making this A Perfect Image because when it's super noisy it's not going to make that much difference whether we get it absolutely right so essentially by re-timing the uh the steps within this whole schedule of time steps you can actually reduce the total number of steps required because you're focusing um the you know more steps on the last stages and you do comparatively fewer steps at the beginning um so this is a kind of a nice result it means that you can move from thousands of steps down to hundreds of steps um and this then led to another paper which is beating Gans so here they started to look at imagenet images and can we do um imagenet images produced um to match the labels using these diffusion methods and indeed you can they can you can prove by using essentially a machine learned statistic that these the quality of these images starts to beat something called began at the time and this is as little as 25 forward steps so I'm quite impressive uh performance now you producing imagenet like images and in particular by using an imagenet classifier so you basically have a trained imagenet classifier and then you ask it at each diffusion step um what would actually make this imagenet class stronger like so I will do whatever it is to make my gradients move in the direction of this imagenet class this required a you know taking the gradients through a bunch of stuff but this would then unlabel it to guess the denoising steps better because you've got some gradients from this class so one one twist is that you'd have to train the classifier not so it's perfect at doing the classes at with perfect images but you can train the classifier so that it would also work on very noisy images of a red panda and so this is you know not not your standard classifier but that signal like is it getting closer to red panda or further away um it was very valuable and could be then used to make good images and you can see like from the top um whether you're just generating any old image from imagenet or something directed I guess that I could call you or something okay so so using this kind of gradient idea um does work um but something new came on so the next thing that came along was from Google so clearly also no all of the major players are playing with these models at this point so now we're about a year ago um and what Google has my guess after warming up these models is to try and say well how do we train it without having a classifier this classifies a little bit embarrassing because we have to have to know all about our images before we can push them in a direction and maybe we want to have guidance which is difficult to differentiate through so they developed a technique called classifier free diffusion guidance so what they do here is they have some like guiding information like what is the class label we need and they can either give the either give the diffusion model no hint about what we need in which case it'll go and produce any old picture from imagenet like a good picture from imagenet or they say well we want the red panda and so because they're training this with an image of a red panda they can either say well I want any old image and it will just happen to produce it happen to want to be a red panda but they will do it with lots and lots of images so the unguided version will give you basically any class in imagenet or you say ah I I expect you to give me a red panda and then it says oh this red pandas is where I should be going in response to that guidance so that the neat thing here is we haven't actually had to specify any gradients within this information we just either give it the information or we don't give it the information and the idea of not getting the information is we will also be training it on lots of other things so it knows what kind of no information feels like compared to here is a here's a big clue so what happens here is you can then say well suppose I at test time what I will do is I will give you two two goes the first go I won't tell you what we want the the second go I will tell you what we're aiming at so now I've got two different kind of denoised images one without any guidance one with the guidance and then I can take the difference between those so why should that difference the only only way in which you've got that difference is because of the guidance so what I'm going to do is I'm going to actually take that difference and say well that is the the value of the prompt we had or the the class I'm going to use that as my new noise image I can actually amplify that not just using the image from the guided version I could actually take a twice guided version essentially taking the difference and doubling it up or tripling it up so this classifier free guidance also gave rise to the effect of um actually amplifying the guidance signal like up to up to 10 kind of thing and so this is something we've got to be a bit careful because you might actually get to pixels which are whiter than white right they go outside the range um that's allowed for Pixel values and so you may need to like cope with that because you know suddenly we're doing something which isn't strictly mathematical at all we've taken two things which have a difference that makes sense but now multiplying that number by 10 and just adding it on that doesn't strictly speaking make sense um apart from the fact that it works really well and so this is where the technique of classify classifier free guidance comes along and what you can see is this tends to produce much more varied examples of um and much more distinctive examples so on the um on the left we can see some here are some examples that we get um but on the on the right we've now got like a strong amplifier a strongly Amplified classifier guidance you can actually see that in the eyes of the cat um they're that super saturated because it's been guided to be even more than it would do just for I guess Siamese cat okay so now we're going to go on to just before Christmas um 2021 um two days I say two papers were released on this one day so both of these teams must have been like kicking themselves um because in the rush to get it out for Christmas um you know that they overlap significantly so these are independent efforts um simultaneously release we've got Glide from open AI they produced code with Unfortunately they released filtered models so the models while they kind of work don't work for faces they don't work in in they don't work for any like sensitive subjects um and also have a new lab here this is the omo lab in Europe that also released code um doing something with latent diffusion now this latent diffusion is going to form the basis of stable diffusion so this is kind of one one to watch here okay so what is Glide um Glide is a diffusion model but instead of using the classifier free Guidance with the imagenet class what we're going to do is we're going to take the clip embedding and use that as the representation we kind of stuff in for every step of the denoising so using this clip embedding and then having an attention mechanism to kind of inside the unit so this is kind of a more sophisticated kind of information we're adding this can then drive the image generation via prompting so this now we've got to the stage we're actually using text to say okay where do we want this image to to end up as I said the openai released a filter model and this from what from what I can tell outside open AI is this is really what morphed into Dali version 2 which was released in April this year um Dali is clearly a kind of a better marketing kind of name for this thing um Glide or even unclip which is what it's kind of originally called in the paper um just aren't as sexy as Dali V2 so this is one of the pictures um we we kind of love the Hedgehog with the calculator over there um this is clearly remarkable at the time um let's also move on to what latent diffusion is doing so Lane diffusion was um had some other qualities in particular it actually got started to get a bit better at representing things in the image even the words in the image so tonight one of the nice things here is they open sourced the code um and they credited there's an open AI code base and all Lucid brains um had nice pytorch implementation so it's got good open source credentials um and this is the one which was then morphed into stable diffusion being trained by stability Ai and the people associated with that so this is the big model which is causing all the recent kind of explosion of content and you know innovation in this field because the Google models the open AR models are basically hidden either not fully within Google or behind apis which are had a huge waiting list and various other issues along with that that's Lane diffusion is a very interesting model I'll get into kind of why so why it's interesting and specifically different um just just now so what they do there are there are a couple of models involved in this the key thing with the the diffusion process is this was kind of the same so the bid in the the bit with all the steps and the units that is kind of the same thing they also had you can see this qkv this this is kind of shorthand for let's do some Transformer attention on the prompt so these steps within this diffusion process are pretty much the same as what other people are doing but the twist here is with this e and D this encoder and decoder so instead of working uh the diffusion process on Raw pixels which other people are doing and other people had a lot of compute to be able to do what this group did is they said well let's train up an encoder decoder so we can take an image in encode it into a smaller image right like a a very rich representation and then such that that representation can get back to the same image so basically this is an auto encoder style thing through a small image which is not pixels some kind of representation space so you might think of this as being like we convert the image a PNG image into JPEG and then back to our PNG image whereas PNG is kind of lossless jpeg is kind of lossy but it's a pretty good um correspondence so here it's definitely not a jpeg representation in the middle but we're down to a much smaller representation which means we can train our diffusion which now noises up in this small latent space and does the denoising there and then can translate back into a nice image so you can go from a good image in noise it up denoise it comes out make a new nice image so this is why it's called a latent diffusion we're doing the diffusion not in pixels but in this latent space um as I said the nice thing about this is that the latent space can be much smaller than the full size of the pixels thing um and you know going from 256 by 256 right which might be a good pixel size down to 64 by 64. you've actually saved a huge amount of compute just because you're not dealing with such large you know areas okay so now we've got gun is the current state of play is um uh this date here is as of our Meetup which is now about two weeks ago um so we've got kind of three big models on the scene we've got Dali version two um open AI is is kind of trying to make this a bit more open uh in terms of opening up the API adding new things to the API as they see stable diffusion people handing out the code for free for some of these things and we have Imogen from Google which I'll explain a bit more about what they're doing um here or you know this is a newer model and also stable diffusion which is a fully trained up version of the latent diffusion model which has been released open source so Dali which has um came out in April of 2022 um there's a nice paper um his beautiful website uh there's an API which you can get credits for which you can then you can pay for credits for which you can then spend on making images you have rights to use these images um so if in a commercial setting this would be just fine um it's you know there's some interesting results there's all sort of interesting negative results here um if you ask for a sign that says deep learning um it it tries um it's it's remarkable that it has any success but actually this isn't as good as it could be um it's also interesting that it's been shown that openai has manipulated prompts um in order to make it so that their models don't take on the biases which they have from the data set so data set if I'm looking for stock images of a doctor is probably going to produce quite a biased range of images open I wanted to avoid that um as as you know would be reasonable but you can prove that they're doing something to your prompt without asking by getting the doctor to hold up a sign so if you say I would like a picture of a doctor with a sign and it says the word black on it you know that something's happened to your prompt because otherwise it might have the word sign on it I mean or doctor on it right clearly someone has added something to the The Prompt because it's identified that it's looking for a person and their open AI reasonably wants the data set to be less biased than the or appear to be less biased than it has been it seems that they're manipulating the prompts in order to produce you know more globally representative images coming now I I fully in support of the idea of trying to de-bias the data set so that's that's probably a good thing um but it's it's questionable to me um whether they should do it by changing what the customer's asking for um is is you know is definitely a difficult subject um but it's you know this is one approach um but it's it's kind of amusing that you can discover what they're doing by probing it in in various ways and people enjoy poking holes in this okay so let's move on to what Imaging is about um imagenes has some great images and I have to say I I do like the the quality of these these Google images that they do have a kind of a googly um innocence about them in some ways which is kind of weird um like the the sushi uh house Corgi anyway one of the nice things um that that comes from this is that Google has done this by using a language model rather than clip as their source of the the text embedding so they're using a T5 model which is a properly trained language model with an encoder decoder kind of structure but they use a very large pre-trained model which you can train on huge amounts of text so the text models can be trained on much larger data sets than the text image models because text image is kind of a specialist Niche with only billions of examples whereas for Pure text we've got trillions of tokens which are available the size of the text training sets are enormous so what they find and call kind of prove by example here is that scaling up the language model actually scale is worth more than scaling up the image processing because what they see what you can see when you try this is if you describe a complex scene the uh T5 model can actually capture the the relationships between the things you're describing better than um clip does and the in a way this is kind of understandable because what clip has done is it's learned the English language through uh reading lots of captions now if you're talking about stuff which doesn't occur often in captions clip's not going to have a strong idea so if you're talking about things which appear in the image clip's going to be fine before talking about the relationship between lots of different things in the image clip might not have a good idea of how language Works sufficiently to communicate that whereas the T5 thing um no knows or understands a lot about uh language so it can actually pick up these um these nuances much better and then the the image size of this you know has to learn to pick out the various elements um I guess another piece of this which you know is key in the Google's thing is they also try to make it like have an adjustment so that if you're the the pixel values go outside of regular ranges which is something which will happen if you do this classifier free guidance um they have a nice method for coping when this thing is getting out of bounds a kind of a rescaling remapping thing um which means that they can use much higher like classifier free guidance weights um which boosts performance in terms of class you know classified free guidance um I guess another thing to point out with Imogen is they have three models to do this um one of which works at like fairly low resolution and then have some up sampler models which do diffusion based up sampling um you know Guided by The Prompt as well so there's a whole bunch of models to train here Google spent a lot on the compute um somehow I doubt they're going to come out with a an API for this but it's definitely proved that Google can also play this game and now on to the final of these which is in some ways like the the big result um because this is the open model so staple diffusion was released in August 22 um and the code and the models was released all as open source now the caveat here is the there's a license which means which says essentially don't make images we don't approve of um which is basically that they they kind of want to discourage people making images of politicians saying things which misrepresent their positions or you know there's kind of um a line which they don't really want people to cross of course people will immediately cross it um but at least the sentiment is there that this is not a fully open model that you can do anything you like with um but there's no should people should observe some kind of restraint on the other hand um people have been experimenting and that experimentation has proved extremely fruitful and we're going to see some ways in which people have expanded what can be done with these models because they're willing to to play with it and and press the models to do unexpected things so let's talk a little bit about some of the other things which you know people have come up with recently um because they've been able to fiddle with the stable diffusion model um and you know basically expand the repertoire of things which are available I've got just picked out a few things here um and but there's no more stuff is arriving all the time I guess in in the future work thing I would have mentioned the make it let's make a movie um but this has actually come out within the last few days so so this is kind of no longer just future work there's a lot which is going on right now okay so for in painting and out painting this is this kind of task is suppose I had someone sitting on this bench and I wanted to remove them basically I can mask them out and get the uh the stable diffusion model to paint in what would what it thinks is most likely to go in that space um so this is a kind of a normal task remove a person from a picture um the way in which this is done we worry is that we already know what most of the image has got to be now absent The Mask most the image is fixed so what we can do is we can noise up the whole image just as if it was part of the whole process and then guide the pixels to whatever they want to be but we also know the truth so we can actually fix up all the pixels which um are you know we are actually known we can actually guide them for sure to what towards what we we know to be the truth um but the pixels we don't know okay those are allowed to denoise in the normal way so everything will just kind of fall into place because the outsides are kind of falling into place at the right into the right pixels the insides will fall into place and and mesh perfectly because the you know that's the most reasonable way of in painting the picture and we're with all these distributional things we're looking for like the most reasonable way or a good way which explains everything which we see um sorry the same could be said for out painting so here we've got an image which we know is is a is a good image we now have some you know surrounds which we want to fill in um we can just keep the we can keep iterating this forwards um knowing what this is but the essentially dragging in the surrounds to be a nice image as well so that they mesh perfectly and that then ties into something where people actually have another constraint that they want to be able to tile these images so it's quite difficult to see from this one but in fact um if you actually play around with these images you the one side of the image perfectly matches what the other side and and same with top and bottom so here we've got a diffusion process except you've also got the additional constraint with whatever's at the top should equal whatever's at the bottom and so having the constraints actually fairly easy to implement and because every time you move forward to the step you can then kind of re-insist this this constraint holds true and so this tiny thing is something which people involved in newspapers and maybe Minecraft wallpaper and maybe Minecraft really care about being the tilability of textures uh maybe um printing for Fabrics as well um so this is kind of a very Niche application but it's a kind of example of things which people innovate for um here's a nice trick which there are collab notebooks out there if you want to essentially this is very impressive for children um if you have kids highly recommend to try this basically you can get a drawing which is kids kids drawings can be a bit abstract but they kind of know what they want um maybe they can't haven't got the facility to actually to make it you take a drawing um you then noise it up so so basically with stable diffusion you put it into the latent space you noise it in the latent space but not fully so it's completely completely nice you notice it maybe halfway or maybe only a quarter way so it's a bit of a noisy image in the latent space but then you give it a prompt and then you prompt it well with whatever the child really thinks the image is off so here it's going to be um you know pixie or whatever um no oh my drawing is of a pixie um and each and she should have this kind of hair or whatever and then so given that you then describe the image um in a prompt but because you're working not from Pure Noise all the way to the image you're working from a essentially a sketch of the image in latent space there's only relatively few things it can do to come up with it to match the prompt so the actual prompt will turn out looking very much like the original image the original drawing but if you want it to be you know high resolution on Art station using or using a Canon D5 you can make photorealist images from children's drawings this um the overall structure is you know set by the genre of drawings and the prompt and you can have a nice discussion about um you know what the child wanted how much they like the new image all this kind of things so very nice experience turns okay so I've just got a few things which I've a few papers have pulled out for the the future um Mike this is a little coming soon list um there would also be uh the idea of Let's Make a movie uh Facebook has come out with a a movie generation thing which is kind of variations of of an image um there's been some other ones I I suspect it's all to do with the iclr deadlines they're trying to get out ahead of the kind of uh going Anonymous blanket while the review period is going on um so there's been a flurry in the last week of new stuff coming out um my guess is that the big Labs will go quiet for a bit until the next major deadline but people have been working on other things so I just picked a few uh interesting directions to take this and this is not a kind of a whole lot of recommendation that this is the great the greatest on the end no these people are doing interesting work um I I in particular point out this motion diffuse paper because it's from a Singapore University um as I mentioned earlier we're part of the Singapore Meetup um we'd be glad to talk to people then to you if they were to reach out for us so okay cold diffusion so this is kind of an interesting idea and there's there's another there's something I think soft diffusion there's another um there's a couple of Works in this direction saying well it's it's all very well doing this gaussian noise but there are lots of other ways to noise up an image maybe we can use these kind of other noisy methods um to come up with you know kind of a different approach Than Just Pure gaussian Noise um and so as well as relaxing this gaussian thing that also introduces other issues because you can't just add distributions together in the same way but with a few correction terms you can get some pretty nice results out and so this you know that this paper is noised in a whole different variety of ways either fading out or you know within a restricted Circle or with a pixelation um so there's an interesting it's interesting that it doesn't have to be a pure gaussian process you can kind of fix up just any old process that's an interesting thing to watch okay so this this is the interview paper which is a motion generation paper um so they're interested in can we get a text guided motion for a 3D skeletal form and so this is basically they've got a whole bunch of skeletons that they can make Noisy Arrangements of them they essentially then make them come together to be nice skeletons again um or with nice joint angles based upon the text prompt so this is basically they take um what would be like a random field and then make it up into something which will be the motion um and it comes together pretty nicely that they've also been um other examples of this just recently I guess for the axillar thing um people are interested in making motions and you can also Imagine This in lots of other fields um where you know you're interested in there's been a robotics one as well I think from the bear lab where they're interested in motion Planning by diffusion so the criteria there is I have random motions I want it to satisfy these constraints um let's iterate these until they get better and better and closer and closer to the constraints um something which I'm I'm keen on is is uh TTS um so text to speech and so here one of the key problems is making like beautiful spectrograms which you can then invert um or or basically you can pass a beautiful spectrogram can be passed into a gan like process which will make a very nice um speech waveforms coming out but can you make a sufficiently satisfactory metal spectrogram to do this with and the answer is yes the diffusion thing can work well and even this can be done in rather few steps another neat thing which they do here is they take the diffusion process which could be a long process of many steps but then they say well can I actually do two steps in one by distilling the model so basically they'll train another they've trained a unit to go one step at a time now we can use that as a teacher to go two steps at a time and then we can use that as a teacher to go four steps at a time so basically they can compress the number of steps by making successfully more um you know Bolder students which really understand how to take bigger leaps so this is something where you know now we can get to kind of more real-time production and so because we can constrain these diffusion processes in lots of different ways um is much more promising for kind of a prosody all these other things we might want to do for speech um finally one final one here is text generation so we know how to produce text essentially one word at a time so this will be a the GPT kind of model but suppose I want the text to satisfy some constraints so in particular the constraint they talk about is suppose I want the text to have a certain format in terms of the parse tree so the kind of thing I'm saying suppose I want it to to be not just subject verb object but I want to have it all switched around and I want to have various Clauses and I want to mention the time first if I if I want to do this generating one token at a time it's very unlikely I'm going to have that pastry so I don't have to generate huge numbers of rollouts um this is going to be impractical so what they do instead is that they say well let's suppose we start with a a bunch of like text tokens or something like text embeddings and then roll it forwards through a diffusion process and at each point measuring How likely is this to come towards the path that I want I want to keep guiding it towards the paths that that is required so this is kind of the same kind of thing but applied to a very different field um by nudging it in this way we can actually satisfy all sorts of non um differentiable constraints because this classifier for guidance and also kind of imposing these constraints we can do without being able to differentiate anything so there's a whole bunch of other interesting directions um which are definitely things to watch for okay in the description below um I'll link to some of these other resources the first is from Lillian Wang who's got an excellent blog including lots of good mathematical details and background on lots of different deep learning topics uh she has an awesome blog more than what are the fusion models um Google brain has done an overview paper um kind of unifying a bunch of these different threads uh 23-page overview model overview paper very nice there was a tutorial at cvpr on the the YouTube video is over three hours long it's an academic approach but um very worthwhile and if you're interested in kind of like a like a different perspective on some of the things I've described there's also miss Coffee Bean who has a like how does stable diffusion work um so if you're interested in like a different perspective um that's also a thumbs up to that so as a wrap up um the advanced these advances are being made at a huge rate now the steps which have all happened like during covid have been enormous so hopefully everyone else else has had a you know covered projects as successful as this um but now everyone can get inspired because everyone has access to stable diffusion models you can run it on a commodity GPU or you could run it on Google's collab very successfully people have been getting it so you can train bigger and bigger models and run all sorts of different configurations you can do the you know in painting out painting they're making uis there's an explosion of people the explosion of ideas here um caused no by the power of Open Source so this is a super exciting times and it kind of proves that actual open AI is is a very worthwhile thing and that's like true democratization of what's going on of course open AI Google deepmind these people are doing great work and Facebook um but there's nothing like releasing for everyone to try so as a final piece of advertising let me just reiterate this is a Content which came from the Singapore machine learning Meetup please join that group if you want or I will be putting you know some of my content on my YouTube channel um which has been there's been quite a lot of Minecraft on this recently because of the mine RL competition which is occurring in Europe's um but I'll also try and put up some of my content for the Meetup and also other interesting things that I see you know whether it's Cloud gpus um or other deep learning stuff hopefully there'll be more more to come see you next time
Info
Channel: Martin Andrews
Views: 4,950
Rating: undefined out of 5
Keywords: AI, Transformers, Singapore, Diffusion, StableDiffusion
Id: hVk7Py1c24Q
Channel Id: undefined
Length: 55min 35sec (3335 seconds)
Published: Mon Oct 03 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.