OpenAI DALL·E: Creating Images from Text (Blog Post Explained)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

a sphere made of swiss cheese a sphere with the texture of swiss cheese and there you have it beautiful very appetizing swiss cheese balls my swiss heart had just just skipped a beat out of this monstrosity what's even cooler than a sphere made of swiss cheese is a taurus made of denim these images are so cool a taurus made of denim and the point here is that these images aren't photoshopped or sort of human created they are ai generated and they are generated by this new model that openai uh released a blog post about it's called dali and it can what it can do is it can take a piece of text such as the one on top here the fact that i can select is simply the fact that they don't give you access to the model they just give you access of a bunch of things that they've tried but the model can take any piece of text and it can output a picture that matches that text so here you got a taurus made of toothpaste and the quality of these images is super astounding and what's even more astounding is sort of the range of capabilities that this model has so the model can do various things such as so in here the input is an illustration of a baby daikon radish in a tutu walking a dog and you see an illustration of a baby diagon radish in a tutu walking a dog the outputs are just adorable these are generated by the ai it's the same for an armchair in the shape of an avocado a storefront that has the word open ai written on it i've tried reverse image searching some of these images and i could not find them on the internet so it's definitely not just a model sort of outputting an image it found somewhere these are actually generated images and the astounding thing is that it's the same model that outputs all of these different images it's not one model here trained on illustrations and one model trained on chairs it's a single model that can take in a piece of text and optionally part of an image or none of an image and it will output the image either it continues the image you already gave part of or it just generates the image by itself so the model is called dali and this is just a blog post for now by open ai they say they'll follow this up with a paper and if the paper you know brings substantially new things i think i'll make a video on it but today we're just going to look at what this model can do how it works how it probably works and we can take some guesses of what we can read in the paper once it's out in fact openai has brought out two new models along with this dali model they've also released a blog post and a paper about a model called clip which is more of a sort of a classifier not exactly a classifier it's sort of a it connects text and images in a different way it's not a generative model and we're going to look at that in a in a different video but you can see the clear trend right here is that open ai is looking into connecting text and images so they say dali which is an this is a an i think a an homage to salvador dali and mixed with the character wall-e so they say it's a 12 billion parameter version of gpt-3 so you know it's more like it's more like not gpt3 that was more than 10 times larger but it's a 12 billion parameter version of gpt3 trained to generate images from text descriptions using a data set of text image pairs we found that it has diverse set of capabilities including creating anthropomorphized versions of animals and objects combining unrelated concepts in plausible ways rendering text and applying transformations to existing images so a lot of the things they don't tell us here especially the data set like how did they get the data set nobody knows they don't say this they simply say it's a data set of text image pairs and they sort of allude to the fact that they have large pieces of data especially in the clip then they they allude to the fact that you can just find data that connects text and images on the internet and it's true if you if you search if you scrape the correct websites and do it in sort of a smart fashion you can find a lot of data where there's an image and there's a piece of text describing that image and we have to assume that they sort of scrape the internet for something like this i don't think they have a lot of human explicitly human labeled data for this type of thing so we'll just assume that they have like a huge data set and of course they train a huge model on it a 12 billion parameter version of gpt3 gpt 3 is the famous model the famous text generation model by open ai and um you can sort of see the same things right here so gpt 3 my hypothesis was is that it sort of smartly mixes the training data rather than remember the training data it it sort of remembers it and then smartly interpolates between it and i think you can sort of see the same kind of things right here in that these are all definitely pictures that you could imagine in the real world but they have you know they have for example their change to open ai in here there are surely chairs that sort of look like this so it just kind of mixes a chair with an avocado in a plausible way i'm not saying this to denigrate the model i'm saying that i mean this is seriously cool the fact that it can do that so they say like gpt3 delhi is a transformer language model now this is very very interesting the fact that it's a transformer language model it receives both the text and the image as a single stream of data containing up to 1 and 1 280 tokens it's trained using maximum likelihood to generate all of the tokens one after another okay this training procedure allows dolly not only to generate images from scratch but also to regenerate any rectangular region of an existing image that extends to the bottom right corner in a way that is consistent with the text prompt and they say a little bit more here on the right and they also say a little bit more down on the bottom so i'm going to try to take a step of explaining how this model works with the full knowledge that i might be wrong once the paper comes out and for that we have to go back a little bit and look at the models it draws from namely the vq vaes so the vector quantized vae literature so vq va will consider this to be sort of the inspiration of um or one of the necessary ingredients of this model so if we combine vq vae with something like gpt 3 we get dolly that's my that's my hypothesis for today um why combining these two models so gpt3 is extremely good at modeling language right so if i have a piece of text let's go down here for a minute and let's say i have a a cat set on the mat a transformer will be very good at understanding the sentence and being able to complete it so if i cross out this and ask a transformer to continue the sentence it will be able to continue the sentence just fine if it is if it is trained well and that's exactly how gpt 3 works now imagine that i don't have a piece of text but i have some sort of a description of an image right let's say i have i have a box here is a box and the box which is going to be a vq vae can take in a description of an image in words but not exactly words that human humans understand but let's say there is an image language sort of like a programming language okay and you input symbols into the image let's say it's a bit like egyptian hieroglyphs maybe so here is the here is the this this hieroglyph thing and then there is the sun the sun thing and then there is the the tree the word for tree like the hieroglyphic tree and i input that here and the output will be an image where i don't know there the sun is shining yes i draw a song like a child it has a little smile okay deal with it and there is a tree maybe not exactly the tree from the hieroglyphs but like some sort of some sort of tree that fits and then there is some human in the scene maybe the human sits here the human sits at the tree you know relaxing chilling okay so um this now the image on the right is consistent of pixels right and modeling pixels with a transformer is very very hard because in in the case of our model right here it's something like 256 by 256 pixels that would mean the transformer would have to generate 256 times 250 which is like 2 to the 2 to the 16. this is just too much for a transformer to model the pixels individually so there are multiple ways around this for example modeling little regions right here which are not really satisfactory so what this model does is it sort of it doesn't try to model the picture as such it tries to predict to predict these hieroglyphs right here it tries to predict sort of a language that this box can understand and produce a picture from okay so its task is going to be given some sort of a given some sort of a text prefix so a human um [Music] in a sunny field sunny day or on a sunny day chilling under a tree so this piece of text followed so the the model is trained to take this piece of text and output this sequence of hieroglyphs okay so this sequence of hieroglyphs outputting from this piece of text and that's something a transformer can do if you have a vocabulary right here so if you have a fixed list of hieroglyphs that you could use right so in there there is the the human is in there [Music] that's a worse egyptian um and then the pyramid is in here as well some that you need some that you don't need so if there is a vocabulary the transformer is going to be pretty pretty good at generating this thing so you need you need two parts the first part right here is a transformer a language model a gpt3 thing that can input a sequence of text and it can output a sequence of text which is just in a different vocabulary namely this picture vocabulary and then in the step 2 you need a box that takes in this picture vocabulary and actually produces an images an image right here so as i already said this part is taken over by gpt gpt3 like the the custom gpt model they built for this and this part is taken over by something like a vq vae the generator part of it so what is a vq vae a vq vae is and you will be able to see that um so the box that we're going to need is this box right here from from here up to where the image is and this thing right here is going to be that vocabulary so what does a vq vae do it takes the image here on the left you can see that here is the encoder it takes the image it encodes it into a latent space now what if what a vae would do or what an auto encoder would do is it would encode the image into a latent space and then it would decode it again into and try to reproduce the same image and then you assume that whatever is in the middle right here is a sensible representation a latent representation of that image right if you can train this model you're going to get some sort of a representation in the middle that describes the image otherwise you couldn't reproduce the image and there have been many models built on this concept now this model right here it turns out that the classic auto encoder doesn't work too well but this model works quite formidably so what you're going to have is you're going to have this vocabulary right here it's also called a code book let's call it a code book so the codebook is also the vocabulary so what you're saying is that um you can't just output any or any latent encoding so the the encoder outputs a continuous vector but what you're saying is it has to be one of those like there are a number of vectors that you have at your disposal uh mr or miss encoder or mrs encoder there is a number of vectors that you have at your disposal you can only choose those you can't choose any vector that you want right so in your latent space you can't just choose any latent space there's this there's this there's this there's this there's this you have to choose one of them and if you choose something in between which you'll inevitably will because this all of our neural networks output continuous values we're just going to clamp you we're just going to find the nearest one in our codebook and we'll just say well we we just make it such that you as if you had output that one so the encoder can only hit one of those code book vectors and then you feed these code book vectors to the decoder and the decoder just decodes from these code book vectors okay and that turns out to be much much much better than simply doing the autoencoder thing continuously so imagine that this codebook vocabulary is sort of like a vocabulary of image descriptions what you do with an image you take this dog image i'm gonna have to draw this myself you take the image here of the dog i can't draw dogs i'm very good at cats though um this is a cat and you don't just encode this into one of these words you what you do is you split the image up into a grid it's not as fine as pixels it's fairly it's it's okay large so in their experiments they're going to use something like 32 by 32 grids um which is also what dolly uses every image is described by 1024 tokens that's 32 by 32 tokens and then you're going to encode you're going to make an encoder such that when this grid is through the encoder this thing here corresponds to one of the code vectors and this thing here corresponds to another one so you have your big vocabulary right here right and this is the the red vector this is the blue vector this is the green vector and you're going to just describe the image regions with these code book vectors like such okay now the fact that is you have you have a lot of these vectors right you have in in fact you have 8092 vectors in dolly and the image only consists of 1024 um tokens so you know it's conceivable like it's not like here where you have to reuse the same token over and over again but one of these tokens could for example be sky so maybe this is the thing that sort of describes sky so what you you'll have is like this thing and this thing and this thing and this thing should be approximately sky right and then maybe the red one is um is i don't know animal and the blue one is vegetation and the green one is some something else so you can see if you feed this to a model that has to make a picture from it it can just look at this and it's sort of like a description a low resolution description of an image it's not exactly a down sampled image it's a it's a description because these things here contain a lot of information by themselves okay it's just that you can't choose any vector in latent space you have to choose one of those vectors in the code book so that's a vector quantized vae and they train everything at the same time so they train the encoder and decoder with this straight through estimator because this nearest neighbor computation isn't exactly differentiable they also train the code book to match the outputs of the encoder so you can train that or you can just take the the exponential average of the encoder outputs and that's the vq vae which is developed more in vq vae 2 so this is vq va2 i've linked the papers v q vae what's i was writing a three two uh the version two of it does the same thing but in multi-scale so here you can see that in the encoder you you you take the image and you put it at multiple resolutions so this is large resolution this is low resolution then you use the vector quantization to encode this into this grid and encode this into the code book vectors so again here maybe you have red red red this is red and this is the green one and so on so you each square has to choose one of these 8 000 vectors to represent itself and then you do this sort of hierarchical thing where you use the d a decoder on this level to produce a slightly higher resolution image but then you quantize again and you use a decoder at a next level to produce an even higher resolution image so you can see that this hierarchical models usually these hierarchical models if you want good high resolution images you sort of need them so you can see that the the top decoder here outputs something quite blocky and then every every additional one adds a sort of details to the image it's pretty impressive as such and you can see the the training right here um of the vqa these are these are papers from last year or the years before so this has been known what dali does is from what i can gather from the blog post right here um the images are pre-processed to 256 to 256 during training similar to the qva each image is compressed to a 32 by 32 grid of discrete latent codes using a discrete vae that we pre-trained using a continuous relaxation okay there there's a lot of um there's a lot of stuff here so um the vae is pre-trained and they're saying they're saying also down here that their model uses maximum likelihood to to generate all of the tokens one after another it's decoder only and so on so probably this whole pipeline here is pre-trained they pre-train a vae a discrete vae and then they simply the dolly model simply has to learn how to produce the tokens right the dolly model simply has to learn how to produce these hieroglyphs and the box is fixed the box is not changed it's possible that they also train the decoder here so the decoder but i don't know i can't tell this from the blog post what's certainly is that they what's certain is that they don't train the encoder okay so what you would do in a single step of dolly is you would have your text right here blah blah blah and you would have a partial image okay you would input this text and the partial image to dully the partial image is any image where you've blacked out the bottom right and they do the bottom right simply it's it's the same as if you do left to right by text so you do so sort of top left to bottom right and yeah it's it's good because you can always flip an image maybe not actually but it's just a bias that you have to provide the model with in order to do autoregressive training right so here is the image of that cat right [Music] and you black out the bottom right you can black out the whole image if you want the model to produce images unconditionally all right so you block all of this out cool so now what you do is these here they are already they are already words right you tokenize those token token token and you go into your vocabulary of text right so there's a vocabulary of text somewhere there's blah and you encode all of these using that vocabulary so this is maybe word 34 so this is word 34 34 34. you go to your image you rasterize this according to your definition okay and then you go and run this through this encoder that you trained so you run it through the box and the box will tell you for each of this grid outputs well the box will tell you well in my in my vocabulary of image pieces uh this here is number two this here is number four this is two again this is 35 and so on so you do this left to right top to bottom and then you put it right here okay so this is followed by an image of 2 4 2 35 and what you ask the model to do is simply to predict from all of this and the model knows that these are this is text and this is images from all of this predict the next token which would be this token right here so you want to predict this one right here what is it and that's how you train the model right and once it gets that you can train you can ask it to predict the next one and so on and in this way you can let it generate an entire image at inference time and you know you can train this they say all these tokens are generated autoregressively now in my understanding this is all the model does because once you have that token so if the model says this is number seven you go back to your box and you say please or well it's a different box like this is the encoder this is the encoder of the vq vae and now you go to your decoder that you've also pre-trained right this is a different box and you ask it i have this image right i have 2 4 2 35 and 7. uh please generate an image for me for that or maybe you want to we want to wait until you have the complete image right so you have the complete image and you give this to your decoder and these are now that these hieroglyphs right so you have the box and the box produces an image and the box says well okay um this cat here probably reproduces the ears fairly well because you can describe them sort of exactly maybe you also want to copy that over or something but then it says well it's a cut so i'm going to you know maybe this if the model has done a good job there should be some sort of a cat right and the model you know maybe in these hieroglyphs it's even described how the cat looks like the cat looks straight ahead as whiskers as eyes and so on okay so i'm going to guess that the part on top that is trained and the part on bottom is pre-trained with the option that the decoder part could also be trained at training time at the same time they train this language model on top so they make some further inferences right here they say um each image is compressed in latent codes using a discrete ve that we pre-trained using a continuous relaxation we found that training using the relaxation obviates the need for an explicit codebook ema loss or tricks like dead code revival and can scale up to large vocabulary sizes and this is the part where i i'm a bit confused so clearly they say they have a vocabulary in the visual domain okay there are eight thousand one hundred and ninety two well i'm i'm don't know my powers of two eight thousand one hundred and ninety two different uh words in the code book so there must be a code book but they say there this obviates the need for an explicit code book so i don't really know what to make of that um i can tell you what a continuous relaxation might look like so this is from a different paper that that they linked of the concrete random variables so if you have an operation such as this like a discrete random variable you need to take an arg max of it what you'll have is you'll have some sort of logits right that are maybe like this and you take the arg max of it which means that you put it into a distribution where it's just one value and this is sort of the same operation as we do in the vq vae where we assign each um each output of the encoder to the nearest codebook vector we say you can only have one of the codebook vectors that's it right now what you want to do when you relax this is you want to say well instead of that what you could do is you could just kind of take that code book vector a lot but also you know take a little bit of the others so more than um doing a hard assignment to a code book vector right so here would be the output of your encoder and you hard assign it to the nearest neighbor um you want to say well i'm going to soft assign it to all the ones it's sort of like the difference between k nearest neighbor and a gaussian mixture model as i understand not not what they do here but it's analogous to that and with that they don't need an explicit code book and i don't know what that means what i can imagine is that they don't actually train the codebook vectors maybe they just quantize to some prefixed schema or i just don't understand what they do yeah here is an illustration of these discrete random variables so you want to get to a point when when you sample the variable um as you drop your temperature it more and more approaches this fixed sampling like you can be either here or here or here with the sort of masses that are indicated by the size of the circle but as you increase the temperature you go more to a mixture so yeah you can be at the corner but you can also be kind of in this region or in this region or in this region as you increase the temperature you can see the the uh distribution becomes more of a mixture distribution and the mixture distribution any mixture distribution with a temperature other than zero of course now all of a sudden has sort of a defined gradient whereas these discrete random variables they do not have a gradient and that's the reason why the vq vae needs to do this straight through estimator right here because this hard assignment to the code book does not have a gradient defined with the soft relaxation you do have a gradient and maybe they just mean they don't need um they don't need this hard assignment to the code book i'm not sure or maybe they they just they quantize in a different way maybe they go back to a continuous latent space um yeah i can imagine they they might go back to a continuous latent space but somehow somehow they uh still do this a form of quantization this could be a fixed quantization like you say okay you can choose any of the basis vectors and some some mixtures that we define between them or they define it via moving averages or they define it via batch statistics or i don't know if you know let me know in the comments to the video all right so this was my take on what the model does and what is probably behind it now let's look at some more examples right here because these are fun so they they say it can sort of control attributes so you see these it's for example a pentagonal green clock and you see it's not always pentagonal it's sometimes hexagonal and sometimes heptagonal and whatnot but in general what it does well is sort of color and also kind of object description so lunchbox it gets and green it gets what it can't do super well is stuff like counting um so i have sort of a hypothesis i have multiple hypotheses about here just see watch in all of these examples how the text prompt is phrased so it says a pentagonal green launch box a green lunch box in the shape of a pentagon this is quite unusual way to phrase the the prompt and by the way all these criticisms that i'm leveraging here most of them are actually admitted and discussed in this blog post it's actually it's pretty cool and pretty self let's say self critical of them so it's this is um um i've you know i thought of these things and then i read the little text and then they they already describe what i concluded it's sad but um yeah it's it's pretty cool of them because the current climate is sort of uh make your research look as as cool and flawless as possible this goes a bit against it so they say that the images here aren't cherry picked and i totally believe this so they have a little trick that they do they output i think 512 images from their model because they can sample and then they re-rank them using this other model that they've released this clip model and this clip model is a pretty good re-ranker um so you give it a piece of text and an image and sort of tells you how well they fit together and it so the outputs that you see here are re-ranked by this model so what you see are strictly the best outputs according to that model so it's not cherry picked by humans but it's cherry picked by a very good model and the second thing is that the text prompt here is absolutely cherry picked right um by the way this is phrased you can see that it is very very brittle probably the model i can't test it but probably it's very brittle in how exactly you phrase this text prompt and i'm going to guess they have tried a lot of things before they've released these few examples uh right here that they show and they've you know made sure that they work so yeah just keep in mind that this is very brittle and we already know this from like gpt 3 we know that the input might seem the same to a human just phrased differently in some cases and yet the model will output completely different things and we know that a lot of these gpt3 examples are very very constructed in terms of the input prompt so yeah the other thing is the model as i said it can do colors and it can do colors and and textures pretty well so we've already seen um the things made of things so the sphere made of noodles that actually probably exist the sphere made of guacamole um however it's not super good at counting for example and i have a sort of multiple hypotheses so these image models they tend to be very good at sort of style and texture style and texture are the domain of these image models like anywhere where there's like a convolution and by the way they use in the vq vae model uh no not in the vq vae in this transformer for images they don't do full attention what they do is each one of the image tokens can attend to each of the text tokens such as this but the image tokens they can only sort of attend um in the grid layer by layer at in one layer they can attend sort of to the row of other image elements in another layer they can attend to the same column and in even another layer they can attend to sort of the the surroundings of them like a convolution so they can attend to let's say their couple of neighbors right here uh so it's not full attention yet in every layer every image token can attend to all the text tokens okay so um yeah in these models what you'll typically see is that textures and style is pretty good however global correspondences are not as good and that's what you see a lot in these face models um where the left and the right earring don't match and things like this so global correspondences are not so good and you would actually expect that objects aren't as good as well right so here this is still a clock this is still a light bulb this is still a stop sign right so it somehow gets the objects correct which in my hypothesis it shouldn't because um this is some sort of a global structure however i think that's just a matter of how the data set is collected the data sets are probably we humans we take pictures of objects right so so the the fundamental structures in these data sets is the object so it makes sense that it learns that we humans we don't we don't take pictures and we often don't describe the count in them so i can get that the model has a harder time to learn that and actually focuses just on the object as a global thing the count would be a global thing right but it's not that prominent in the data and the rest is a local thing like the color the texture and so on yeah the cube made of porcupine so you can see here that um this this counting so two is often quite good actually here it mixes up glasses and glasses right um so two often works however if you go if you go past two um it often gets it wrong so five you'll get anything from three to seven uh clocks and so on so i'm going to also guess it's very brittle like they're not here yes they're sitting on a table but if you take a object that's not that often on a table like a club you'll see that it's pretty unrecognizable whether or not it's on a table um five four clubs so you know the model is prone to ignoring part of its input if the likelihood in another part is larger also it can't do things like this you know a stack of three cubes a red cube is on the top sitting on a green cube it often gets the order wrong like it gets yeah the cubes on top of each other however it often gets it wrong when it comes to you know the order the global things as i said anything global that is not what the object is uh tends to be weak anything local tends to be strong in these models and that's just a matter of how they're built and how the data is so they say uh the image can render new views and here is where i'm not as convinced so here you have like an extreme close-up view of a cobby cup cabby bar sorry um of a fox they're close up sometimes they're extreme close up right um you can see that it gets like forest it gets it gets pretty well but then you say okay ground level view like and then you say okay an aerial view maybe some of them are aerial views some of them aren't um what's pretty cool is things like a okay a fish eye lens view i mean that's that's pretty cool and a they have something a bottom view or a rear view yeah the rear view works better so it does understand these these kind of things like what's the rear of a fox and what's the front of a fox though as you can also see not always um texture it's very good at texture so here something made of voxels can do that perfectly uh an owl made of voxels like this looks like it comes straight from minecraft right absolutely absolutely cool even x-ray sometimes doesn't always get the bones right but yeah as i said style structure um very cool so here is an example of a completion so they give the the text prompt a photograph of a bust of a homer and the image the top part of the image and they say well it can describing a well-known figure it can complete the figure um i don't agree that it completes homer like it completes it probably just sees this bust and this and it just completes you know whatever fits i i don't i have not studied homer as a historic person or busts of him but you know um i i disagree that this depicts largely the same person very often you can see here um there is sometimes there is even you know there's completely unrelated stuff there is that lady with the pearl earring by vermeer somewhere in there and so on and what i also like in this kind of this this one you know the game draw something where or you know pictionary and so on there are people when they can't draw something they just kind of write it on the picture it's like ah screw it and i'll just write it's like this is homer this is homer no i don't care what you say this is homer but you know it does you know it does um so when you say cleopatra it it goes uh more into the into sort of the uh female direction medusa uh it has you know some though i'm pretty sure medusa has the the snake the snake hair no maybe venus yeah somewhat somewhat um it they test a lot of things like can it do mirror reflections and you can see right here they say it can do reflections on the ground pretty well but it can't do reflections for example in a mirror because in a lot of these pictures the object like here would actually have to be in front of the mirror however in the fewest amount of pictures the object mirrored is actually also in front of the mirror so this kind of global correspondence isn't given as much however there is a fair bit of reflection on the ground so to say so you know that's pretty cool but it's also probably very very common in data sets um yeah cross-section view of a walnut so they sort of implore sorry explore the model what it can do and here you can see that you know if something is common in the data set you know like the the cross section of human head there are a lot of pictures of that right in the data set however if it comes to cross-section view of a where where did i see the airplane there is an airplane somewhere it's less it's less so so um you can see that this is still it is so here it probably doesn't really know how that looks because you know they probably on the image on the internet even on the whole internet pictures of cross sections of airplanes or any sections of airplanes are not really distributed often so it sort of just focuses on airplane and then with cross-section it probably knows that it should somehow display some of the interior so it just kind of produces some stuff that matches this thing as i said if if it can't make the likelihood high of all of the things what it tends to do is just focus on one of the things and just make that likelihood high which is reasonable you know for a model um a macro photo macro photographs of stuff these are pretty cool this is what you would find in some image galleries absolutely uh then it can do various things like style transfer and here is where it shines right a so you can have different paintings of different objects in different styles so here you can like have an owl sitting in the forest um in the morning and you can have this as a painting as a painting in the pop art style and so on it's very very impressive so i absolutely am blurry actually too like as a postage stamp um these are these are these are absolutely amazing and yeah you can have stuff like stained glass windows and this is yeah it's where the model shines and even here a storefront that has the word opener written on it so just right now just look at how convoluted this text prompt has to be for them to get this to work it's impressive but the text prompt has to be repeated and reformulated a bunch of times and so on my personal favorite is the pie torch chips um they're crunchy uh you get a piece of back prop in every package and so you can see it sometimes misses like this is perch perch chips and so on uh it sometimes misses but it is pretty cool that it basically can do ocr right or reverse ocr uh you can you give it a a piece of text and it sort of makes a picture with that on it it's very very impressive even though as we said like the global the global correspondences are not always there um they do implore like fashion a skirt like here the yellow skirt um then you know these mannequins and here they have a love bedroom with a white bed next to a nightstand there is a fish tank standing beside the bed and they give sort of the beginning of the image and here's what the model comes up with and you know you can imagine that there are a lot of pictures like this in the data set so the model might be pretty good at stuff like this though i have found their king bed next to yeah let's say the nightstand with the telescope the telescope beside the bed he just you know that beside like there's a telescope sometimes it's on the bed sometimes it's next to it there's some weird telescopes around well this is a lot of telescopes that's a weird telescope but you know the quality is pretty impressive this is absolutely nitpicking that i'm doing here combining unrelated concepts we've already seen the armchair in the shape of an avocado they also have a snail made of harp though my personal favorite is the penguin made of garlic the penguin made of garlic this perfect right absolutely adorable and just qualitatively like this this would take a human like you would pay a high quality um highly educated photoshop artist quite a bit of money to get this sort of output right and these models they they shine at this sort of style transfer texture stuff and it was here yeah you have the illustrations you can have any kind of illustrations like the illustration of a baby shark with a mustache um holding there's holding an umbrella somewhere playing it running riding a unicycle it's just it's just nice and as i said this is the same model that can do all of this stuff and these are samples they're just samples they're not cherry picked however they are re-ranked remember that um so they can do you know hybrids of images hybrids of different um giraffe and turtle and so on and they do sort of implore the model a little bit more where they as i said they give this cat on the top and they say they want the exact same cat on the top as a photo colored blue on the bottom so you can see that it doesn't always work right but in a surprising amount of times it actually does work sometimes it's just like a blue pot but you can you can see it's not the finished model yet however it is a step into the direction that shows us that this is definitely definitely possible it can even do some of these progressive matrices where it fills in the bottom right however they do mention it's very very finicky with respect to whether or not for example if you invert the color so if you look at the bottom right of any of these things if i invert the colors uh the output sort of changes and it's often also not right um however sometimes it is actually right which is crazy because in some of these things you have to do some crazy sort of inference that um we usually we usually do these things in iq tests so i don't know the debate about what is intelligence goes on they say it has geographic knowledge however i'm not sure it has geographic knowledge as it just associates words with particular images like they say okay this is a photo of food of china um okay maybe you just i'm not sure this classifies as geographic knowledge is yeah also this temporal knowledge a photo of a phone from the 20s okay you know and then the different time periods 60s 70s 80s future and so on like distant future like wow these phones um i particularly so i like the usually this stuff it's it's pretty okay right but it's not temporal knowledge it just associates a bunch of tokens with some sort of style of computer today's computer the future computer the distant future computer please no please don't please please don't give me that i don't want to i don't want that i love the action movie poster because so the style is correct but you know it just says action movie in the future yeah they do get sort of the kind of some of the styles of it just it just says action movie like this is like a like a naggy laggy child like um i'm hungry hi hungry i'm dead all right so they also have a summary right here and they do show what it means that they they use this clip to re-rank so on the left here you can see just eight samples straight up from the model and they're not too bad but you know you increase the quality by sort of sampling more and then taking the best eight as you go to the right here um according to the re-ranker so i'm gonna guess they decided on 512 because that was sort of you know it gives you already pretty diverse pretty good pretty high quality outputs right here all right so just uh lastly shout out to the the the offers right here uh the primary authors are the terra mesh mikhail pavlov gabriel go and scott ray with a i guess the secondary supporting authors and most open ai behind them though i don't know how they work i would encourage you to go look at the model it's pretty cool uh try out all these inputs as i said these are the inputs are simply restricting you uh because they don't trust you with their model yet right in the real model you can input any piece of text that you want um and you will get out an image and the fact that you have to select this stuff here is simply because that's the stuff they tried that's the stuff their pr department has signed off on right and um uh so so you get to see that because as i said that they they're not like this is at the same time this is a pr uh dilemma when you release a generative model because it you know it could release and they discussed this a little bit in the blog post you know it could release like it's very problem problematic images in a classifier it's not as pronounced it's also sometimes dangerous but not as dangerous as if you have a generative model that's the first thing and the second thing is there is i mean there is money in this um definitely definitely money to be made in this so you know we'll see whether or not we get the full model or not all right with that that was it from me i hope you enjoy the blog post i hope you enjoyed the video if you did let me know share it out subscribe if you haven't and bye bye

Info

Channel: Yannic Kilcher

Views: 61,534

Rating: 4.9557033 out of 5

Keywords: deep learning, machine learning, arxiv, explained, neural networks, ai, artificial intelligence, paper, gpt, gpt-3, visual transformer, transformer, transformers, attention mechanism, vqvae, vq vae, vq-vae, codebook, relaxation, gumbel, text, images, nlp, natural language processing, autoregressive, grid, encoder, decoder, gpt3, avocado chair, porcupine sphere, animations, fisheye, text to image, image captioning, openai, sutskever, dali, dalle, walle, vector quantized, hierarchical, gan, generative, likelihood

Id: j4xgkjWlfL4

Channel Id: undefined

Length: 55min 45sec (3345 seconds)

Published: Wed Jan 06 2021