Ai that makes thumbnails (or any image)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
these people do not exist they aren't real because they were generated by a computer if you haven't noticed ai has been getting really good at creating photo realistic images you may even have heard about newer programs like dolly 2 or imagine that can create incredible images from any prompt written in natural human language in this video i want to explore how machines do this and use the latest methods from deep learning to generate my own youtube thumbnails that do not exist so let's start with a statement of the problem we want a program that can generate realistic images but what does realistic mean exactly well if we're going to solve a problem with deep learning the first question we have to answer is what is our data we need some data set of images it doesn't matter what kind it could be people or mountains or cats or dogs or all of the above a realistic image simply looks like it could belong in that data set and we want a program to generate these images for us this is a perfect problem for deep learning i have no idea how to write an algorithm to do this but i have plenty of data to describe it so if we're going to generate youtube thumbnails we need a collection of thumbnails and a lot of them thankfully for my purposes there are plenty of public online resources that i use to create my own thumbnail data set we're talking cream of the crop stuff here best of the best in all i ended up with around 200 000 thumbnails from generally popular videos and channels make no mistake this data set is going to be hard to learn thumbnails are complicated and contain a wide variety of subjects and styles different languages abstract objects animations video games and worst of all human faces expressive extreme faces in weird environments with strange editing we are very sensitive to mistakes and faces and our model could end up generating some very uncanny things spoiler alert that is exactly what will happen but if you've seen my previous video on neural networks you will know that the more data you have the better performance you get this is part of what makes data so valuable it is the oil of the 21st century and the fuel of deep learning so we've got our data let's talk about algorithms ultimately we want to program or model that pops out a realistic image but not just one we want it to produce a diversity of realistic images this diversity is introduced with randomness rather than producing the same image every time the model can be fed a bunch of random numbers and generate a different image based on that input essentially all deep learning methods rely more or less on this paradigm random noise in realistic image out more technically this random input is known as a latent variable as opposed to an observed variable like the output image we don't care what the latent input is but the model does and should map different inputs to different images there are a lot of different methods for doing this but a classic one is known as a gan a generative adversarial network at a high level two neural networks are pitted against one another where one network the generator is tasked with generating images and the other the discriminator decides whether a given image is real or generated the generator is fed random noise and produces an image while the discriminator is fed an image and outputs a single value between 0 and 1 representing its best guess as to whether the given image is real or fake we give it both real images from our data set and fake images made by our generator the generator is trained to fool the discriminator by creating ever more convincing images of your data set and the discriminator is trained to become ever more perceptive learning what real images look like and detecting the mistakes of the generator they are playing what is called a mini max game the discriminator is trying to maximize the probability that it has guessed correctly while the generator is trying to minimize that same value it is adversarial once we finish training the generator is our final model and we can smoothly and continuously walk through the images that it makes by walking through the random values that we feed the network this is called gan interpolation and every frame of this video should look like a realistic image and was made by our generator so i decided to train again on my thumbnail data set because the internet is a better programmer than i'll ever be i borrowed an open source gan implementation fixed it up and started it off training the generator started by outputting random images and the discriminator quickly learned that random noise does not look like youtube thumbnails so the generator output more distinct colors and shapes which the discriminator grew wise to and the game is on you can see what the generator is cooking up as it learns and hopefully these splotchy patterns should eventually start looking more like youtube thumbnails i love gans i think that they are an elegant solution to a complicated problem and they remind me a lot of how camouflage evolves in the natural world a prey species might generate ever more convincing disguises for leaves and sticks and thorns while a perceptive predator evolves to discriminate between true and false prey this arms race produces intricate and complex imitations just as it does in gans again is like a little co-evolution simulator and i like evolution simulators but this analogy can also help explain some inherent problems with gans like populations of predators and prey the system is a balancing act a cunning predator can drive a poorly camouflaged species to extinction and vice versa the same is true for gans as they depend on a stable balance between the generator and the discriminator and this can be very difficult to achieve to get a closer look we can chart the performance of both networks as they learn they're both striving to make their error measurement or loss go down over time because they're playing an adversarial game these values are inversely related when one dips the other spikes sometimes the loss for both explodes that's bad but some gradient clipping should fix that now just the generator's loss is exploding okay that one was a bug now the generator just kind of sucks in fact it looks like it's just generating the same thing over and over with little variation and doesn't look very good this is called mode collapse and is one of the tricky fail states that gans can get stuck in even when loss seems stable so read up on some stabilization techniques plug in some fancy things like image noise and label smoothing and spectral normalization then scale up my networks and train it for 24 hours and well there you go this was just about the best i could do let's interpolate and walk through the sample oh my i mean you can see what it's going for right there's some text some dim outlines of heads some diversity of colors but it's a bit how do you say lovecraftian nightmare i'm not a big fan i tried this over and over and over testing different architectures and hyper parameters and data sets and just could not get the thing off the ground at this point i was getting a bit discouraged i had invested a month of work into this thing and had nothing to show if it's not obvious i am very much an amateur at this after all you can get gans to do incredible things but as i was struggling to get it to work i came across a short blog post where someone had been trying to train again on their data set and had run into a wall finally they gave up on the gan and tried a newer method called diffusion this algorithm more formally known as the denoising diffusion probabilistic model is a relatively new innovation and is the part of the text to image models like dolly 2 and imagine that actually generate the images by itself diffusion is another way of solving the same problem that can solve image synthesis it works by first taking an image from your data set and adding a small amount of random noise to it over and over for a fixed number of steps until the image has turned into pure static a neural network is then trained to reverse this process detecting and then removing the noise that was added at each step gradually reconstructing the original image over the same number of steps in order to do this the network must know something about what the image should look like so that it can discern what is noise and what is not however the forward noising process cannot be perfectly undone as information is literally being destroyed and by the end there really is no way to tell what the original image was but a well-trained model will be able to make educated guesses about what the image may have looked like turning random noise into a realistic image which is exactly what we need so again i went out and borrowed an open source implementation for diffusion and threw it at my data set almost immediately there was a noticeable difference the early images look distinctly more painted almost like watercolor these are aesthetically very different from the early gan images and to me are very pretty slowly the general idea of a thumbnail began to form and then i saw it a face clear as day emerged from the noise a little goblin man complete with eyes and ears and a mouth already it's better than the gan and we're only a few hours into training as the model continues to learn it soaks up patterns from the data like a sponge more thumbnail features began to emerge faces complexified and grew bodies shapes turned into letters and words and it even began to memorize logos like vivo if we let it keep running for a long time it should just keep getting better diffusion is based on the real physical process of diffusion in which a system gradually spreads out and becomes more and more disorganized over time in order to reverse this effect we have to start with a messy high entropy state and find a path back to an organized low entropy state in the diffusion algorithm we are starting with unstructured information and finding a path back to structured information restoring the original distribution we know that such a path exists because we took that path when adding noise in the first place these processes are not just similar they are in some ways exactly the same entropy is information now there are a lot of technical details that i'm brushing over here so i'll link some better explanations if you want to dive deeper i found both the papers and code for diffusion to be much harder to parse and much more complicated than gans never trust a mathematician who tells you something is simple but this complexity is there for a reason it works after training for a few days we have a model that can denoise pure static into images that look like thumbnails extracting meaning from randomness once we've generated some samples we can scale them up to a higher resolution with another neural network obviously these faces are a little funny looking but i hope you'll agree that compared to the gan these results look a lot better it really is no contest you may even see some familiar faces here and you can really feel the youtuber culture just oozing from some of these in fact these wacky faces and made up words are kind of perfect as youtube thumbnails since they're really good at grabbing your attention which is the whole point the program is imitating our culture our languages and our interests everything that attracts our attention some of these definitely still creep me out but a few of the dramatic faces i think look unironically great even some of the really abstract ones can look beautiful though i can't say exactly what they are they look like abstract art to me now you may be accusing me of cherry picking these results and that is exactly what i'm doing ultimately i have to pick the best ones for the actual thumbnail but let's take a look at some randomly selected samples that the model generated a lot of them are just mush no real subject or form there are a lot that are just pure black or pure white or completely abstract shapes while it makes a lot of faces it only rarely generates other kinds of common subjects like food or cars or landscapes you can kind of tell that it's trying to make things that look like animations or video games but they're not that impressive and yeah we're still getting some lovecraftian nightmares here when it's not making up words it can have trouble spelling simple phrases like the famous khan academy some results look suspiciously good i'd say too good these happen when a particular pattern is copy pasted in thousands of thumbnails and the model learns really well to copy it pixel for pixel i might call this over fitting the model is too good at imitating the original data set and is just memorizing images we want novel images ones that look like the data set but are not in the data set so how would i improve these results well the promise of deep learning is more data and bigger models means better performance i've just about maxed out model size but i can always use more data if we want the model to be better at generating all the funky things that thumbnails can be we need to give it more examples we need more fuel a larger data set with more diversity would also make it harder for the model to exactly memorize specific faces words and logos and it would be forced to learn more general patterns rather than pixel for pixel copies so if i were to improve the model the first thing i do is double my data set size a few times there are probably a million other improvements i could add but for now i'm satisfied with these results for the moment the model is not publicly available though i may change that in the future but if you want to see more generated thumbnails i'll be posting them on my twitter and discord okay so diffusion works really well now we have to answer the question why does diffusion work so much better than gans well first off i'd say this project has not been a fair comparison the gan implementation i borrowed was much older than the diffusion model and is missing out on some major upgrades like attention and residual connections again you can get gans to perform very well if you put in the work but researchers have done a much more fair comparison and found that diffusion methods really do outperform gans in a number of metrics there are many reasons for this and we only really understand a few like the fact that gans don't have a clear loss function with diffusion the goal of both the program and the programmer is to make this number loss go down with gans there's no reliable way to interpret the loss you can fix this by using something fancy called a washer steingan but this is not the only problem like i mentioned before gans are a balancing act you have to split computational resources between the generator and discriminator and when doing this you have to make sure they are properly balanced in scale and architecture because the game is adversarial a weakness in one will be endlessly exploited by the other this is inherently unstable and makes them very hard to train and scale if that balance is not perfect diffusion despite all of the mathematical details is much more straightforward there is one network with one well-defined loss function it's built with a very direct goal of replicating the distribution of your data set it does so with a strong mathematical basis in information theory in my experience it is more stable more scalable and more reliable as a result the technologies that were previously based on gans or other methods can be upgraded by moving to diffusion text to image models have historically been built with gans but the impressive dolly 2 and imagine were built with diffusion deep fakes are also based on gans and i think we should expect them to get much better very soon now ai is a fast-moving field and this video could be outdated in a week already another method called auto regression may have beat both diffusion and gans the arms race is ever escalating to be sure people are still using gans to do interesting and valuable things and i hope they stay competitive in the field but maybe not sometimes beautiful technologies become obsolete and maybe gans themselves will one day go extinct thanks for watching everyone this video was a ton of fun to make but it did cost me an enormous amount of time and energy and indeed money so i've decided to open a patreon if you want to help support my channel and other projects this would help me out a lot and there are some benefits for patrons like discord roles and bonus content that didn't make it into my videos rest assured the life engine will remain completely free and open source and you should only donate if you feel comfortable just watching my videos is enough until next time
Info
Channel: Emergent Garden
Views: 11,664
Rating: undefined out of 5
Keywords: artificial intelligence, ai, gan, deep learning, machine learning
Id: gvNdCRe3T-g
Channel Id: undefined
Length: 16min 59sec (1019 seconds)
Published: Sat Jul 09 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.