This is an artificial intelligence image
generator. Given a text description of a picture, it will create, out of nothing, an image
matching that description. As you can see, it is capable of generating high quality
images of all kinds of different scenes. And it’s not just images, in recent
years generative AI models have been developed that can generate text, audio,
code, and soon, videos too. All of these models are based on the same underlying
technology, namely deep neural networks. In a few of my previous videos, I’ve explained
how and why deep neural networks work so well. But I only explained how neural nets can
solve prediction tasks. In a prediction task, the neural net is trained with a bunch of
examples of inputs and their labels, and tries to predict what the label will be for a new
input which it hasn’t seen before. For example, if you trained a neural net on images labelled
with the type of object appearing in each image, that neural net would learn to predict which
object a human would say is in an image, even for new images which it hasn’t seen before.
Under the hood, the way that prediction tasks are solved is by converting the training dataset
into a set of points in a space, and then fitting a curve through those points, so prediction
tasks are also known as curve-fitting tasks. And while prediction is certainly cool and very
useful, it’s not generation. Right? This model is just fitting a curve to a set of points.
It can’t produce new images. So where does the creativity of these generative models come
from, if neural nets can only do curve fitting? Well, all of these generative models,
are in fact just predictors. Yep, it turns out that the process of
producing novel works of art can be reduced to a curve fitting exercise. And
in this video, you’ll learn exactly how. Suppose that we have a training dataset
consisting of a bunch of images. We want to train a neural net to create new images which
are similar in style to these training images. The first thing you might try is to
simply use the images as labels to train the predictor. Here we don’t care
about the mapping from inputs to outputs, so we can just use anything we like for
the inputs, for example a completely black image. Predictors learn to map inputs to
outputs according to their training data. So, this predictor, once trained, should be able
to map the dummy all-black image to new images, like those seen in the training set, right? Err,
ok maybe not quite. That didn’t work so well, instead of producing a nice, beautiful
picture, we just got this blurry mess. This demonstrates a very important fact about
of predictors. If there are multiple possible labels for the same input, the predictor
will learn to output the average of those labels. For traditional classification
tasks, this isn’t really a problem, because the average of multiple class labels
can still be a meaningful label. For example, this image could plausibly be given two
different labels, both cat and dog would be valid labels. In this case a classifier would
learn to output the average of those labels, which means you end up with a score of 0.5 cat and
0.5 dog. Which is still a useful label. In fact it’s arguably a better label than either of the
original ones. On the other hand, when you average a bunch of images together you do not get a
meaningful image out, you just get a blurry mess. Let’s try something a bit easier this time. How
about, instead of generating a new image from scratch, we try to complete an image which has
a part of it missing. In fact, let’s make this really easy and suppose there is only one
missing pixel, say, the bottom right pixel. Can we train a neural net to predict the
value of this one missing pixel? Well, as before, the neural net is going to
output the average of plausible values that the missing pixel can take. But since
it’s only one pixel that we’re predicting, the average value is still meaningful. The average
of a bunch of colors is just another color, there’s no blurring effect. So,
this model works perfectly fine! And we can use the value predicted by this neural net to complete images which are
missing the bottom-right pixel. Great, so we can complete images
with 1 missing pixel… What about 2? Well, we can do the same thing again, train
another neural net on images with 2 missing pixels, using the value of the second missing
pixel as the label. And then use this neural net to fill in the second missing pixel. Now
we have an image with just 1 missing pixel, and so we can use the first
neural net to fill in that. Great. And we can do this for every pixel in the image;
train a neural net to predict the color of that pixel when it and all of the subsequent
pixels are missing. Now we can “complete” an image starting from a fully black image,
and filling in one pixel at a time. Crucially, each neural net only predicts one pixel,
and so there’s no blurring effect. And there we have it, we have just generated a
plausible image, out of nothing… There’s just one small problem. If we run this model again,
it will generate exactly the same image… Not very creative, is it? But not to worry, we can fix this
by introducing a bit of random sampling. You see, all predictors actually output a probability
distribution over possible labels. Usually, we just take the label with the largest
probability as the predicted value. But if we want diversity in our outputs, we
can instead randomly sample a value from this probability distribution. This way,
each time the model is run, it will sample different values at each step, which therefore
changes the prediction for subsequent steps, and we get a completely different image each
time. Now we have an interesting image generator. But still, at the end of the day, this model is
made of predictors. They take as input a partially masked image, and predict the value of the next
pixel. The only difference between this and a traditional image classifier is the label we used
for training. The labels for our generator happen to be pixel colors which come from the original
image itself, and not a human labeller. This is a very important point in practice: it means
we don’t need humans to manually label images for this model, we can just scrape unlabelled
images off the internet. But from the point of view of the neural net, it doesn’t know, nor
does it care, that the label came from the original image. As far as it’s concerned this is
just a curve fitting exercise, like any other. The generative model we’ve just created is called
an auto-regressor. We have a removal process, which removes pixels one at a time, and we train
neural nets to undo this process, generating and adding back in pixels one at a time. This is
actually one of the oldest generative models, the very earliest use of auto-regression dates
back to 1927, where it was used to model the timing of sunspots. But auto-regressors
are still in use today. Most notably, Chat-GPT is an auto-regressor. Chat-GPT generates
text by using a transformer classifier to output a probability distribution over possible next
words, given a partial piece of text. However, auto-regressors are not used to generate
images anymore. And the reason is that, while they can generate very realistic
images, they take too long to run. In order to generate a sample
with an auto-regressor, we need to evaluate a neural net once
for every element. This is fine for generating a few thousand words to make
a piece of text, but large images can have tens of millions of pixels. How can we
get away with fewer neural net evaluations? For our auto-regressor, we removed one pixel at a
time. But we don’t have to remove only one pixel, we could, for example, remove a 4
by 4 patch of pixels at a time. And train the neural net to predict
all 16 missing pixels at once. This way, when we use our
model to generate an image, it can produce 16 pixels per evaluation,
and so generation is 16 times as fast. But there is a limit to this. We can’t generate
too many pixels at the same time. In the extreme case, if we try to generate every pixel in
the image at once, then we’re back to the original problem: there are many possible labels
that get averaged together into a blurry mess. To be clear, the reason why the
image quality degrades is that, when we predict a bunch of pixels at the same
time, the model has to decide on the values for all of them at once. There are lots of plausible
ways that this missing patch could be filled in, and so the model outputs the average
of those. The model isn’t able to make sure that the generated values are
consistent with each-other. In contrast, when we predict one pixel at a time, the model
gets to see the previously generated pixels, and so the model can change its prediction for
this pixel to make it consistent with what has already been generated. This is why there’s a
trade-off, the more pixels we generate at once, the less computation we need to use, but the
worse the quality of the generated images will be. Although, this problem only arises if the
values we are predicting are related to each other. Suppose that the values were
statistically independent of each-other, that is, knowing one of them does not
help to predict any others. In this case, the model doesn’t need look at
the previously generated values, since knowing what they were wouldn’t change
its prediction for the next value anyway. In this case you can predict all of them at
the same time without any loss in quality. So, that means, ideally, we want our model to
generate a set of pixels that are unrelated to each other. For natural images, nearby
pixels are the most strongly related, because they are usually part of the same
object. Knowing the value of one pixel very often gives you a good idea of what color nearby
pixels will be. This means that removing pixels in contiguous chunks is actually the worst way to
do it. Instead, we should be removing pixels that are far away from each other, and hence more
likely to be unrelated. So if in each step, we remove a random set of pixels, and predict
values for those, then we can remove more pixels in each step for the same loss in image
quality, compared to contiguous chunks. In order to minimize the number of steps
needed for generation, we want the pixels we remove in each step to be as spread out as
possible. Removing pixels in a random order is a pretty good way of maximizing the average
spread, but there is an even better way. We can think of our generative model as two
processes: a removal process that gradually removes information from the input, until
nothing is left. And a generation process that uses neural nets to undo the removal process,
generating and adding back in information. So far, we have been completely removing pixels.
But rather than completely removing a pixel, we could instead remove only some of the
information from a pixel, by, for example, adding a small amount of random noise to it.
This means we don’t know exactly what the original pixel value was, but we do know it
was somewhere close to the noisy value. Now, instead of removing a bunch of pixels in each
step, we can add noise to the entire image. This way, we can remove information from
every pixel in the image in a single step, which is the most spread-out way of removing
information. And since its more spread out, you can remove more information in each step,
for the same loss in generation quality. There is one small problem with this though. When
we want to generate a new image, we need to start the neural net off with some initial blank image.
When we were removing pixels, then every image eventually ends up as a completely black image,
so of course that’s where we start the generation process from. But now that we’re adding noise,
the values just keep getting larger and larger, never converging to anything. So where
do we start the generation process from? We can avoid this problem by changing
our noising step slightly, so that we first scale down the original value and
then add the noise. This ensures that, when we repeat this noising step many
times, information from the original image will disappear, and the result will
be equivalent to a pure random sample from the noise distribution. So we can start our
generation process from any such noise sample. And there we have it, this is known as a denoising
diffusion model. The overall form is identical to an auto-regressor, the only difference is
the way in which we remove information at each step. By adding noise, we can spread out the
removal of information all across the image, which makes the predicted values as independent
of each-other as possible, allowing you to use fewer neural net evaluations. Empirically,
diffusion models can produce high-quality photo-realistic images in about a hundred steps,
where auto-regressors would take millions. Now that we understand how these generative models work at a conceptual level, if you are ever
going to implement these models in practice, there are a few important technical
details that you should be aware of. First, in the procedure I described for
auto-regression, I used a different neural net in each step of the process. This is certainly
the best way to get the most accurate predictions, but it’s also very inefficient, since we
need to train a whole bunch of different neural nets. In practice, you would just use
the same model to do every step. This gives slightly worse predictions, but the savings
in computation time more than make up for it. In order to train a single neural net
to perform all of the generation steps, you would remove a random number of
pixels from each input, and train the neural net to predict the corresponding
next pixel of each input. Additionally, you can also give the number of pixels removed as
an input to the neural net, so that it knows which pixel it’s supposed to be generating. Now this one
neural net can be used for all generation steps. In the setup I just described, for
each training image, the neural is trained on only one generation
step for that image. But ideally, we would like to train it on every generation
step of every image, we can get more use out of our training data that way. If you did
this the naïve way you would have to evaluate the neural net once for every generation
step. Which means a lot more computation. Fortunately, there exist special neural net
architectures, known as causal architectures, that allow you to train on all of these
generation steps while only evaluating the neural net once. There exist causal versions
of all of the popular neural net architectures, such as causal convolutional neural
nets, and causal transformers. Causal architectures actually give slightly
worse predictions, but in practice, auto-regression is almost always done with causal
architectures because the training is so much faster. The generation process for causal
architectures is still exactly same though. For diffusion models, you can’t use
causal architectures and so you do have to just train with each data
point at a random generation step. I described the diffusion model as predicting
the slightly less noisy image from the previous step. However, it’s actually better to predict
the original, completely clean image at every step. The reason for this is it makes the
job of the neural net easier. If you make it predict the noisy next step image, then the
neural net needs to learn how to generate images at all different noise levels. This means
the model will waste some of its capacity learning to produce noisy versions of images.
If you instead just have the neural net always predict the clean image, then the model only
needs to learn how to generate clean images, which is all we care about. You can then
take the predicted clean image and reapply the noising process to it to get the
next step of the generation process. Except that when you predict the clean image
then, at the early steps of the generation process the model has only pure noise as input, so
the original clean image could have been anything, and so you get a blurry mess again. To avoid
this, we can train the neural net to predict the noise which was added to the image. Once we
have a predicted value for the noise, we can plug it into this equation to get a prediction for the
original clean image. So we are still predicting the original clean image, just in a round-about
way. The advantage of doing it this way is that, now, the model output is uncertain at the
later stages of the generation process, since any noise could have been added to
the clean image. So the model outputs the average of a bunch of different noise
samples, which is still valid noise. So far we’ve just been generating images from
nothing, but most image generators actually allow you to provide a text prompt describing
the image you want to make. The way that this works is exactly the same, you just give the
neural net the text as an additional input at each step. These models are trained on pairs of
images and their corresponding text descriptions, usually scraped from image alt text tags found
on the internet. This ensures that the generated image is something for which the text prompt could
plausibly be given as a description of that image. In principle, you can condition
generative models on anything, not just text, so long as you can find
appropriate training data. For example, here is a generative model that
is conditioned on sketches. Finally, there’s a technique to make
conditional diffusion models work better, called classifier free guidance. For this,
during training the model will sometimes be given the text-prompts as additional
input, and sometimes it won’t. This way, the same model learns to do predictions with or
without the conditioning prompt as input. Then, at each step of the denoising process, the
model is run twice, once with the prompt, and once without. The prediction without the
prompt is subtracted from the prediction with the prompt, which removes details that are
generated without the prompt, thus leaving only details that came from the prompt, leading to
generations which more closely follow the prompt. In conclusion, generative AI, like all
machine learning, is just curve fitting. And that’s all for this video. If you enjoyed
it, please like and subscribe. And if you have any suggestions for topics you’d like to me to
cover in a future video, leave a comment below.