In many CSI movies, there's that scene where
someone finds a small and obscured image, and they get a clear picture out of it by
zooming and enhancing it. Is this really possible? Mostly no, those movies are nowhere near technically
accurate. But, to some extent, yes. It is indeed possible to enlarge and enhance
images, and that's called super-resolution. That's what we are going to be talking about
in this video. The process of upscaling and enhancing an
image is called super-resolution. There are multiple ways to do it. For now, let's focus on single-frame super-resolution
where we have a single low-resolution image, and we want to upscale it. So how can we upscale an image? The simplest way to do would be to spread
the pixels out and fill in the holes by copying the values from the closest pixels. That's how nearest neighbor upscaling works. But that doesn't really look like a higher-resolution
image, does it? It looks more like a low-resolution image
with larger pixels. You can improve it a bit by taking the weighted
averages of the neighboring pixels rather than copying the closest ones. That's essentially what bilinear and bicubic
upscaling algorithms do. But even that doesn't look good enough. In information theory, there's a concept called
data processing inequality. It states that whatever way you process data,
you cannot add information that is not already there. This implies that missing data cannot be recovered
by further processing. Does that mean super-resolution is theoretically
impossible? Not if you have an additional source of information. A neural network can learn to hallucinate
details based on some prior information it collects from a large set of images. The details added to an image this way would
still not violate the data processing inequality. Because the information is there, somewhere
in the training set, even if it's not in the input image. So, how can we train such a model? If you watched my Deep Learning Crash Course
series, you might be thinking: can't we just train a neural network to learn a mapping
between low and high-resolution images? Yes, we can, and we wouldn't be the first
ones to do so. That's pretty much what the SRCNN paper did. First, we can create a dataset by collecting
high-resolution images and downscaling them, or we can simply use one of the existing super-resolution
datasets, such as the DIV2K dataset. Then, we can build a convolutional neural
network that would input only the low-resolution images, and we can train it to produce higher
resolution images that match the original ones the best. The SRCNN paper simply minimized the squared
difference between the pixel values to produce images that are as close as possible to the
original high-resolution images. But is mean squared error really the right
metric to optimize? This is a very old debate. Long story short, mean squared error doesn't
express the human perception of image fidelity well. For example, all of these distorted images
are equally distant from the original image in terms of mean squared error. Clearly, they don't look equally good. Because mean squared error cares only about
pixel-wise intensity differences but not the structural information about the contents
of an image. There's a better measure of perceptual image
quality called the structural similarity index, which was developed in my lab at the University
of Texas at Austin. The structural similarity index made a very
high impact, both in academia and the industry. My doctoral advisor, Alan Bovik, and his collaborators
won a Primetime Emmy Award for this method a few years ago. This metric was initially developed to measure
the severity of image degradations. However, many researchers also used it as
a loss function to train neural networks for image restoration. More recently, people also started using pre-trained
convolutional neural networks as perceptually-relevant loss functions. How it works is that you first take a pre-trained
model. This is typically a VGG-19 model trained on
ImageNet. Then take it's first few layers and compute
the difference between the feature maps produced by those layers. The difference between the feature maps can
be minimized to train another model, just like any other loss function. The layers that generate those feature maps
stay frozen during training and act as a fixed feature extractor. This method is commonly referred to as perceptual
loss, content loss, or VGG-loss. How is this relevant to super-resolution? We can use this loss function to train models
to enhance images and get pretty decent results. But, sometimes, it doesn't feel fair to penalize
the model for pixelwise differences that don't really make much difference for human viewers. For example, does the direction of the hair
on this baboon's face really matter? What if we cared a little less about how the
original high-resolution images looked like, as long as the produced images looked good. We can do so by using GANs: generative adversarial
networks. GANs consist of two networks fighting each
other to achieve adversarial goals. I made a more detailed video about this earlier. There's a GAN-based super-resolution system
called SRGAN. It uses a generator network that inputs low-resolution
images and tries to produce their high-resolution versions. It also uses a discriminator network that
tries to tell whether this is a real high-resolution image or an image upscaled by the generator. Both networks are trained simultaneously,
and they both get better over time. Once the training is done, all we need is
the generator part to upscale low-resolution images. In addition to this adversarial training setup,
SRGAN also used a VGG-based loss function that we talked about earlier. There's another paper called Enhanced SRGAN,
which proposed a few tricks to improve the results further. Enhanced SRGAN, or ESRGAN for short, somehow
got popular in the gaming community. People started using it to upscale vintage
games, and it worked pretty well. It's surprising how well it worked on video
game graphics despite being trained only on natural images. Let's take a look at what enhancements the
ESRGAN paper proposed for better results. First, they removed the batch normalization
layers in their network architecture. This may sound contradictory to what I said
in my previous videos, but it's not. Batch normalization does help a lot for many
computer vision tasks. But for image-processing related tasks, such
as super-resolution or image restoration in general, batch normalization can create some
artifacts. They also added more layers and connections
to their model architecture. It's not surprising that a more sophisticated
model resulted in better images, but deeper models can be trickier to train, especially
if they are not using batch normalization layers. So, the authors of ESRGAN used some tricks
like residual scaling to stabilize the training of such a network. In addition to the changes in the model architecture,
they also modified the loss functions. For example, they modified the VGG-loss in
a way that compared the feature maps before activations. Their rationale is that the feature maps are
denser and contain more information before they get clipped by the activation functions. In the original SRGAN paper, the discriminator
model was trained to detect whether its input is real or fake. In the enhanced version, the authors used
a relativistic discriminator that tells whether the input looks more realistic than fake data
or less realistic than real data. Earlier I said minimizing the mean squared
error might not be the best way to generate textures that look appealing to the human
visual system. Then, I went on to say maybe we shouldn't
care too much about how close the generated images are to the original ones. There's actually a trade-off there. We would still want the upscaled images to
be a faithful representation of the originals while having good-looking textures. The ESRGAN paper aims to find the sweet spot
by interpolating between models. What they do is that they compute the weighted
average of two models, one trained using mean squared error, and the other fine-tuned with
adversarial training. Blending the parameters this way allows for
finding the right balance between the two models without retraining them. More recently, another paper also explored
the idea of network interpolation, and their results also look promising. Super-resolution is a relatively hot topic,
and many researchers are experimenting with different ways of approaching this problem
and are publishing their results. This paper, titled "Zoom to learn, Learn to
zoom," for example, focuses on building a model that mimics optical zoom directly on
raw sensor data. The authors created a dataset of raw images,
and their corresponding optically zoomed ground truth. They also proposed a loss function named "contextual
bilateral loss" to handle slightly misaligned image pairs. Speaking of raw images, Google Pixel's Super
Res Zoom feature showed that it's possible to achieve super-resolution through a burst
of raw images. Google's method makes use of slight hand movements
to fill in the missing spots in an upscaled image. So what if the user is using a tripod, and
the image is perfectly still. Then, they deliberately jiggle the camera
between the shots. So, to be able to implement this, you need
to have complete control of the hardware. Unlike the other methods we covered so far,
Google's Super Res Zoom is a multi-frame super-resolution algorithm. If you don't have such bursts of images and
want to upscale your pictures, you can easily use the single-frame super-resolution methods
that we overviewed today. ESRGAN, for example, operates on a single
input image and is very easy to run on an arbitrary picture you may want to use. There are also task-specific super-resolution
models, which I think is worth mentioning. For instance, face-upscaling models use face
priors to synthesize realistic details on faces. Basically, the models know what a face typically
looks like and uses that information to hallucinate the details. As you can tell, those methods are absolutely
not suitable for CSI purposes, since all the details in the upscaled version are completely
made up. Alright, that's all for today. I hope you liked it. I put the links to all referenced papers in
the description below. Subscribe for more videos. And as always, thanks for watching, stay tuned,
and see you next time.
Nice video and exciting to see where AI can lead us.
Got a question about traditional super resolution using multiple images, sub-pixel shifts and image stacking. Under ideal circumstances, what is the theoretical maximum increase in resolution that can be obtained (not using AI methods)?
Wow. 8 posts in 7 minutes.