To support the production of more high quality content, consider supporting us on
Patreon or a YouTube membership. Additionally, consider visiting
our parent company EarthOne, for sustainable living made simple. Throughout this deep learning series, we have gone from the origins of the field and how the structure of the artificial neural
network was conceived to working through an intuitive example, covering the main aspects and
some of the many complexities of deep learning. Now all of these videos
have only been focused on one type of neural network,
the feedforward network. The focus of this video then
will be to initiate discussion on another very popular and important neural network architecture, the convolutional neural network. For our discussions on
convolutional neural networks, we will go with a very common
example, number recognition. The reason this example was chosen is because of this great
interactive resource by Adam Harley, a robotics PhD at Carnegie Mellon University. I have linked the resource
in the description so that you too can experiment with it and see the internals of
a convolutional network. Now, if you can recall from our example on an image pattern recognizer
using a feedforward network, we set up an idealized system
and made the assumption that each hidden layer in our network would build upon further
layers of abstraction, from vertical lines to
combinations of vertical lines. However, in actuality, as we explained, this wouldn't happen,
and the receptive fields in the network would be a lot
more random to the human eye due to the architecture
of feedforward networks and how they compute. With convolutional networks,
however, this isn't the case, and we could to some extent see these layers of
abstraction building up. Before we delve into why
and how this happens, let's first set up a
structure for the network we will be using. As with our previous example
in image recognition, the input will be the
individual pixels of our image and the output the patterns
we are trying to classify. In this example's case,
we have 10 outputs, numbers zero to nine. For the layers in between, we will have two convolutional
layers, two pooling layers, and two fully connected layers. Now, that was a lot of new
terminology thrown around. So let's break it down layer by layer, starting with the input. While we have already
stated that the input is the individual pixels of the image, I want to mention some important details about how a computer views this image. Every image in a digital device is stored as a matrix of pixel values. This is referred to as a channel, a certain component of an image. Now, with a typical digital camera, every image will have three channels, red, green, and blue, RGB, which you can imagine as three 2D matrices stacked upon one another. For the sake of simplicity, we will assume our input
has just one channel, the luminance of the image, with the value of each pixel
represented in eight bits. In other words, the pixel values will
range from zero to 255, with zero indicating darker
transparent luminance and 255 as bright. As a side note, digital systems store all types of sensor data
and other information in this fashion. For example, for a convolutional network that would operate on speech, we could represent that speech as a matrix of frequency values. Coming back on topic, now that we understand
the format of the input, let's finally delve into
where the real magic happens, the convolutional layer. As you can assume, this layer and the network
as a whole get their names from the convolutional operator, which in layman terms is a
mathematical combination, in other words, the dot
product of two functions to produce a third function. In CNNs, this operation is implemented in what is referred to
as a feature detector, filter, or most commonly, a kernel. You can think of a
kernel as a mini matrix, orders of magnitude
smaller than the input. In a convolution operation, then, the kernel moves across the input image, taking the dot product of the two matrices and then saving the
values to a new matrix, dubbed the feature map
of the original image. At this point, you may be wondering how this is able to detect
any features at all. Well, if a kernel is
initialized with the values in a specific configuration, they can be used to
transform an input image and find various patterns. Take these kernels, for instance. When loaded with the
appropriate values and convolved with the input image produces an output which highlights various
edges in the photo. Now I won't delve into the specifics of the various types of
kernels in this video, but there are many resources that talk about them extensively, such as Computerphile and their
various videos on kernels, like the Sobel edge detector. Coming back on topic, as you can see, kernels
are essentially ways to do fast computation
and produce a new output. In this case, we're
doing it with an image, with an effect or features highlighted. Programs like Photoshop
use these simple kernels for effects like blurs, for instance. Now, when we look at a typical image, take in how much is actually going on. There are various edges,
shapes, textures, et cetera, that stack together to
make various objects in the overall image we are looking at. At the start of a convolutional network, the types of kernels we
use would be quite simple and more geometric, detecting
things such as edges, corners, and simple shapes and patterns, like a circle, for instance. Additionally, each convolutional layer can have multiple kernels that
produce multiple feature maps of their own. So for our example, we can see that our first
layer has six kernels, and these kernels are
detecting simple patterns, like horizontal lines,
vertical lines, and corners on our drawn numbers. As a side note, for each
of the convolved pixels in our feature map, a
non-linearity function was also applied, in
this instance, a ReLU. We discussed this function
in previous videos and the need for nonlinearity
in neural networks, so our feature maps are more
adaptable to real-world data. Also, you can now see how complicated this process would become if
we had more than one channel, as each channel would
have six feature maps, in this instance leading
to 18 for an RGB image, and then those feature maps would have to be filtered together or operated on separately, depending on your specifications. For instance, in a self-driving
system, a red hexagon is a stop sign. So we would need separate channels and kernels for those channels. However, coming back to our example, just one channel, luminance
was more than enough. Moving on, now that we have
our rectified feature maps, the next layer in CNNs
after the convolution is a pooling layer. Pooling layers are used to
downsample our feature maps, keeping the most important
parts and discarding the rest, and primarily done to
reduce the overfitting rate, and so that calculations can
be sped up in later layers due to the reduced
spatial size of the image. The type of pooling our network
implements is max pooling, in which we take another kernel and slide it across
our input feature maps. And the largest pixel value in that region is what is saved to our
new output feature map. As you can see in this number classifier, the max pooled layer still
retains the important information from the convolutional layer, while being smaller in spatial size. For instance, in this feature map, you can still make out
the horizontal lines, and in this one, the vertical from their respective
original feature maps. So now with the defining layers of the convolutional
neural network understood, A, the convolutional layer
which extracts features, and then B, the pooling layer which downsamples the feature maps, the next step is repeating these layers to build up more
abstraction in the network. As stated earlier, in later
layers of the network, more complex kernels are incorporated that detect shapes, objects,
and other complex structures, which is done by leveraging the previously generated feature maps and their detected simple features to build more complex ones. This is essentially the
definition of abstraction. You can see in our example that in the next convolutional layer, there are 16 kernels, and these kernels are now much more complex. In the first convolutional layer, we were detecting simple
edges and corners. And now, as you can see, horizontal and vertical
edges are being combined as well as other simple shapes from the previous feature maps. For instance, with this drawn four, you can see this feature map
combining both horizontal and vertical edges from various
earlier layer feature maps, and resembling a bit of
the structure of a four, or with these feature maps of a five, where diagonal lines and S-shaped patterns are being detected. As you can see, similar higher level features
are present for all numbers. Now, after this convolutional layer, just as before, there is
another max pooling layer, to reduce the spatial
dimensions of the feature maps before the fully connected layers. As seen here, with
these newly pooled maps, they look a lot more
random to the human eye, since the spatial
resolution is not very low. However, to a computer, they are just the key points in a matrix that represent a drawn number. You can also see now why we
needed these pooling layers. From a raw input 32 by 32
pixel image with 1,024 pixels and no discernible features, to six low-level feature
maps of size 14 by 14 pixels resulting in 1,176 pixels. And then finally, 16
high-level feature maps, of five by five pixels,
resulting in 400 pixels. And this in general was the whole point of these convolutional and pooling layers, to detect high-level features with as low as possible
spatial resolution, and why this part of the
convolutional neural network is referred to as feature extraction. At this point, with
the features extracted, we still need to be able to classify them. Hence the second part of a convolutional
network, the classifier. This part is comprised of
fully connected layers, similar to a feedforward network. Except now our input to
these perceptron layers are the high level abstracted features from our input, rather
than the raw input pixels. In this example's case, there are 120 neurons in the
first fully connected layer, and 100 in the second. And as you can see, these are sufficient to
correctly classify numbers with a high degree of accuracy. Now, I hope you can recall, as we've covered in videos past, during the process of
backpropagation and gradient descent, the weight and bias values
are tuned as a network learns. In terms of convolutional networks, those values would still be
altered during this process, and also other values such
as the kernel coefficients of our convolution and cooling layers. Now, before continuing, I want to stop and highlight the fact that in our discussion of
convolutional networks, we have made many generalizations and overlooked many details. For instance, hyper-parameters
like kernel size, stride and dilation rate,
transposed convolutions and padding, different
types of pooling layers, as well as other parameters
and hyper-parameters in our feature extracting layers and their effects on the output. Not to mention the impact
of all the parameters in our fully connected layers, which we already mentioned
in previous videos. If you want more
information on these topics, check out the sources in the disclaimer and description below, as
they are beyond the scope of this video. Coming back on topic,
throughout this video, we've now seen how and why convolutional networks excel at tasks related to image classification. However, where they don't perform well is for tasks such as
natural language processing. This is because these
tasks require memory. But not to fret, there are networks designed specifically for this, that among others will be
discussed in the next video in this series. However, this doesn't mean you
have to wait to learn more. If you want to learn
more about deep learning, and I mean really learn about the field, from how these artificial
learning algorithms were inspired from the brain to their foundational building blocks, the perceptron, scaling
up to multilayer networks, different types of networks, such as convolutional networks
and recurrent networks, and much more, then brilliant.org
is a place for you to go. For instance, in this section on convolutional neural networks in the artificial neural networks course, goes through many of the concepts we have discussed in this video. Now, what we love about how the topics in these courses are presented is that first an intuitive
explanation is given, and then you're taken
through related problems. If you get a problem wrong, you can see an explanation
for where you went wrong and how to rectify that flaw. In a world where automation
through algorithms will increasingly replace more jobs, it is up to us as individuals
to keep our brains sharp and think of creative solutions to multidisciplinary problems. To support Futurology and
learn more about Brilliant, go to brilliant.org/futurology,
and sign up for free. Additionally, the first 200
people that go to that link will get 20% off their
annual premium subscription. At this point, the video has concluded. We'd like to thank you for
taking the time to watch it. If you enjoyed it, consider supporting us on
Patreon or YouTube membership to keep this brand growing. And if you have any topic suggestions, please leave them in the comments below. Consider subscribing for more content, and check out our website
and parent company EarthOne for more information. This has been Ankur, you've been watching Futurology, and we'll see you again soon. (upbeat music)