Convolutional Neural Networks Explained (CNN Visualized)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

To support the production of more high quality content, consider supporting us on Patreon or a YouTube membership. Additionally, consider visiting our parent company EarthOne, for sustainable living made simple. Throughout this deep learning series, we have gone from the origins of the field and how the structure of the artificial neural network was conceived to working through an intuitive example, covering the main aspects and some of the many complexities of deep learning. Now all of these videos have only been focused on one type of neural network, the feedforward network. The focus of this video then will be to initiate discussion on another very popular and important neural network architecture, the convolutional neural network. For our discussions on convolutional neural networks, we will go with a very common example, number recognition. The reason this example was chosen is because of this great interactive resource by Adam Harley, a robotics PhD at Carnegie Mellon University. I have linked the resource in the description so that you too can experiment with it and see the internals of a convolutional network. Now, if you can recall from our example on an image pattern recognizer using a feedforward network, we set up an idealized system and made the assumption that each hidden layer in our network would build upon further layers of abstraction, from vertical lines to combinations of vertical lines. However, in actuality, as we explained, this wouldn't happen, and the receptive fields in the network would be a lot more random to the human eye due to the architecture of feedforward networks and how they compute. With convolutional networks, however, this isn't the case, and we could to some extent see these layers of abstraction building up. Before we delve into why and how this happens, let's first set up a structure for the network we will be using. As with our previous example in image recognition, the input will be the individual pixels of our image and the output the patterns we are trying to classify. In this example's case, we have 10 outputs, numbers zero to nine. For the layers in between, we will have two convolutional layers, two pooling layers, and two fully connected layers. Now, that was a lot of new terminology thrown around. So let's break it down layer by layer, starting with the input. While we have already stated that the input is the individual pixels of the image, I want to mention some important details about how a computer views this image. Every image in a digital device is stored as a matrix of pixel values. This is referred to as a channel, a certain component of an image. Now, with a typical digital camera, every image will have three channels, red, green, and blue, RGB, which you can imagine as three 2D matrices stacked upon one another. For the sake of simplicity, we will assume our input has just one channel, the luminance of the image, with the value of each pixel represented in eight bits. In other words, the pixel values will range from zero to 255, with zero indicating darker transparent luminance and 255 as bright. As a side note, digital systems store all types of sensor data and other information in this fashion. For example, for a convolutional network that would operate on speech, we could represent that speech as a matrix of frequency values. Coming back on topic, now that we understand the format of the input, let's finally delve into where the real magic happens, the convolutional layer. As you can assume, this layer and the network as a whole get their names from the convolutional operator, which in layman terms is a mathematical combination, in other words, the dot product of two functions to produce a third function. In CNNs, this operation is implemented in what is referred to as a feature detector, filter, or most commonly, a kernel. You can think of a kernel as a mini matrix, orders of magnitude smaller than the input. In a convolution operation, then, the kernel moves across the input image, taking the dot product of the two matrices and then saving the values to a new matrix, dubbed the feature map of the original image. At this point, you may be wondering how this is able to detect any features at all. Well, if a kernel is initialized with the values in a specific configuration, they can be used to transform an input image and find various patterns. Take these kernels, for instance. When loaded with the appropriate values and convolved with the input image produces an output which highlights various edges in the photo. Now I won't delve into the specifics of the various types of kernels in this video, but there are many resources that talk about them extensively, such as Computerphile and their various videos on kernels, like the Sobel edge detector. Coming back on topic, as you can see, kernels are essentially ways to do fast computation and produce a new output. In this case, we're doing it with an image, with an effect or features highlighted. Programs like Photoshop use these simple kernels for effects like blurs, for instance. Now, when we look at a typical image, take in how much is actually going on. There are various edges, shapes, textures, et cetera, that stack together to make various objects in the overall image we are looking at. At the start of a convolutional network, the types of kernels we use would be quite simple and more geometric, detecting things such as edges, corners, and simple shapes and patterns, like a circle, for instance. Additionally, each convolutional layer can have multiple kernels that produce multiple feature maps of their own. So for our example, we can see that our first layer has six kernels, and these kernels are detecting simple patterns, like horizontal lines, vertical lines, and corners on our drawn numbers. As a side note, for each of the convolved pixels in our feature map, a non-linearity function was also applied, in this instance, a ReLU. We discussed this function in previous videos and the need for nonlinearity in neural networks, so our feature maps are more adaptable to real-world data. Also, you can now see how complicated this process would become if we had more than one channel, as each channel would have six feature maps, in this instance leading to 18 for an RGB image, and then those feature maps would have to be filtered together or operated on separately, depending on your specifications. For instance, in a self-driving system, a red hexagon is a stop sign. So we would need separate channels and kernels for those channels. However, coming back to our example, just one channel, luminance was more than enough. Moving on, now that we have our rectified feature maps, the next layer in CNNs after the convolution is a pooling layer. Pooling layers are used to downsample our feature maps, keeping the most important parts and discarding the rest, and primarily done to reduce the overfitting rate, and so that calculations can be sped up in later layers due to the reduced spatial size of the image. The type of pooling our network implements is max pooling, in which we take another kernel and slide it across our input feature maps. And the largest pixel value in that region is what is saved to our new output feature map. As you can see in this number classifier, the max pooled layer still retains the important information from the convolutional layer, while being smaller in spatial size. For instance, in this feature map, you can still make out the horizontal lines, and in this one, the vertical from their respective original feature maps. So now with the defining layers of the convolutional neural network understood, A, the convolutional layer which extracts features, and then B, the pooling layer which downsamples the feature maps, the next step is repeating these layers to build up more abstraction in the network. As stated earlier, in later layers of the network, more complex kernels are incorporated that detect shapes, objects, and other complex structures, which is done by leveraging the previously generated feature maps and their detected simple features to build more complex ones. This is essentially the definition of abstraction. You can see in our example that in the next convolutional layer, there are 16 kernels, and these kernels are now much more complex. In the first convolutional layer, we were detecting simple edges and corners. And now, as you can see, horizontal and vertical edges are being combined as well as other simple shapes from the previous feature maps. For instance, with this drawn four, you can see this feature map combining both horizontal and vertical edges from various earlier layer feature maps, and resembling a bit of the structure of a four, or with these feature maps of a five, where diagonal lines and S-shaped patterns are being detected. As you can see, similar higher level features are present for all numbers. Now, after this convolutional layer, just as before, there is another max pooling layer, to reduce the spatial dimensions of the feature maps before the fully connected layers. As seen here, with these newly pooled maps, they look a lot more random to the human eye, since the spatial resolution is not very low. However, to a computer, they are just the key points in a matrix that represent a drawn number. You can also see now why we needed these pooling layers. From a raw input 32 by 32 pixel image with 1,024 pixels and no discernible features, to six low-level feature maps of size 14 by 14 pixels resulting in 1,176 pixels. And then finally, 16 high-level feature maps, of five by five pixels, resulting in 400 pixels. And this in general was the whole point of these convolutional and pooling layers, to detect high-level features with as low as possible spatial resolution, and why this part of the convolutional neural network is referred to as feature extraction. At this point, with the features extracted, we still need to be able to classify them. Hence the second part of a convolutional network, the classifier. This part is comprised of fully connected layers, similar to a feedforward network. Except now our input to these perceptron layers are the high level abstracted features from our input, rather than the raw input pixels. In this example's case, there are 120 neurons in the first fully connected layer, and 100 in the second. And as you can see, these are sufficient to correctly classify numbers with a high degree of accuracy. Now, I hope you can recall, as we've covered in videos past, during the process of backpropagation and gradient descent, the weight and bias values are tuned as a network learns. In terms of convolutional networks, those values would still be altered during this process, and also other values such as the kernel coefficients of our convolution and cooling layers. Now, before continuing, I want to stop and highlight the fact that in our discussion of convolutional networks, we have made many generalizations and overlooked many details. For instance, hyper-parameters like kernel size, stride and dilation rate, transposed convolutions and padding, different types of pooling layers, as well as other parameters and hyper-parameters in our feature extracting layers and their effects on the output. Not to mention the impact of all the parameters in our fully connected layers, which we already mentioned in previous videos. If you want more information on these topics, check out the sources in the disclaimer and description below, as they are beyond the scope of this video. Coming back on topic, throughout this video, we've now seen how and why convolutional networks excel at tasks related to image classification. However, where they don't perform well is for tasks such as natural language processing. This is because these tasks require memory. But not to fret, there are networks designed specifically for this, that among others will be discussed in the next video in this series. However, this doesn't mean you have to wait to learn more. If you want to learn more about deep learning, and I mean really learn about the field, from how these artificial learning algorithms were inspired from the brain to their foundational building blocks, the perceptron, scaling up to multilayer networks, different types of networks, such as convolutional networks and recurrent networks, and much more, then brilliant.org is a place for you to go. For instance, in this section on convolutional neural networks in the artificial neural networks course, goes through many of the concepts we have discussed in this video. Now, what we love about how the topics in these courses are presented is that first an intuitive explanation is given, and then you're taken through related problems. If you get a problem wrong, you can see an explanation for where you went wrong and how to rectify that flaw. In a world where automation through algorithms will increasingly replace more jobs, it is up to us as individuals to keep our brains sharp and think of creative solutions to multidisciplinary problems. To support Futurology and learn more about Brilliant, go to brilliant.org/futurology, and sign up for free. Additionally, the first 200 people that go to that link will get 20% off their annual premium subscription. At this point, the video has concluded. We'd like to thank you for taking the time to watch it. If you enjoyed it, consider supporting us on Patreon or YouTube membership to keep this brand growing. And if you have any topic suggestions, please leave them in the comments below. Consider subscribing for more content, and check out our website and parent company EarthOne for more information. This has been Ankur, you've been watching Futurology, and we'll see you again soon. (upbeat music)

Info

Channel: Futurology — An Optimistic Future

Views: 195,802

Rating: undefined out of 5

Keywords: Convolutional neural network, Convolutional, computer vision, kernel, CNN, conv, filter, feature detector, edge detector, Deep learning, neural network, gradient descent, machine learning, backpropogation, perceptron, artificial intelligence, AI, automation, artificial general intelligence, strong AI, supervised learning, unsupervised learning, superintelligence, robotics, classification, convolutional network, classifier, pattern recognition, chatgpt, gpt-3, gpt-4, openai, stable diffusion

Id: pj9-rr1wDhM

Channel Id: undefined

Length: 10min 46sec (646 seconds)

Published: Sat Dec 19 2020