ml5.js: What is a Convolutional Neural Network Part 1 - Filters

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Hello and welcome to another Beginner's Guide to Machine Learning with ml5.js video. This is a video. You're watching it. And I am beginning this journey to talk about, and think about, and attempt to explain and implement convolutional neural networks. So this is something that I refer to in the previous video, where I took the pixels of an image and made those the inputs to a neural network to perform classification. And I did this in even earlier videos with pretrained models. And I mentioned that those pretrained models included something called a convolutional layer, but my example didn't include a convolutional layer. So ml5 has a mechanism for adding convolutional layers to your ml5 neural network. But before I look at that mechanism, what I want to do in this video and in the next one is just explain what are the elements of a convolutional neural network, how do they work, and then look at some code examples that actually implement the features of that convolutional layer. I'm not going to build from scratch a full convolutional neural network. Maybe that's some other video series that I'll do someday. We're going to use the fact that the ml5 library just makes that possible for you. In the first part I will just talk about from the zoomed out view, what a convolutional layer is, then I will look at with code, this idea of a filter. In the second part, I'll come back and look at this other aspect of a convolutional layer called pooling. I hope you enjoy this and you find it useful. And I'll see you-- I'll be back in this outfit at the end of the video. Let me start by diagramming what the neural networks looked like with ml5 neural network to date in the videos that I've made. So there's been two layers-- a hidden layer and an output layer-- and then also there's some data coming into the neural network. And in this case, in the previous example, it was an image, which was flattened. So I used the example of 10 by 10 pixels, each with an R, a G, and a B. So that made an array of 300 inputs. All these pixel values, those are the inputs. And those go into the hidden layer. But just for the sake of argument, let me simplify this diagram and I'm just going to consider an example with four inputs. I'm going to consider that example as having five hidden nodes-- hidden units. And then let's say, it's a classification problem and there's three possible categories. So when I call the function ml5.neuralNetwork, it creates this architecture behind the scenes and connects every single input to every hidden unit and every hidden unit to each output. [MUSIC PLAYING] So this is what the neural network looks like. Each one of these connections has a weight associated with it. Each unit receives the sum of all of the inputs times the weights passed through an activation function, which then becomes the output, which then all of those with those weights are summed into the next layer, and so on and so forth. So this is what I have worked with before. While in the previous example, I was able to get this kind of architecture to work with image input and get results that produced something in the output, this can be improved upon. There is information in this data that's coming in that is lost when it is flattened to just a single flat array. And the information that's lost is the relative spatial orientation of the pixels. It's meaningful that these colors are near other colors. Something in what we're seeing in the image has to do with the spatial arrangement of the pixels themselves in two dimensions. In order to address that, we want to add into this architecture-- I really spent a lot of time drawing this diagram, which I'm now going to mostly erase-- we want to add something called a convolutional layer. So in this video, I want to explain what are the elements. There are units, nodes, neurons, so to speak, in a convolutional layer, but what are they? And the word that's typically used is actually called a filter, which makes a lot of sense. Now, convolutional neural networks can be applied to lots of scenarios besides images and there's a lot of research into different ways that they can be used effectively, but I'm going to stick with the context of working with images because the word "filter" really fits with that. We're filtering an image. How is this layer filtering an image? So the idea of a convolutional layer is not a new concept, and it predates the era that we're in now of so-called deep learning. And if you want to go back and look at the origins of convolutional neural networks, you can find them in this paper called "Gradient-Based Learning Applied to Document Recognition" from 1998. Section two, convolutional neural networks for isolated character recognition. And here, we can see this diagram, which is I'm attempting to kind of talk through and create my own version of over here on the whiteboard itself. This is also the original paper associated with the MNIST dataset-- a dataset of handwritten digits that's been used umpteen amounts of times in research papers over the years related to machine learning. I know I'm going back and forth a lot here, but let's go back to thinking of the input as a two-dimensional image itself. So this two-dimensional image-- and let's not say it's 10 by 10. Let's use what the MNIST dataset is, which is a 28 by 28 pixel image. And of course now, much higher resolution images are used. And this is what is coming in to the first convolutional layer. This image is being sent to every single one of these filters. A filter is a matrix of numbers. And let's just, for example, let's have a 3 by 3 matrix. Each one of these filters represents nine numbers-- a matrix that's 3 by 3. You could have a 5 by 5 filter and so on and so forth, but it a sort of standard size or a nice example size for us to start with is 3 by 3. Each one of these filters is then applied to the image through a convolutional process. This by the way, is not a concept exclusive to machine learning. This idea of a convolutional filter to an image has been part of image processing, and computer science, and computer vision algorithms for a very long time. To demonstrate this, let me actually open up-- I can't believe I'm going to do this, but I'm going to open up Photoshop. So here I am in Photoshop and I've opened this image of a kitten. And there's a menu option called Filter. This word is not filter by accident. There's a connection. So all of these types of operations that you might do-- for example, like blur an image-- these are filters-- convolutions applied to the image. I'm going to go down here under Other and select Custom. All of a sudden, you're going to see here, I have this matrix of numbers. This matrix of numbers in Photoshop is exactly the same thing as this matrix of numbers I'm drawing right here. Each one of these filters in the convolutional layer represents a matrix of numbers that will be applied to the image. So let me actually just put some numbers in here. [MUSIC PLAYING] This particular set of numbers happens to be a filter for finding edges in an image. And you can think of it as these are all weights for a given pixel. So for any given pixel, I want to subtract colors that are to the left of it and emphasize colors that are at that pixel and above and below. This draws out areas of the image where the neighboring pixels are very, very different. Interestingly enough, I could switch these to 0. [MUSIC PLAYING] Switching the filter to have the negative numbers on the top, you can see now I'm still detecting edges, but I'm detecting horizontal edges. If you go back and look at the cat that I had previously versus this one, you can see vertical edges versus horizontal edges. So there are known filters, which draw out certain features of an image. And that's exactly what each one of these filters does. If all of the nodes of a neural network can draw out and highlight different aspects of an image, those can be weighted to indicate and classify the image in certain ways. The big difference between a convolutional layer, and a neural network, and what I'm doing here by hardcoding in sort of known filters is that the neural network is not going to have filters hardcoded into them. It's going to learn filters that do a good job of identifying features in an image. This relates to the idea of weights, I think. So if I go back to my previous diagram, where every single input is connected to each hidden neuron with a weight, now the input image is connected to every single one of these filters. In a way, there are now nine weights for every single one. Instead of learning a single weight, it's going to learn a set of weights for an area of pixels to identify a feature in the image. All of these filters will start with random values, and then the same gradient descent process-- the error backpropagating through the network, adjusting all the dials, adjusting all the weights in these matrices and all of these filters-- works in the same way. So in the ml5 series, I haven't really gone through and looked at the gradient descent learning algorithm to adjust all the weights in detail. I do have another set of videos that do that if you're interested, but the same gradient descent algorithm that is applied to these weights is applied to all of the different values in each one of these filters. Incidentally, just to show a very common convolution operation to blur an image, blurring an image is taking the average of a given pixel and all of its neighbors. So here, you can see if I give the same weight to a 5 by 5 matrix of pixels around a center pixel, and then divide that scale-- let's divide by 25 because there's 25-- that's averaging all of the colors. If I click on Preview, blurred, not blurred, blurred, not blurred. Of course, there are other more sophisticated convolutions, like a Gaussian blur. You can take a look a Gaussian blur. There's different ways to pronounce it. You can take a look and research what that is, but again, I'm not going down the road to look at common image processing convolutions. Instead, talking about the concept of a convolution as applied to an image in the process of a convolutional neural network. Just to take this a little bit further, I'm going to demonstrate how to code the convolution algorithm in p5.js. In truth, ml5 and TensorFlow.js are going to handle all of the convolution operations for us and creating all the filters. We're just going to configure a convolutional layer from a high level. But I think it's interesting to look at how you might code an image processing algorithm in p5. I have some videos that do things like this previously, but let's look at it in this context. So I took a low resolution 28 by 28 image of a cat. This comes from the Quick Draw dataset, which I've made videos about before and I will also use to see if we can create a doodle classifier as part of this series. And all I want to do is apply a convolution to that image. So first, I'm going to create a variable and I'm going to call it filter. So this is going to be our filter. And I'm going to make it a two-dimensional array. So let me just put all zeros in it to start. So this is the filter. And let's go with that one that looks for edges. The cat image is actually quite low resolution, just 28 by 28 pixels, but I'm drawing it at twice the size. I want to write the code to apply this filter to the image and draw the filtered image to the right. I'm going to create a variable called dim for dimensions and just call this 28. And then I want another variable to store the filtered image. And in setup, I can create that image. This creates a blank image of the same dimensions as the original cat drawing. Then I can write a loop. And this loop is going to look at every single pixel for all the columns x and all of the rows y. And I wrote int there because I'm half the time programming in Java. But one thing that's important here, if we're going to take this 3 by 3 matrix and apply it to every single pixel of the original image, if we're applying it to that first pixel 0,0, there's no pixel to the left and no pixel above it. It doesn't have all of its neighbors. So there's various ways around this. I'm just going to ignore all the edge pixels. So the loop will go from 1 to dimensions minus 1. Now, there's a lot more work to be done here just to apply this filter to any given pixel. I think a way that might make sense to do this is to actually have a new function. I would call the function filter-- let's just call it convolution. I'm going to write a function called convolution. It receives an image, an x and a y, and a filter, and it returns a new color. So the idea of this function is that it receives all the things it needs. It receives the original image, the filter to apply to it, which particular pixel we want to process, and then will return back to new RGB value after that pixel is processed. And the reason why I'm doing that in a separate function is I need another nested loop to go over the filter. So I need to go from 0 to 3-- 0, 1, 2 columns in the filter, 0, 1, 2 rows in the filter. And it would be getting to be quite a lot if I had four nested loops right in here. Now, I probably shouldn't have some of this hardcoded in here-- the number 3 and that sort of thing-- but you can imagine how you might need to use variables if the filter size is flexible. Now, we have a really sort of like sad fact, which is true about most cases where you're doing image processing with some framework. And in this case, our framework is JavaScript, and canvas, and p5.js. And the sad fact is though even though all of this is built-- all of this discussion is built upon the fact that we are retaining the spatial orientation of the pixels. We're thinking of it as a two-dimensional matrix of numbers. The actual data is stored in one array. And so I've gone over this in probably countless videos, but there's a simple formula to look at if I have a given x,y position in a two-dimensional matrix, how do I find the one-dimensional lookup into that matrix, assuming that the pixels were counted by rows-- 0, 1, 2, 3, 4, 5, 6, 7, blah, blah, blah, next row, 28, 29, 30, blah, blah blah. And that formula is let index-- oh, well, I need to do that before this nested loop because right now, I just want the center pixel-- that x,y. Let index equal x plus y times img.width. But there's more, oh! So this is the form. And if you think about it, it makes sense because it's all the x's, and then the offset along the y's is how many rows times the width of the image. But there's another problem, which is that in JavaScript in canvas, for every single pixel in this image, there are actually four numbers being stored-- an R, a G, a B, and an alpha-- the red, green, and blue channels and the alpha channels-- channel, singular. So each pixel takes up four spots. So this index actually needs to say times 4. So guess what? You know it's going to make a lot of sense. I'm going to need this operation a lot. Let's write a function for it. I'll just call it index, and it receives an x, y, and a width, and it returns-- you know what? The width is never going to change in my sketch, so I don't want to be so crazy as to have to pass it around everywhere. So we're just going to pull it from a global variable. Return x plus y times img.width. And that's not img, it's cat.width. OK, so once again, this is terrible what I'm doing, but I'm just saving myself a little bit of heartache here and there. So this index-- ooh, let's call this pixel. Oh, and this should be times 4. This pixel is that function index x,y. Now, I have something I could do to simplify this, but I might as well write the code for if this were a full RGB image. This is a grayscale image, but it has all the channels in it. The thing that I need to do to perform this convolution operation is to take all of the weights-- the numbers that are in the filter matrix-- and I need to multiply each one times the pixel value of all of the neighbors and their corresponding locations, add them all up together, and maybe divide by something if I wanted to sort of, like, average it out. But in this case, I actually don't want to divide by anything. I'm just going to leave the weights are the weights are the weights are the weights. And actually, this right here is irrelevant. I need to do this inside the loop. You'll see in a second. I think it's going to make sense. So I need sum. I'm going to make a sum of all the R values, a sum of all the green values, and a sum of all the blue values. All right, wait a sec, wait a sec, wait a sec. Actually, I think this is going to make more sense. Let's go from negative 1 to 2. You'll see why. I mean, I'll explain why. And negative 1 to 2. Let's do that instead. And maybe it's more clear to say less than or equal to 1. Less than or equal to 1 because-- and let me draw this diagram once again-- if this is pixel 0,0, this is pixel negative 1, negative 1. This is 1,1. This is 1,0. This is 1, negative 1. I guess I'll do them all. So you can see that the neighboring pixels are offset by negative 1 and 1, and negative 1 and 1. So the pixel x value is x plus i. The pixel y value is y plus j. And then the pixel index is call the index function x, which returns the actual index into that array for pixel x and pixel y. And actually, maybe it makes more sense for me to just say that I don't necessarily need separate variables. It might actually be just as clear just to put this right in here. So now, I just need to add the red, green, and blue values of this particular pixel to the sum. So sumR plus equal img.pixels at that pixel index. And then G and B. G is the next one, and B, blue, is the next one. And let's add a plus 0 here just to be consistent. So ultimately, what I'm actually returning here is r is sumR, g is sumB, and b is sum-- oh, sorry, g is sumG and b is sumB. So this is the process now of adding up all the pixels. I've gone through every single pixel in a 3 by 3 neighboring area and added up all the reds, greens, and blues, and I'm returning those back. But I'm missing the crucial component, which is as I'm adding all the pixels up in that area, I need to multiply each one by the value in the filter itself. Incidentally, I should also mention that the operation that this really is is the dot product, and in an actual machine learning system, all this would be done with matrix math, but I'm doing it sort of like longhand just to sort of see the process and look at it. What should I call this in the filter, like the factor? Now, I need to look up in the filter, i,j. Only here's the thing-- because I decided to go from negative 1 to 1, negative 1 to 1, the filter doesn't have those index values. It goes 0, 1, 2, 0, 1, 2. So this has to be i plus 1, j plus 1. So it's all six of one, half dozen of the other, whether I go from 0 to 2 there and do the offset in the pixels. But the point is the pixel array, I'm looking actually to the negative and positive to the left and right, but the filter is just a 3 by 3 array starting with 0,0 on the top left. So now, I should be able to multiply by factor. And there we go. I have the full convolution operation. Now, I might have made a mistake here. I think this is right. When I run it, we'll find out if I made a mistake. I'm summing up a 3 by 3 neighborhood of pixels, all multiplied by weights that are in a 3 by 3 filter. Oh, but I actually have to call that function here. Now, it should be relatively easy because all of the work was in there. So if I say let I'm just going to call this rgb equal convolution, the cat at the given x and y with the filter, then the new image, which is called filter-- oh. I have to look up. It's OK. No problem. The pixel is index x,y, and then filter-- so I have to look up the one-dimensional location in the new image, and then at .pixels at that pixel is the rgb-- the red value that came back plus 0 plus 1 plus 2, green and blue. And then if all goes according to plan, I should be able to draw the filtered image at offset to the right with the same size. I did miss something kind of important, which is that if I am working with pixels of an image in p5, I need to call loadPixels. So cat.loadPixels filtered.loadPixels. And then I haven't changed the pixels of the original cat image, but since I changed the pixels of the filtered image, afterwards I need to call updatePixels. And now is the moment of truth. [DRUM ROLL] Never good when I press the snare drum button. I'm going to run the sketch. Whoops. All right, well, I've already got an error. [SAD TROMBONE] Cannot read property loadPixels. Oh, filter, filter, filtered. That should be filtered. Also this isn't right-- createCanvas. The size of the canvas is times 10 times 2 times 10. Remember, the image is just 28 by 28. Let's try this again. [DRUM ROLL] [SAD TROMBONE] Well, a little bit better. We didn't get any errors. I don't see an image. Do I need to give it a hardcoded transparency of 255? Yes. [BELL] Oops. So it was fully transparent. So I'm not pulling the transparency over. I could pull it over, but I just know I don't want it to be transparent. Look at that. Look at how it found the-- oh, oh, oh, oh. Look at this. That doesn't look like it's finding the vertical edges-- pixels that are different to the left. It looks like it's finding horizontal edges. Even though I've typed this out in a way that visually, these negative 1's appear in a column, it's actually those correspond not to the j index, but to the i index. So I think one way to fix that would just be to swap it here. And maybe there's like a more elegant way of doing this, but this now, if I run it this way, you'll see, ah, look at those horizontal edges. So now, we see how this convolution is applied to the image. The difference in the neural network here-- the convolutional neural network-- is we're not hardcoding in specific filters that we know highlight things in an image. The neural network is going to learn what values for the filters highlight important aspects of the image to help the machine learning task at hand, such as classification. So it might draw out, you know, cats tend to have ears that appear a certain way and this kind of filter, like, brings that out, and then leads to the final layer of the network activating with a high value for that particular classification. So just to keep my example simulating the neural network process a bit more, let's just every time I run it, give it a random filter because that's what the layer would begin with. Just like a neural network begins with random weights and learns the right weights, the filters begin with random values and it learns optimal values. So right here in setup, I'll write a nested loop and give it a random value between negative and 1. In truth, there are other mechanisms and strategies for the initial weights of a convolutional neural network, but picking random numbers will work for us right now just to see. So every time I run it, you can see we get a different resulting image that is filtering the image in a different way. OK, that was a lot and I think it would be good to take a break. So this was the first part of my explanation, a long-winded attempt to answer the question, what is a convolutional neural network? So the first thing to look at is the convolutional layer. It's made up of filters. And so this video attempted to explain that. And I think we could take a break, have a cup of tea, talk to your pet, or friend, or plant, or something, meditate, relax. And then if you want-- if you want, you can come back and in the next video, I'm going to look at the next piece-- the next component of the convolutional layer, an operation called pooling or more specifically, max pooling. And then I'll be able to tie a little ribbon and put a little bow on this explanation about convolutional neural networks and move towards actually implementing one with the ml5 built-in functionality. All right, so maybe I'll see you in the future and have a great rest of your day. Goodbye. [MUSIC PLAYING]

Info

Channel: The Coding Train

Views: 54,156

Rating: undefined out of 5

Keywords: machine learning, cnn, convolutional neural network, ml5.js, ml5, JavaScript, filters, image processing, pixel array, pixels

Id: qPKsVAI_W6M

Channel Id: undefined

Length: 28min 3sec (1683 seconds)

Published: Sun Feb 23 2020