Hello and welcome to
another Beginner's Guide to Machine Learning
with ml5.js video. This is a video. You're watching it. And I am beginning this journey
to talk about, and think about, and attempt to
explain and implement convolutional neural networks. So this is something that I
refer to in the previous video, where I took the
pixels of an image and made those the inputs
to a neural network to perform classification. And I did this in even earlier
videos with pretrained models. And I mentioned that those
pretrained models included something called a
convolutional layer, but my example didn't include
a convolutional layer. So ml5 has a mechanism for
adding convolutional layers to your ml5 neural network. But before I look at
that mechanism, what I want to do in this
video and in the next one is just explain what
are the elements of a convolutional
neural network, how do they work, and then
look at some code examples that actually implement the features
of that convolutional layer. I'm not going to
build from scratch a full convolutional
neural network. Maybe that's some other video
series that I'll do someday. We're going to use the fact
that the ml5 library just makes that possible for you. In the first part
I will just talk about from the zoomed out view,
what a convolutional layer is, then I will look at with
code, this idea of a filter. In the second part,
I'll come back and look at this other aspect
of a convolutional layer called pooling. I hope you enjoy this
and you find it useful. And I'll see you-- I'll be back in this outfit
at the end of the video. Let me start by diagramming
what the neural networks looked like with ml5 neural network
to date in the videos that I've made. So there's been two layers-- a hidden layer and
an output layer-- and then also there's some data
coming into the neural network. And in this case, in
the previous example, it was an image,
which was flattened. So I used the example of 10 by
10 pixels, each with an R, a G, and a B. So that made
an array of 300 inputs. All these pixel values,
those are the inputs. And those go into
the hidden layer. But just for the
sake of argument, let me simplify this
diagram and I'm just going to consider an
example with four inputs. I'm going to
consider that example as having five hidden nodes-- hidden units. And then let's say, it's
a classification problem and there's three
possible categories. So when I call the
function ml5.neuralNetwork, it creates this architecture
behind the scenes and connects every single
input to every hidden unit and every hidden
unit to each output. [MUSIC PLAYING] So this is what the
neural network looks like. Each one of these
connections has a weight associated with it. Each unit receives
the sum of all of the inputs times the weights
passed through an activation function, which then becomes the
output, which then all of those with those weights are
summed into the next layer, and so on and so forth. So this is what I have
worked with before. While in the previous
example, I was able to get this
kind of architecture to work with image input and get
results that produced something in the output, this
can be improved upon. There is information
in this data that's coming in that is lost
when it is flattened to just a single flat array. And the information
that's lost is the relative spatial
orientation of the pixels. It's meaningful that these
colors are near other colors. Something in what we're
seeing in the image has to do with the spatial
arrangement of the pixels themselves in two dimensions. In order to address
that, we want to add into this architecture-- I really spent a lot of
time drawing this diagram, which I'm now going
to mostly erase-- we want to add something
called a convolutional layer. So in this video, I want to
explain what are the elements. There are units,
nodes, neurons, so to speak, in a convolutional
layer, but what are they? And the word that's
typically used is actually called a filter,
which makes a lot of sense. Now, convolutional
neural networks can be applied to lots of
scenarios besides images and there's a lot of
research into different ways that they can be
used effectively, but I'm going to stick
with the context of working with images because the word
"filter" really fits with that. We're filtering an image. How is this layer
filtering an image? So the idea of a convolutional
layer is not a new concept, and it predates
the era that we're in now of so-called
deep learning. And if you want to go back
and look at the origins of convolutional
neural networks, you can find them in this paper
called "Gradient-Based Learning Applied to Document
Recognition" from 1998. Section two, convolutional
neural networks for isolated
character recognition. And here, we can see
this diagram, which is I'm attempting to
kind of talk through and create my own version of
over here on the whiteboard itself. This is also the
original paper associated with the MNIST dataset-- a dataset of handwritten
digits that's been used umpteen amounts
of times in research papers over the years related
to machine learning. I know I'm going back
and forth a lot here, but let's go back to
thinking of the input as a two-dimensional
image itself. So this two-dimensional image-- and let's not say it's 10 by 10. Let's use what the
MNIST dataset is, which is a 28 by 28 pixel image. And of course now, much higher
resolution images are used. And this is what is coming in to
the first convolutional layer. This image is being
sent to every single one of these filters. A filter is a matrix of numbers. And let's just, for example,
let's have a 3 by 3 matrix. Each one of these filters
represents nine numbers-- a matrix that's 3 by 3. You could have a 5 by 5
filter and so on and so forth, but it a sort of standard size
or a nice example size for us to start with is 3 by 3. Each one of these filters
is then applied to the image through a convolutional process. This by the way,
is not a concept exclusive to machine learning. This idea of a convolutional
filter to an image has been part of
image processing, and computer science, and
computer vision algorithms for a very long time. To demonstrate this, let
me actually open up-- I can't believe I'm
going to do this, but I'm going to
open up Photoshop. So here I am in
Photoshop and I've opened this image of a kitten. And there's a menu
option called Filter. This word is not
filter by accident. There's a connection. So all of these types of
operations that you might do-- for example, like
blur an image-- these are filters-- convolutions
applied to the image. I'm going to go down here
under Other and select Custom. All of a sudden, you're
going to see here, I have this matrix of numbers. This matrix of
numbers in Photoshop is exactly the same thing
as this matrix of numbers I'm drawing right here. Each one of these filters
in the convolutional layer represents a matrix
of numbers that will be applied to the image. So let me actually just
put some numbers in here. [MUSIC PLAYING] This particular set
of numbers happens to be a filter for
finding edges in an image. And you can think
of it as these are all weights for a given pixel. So for any given pixel,
I want to subtract colors that are to the left
of it and emphasize colors that are at that
pixel and above and below. This draws out
areas of the image where the neighboring pixels
are very, very different. Interestingly enough, I
could switch these to 0. [MUSIC PLAYING] Switching the filter to have
the negative numbers on the top, you can see now I'm
still detecting edges, but I'm detecting
horizontal edges. If you go back and
look at the cat that I had previously
versus this one, you can see vertical edges
versus horizontal edges. So there are known
filters, which draw out certain features of an image. And that's exactly what each
one of these filters does. If all of the nodes
of a neural network can draw out and highlight
different aspects of an image, those can be weighted
to indicate and classify the image in certain ways. The big difference between
a convolutional layer, and a neural network,
and what I'm doing here by hardcoding in
sort of known filters is that the neural
network is not going to have filters
hardcoded into them. It's going to learn filters that
do a good job of identifying features in an image. This relates to the idea
of weights, I think. So if I go back to
my previous diagram, where every single
input is connected to each hidden
neuron with a weight, now the input image is
connected to every single one of these filters. In a way, there are now nine
weights for every single one. Instead of learning
a single weight, it's going to learn a set of
weights for an area of pixels to identify a
feature in the image. All of these filters will start
with random values, and then the same gradient
descent process-- the error backpropagating
through the network, adjusting all the dials,
adjusting all the weights in these matrices and
all of these filters-- works in the same way. So in the ml5 series,
I haven't really gone through and looked at
the gradient descent learning algorithm to adjust all
the weights in detail. I do have another
set of videos that do that if you're interested,
but the same gradient descent algorithm that is
applied to these weights is applied to all of
the different values in each one of these filters. Incidentally, just to show
a very common convolution operation to blur an
image, blurring an image is taking the average of a given
pixel and all of its neighbors. So here, you can see if I give
the same weight to a 5 by 5 matrix of pixels
around a center pixel, and then divide that
scale-- let's divide by 25 because there's 25-- that's averaging
all of the colors. If I click on Preview,
blurred, not blurred, blurred, not blurred. Of course, there are other more
sophisticated convolutions, like a Gaussian blur. You can take a look
a Gaussian blur. There's different
ways to pronounce it. You can take a look and
research what that is, but again, I'm not
going down the road to look at common image
processing convolutions. Instead, talking about the
concept of a convolution as applied to an image
in the process of a convolutional
neural network. Just to take this a
little bit further, I'm going to demonstrate
how to code the convolution algorithm in p5.js. In truth, ml5 and
TensorFlow.js are going to handle all of the
convolution operations for us and creating all the filters. We're just going to configure
a convolutional layer from a high level. But I think it's
interesting to look at how you might code an image
processing algorithm in p5. I have some videos that do
things like this previously, but let's look at
it in this context. So I took a low resolution
28 by 28 image of a cat. This comes from the Quick Draw
dataset, which I've made videos about before and I
will also use to see if we can create a
doodle classifier as part of this series. And all I want to do is apply
a convolution to that image. So first, I'm going
to create a variable and I'm going to call it filter. So this is going
to be our filter. And I'm going to make it
a two-dimensional array. So let me just put all
zeros in it to start. So this is the filter. And let's go with that
one that looks for edges. The cat image is actually
quite low resolution, just 28 by 28 pixels, but I'm
drawing it at twice the size. I want to write the code to
apply this filter to the image and draw the filtered
image to the right. I'm going to create a variable
called dim for dimensions and just call this 28. And then I want another variable
to store the filtered image. And in setup, I can
create that image. This creates a blank image
of the same dimensions as the original cat drawing. Then I can write a loop. And this loop is going to
look at every single pixel for all the columns x
and all of the rows y. And I wrote int there
because I'm half the time programming in Java. But one thing that's
important here, if we're going to take
this 3 by 3 matrix and apply it to every single
pixel of the original image, if we're applying it to
that first pixel 0,0, there's no pixel to the
left and no pixel above it. It doesn't have all
of its neighbors. So there's various
ways around this. I'm just going to ignore
all the edge pixels. So the loop will go from
1 to dimensions minus 1. Now, there's a lot more work
to be done here just to apply this filter to any given pixel. I think a way that
might make sense to do this is to actually
have a new function. I would call the
function filter-- let's just call it convolution. I'm going to write a
function called convolution. It receives an image, an
x and a y, and a filter, and it returns a new color. So the idea of this
function is that it receives all the things it needs. It receives the original image,
the filter to apply to it, which particular pixel
we want to process, and then will return
back to new RGB value after that pixel is processed. And the reason why I'm doing
that in a separate function is I need another nested
loop to go over the filter. So I need to go from 0 to 3-- 0, 1, 2 columns in the filter,
0, 1, 2 rows in the filter. And it would be getting
to be quite a lot if I had four nested
loops right in here. Now, I probably
shouldn't have some of this hardcoded in
here-- the number 3 and that sort of
thing-- but you can imagine how you might
need to use variables if the filter size is flexible. Now, we have a really sort
of like sad fact, which is true about most cases
where you're doing image processing with some framework. And in this case, our framework
is JavaScript, and canvas, and p5.js. And the sad fact is though even
though all of this is built-- all of this discussion
is built upon the fact that we are retaining
the spatial orientation of the pixels. We're thinking of it as
a two-dimensional matrix of numbers. The actual data is
stored in one array. And so I've gone over this
in probably countless videos, but there's a simple formula to
look at if I have a given x,y position in a
two-dimensional matrix, how do I find the
one-dimensional lookup into that matrix, assuming
that the pixels were counted by rows-- 0, 1, 2, 3, 4, 5, 6, 7, blah,
blah, blah, next row, 28, 29, 30, blah, blah blah. And that formula is let index-- oh, well, I need to do that
before this nested loop because right now, I just want
the center pixel-- that x,y. Let index equal x plus
y times img.width. But there's more, oh! So this is the form. And if you think about
it, it makes sense because it's all the
x's, and then the offset along the
y's is how many rows times the width of the image. But there's another
problem, which is that in JavaScript in
canvas, for every single pixel in this image,
there are actually four numbers being stored-- an R, a G, a B, and an alpha-- the red, green,
and blue channels and the alpha channels-- channel, singular. So each pixel takes
up four spots. So this index actually
needs to say times 4. So guess what? You know it's going to
make a lot of sense. I'm going to need
this operation a lot. Let's write a function for it. I'll just call it index, and it
receives an x, y, and a width, and it returns-- you know what? The width is never going
to change in my sketch, so I don't want to be
so crazy as to have to pass it around everywhere. So we're just going to pull
it from a global variable. Return x plus y times img.width. And that's not img,
it's cat.width. OK, so once again, this is
terrible what I'm doing, but I'm just saving myself
a little bit of heartache here and there. So this index-- ooh,
let's call this pixel. Oh, and this should be times 4. This pixel is that
function index x,y. Now, I have something I
could do to simplify this, but I might as well write
the code for if this were a full RGB image. This is a grayscale image, but
it has all the channels in it. The thing that I need to do
to perform this convolution operation is to take
all of the weights-- the numbers that are
in the filter matrix-- and I need to multiply each one
times the pixel value of all of the neighbors and their
corresponding locations, add them all up together,
and maybe divide by something if I wanted to sort of,
like, average it out. But in this case, I actually
don't want to divide by anything. I'm just going to leave the
weights are the weights are the weights are the weights. And actually, this right
here is irrelevant. I need to do this
inside the loop. You'll see in a second. I think it's going
to make sense. So I need sum. I'm going to make a sum
of all the R values, a sum of all the green
values, and a sum of all the blue values. All right, wait a sec,
wait a sec, wait a sec. Actually, I think this is
going to make more sense. Let's go from negative 1 to 2. You'll see why. I mean, I'll explain why. And negative 1 to 2. Let's do that instead. And maybe it's more clear to
say less than or equal to 1. Less than or equal
to 1 because-- and let me draw this
diagram once again-- if this is pixel 0,0, this is
pixel negative 1, negative 1. This is 1,1. This is 1,0. This is 1, negative 1. I guess I'll do them all. So you can see that
the neighboring pixels are offset by negative
1 and 1, and negative 1 and 1. So the pixel x
value is x plus i. The pixel y value is y plus j. And then the pixel index
is call the index function x, which returns the actual
index into that array for pixel x and pixel y. And actually, maybe it
makes more sense for me to just say that I
don't necessarily need separate variables. It might actually be
just as clear just to put this right in here. So now, I just need to add the
red, green, and blue values of this particular
pixel to the sum. So sumR plus equal img.pixels
at that pixel index. And then G and B.
G is the next one, and B, blue, is the next one. And let's add a plus 0
here just to be consistent. So ultimately, what I'm actually
returning here is r is sumR, g is sumB, and b is sum-- oh, sorry, g is
sumG and b is sumB. So this is the process now
of adding up all the pixels. I've gone through every
single pixel in a 3 by 3 neighboring
area and added up all the reds, greens, and blues,
and I'm returning those back. But I'm missing the
crucial component, which is as I'm adding all the
pixels up in that area, I need to multiply each one by
the value in the filter itself. Incidentally, I
should also mention that the operation that this
really is is the dot product, and in an actual
machine learning system, all this would be
done with matrix math, but I'm doing it sort
of like longhand just to sort of see the
process and look at it. What should I call this in
the filter, like the factor? Now, I need to look
up in the filter, i,j. Only here's the thing-- because I decided to go
from negative 1 to 1, negative 1 to 1,
the filter doesn't have those index values. It goes 0, 1, 2, 0, 1, 2. So this has to be
i plus 1, j plus 1. So it's all six of one,
half dozen of the other, whether I go from
0 to 2 there and do the offset in the pixels. But the point is
the pixel array, I'm looking actually to
the negative and positive to the left and right,
but the filter is just a 3 by 3 array starting with
0,0 on the top left. So now, I should be able
to multiply by factor. And there we go. I have the full
convolution operation. Now, I might have
made a mistake here. I think this is right. When I run it, we'll find
out if I made a mistake. I'm summing up a 3 by 3
neighborhood of pixels, all multiplied by weights
that are in a 3 by 3 filter. Oh, but I actually have to
call that function here. Now, it should be relatively
easy because all of the work was in there. So if I say let I'm just
going to call this rgb equal convolution, the
cat at the given x and y with the filter, then the new
image, which is called filter-- oh. I have to look up. It's OK. No problem. The pixel is index
x,y, and then filter-- so I have to look up the
one-dimensional location in the new image, and then
at .pixels at that pixel is the rgb-- the red value that
came back plus 0 plus 1 plus 2, green and blue. And then if all goes
according to plan, I should be able to
draw the filtered image at offset to the right
with the same size. I did miss something
kind of important, which is that if I am working
with pixels of an image in p5, I need to call loadPixels. So cat.loadPixels
filtered.loadPixels. And then I haven't changed
the pixels of the original cat image, but since I changed the
pixels of the filtered image, afterwards I need to
call updatePixels. And now is the moment of truth. [DRUM ROLL] Never good when I press
the snare drum button. I'm going to run the sketch. Whoops. All right, well, I've
already got an error. [SAD TROMBONE] Cannot read property loadPixels. Oh, filter, filter, filtered. That should be filtered. Also this isn't
right-- createCanvas. The size of the canvas is
times 10 times 2 times 10. Remember, the image
is just 28 by 28. Let's try this again. [DRUM ROLL] [SAD TROMBONE] Well, a little bit better. We didn't get any errors. I don't see an image. Do I need to give it a
hardcoded transparency of 255? Yes. [BELL] Oops. So it was fully transparent. So I'm not pulling
the transparency over. I could pull it
over, but I just know I don't want it
to be transparent. Look at that. Look at how it found the-- oh, oh, oh, oh. Look at this. That doesn't look like it's
finding the vertical edges-- pixels that are
different to the left. It looks like it's
finding horizontal edges. Even though I've typed
this out in a way that visually, these negative
1's appear in a column, it's actually those
correspond not to the j index, but to the i index. So I think one way to fix that
would just be to swap it here. And maybe there's like a more
elegant way of doing this, but this now, if
I run it this way, you'll see, ah, look at
those horizontal edges. So now, we see how
this convolution is applied to the image. The difference in the
neural network here-- the convolutional
neural network-- is we're not hardcoding
in specific filters that we know highlight
things in an image. The neural network
is going to learn what values for the
filters highlight important aspects of the image
to help the machine learning task at hand, such
as classification. So it might draw
out, you know, cats tend to have ears that appear
a certain way and this kind of filter, like, brings
that out, and then leads to the final layer of
the network activating with a high value for that
particular classification. So just to keep my example
simulating the neural network process a bit more, let's
just every time I run it, give it a random filter
because that's what the layer would begin with. Just like a neural network
begins with random weights and learns the right
weights, the filters begin with random values and
it learns optimal values. So right here in setup,
I'll write a nested loop and give it a random value
between negative and 1. In truth, there are other
mechanisms and strategies for the initial weights of a
convolutional neural network, but picking random
numbers will work for us right now just to see. So every time I
run it, you can see we get a different resulting
image that is filtering the image in a different way. OK, that was a
lot and I think it would be good to take a break. So this was the first
part of my explanation, a long-winded attempt to
answer the question, what is a convolutional neural network? So the first thing to look at
is the convolutional layer. It's made up of filters. And so this video
attempted to explain that. And I think we could take
a break, have a cup of tea, talk to your pet, or friend, or
plant, or something, meditate, relax. And then if you want-- if you want, you can come
back and in the next video, I'm going to look
at the next piece-- the next component of
the convolutional layer, an operation called pooling or
more specifically, max pooling. And then I'll be able
to tie a little ribbon and put a little bow
on this explanation about convolutional
neural networks and move towards
actually implementing one with the ml5 built-in
functionality. All right, so maybe I'll
see you in the future and have a great
rest of your day. Goodbye. [MUSIC PLAYING]