CNN: Convolutional Neural Networks Explained - Computerphile

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

I think he was kind of tough to understand.

👍︎︎ 5 👤︎︎ u/startingover_90 📅︎︎ May 20 2016 🗫︎ replies

holy shit this is awesome

👍︎︎ 1 👤︎︎ u/avd007 📅︎︎ May 21 2016 🗫︎ replies
Captions
This is kind-of a follow up to Brais' videos on deep learning So deep learning is kind of a big thing at the moment and there's some disagreement between research over whether this is gonna be - the, this is *it* This is the big thing that's gonna change everything or whether this is another flash in the pan, like artificial neural networks were in the 80s Everyone got very excited and they got quite a good results and when they realized that they couldn't solve all the problems with them, I don't know For what it's worth, these are a big deal, I think [offscreen] Let's talk about convoluted neural networks, have I said that right? Convolutional neural networks. [offscreen] Ah, right, ok They combine both deep neural networks, which is what Brais was talking about and kernel convolutions, which is what I talked about in a previous video. I would thoroughly recommend people watch that video You know, it's got an entertaining host, right? *laugh from offscreen* So, but, but, because, if you don't know what a kernel convolution is, this isn't gonna make much sense to you So watch that video first [offscreen] So that's the kernel convolutions we did on graphics and things like, uh, sobel op- Yeah, Sobel operations, Gaussian blurs, and things like this. Sobel operations in particular, and edge detection So, if we think back to a traditional artificial neural network ok, what we've got is we've got some kind of input we're trying to learn, ok we've got some hidden layers, alright, and then we've got some output layers maybe just this one, I don't know And these are fully connected, so we have lots of connections from here and here and here and these are connected to here and so on, I'm just drawing in a few of them, and then these are all connected to the end. ok, Now using Brais's analogy, we were talking about house prices. ok, so, this will be something like number of bedrooms, and this would be something like "has it got a pool" and this would be, you know, what floor space is it and has it got a good garden and so on lots of these, ok, lots of inner nodes that we don't really care about particularly, or I don't Uh, right, and so on and so forth, and then finally at the end we have a house price. Now what this house price is, is a complicated function of these inputs. It's complicated because this node here is some linear, or non-linear now, combination of these. ok, so, a bit of this plus a bit of this, plus a bit of this, plus a bit of this through some non-linearity function. This is a different combination of these. This, again, different combination of these, and so on, right And then this, a different combination of these So, you can see, you're building up some kind of level of abstraction here where you've got combinations of combinations And that function is very complicated When Brais talked about a black box, in some ways that's exactly what it is because we can't look at these individual weights and say, "well that's got .2 of this one, so that must mean this" because we just don't know what it means, right In the grand scheme of things, in this whole network, we don't know what that individual weight means And to be honest, we might not even care that much What we really care about is how well does it predict house price, how accurate is that based on that for a different input, so we change this, read the output - is it good? Yes? Brilliant! ok, now, for images, which is obviously what I spend most of my time around, this is a start, but it's not very useful to me. If you think that this is our inputs, ok, and I give you a picture of a house, and I say "right, tell me how much this house is worth" ok, well, what? So how do I, ok, so there's two things I could do, right first of all is I could try and calculate things like number of bedrooms and stuff, based on the image and put them in here In some way, I'd be calculating some features and then I'd be putting them in here and learning on those features. That is quite a smart way of doing it, because, apart from that's obviously quite difficult um, it's smart because we don't have to have that many more neurons In anything, we can actually use the same network as we used before for our model on our house, all we have to do is work out the bit of code that does the image analysis Now, anyone that's tried to find out the number of rooms in a house based only on one picture of the outside of the house will tell me that that can't be done, right That's hard. Ok, so you could naively think, what we could do instead is just put the image in here. Make this the first pixel, and this the second pixel, and this the third pixel, and so on, ok. Then, this has got all the information it could ever need, right But it, but that's the problem 7 megapixel image, that's 7 million input nodes, let's say we have 7 million nodes on the next layer each one connects to each other one, you can see that that's just gonna melt my computer it's not even gonna try and create it, it's too much information That's why we downsample our space a little bit. What we would usually do is calculate some small subset of features and then we would put them in at this end. So that's quite important. So, traditional machine learning is done a bit like that. So Michel's done some videos on this. Calculate some features about someone's face and put that in to some machine learning algorithm What you don't do is try and put the machine learning algorithm just on the face because it's too much information there Until now, ok, right? That's where convolutional neural networks step in. So, convolutional neural networks replace each of these nodes with a kernel convolution. So, like, a Sobel edge detector Now, so instead of what I would've done before, which was run a Sobel over something and then machine learn on that, I just give this the opportunity to learn which features are interesting maybe it is an edge detection, maybe it is a corner detection maybe it's something that highlights whatever's in the middle of the picture Or something that highlights the top left-hand corner it doesn't really matter, and the point is I don't know what they are right, if I give you, you know, two thousand pictures of houses and ask you to predict house prices based on the pictures I don't know for sure, I can guess, but it might be - that how many windows it has and things like this but I don't know for sure. And a computer can brute-force through those things much quicker than I can and tell me And then I can go, I can both predict it, and I can look back and say, "oh, it was windows after all" So, let's imagine that what we have is our image, ok, so I'm gonna move away from the house analogy now because I'm gonna have to draw a lot of pictures of houses if I do that. Ok, so let's talk about CNN works Um, and why it's useful. So, we have an image of something Now, I have seen convolutional neural networks used for non-images but for now, we'll just talk about images This is a picture of, let's say me. It's, you know, it's not a great likeness but I'll stick by it Now, there are three channels here, ok. So this is actually a 3D volume, in some sense remember when we talked about 3D images, you could view RGB as a, in some sense, 3D So, the first plane is our R, G, and B, or vice versa What we do is, if we performed a Sobel edge detection on this, what it would do is produce another image that was slightly smaller than this and only one deep. So hypothetically, it would be another image where the edges, let's say the horizontal edges, were highlighted So it would kinda look like, that, or something, I don't know some half of my face where the horizontal edges are highlighted, ok It's not a great diagram But there would only be one output, because Sobel just outputs a number between 0 and 255, as soon as you scale it, ok Now the problem is that I don't know that Sobel's the best thing for this task, ok It might be, right, it might be useful to detect edges on houses, to work out what their prices are or if you want to detect faces, to detect the size of the face, that kind of makes sense On the other hand, it's gonna produce a lot of erroneous bits if I was sitting in front of a tree, there's gonna be loads of edge stuff going on there that I don't care about In a convolutional neural network, what we do is we do, let's say, 60 of these on the first layer. So we have one, and then behind it we have another one, and behind it we have another one, and behind it we another one, and so on, going this way. So the first one will be some convolution process applied to this whole image that takes three input channels and outputs one output channel The next one will be a different kernel convolution operation so each of these will have a different kernel those are our weights, those are these values here In sort of our analogy back to normal learning Um, and so let's say we have 60 of those, or 64 of those One of them might be detecting edges, one of them might be detecting corners Um, and then we use them as our features for learning Now that's a start, but we're - this is is deep learning now, right, so what do we do now well, what we now do is we do more features based on these features So we find combinations of corners, combinations of edges, that make something interesting My face is not just a circle of edges, what it is is a number of corners and edges and bits of texture and things all in a specific shape that is unique to, uh, well, certainly to a human face, but even unique to me right, because we're capable of distinguishing between different people So, this kernel window will go down to this pixel here ok, so this will slide about this image and produce this output image and then the next one will do the same, and the next one will do the same Then we do the same thing on this one let me do it in a different pen so we can see better. Here's my red kernel convolution and this slides about and produces another image, which is a slight combination of, maybe, corners and edges or something. I mean, this second level, it's not gonna be too abstract, but we'll get the idea So there's gonna be some sort of shape that's gonna be sort of... It's not gonna make much sense to us, but it'll make some sense to this machine. And there'll be another set of these, so there'll be lots of these, right, going back All of these will look different and be some different representation of my face transformed in some way, to be useful And again, I haven't picked these, these have been learned, just like a normal deep learning algorithm So I haven't had to say, "I definitely think edges are important for this" cause I don't know for sure. So this goes on, and we keep doing this, and sometimes we also downsample the size of these images, just to save memory, ok, but we won't dwell on that too much. And, because of the way that we downsample, and the way that sometimes these convolution operations slightly shrink the image, cause they don't go all the way to the edge, right If you've got a 5 by 5 kernel, you can't go to the edge 2 pixels cause you're going off the edge so we don't worry about that, we just get slightly smaller In the end, we end up with a much smaller image, and lots of features going all the way back So these are my different convolutions of convolutions, of convolutions, of convolutions And each one will look different, and represent something different and we don't know what that is. So this one, could be highlighted when it's a face in the middle, or it could be dark when there isn't This one might be highlighted when there's an ear at a certain position, and so on. Eventually, these will get down to being just one pixel, and very very long So essentially what we've done there is we completely removed the spatial dimension. There's no more spatial information left, we don't know where anything is. But we know what it is, because it's listed in all these features. These now are our neurons at the end. So we have a couple more layers that point to these, and then finally, we have one at the end that says "Is this a picture of Mike's face?" And it produces a 1 if it is, and a 0 if it isn't. And then what we do is, just like a normal network, we train it. So we say, "here's a picture of me", ok , so this should be a 1. And let's say it's 0.5. cause it's kind of random. So we adjust these weights, and we adjust the weights inside all these kernel convolutions. [offscreen] So does that adjustment happen manually? No, it happens using a, uh, well, it's coded in, um, but it's usually performed by a library, and it's using a process called back propagation. So what we do is we basically predict what direction we have to move the weights in to improve our output, and then we move them over slightly in that direction. And we have to do it in reverse order, because these ones depend on these ones, depend on these ones, and vice versa, what we do is we say, well, look, given that I've said it's 0.5 chance of Mike and we want a 1, how do I change these weights here to get slightly closer to 1? and I do it. And then I say, "how do I change these again to do even better?" and so on and then I work my way back, ok, that kind of maths we're not gonna go into, right. A lot of these things are, are implemented in libraries So as a researcher, I mean, much as I'd like to implement some of these things, it takes quite a long time just because programming takes a while, right, and And, it's better for me just to apply these things and get good results than it is for me to reinvent the wheel all the time, constantly, if everyone was programming the same things over and over again, no one would get anything done So, I'd have to start by programming up Linux, to get, to get, I'm not claiming I can, by the way, and, and so on. So, you know, let's not reinvent the wheel. Um, so I do this, I send in, let's say 1000 pictures, 500 of which are me you know, so I've been to a photo shoot or something, right and 500 of which are not. And I train it so the convolutions and these weights on the output are such that it gets 1 when it's a picture of me and 0 when it isn't. And then I can look at these convolutions and say "what is it about me that's distinctive?" And it's probably gonna be finding, um, you know, weird shapes on my face, right cause it's a bit of a weird shape, so it - things that are unique to me Now in a more general situation, there's a big database called ImageNet They have a competition every year to see who can classify these images the best. So dogs, cats, planes, trees, and so on OK, they're all in there, and there's a thousand or so images of each right, so, we have a really big network that's much bigger than this little one I drew and we say, "right, let's throw millions of images at this", right, thousands of cats, thousands of dogs and we have lots more outputs than just the one, and we say "what is it?" and it says "it's a dog" and it is. *dog bark* Convolutional neural networks have been around for a little while but, they've really started to be big in about 2012 when it - when someone came along, applied one of these to ImageNet, and got incredible results. And so on and so forth. And now there's this big push and everyone's trying to get even better results, and even better results. Now, I work on more of the applied end of computer science, so I'm more interested in how this affects plant science and things like this So that's what we're working on. Um, but, the kind of results we're seeing are really really impressive So, I mean, case in point, I've done, I've done some root tip detection, so detection of root tips in images, right, of plants and, um, I've got some software that I've already programmed and I've kind of done a low-level feature detector approach to detecting root tips and it's about 70%, ok, which is what you would expect, because maybe some root hair gets confused as to root tip or a bit of blotch of dirt, or maybe there's just two root tips really close together and it gets confused This, the CNN that I trained, um, is 98% accurate And it finds them with 99% accuracy. It doesn't make many mistakes. And that's over thousands of images. [offscreen] So does that mean the work you've done already just goes out the window? Yep. Uh, no, to an extent, yes, and to an extent, no. You need expertise to be able to craft a network and train it and prepare the images. And there's obviously work to be done, and there's some disagreement over how much of a problem you can solve with a convolutional neural network So, there are lots more things you can do with roots beyond finding tips. Can you do all of them with a convolutional neural network? I don't know, we'll see. Are we trying, but, we'll see. Maybe not. So maybe what you do is you use this as a tool, just like other machine learning algorithms, within a package that does lots of other things as well. On the other hand, if you're just doing cat and dog detection, you might as well use a CNN, cause it's gonna do better than anything else. The other purpose for ways the botnet can use its parts is for distributed computing [fades out] [fades in] Now, some objects obviously are more amenable to this than others, but the more images we get, the better it is. There's no depth involved here at all, ok
Info
Channel: Computerphile
Views: 705,760
Rating: 4.95051 out of 5
Keywords: computers, computerphile, computer, science, CNN, Convolutional Neural Network, Computer Vision, Artificial Intelligence, Dr Mike Pound, University of Nottingham, kernel convolution, image net, imagenet, Computer Science
Id: py5byOOHZM8
Channel Id: undefined
Length: 14min 16sec (856 seconds)
Published: Fri May 20 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.