How Convolutional Neural Networks work

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello welcome to how convolutional neural networks work convolutional neural networks or con nets or CN NS can do some pretty cool things if you feed them a bunch of pictures of faces for instance don't learn some basic things like edges and dots bright spots dark spots and then because they are a multi-layer neural network that's what gets lowered in the first layer the second layer are things that are recognizable as eyes noses mouths and the third layer are things that look like faces similarly if you feed it a bunch of images of cars down to the lowest layer you'll get things again that look like edges and then higher up look at things that look like tires and wheel wells and hoods and at a level above that things that are clearly identifiable as cars CN NS can even learn to play video games by forming patterns of the pixels as they appear on the screen and learning what is the best action to take when it sees a certain pattern a CNN can learn to play video games in some cases far better than a human ever could not only that if you take a couple of CN NS and have them set to watching YouTube videos one can learn objects by again picking out patterns and the other one can learn types of grasps this then coupled with some other execution software can let a robot learn to cook just by watching YouTube so there's no doubt CN NS are powerful usually when we talk about them we do so in the same way we might talk about magic but they're not magic what they do is based on some pretty basic ideas apply it in a clever way so to illustrate these we'll talk about a very simple toy convolutional neural network what this one does is takes in an image a two-dimensional array of pixels you can think of it as a check board and each square on the checkerboard is either light or dark and then by looking at that the CNN decides whether it's a picture of an X or of an O so for instance on top there we see an image with an X drawn in white pixels on a black background and we would like to identify this as an X and the O we like to identify as an O so how a CNN does this is has several steps in it what makes it tricky is if the X is not exactly the same every time the extra the O can be shifted it can be bigger or smaller can be rotated a little bit thicker or thinner and in every case we would still like to identify whether it's an X or a toe now the reason that this is challenging is because for us deciding whether these two things are similar is straightforward we don't even have to think about it for a computer it's very hard what a computer sees is this checkerboard this two-dimensional array as a bunch of numbers ones and minus ones a one is a bright pixel a minus one is a black pixel and what it can do is go through pixel by pixel and compare whether they match or not so to a computer to a computer it looks like there are a lot of pixels that match but some that don't quite a few that don't actually and so it might look at this and say I'm really not sure whether these are the same and so it was because a computer is so literal would say uncertain I can't say that they're equal now one of the tricks that convolutional neural networks use is to match parts of the image rather than the whole thing so if you break it down into its smaller parts or features then it becomes much more clear whether these two things are similar so examples of these little features are little mini images in this ace just three pixels by three pixels the one on the left is a diagonal line slanting downward from left to right the one on the right is also a diagonal line slanting in the other direction and the one in the middle is a little X these are little pieces of the bigger image and you can see as we go through if you choose the right feature and put it in the right place it matches the image exactly so okay we have the bits and pieces now to take a step deeper there the math behind matching these is called filtering and the way this is done is a feature is lined up with the little patch of the image and then one by one the pixels are compared they're multiplied by each other and then add it up and divide it by the total number of pixels so to step through this to see why it makes sense to do this you can see starting in the upper left-hand pixel in both the feature and the image patch multiplying the 1 by a 1 gives you a 1 and we can keep track of that by putting that in the position of the pixel that we're comparing we step to the next one minus 1 times minus 1 is also a 1 and we continue to step through pixel by pixel multiplying them all by each other and because they're always the same the answer is always 1 when we're done we take all these ones and add them up and divide by 9 and the answer is 1 so now we want to keep track of where that feature was in the image and we put a 1 there say when we put the feature here we get a match of what that is filtering now we can take that same feature and move it to another position and perform the filtering again and we start with the same pattern the first pixel matches the second pitch we'll Pacha choose the third pixel does not match minus one times one equals minus one so we record that in our results and we go through and do that through the rest of the image patch and when we're done we notice we have two minus ones this time so we add up all the pixels to add up to five divided by nine and we get a point five five so this is very different than our one and we can record the point five five in that position where we were where it occurred so by moving our filter around to different places in the image we actually find different values for how well that filter matches or how well that feature is represented at that position so this becomes a map of where the feature occurs by moving it around to every possible position we do convolution that's just the repeated application of this feature this filter over and over again and what we get is a nice map across the whole image of where this feature occurs and if we look at it it makes sense this feature is a diagonal line slightly downward left to right which matches the downward left-to-right diagonal of the X so if we look at our filtered image we see that all of the high numbers ones and point seven sevens are all right along that diagonal that suggests that that feature matches along that diagonal much better than it does elsewhere in the image to use a shorthand notation here we'll do a little X with a circle in it to represent convolution the act of trying every possible match and we repeat that with other features we can repeat that with our X filter in the middle and with our upward slanting diagonal line in the bottom and in each case the map that we get of where that feature occurs is consistent what we would expect is what we know about the X and about where our features match this act of convolving an image with a bunch of filters a bunch of features and creating a stack of filtered images is will call a convolution layer a layer because it's an operation that we can stack with others as we'll show in a minute in convolution one image becomes a stack of filtered images we get as many filtered images out as we have filters so convolution layer is one trick that we have the next big trick that we have is called pooling this is how we shrink the image stack and this is pretty straightforward we start with the window size usually 2 by 2 pixels or 3 by 3 pixels and destryed usually 2 pixels just in practice these work best and then we take that window and walk it in strides across each of the filtered images from each window we take the maximum value so to illustrate this we start with our first filtered image we have our two Mixel by 2 pixel window within that pixel the maximum value is 1 so we track that and then move to our stride of 2 pixels we move 2 pixels to the right and repeat out of that window the maximum value is 0.33 etc 0.55 when we get to the end we have to be creative we have don't have all the pixels representative so we take the max of what's there and we continue doing this across the whole image and when we're done what we end up with is a similar pattern but smaller we can still see our high values are all on the diagonal but instead of 7 by 7 pixels in our filtered image we have a 4 by 4 pixel image so as half as big as it was about this makes a lot of sense to do if you can imagine if instead of starting with a 9 by 9 pixel image we had started with a 9 thousand by 9000 pixel image shrinking it is convenient for working with it makes it smaller the other thing it does is pooling doesn't care where in that window that maximum value occurs so that makes it a little less sensitive to position and the way this plays out is that if you're looking for a particular feature in an image it can be a little to the left a little to the right maybe a little rotated and it'll still get picked up so we do max pooling with all of our stack of filtered images and get in every case smaller set of filtered images now that's our second trick third trick normalization this is just a step to keep the math from blowing up and keep it from going to zero all you do here is it everywhere in your image that there is a negative value change it to zero so for instance if we're looking back at our filtered image we have these what are called rectified linear units that's the little computational unit that does this but all it does is steps through everywhere there's a negative value change it to zero now there negative value change it to zero by the time you're done you have a very similar looking image except there's no negative values they're just zeros and we do this with all of our images and this becomes another type of layer so in a rectified linear unit a stack of images becomes a stack of images with no negative values now what's really fun the magic starts to happen here when we take these layers convolution layers rectified linear units and pooling layers and we stack them up so that the output of one becomes the input of the next you'll notice that what goes into each of these and what comes out of these looks like an array of pixels or an array of an array of pixels and because of that we can stack that nicely we can use the output of one for the input of the next and by stacking them we get these operations building on top of each other what's more we can repeat the stacks we can do deep stacking you can imagine making a sandwich that is not just one patty in one slice of cheese and one lettuce in one tomato but a whole bunch of let layers double Tripper triple quadruple Decker's as many times as you want each time the image gets more filtered as it goes through convolution layers and it gets smaller as it goes through pooling layers now the final layer in our toolbox it's called a fully connected layer here every value gets a vote on what the answer is going to be so we take our now much filtered and much reduced in size stack of images we break them out we just rearrange and put them into a single list because it's easier to visualize that way and then each of those connects to one of our answers that we're going to vote for when we feed this in X there will be certain values here that tend to be high they tend to predict very strongly this is going to be an X they get a lot of vote for the X outcome similarly when we feed in a picture of an O to our convolutional neural network there are certain values here at the end that tend to be very high and tend to predict strongly when we're going to have an O at the end so they get a lot of weight a strong vote for the category now when we get a new input and we don't know what it is and we want to decide the way this works is the input goes through all of our convolutional are rectified linear unit our pooling layers comes out to the end here we get a series of votes and then based on weights that each value gets to vote with we get a nice average vote at the end in this case there's this particular set of inputs votes for an X with a strength of 0.92 and an O with the strength of 0.5 1 so here definitely X is the winner and so the neural network would categorize this input as an X so in a fully connected layer a list of feature values becomes a list of votes now again what's cool here is that a list of votes looks a whole lot like a list of feature values so you can use the output of one for the input of the next and so you can have intermediate categories that aren't your final votes or sometimes these are called hidden units in a neural network and you can stack as many of these together as you want also but in the end they all end up voting for an X or an O and whoever gets the most votes wins so if we put this all together then a two dimensional array of pixels in results in a set of votes for a category out at the far end so there are some things that we have glossed over here you might be asking yourself where all of the magic numbers come from things that I pulled out of thin air include the features in the convolutional layers those convenient three pixel by three pixel diagonal lines of the X also the voting weights in the fully connected layers I really waved my hands about how those are obtained in all these cases the answer is the same there is a trick called back propagation all of these are learned you don't have to know them you don't have to guess them the deep neural network does this on its own so the underlying principle behind back propagation is that the error in the final answer is use to determine how much the network adjusts and changes so in this case if we knew we were putting in an X and we got 0.9 to vote for an X and that would be a error of 0.08 and we got a point five one vote for oh we know that that would be an error of 0.49 actually an error of 0.5 one because it should be zero then if we add all that up we get an error of what should be 0.5 nine so what happens with this error signal is it helps drive a process called gradient descent if there is another bit of something that is pretty special sauce to deep neural networks it is the ability to do gradient descent so for each of these magic numbers each of the feature pixels each voting weight they're adjusted up and down by a very small amount to see how the error changes the amount that they're adjusted is determined by how big the error is large error they're adjusted a lot small error just a tiny bit no error they're not adjusted at all you have the right answer stop messing with it as they're adjusted you can think of that as sliding a ball slightly to the left and slightly to the right on a hill you want to find the direction where it goes down hill you want to go down that slope down that gradient to find the very bottom because the bottom is where you have a very least error that's your happy place so after sliding it to the left and to the right you find the downhill direction and you leave it there doing that many times over lots of lots of iterations lots of steps helps all of these values across all the features and all of the weights settle in to what's called a minimum and it ended at that point the network is performing as well as it possibly can adjust any of those a little bit its behavior now there are some things called hyper parameters and these are knobs that the designer gets to turn decisions the designer gets to make these are not learned automatically in convolution figuring out how many features should be used how big those features should be how many pixels on the side in the pooling layers choosing the window size and the window stride and in fully connected layers choosing the number of hidden neurons intermediate neurons all of these things are decisions that the designer gets to make right now there are some common and practices that tend to work better than others but there is no principled way there's no hard and fast rules for the right way to do this and in fact a lot of the advances in convolutional neural networks are in getting combinations of these they work really well now in addition to this there are other decisions the designer gets to make like how many of each type of layer and in what order and for those that really like to go off the rails can we design new types of layers entirely and slip them in there and get new fun behaviors these are all things that people are playing with to try to eke out more performance and address stickier problems with CNN's now what's really cool about these we've been talking about images but you can use any two-dimensional or even for that matter three or four dimensional data but what's important is that in your data things closer together are more closely related than things far away what I mean by that is if you look at an image two rows of pixels or two columns of pixels are right next to each other they're more closely related than rows or columns that are far away now what you can do is you can take something like sound and you can chop it up into little time steps and for each piece of time that the time step right before and right after is more closely related than time steps that are far away and the order matters you can also chop it up into different frequency bands bass mid-range treble you can slice it a whole lot more finely than that and again those frequency bands are the ones closer together and we're closely related and you can't rearrange them the order matters once you do this with sound it looks like a picture it looks like an image and you can use convert convolutional neural networks with them you can do something similar with text where the position in the sentence becomes the column and the row is words in a dictionary in this case it's hard to argue whether order matters that order matters it's hard to argue that words in a dictionary on our problem that's some are more closely related than others in all cases and so the trick here is to take a window that spans the entire column top to bottom and then slide it left to right that way it captures all of the words but it only captures a few positions in the sentence at a time now the other side of this limitation of convolutional neural networks is that they're really designed to capture local spatial patterns spatial in the sense of things that are next together next to each other matter quite a bit so if the data can't be made to look like an image then they're not as useful so an example of this is say some customer data if I have each row it's a separate customer each column is a separate piece of information about that customer such as their name their address that's what they bought what websites they visited then this doesn't so much look like a picture I can take and rearrange those columns and rearrange those rows and this still means the same thing and still equally easy to interpret if I were to take an image and rearrange the columns and rearrange the rows it would result in a scramble of pixels and it would be difficult or impossible to say what the image was on there I would lose a lot of information so as a rule of thumb if your data is just as useful after swapping out any of the columns for each other then you can't use convolutional neural networks so the take-home is a convolutional neural networks are great at finding patterns and using them to classify images if you can make your problem look like finding cats on the Internet then there are a huge asset if you'd like to continue your study of CNN I would recommend looking at the notes from the Stanford computer science 231 course from Justin Johnson and Andre Karpov II also checking out the writings of Christopher Ola is it exceptionally clear writer and feel free to check out another presentation that I did deep learning demystified talking about some of the properties of deep neural networks in general for someone who is new to the topic there are if you'd like to dig even deeper and play with some of these there's a variety of toolkits they each have their strengths and weaknesses I invite you to deepen to dig deep into them and learn all about them thanks for listening feel free to connect with me online and I would love to follow up with you
Info
Channel: Brandon Rohrer
Views: 840,109
Rating: 4.9613175 out of 5
Keywords: deep learning, deep neural network, convolutional nerual network, machine learning, artificial intelligence, convolution, neural network, artificial neural network, machine vision, how it works
Id: FmpDIaiMIeA
Channel Id: undefined
Length: 26min 13sec (1573 seconds)
Published: Thu Aug 18 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.