MIT 6.S191 (2020): Convolutional Neural Networks

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
  • Original Title: Convolutional Neural Networks | MIT 6.S191
  • Author: Alexander Amini
  • Description: MIT Introduction to Deep Learning 6.S191: Lecture 3 New 2020 Edition Convolutional Neural Networks for Computer Vision Lecturer: Alexander Amini ...
  • Youtube URL: https://www.youtube.com/watch?v=iaSUYvmCekI
👍︎︎ 1 👤︎︎ u/aivideos 📅︎︎ Feb 22 2020 🗫︎ replies
Captions
hi everyone and welcome back to MIT 6.S191 today we're going to be talking about one of my favorite topics in this course and that's how we can give machines a sense of vision now vision I think is one of the most important senses that humans possess sighted people rely on vision every single day from things like navigation manipulation how you can pick up objects how you can recognize objects recognize complex human emotion and behaviors and I think it's very safe to say that vision is a huge part of human life today we're going to be learning about how deep learning can build powerful computer vision systems capable of solving extraordinary complex tasks that may be just 15 years ago would have not even been possible to be solved now one example of how vision is transforming computer or how deep learning is transforming computer vision is is facial recognition so on the top left you can see an icon of the human eye which visually represents vision coming into a deep neural network in the form of images or pixels or video and on the output on the bottom you can see a depiction of a human face or detecting human face but this could also be recognizing different human faces or even emotions on the face recognizing key facial features etc now deep learning has transformed this field specifically because it means that the creator of this AI does not need to tailor that algorithm specifically for towards facial detection but instead they can provide lots and lots of data to this algorithm and swap out this end this end piece instead of facial detection they can swap it out for many other detection types or recognition types and the neural network can try and learn to solve that task so for example we can replace that facial detection task with the detection of disease region in the retina of the eye and similar techniques could also be applied throughout healthcare matters and towards the detection and classification of many different types of diseases in biology and so on now another common example is in the context of self-driving cars where we take an image as input and try to learn an autonomous control system for that car this is all entirely end-to-end so we have vision and pixels coming in as input and the actuation of the car coming in as output now this is radically different than the vast majority of autonomous car companies and how they operate so if you look at companies like way Moe and Tesla this end-to-end approach is radically different we'll talk more about this later on but this is actually just one of the autonomous vehicles that we build here as part of my lab at csail so that's why I'm bringing it up so now that we've gotten a sense of at a very high level some of the computer vision tasks that we as humans solve every single day and that we can also train machines to solve the next natural question I think to ask is how can computers see and specifically how does a computer process an image or a video how do they process pixels coming from those images well to a computer images are just numbers and suppose we have this picture here of Abraham Lincoln it's made up of pixels now each of these pixels since it's a grayscale image can be represented by a single number and now we can represent our image as a two dimensional matrix of numbers one for each pixel in that image and that's how a computer sees this image it sees that that's just a matrix of two-dimensional numbers or two-dimensional matrix of numbers rather now if we have an RGB image a color image instead of a grayscale image we can simply represent that as three of these two-dimensional images concatenated or stacked on top of each other one for the red channel one for the green channel one for the blue channel and that's RGB now we have a way to represent images to computers and we can think about what types of computer vision tasks this will allow us to solve and what we can perform given this this foundation well two common types of machine learning that we actually saw in lecture 1 and 2 yesterday are those of classification and those of progression in regression we take we have our output take a continuous value in classification we have our output take a continuous label so let's first start with classification and specifically the the problem of image classification we want to predict a single label for each image for example we have a bunch of US presidents here and we want to build the classification pipeline to determine which President is in this image that we're looking at outputting the probability that this image is each of those US presidents in order to collect correctly classify this image our pipeline needs to be able to tell what is unique about a picture of Lincoln versus what is unique about a picture of Washington versus a picture of Obama it needs to understand those unique differences in each of those images or each of those classifications each of those features now another way to think about this and this image classification pipeline at a high level is in terms of features that are characteristics of a particular class classification is done done by detecting these types of features in that class if you detect enough of these features specific to that class then you can probably say with pretty high confidence that you're looking at that class now one way to solve this problem is to leverage knowledge about your field your domain knowledge and say let's suppose we're dealing with human faces we can use our knowledge about human faces to say that if we want to detect human faces we can first detect noses eyes ears mouths and then once we have a detection pipeline for those we can start to detect those features and then determine if we're looking at a human face or not now there's a big problem with that approach and that's that preliminary detection pipeline how do we detect those noses ears mouths and like this hierarchy is kind of our bottleneck in that sense and remember that these images are just three dimensional arrays of numbers well actually they're just three dimensional arrays of brightness values and that images can hold tons of variation so there's variations such as occlusions that we have to deal with variations in illumination and even intro class very and when we're building our classification pipeline we need to be invariant to all of these variations it'll and be sensitive to inter class variation so sensitive to the variations between classes but invariant to the variations within a single class now even though our pipeline could use the features that we as humans defined the manual extraction of those features is where this really breaks down now due to the incredible variability in image data specifically the detection of these features is super difficult in practice and defining the manually extracting these features can be extremely brittle so how can we do better than this that's really the question that we want to tackle today one way is that we want to extract both these visual features and detect their presence in the image simultaneously and in a hierarchical fashion and for that we can use neural networks like we saw in lab in class number one and two and our approach here is going to be to learn the visual features directly from data and to learn a hierarchy of these features so that we can reconstruct a representation of what makes up our final class label so I think now that we have that foundation of how images work we can actually move on to asking ourselves how we can learn those visual features specifically with a certain type of operation in neural networks and neural networks will allow us to directly learn those features from visual data if we construct them cleverly and correctly in lecture one we learned about fully connected or dense neural networks where you can have multiple hidden layers and each hidden layer is densely connected to its previous layer and densely connected here let me just remind you is that every input is connected to every output in that layer so let's say that we want to use these densely connected networks in image classification what that would mean is that we're going to take our two-dimensional image right it's a two-dimensional spatial structure we're going to collapse it down into a one dimensional vector and then we can feed that through our dense Network so every pixel in that that one dimensional vector will feed into the next layer and you cannot already imagine that or you can you should already appreciate that all of our two-dimensional structure in that image is completely gone already because we've collapsed a two-dimensional image into one dimension we've lost all of that very useful spatial structure in our image and all of that domain knowledge that we could have used a priori and additionally we're going to have a ton of parameters in this network because it's densely connected we're connecting every single pixel in our input to every single neuron in our hidden layer so this is not really feasible in practice and instead we need to ask how we can build some spatial structure into neural networks so we can be a little more clever in our learning process and allow us to tackle this specific type of inputs in a more reasonable and and well-behaved way also we're dealing with some prior knowledge that we have specifically that spatial structure is super important in image data and to do this let's first represent our two-dimensional image as a array of pixel values just like they normally were to start with one way that we can keep and maintain that spatial structure is by connecting patches of the input to a single neuron in the hidden layer so instead of connecting every input pixel from our input layer and our input image to a single neuron in the hidden layer like with dense neural networks we're going to connect just a single patch a very small patch and notice here that only a region of that input layer or that input image is influencing this single neuron at the hidden layer to define connections across the entire input we can apply the same principle of connecting patches in the input layer in single neurons in the subsequent layer we do this by simply sliding that patch window across the input image and in this case we're sliding it by two units each time in this way we take into account we maintain all of that spatial structure that spatial information inherent to our image input but we also remember that the final task that we really want to do here that I told you we want to do was to learn visual features and we can do this very simply by waving those connections in the patches so each of the patches instead of just connecting them uniformly to our hidden layer we're going to weight each of those pixels and apply a similar technique like we saw in lab 1 instead of we can basically just have a weighted summation of all of those pixels in that patch and that feeds into the next hidden unit in our hidden layer to detect a particular feature now in practice this operation is simply called convolution which gives way to the name convolutional neural network which we'll get to later on we'll think about this at a high level first and suppose that we have a four by four filter which means that we have 16 different weights 4 by 4 we are going to apply the same filter of four by four patches across the entire input image and we'll use the result of that operation to define the state of the neurons in the next hidden layer we basically shift this patch across the image we shifted for example in units of two pixels each time to grab the next patch we repeat the convolution operation and that's how we can start to think about extracting features in our input but you're probably wondering how does this convolution operation actually relate to feature extraction so so far we've just defined the sliding operation where we can slide a patch over the input but we haven't really talked about how that allows us to extract features from that image itself so let's make this concrete by walking through an example first suppose we want to classify X's from a set of black and white images so here black is represented by -1 white is represented by the pixel 1 now to classify X's clearly we're not going to be able to just compare these two matrices because there's too much variation between these classes we want to be able to get invariant to certain types of deformation to the images scale shift rotation we want to be able to handle all of that so we can't just compare these two like as they are right now so instead what we're gonna do is we want to model our model to compare these images of exes piece by piece or patch by patch and the important patches are the important pieces that it's looking for are the features now if our model can find rough feature matches across these two images then we can say with pretty high confidence that they're probably coming from the same image if they share a lot of the same visual features then they're probably representing the same object now each feature is like a mini image each of these patches is like a mini image it's also a two-dimensional array of numbers and we'll use these filters let me call them now to pick up on the features comment 2x in the case of X's filter is representing diagonal lines and crosses are probably the most important things to look for and you can see those on the top the top row here so we can probably capture these features in terms of the arms and the main body of that X so the arms the legs and the body will capture all of those features that we show here and note that the smaller matrices are the filters of weights so these are the actual values of the weights that correspond to that patch as we slide it across the image now all that's left to do here is really just define that convolution operation and tell you when you slide that patch over the image what is the actual operation that takes that patch on top of that image and then produces that next output at the hidden neuron layer so convolution preserve is that spatial structure between pixels by learning the image features in these small squares or the small patches of the input data to do this to cut the entire equation or the entire computation is as follows we first place that patch on top of our input image of the same size so here we're placing this patch on the top left on this part of the image in green on the X there and we perform an element-wise multiplication so for every pixel on our image where the patch overlaps with we element-wise multiply every pixel in the filter the result you can see on the right is just a matrix of all ones because there's perfect overlap between our filter in this case and our image at the patch location the only thing left to do here is sum up all of those numbers and when you sum them up you get nine and that's the output at the next layer now let's go through one more example a little bit more slowly of how we did this and you might be able to appreciate what this convolution operation is intuitively telling us now that's mathematically how it's done but now let's see intuitively what this is showing us suppose we want to compute the convolution now of this 5x5 image in green with this 3x3 filter or this 3x3 patch to do this we need to cover that entire image by sliding the filter over that image and performing that element-wise multiplication and adding the output for each patch and this is what that looks like so first we'll start off by placing that yellow filter on the top left corner we're going to element-wise multiply and add all of the outputs and we're gonna do it four and we're gonna place that four in our first entry of our output matrix this is called the feature map now we can continue this and slide that 3x3 filter over the image element wise multiply add up all the numbers and place the next result in the next row in the next column which is three and we can just keep repeating this operation over and over and that's it the feature map on the right reflects where in the image there is activation by this particular filter so let's take a look at this filter really quickly you can see in this filter this filter is an X or a cross it has ones on both diagonals and then the image you can see that it's being activated also along this main diagonal on the four where the four is being maximally activated so this is showing that there is maximum overlap with this filter on this image along this central diagonal now let's take a quick example of how different types of filters are changing the weights in your filter can impact different feature Maps or different outputs so simply by changing the weights in your filter you can change what your filter is looking for what it's going to be activating so take for example this image of this woman Lenna on the left that's the original image on the left if you slide different filters over this image you can get different output feature Maps so for example you can sharpen this image by having a filter shown on the second column you can detect edges in this image by having the third column by using the third columns features filter sorry and you can even detect stronger edges by having the fourth column and these are ways that changing those weights in your filter can really impact the features that you detect so now I hope you can appreciate how convolution allows us to capitalize on spatial structure and use sets of weights to extract these local features within images and very easily we can detect different features by simply changing our weights and using different filters okay now these concepts of preserving spatial information and spatial structure while local feature extraction while also doing local feature extraction using the convolution operation are at the core of neural networks and we use those for computer vision tasks so now that we've gotten convolutions under our belt we can start to think about how we can utilize this to build full convolutional neural networks for solving computer vision tasks and these networks are very appropriately named convolutional neural networks because the backbone of them is the convolution operation and we'll take a look first at a CNN or convolutional neural network architecture define designed for image classification tasks and we'll see how the convolution operation can actually feed into those spatial sampling operations so that we can build this full thing end to end so first let's consider the simple very simple CN n for image classification now here the goal is to learn features directly from data and to use these learn feature Maps for classification of these images there are three main parts to a CNN that I want to talk about now first part is the convolutions which we about before these are for extracting the features in your image or in your previous layer in a more generic saying the second step is applying your non-linearity and again like we saw in lecture 1 and 2 nonlinearities allow us to deal with nonlinear data and introduce complexity into our learning pipeline so that we can solve these more complex tasks and finally the third step which is what I was talking about before is this pooling operation which allows you to down sample your spatial resolution of your image and deal with multiple scales of that image or multiple scales of your features within that image and finally the last point I want to make here is that the computation of class scores let's suppose if we're dealing with image classification can be outputted using maybe a dense layer at the end after your convolutional layers so you can output a dense layer which which represents those probabilities of representing each class and that can be your final output in this case and now we'll go through each of these operations and break these ideas down a little bit further just so we can see the basic architecture of a CNN and how you can implement one as well okay so going through this step by step those three steps the first step is that convolution operation and as before this is the same story that we've going we've been going through each neuron here in the hidden layer we'll compute a weighted sum of its inputs from that patch and we'll apply a bias like in lecture one and two and activate with a local non-linearity know what's special here is that local connectivity that I just want to keep stressing again each neuron in that hidden layer is only seeing a patch from that original input image and that's what's really important here we can define the actual computation for a neuron in that hidden layer its inputs are those neurons in the patch and the previous layer we apply a matrix of weights again that's that filter a 4x4 filter in this case we do an element-wise multiplication add the result apply a bias activate with that non-linearity and that's it that's our single neuron at the hidden layer and we just keep repeating this by sliding that patch over the input remember that our element-wise multiplication and addition here is simply that convolution operation that we talked about earlier I'm not saying anything new except the addition of that bias term before our non-linearity so this defines how neurons and convolutional layers are connected but with a single convolutional layer we can have multiple different filters or multiple different features that we might want to extract or detect the output layer of a convolution therefore is not a single image as well but rather a volume of images representing all of the different filters that it detects so here at D the depth is the number of filters or the number of features that you want to detect in that image and that's set by the human so when you define your network you define at every layer how many features do you want to detect at that layer now we can also think about the connections in a neuron in a convolutional neural network in terms of their receptive field and the locations of their input of that specific node that they're connected to right so these parameters define the spatial arrangement of that output of the convolutional layer and to summarize we can see basically how the connections of let's see so how the connections of these convolutional layers are defined first of all and also how the output of a convolutional layer is a volume defined by that depth or the number of filters that we want to learn and with this information this defines our single convolutional layer and now we're well on our way to defining the full convolutional neural network the remaining steps are are kind of just icing on the cake at this point and it starts with applying that non-linearity so on that volume we apply an element-wise non-linearity in this case I'm showing a rectified linear unit activation function this is very similar in idea to lectures 1 & 2 where we also applied nonlinearities to deal with highly nonlinear data now here the relative activation function rectified linear unit we haven't talked about it yet but this is just an activation function that takes as input any real number and essentially ships everything less than zero to zero and anything greater than zero it keeps the same another way you can think about this is it make sure that the minimum of whatever you feed in is zero so if it's greater than zero it doesn't touch it if it's less than zero to make sure that it caps it at zero now the key idea the next key idea let's say of convolutional neural networks is pulling and that's how we can deal with different spatial resolutions and become spatially or like invariant to spatial size in our image now the pooling operation is used to reduce the dimensionality of our input layers and this can be done on any layer after the convolutional layer so you can apply on your input image a convolutional layer apply a comp a non-linearity and then down sample using a pooling layer to get a different spatial resolution before applying your next convolutional layer and repeat this process for many layers and a deep neural network now a common technique here for pooling is called max pooling and when and the idea is as follows so you can slide now another window or another patch over your network and for each of the patches you simply take the maximum value in that patch so let's say we're dealing with two by two patches in this case the red patch you can see on the top right we just simply take the maximum value in that red patch which is six and the output is on the right-hand side here so that six is the maximum from this patch this 2x2 patch and we repeat this over the entire image this makes us or this allows us to shrink the spatial dimensions of our image while still maintaining all of that spatial structure so actually this is a great point because I encourage all of you to think about what are some other ways that you could perform a pooling operation how else could you down sample these images max pooling is one way so you could always take the maximum of these these 2x2 patches but there are a lot of other really clever ways as well so it's interesting to think about some ways that we can also in another ways potentially perform this down sampling operation now the key idea here of these convolutional neural networks and how we're now with all of this knowledge we're kind of ready to put this together and perform these end-to-end networks so we have the three main steps that I talked to you about before convolution local nonlinearities and pooling operations and with CNN's we can layer these operations to learn a hierarchy of features and a hierarchy of features that we want to detect if they're present in the image data or not so a CNN built for image classification I'm showing the first part of that CNN here on the Left we can break it down roughly into two parts so the first part I'm showing here is the part of feature learning so that's where we want to extract those features and learn the features from our image data this is simply applying that same idea that I showed you before we're gonna stack convolution and nonlinearities with pooling operations and repeat this throughout the depth of our network the next step for our convolutional neural network is to take those extracted or learned features and to classify our image right so the ultimate goal here is not to extract features we want to extract features but then use them to make some classification or make some decision based on our image so we can feed these outputted features into a fully connected or dense layer and that dense layer can output a probability distribution over the image membership in different categories or classes and we do this using a function called softmax which you actually already got some experience with in lab 1 whose output represents this categorical distribution so now let's put this all together into coding our first end-to-end convolutional neural network from scratch we'll start by defining our feature extraction head which starts with a convolutional layer here I'm showing with 32 filters so 32 is coming from this number right here that's the number of filters that we want to extract inside of this first convolutional layer we down sample the spatial information using a max pooling operation like I discussed earlier and next we feed this into the next set of convolutional layers in our network so now instead of 32 features we're gonna be extracting even more features so now we're extracting 64 features then finally we can flatten this all of the spatial information and the spatial features that we've learned into a vector and learn our final probability distribution over class membership and that allows us to actually classify this image into one of these different classes that we've defined so far we've talked only about using CN NS for image classification tasks in reality this architecture extends to many many different types of tasks and many many different types of applications as well when we're considering CN NS for classification we saw that it has two main parts first being the feature learning part shown here and then a classification part and the second part of the pipeline what makes a convolutional neural network so powerful is that you can take this feature extraction part of the pipeline and at the output you can attach whatever kind of output that you want to it so you can just treat this convolutional feature extractor simply as that a feature extractor and then plug in whatever other type of neural network you want at its output so you can do detection by changing the output head you can do semantic segmentation which is where you want to detect semantic classes for every pixel in your image you can also do ant and robotic control like we saw with the tongue that's driving before so what's an example of this we've seen a significant impact in computer vision in medicine and healthcare over the last couple of years just a couple weeks ago actually there was this paper that came out where deep learning models have been applied to the analysis of a whole host of breast and the sry mammogram cancer detection or yeah breast cancer detection in mammogram images so what we showed what was showed here was that CNN were able to significantly outperform expert radiologists and detecting breast cancer directly from these mammogram images that's done by feeding these images through a convolutional feature extract they're out putting that those features those learn features to dense layers and then performing classification based on those dense layers instead of predicting a single number breast cancer or no breast cancer you could also imagine for every pixel in that image you want to predict what is the class of that pixel so here we're showing a picture of two cows on the left those are fed into a convolutional feature extractor and then they're up scaled through the inverse convolutional decoder to predict for every pixel in that image what is the class of that pixel so you can see that the network is able to correctly classify that it sees two cows and brown whereas the grass is in green and the sky is in blue and this is basically detection but not for a single number over the image yes or no there's a cow or in this image but for every pixel what is the class of this pixel this is a much harder problem and this output is actually created using these up sampling operations so this is no longer a dense neural network here but we have kind of inverse or what are called transpose convolutions which scale back up our image data and allow us to predict these images as outputs and not just single numbers or single probability distributions and of course this idea can be you can imagine very easily applied to many other applications in healthcare as well especially for segmenting various types of cancers such as here we're showing brain tumors on the top as well as parts of the blood that are infected with malaria on the bottom so let's see one final example before before ending this lecture and here we're showing again and going back to the example of self-driving cars and the idea again is pretty similar so let's say we want to learn a neural network to control a self-driving car and learn autonomous navigation specifically we want to go from a model we're using our model to go from images of the road maybe from a camera attached to that car on top of the car you can think of the actual pixels coming from this camera that are fed to the neural network and in addition to the pixels coming from the camera we also have these images from a bird's-eye Street view of where the car roughly is in the world and we can feed both of those images these are just two two-dimensional arrays so this is one two-dimensional array of images or pixels excuse me and this is another two-dimensional array of pixels both represent different things so this represents your perception of the world around you and this represents roughly where you are in the world globally and what we want to do with this is then to directly predict or infer a full distribution of possible control actions that the car could take at that's instant so if it doesn't have any goal destination in mind they could say that I could take any of these three directions and steer in those directions and that's what we want to predict with this Network one way I do this is that you can actually train your neural network to take as input these camera images coming from the car pass them each through these convolutional encoders or feature extractors and then now that you've learned features for each of those images you can concatenate all of them together so now you have a global set of features across all of your sensor data and then learn your control outputs from those on the right hand side now again this is done entirely end to end right so we never told the car what a lane marker was what a road was or how to even turn right or left or what's an intersection so we never told any of that information but it's able to learn all of this and extract those features from scratch just by watching a lot of human driving data and learn how to drive on its own so here's an example of how a human can actually enter the car input a desired destination which you can see on the top right the red line indicates where we want the car to go in the map so think of this as like a Google map so you plug into Google Maps where you want to go and the antenna see and then the convolutional neural network will output the control commands given what it sees on the road to actually activate that vehicle towards that destination note here that the vehicle is able to like sickness successfully navigate through those intersections even though it's never been driving in this area before it's never seen these roads before and we never even told it what in an intersection was it learned all of this from data using convolutional neural networks now the impact of cnn's has been very very wide reaching beyond these examples that I've given to today and it has touched so many different fields of computer vision ranging across robotics and as medicine and many many other fields I'd like to conclude by taking a look at what we've covered in today's lecture we first considered the origins of computer vision and how images are represented as brightness values to a computer and how these convolution operations work in practice right so then we discussed the basic architecture and how we get build-up from convolution operations to build convolutional layers and then pass that to convolutional neural networks and finally we talked about the extensions and applications of convolutional neural networks and how we can visualize a little bit of the behavior and actually actuate some of the real world with convolutional neural networks either by predicting some parts of medicine or some parts of medical scans or even activating robots to interact with humans in the real world and that's it for the CNN lecture on computer vision next up we'll hear from alpha a deep generate generative modeling and thank you you
Info
Channel: Alexander Amini
Views: 222,613
Rating: 4.9756923 out of 5
Keywords: deep learning, mit, artificial intelligence, neural networks, machine learning, 6s191, 6.s191, mit deep learning, ava soleimany, soleimany, alexander amini, amini, lecture 2, tensorflow, computer vision, deep mind, openai, basics, introduction, deeplearning, ai, tensorflow tutorial, what is deep learning, deep learning basics, cnn, convolutional, convolution, vision, self driving, autonomous vehicles, machine vision, image processing, semantic segmentation
Id: iaSUYvmCekI
Channel Id: undefined
Length: 37min 20sec (2240 seconds)
Published: Fri Feb 21 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.