MIT 6.S191: Convolutional Neural Networks

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hi Everyone and welcome back to MIT 6.S191! Today  we're going to be talking about one of my favorite   topics in this course and that's how we can give  machines a sense of vision now vision is one of   the most important human senses I believe sighted  people rely on vision quite a lot from everything   from navigating in the world to recognizing  and manipulating objects to interpreting facial   expressions and understanding very complex human  emotions i think it's safe to say that vision is   a huge part of everyday human life and today we're  going to learn about how we can use deep learning   to build very powerful computer vision  systems and actually predict what is where   by only looking and specifically looking at only  raw visual inputs i like to think that this is a   very super simple definition of what vision at its  core really means but actually vision is so much   more than simply understanding what an image  is of it means not just what the image is of   but also understanding where the objects  in the scene are and really predicting   and anticipating forward in the future what's  going to happen next take this scene for example   we can build computer vision algorithms that  can identify objects in the scene such as this   yellow taxi or maybe even this white truck on the  side of the road but what we need to understand   on a different level is what is actually going  to be required to achieve true vision where are   all of these objects going uh for that we should  actually focus probably more on the yellow taxi   than on the white truck because there are  some subtle cues in this image that you can   probably pick up on that lead us to believe that  probably this white truck is parked on the side   of the road it's stationary and probably won't  be moving in the future at least for the time   that we're observing the scene the yellow taxi on  the other hand is even though it's also not moving   it is much more likely to be stationary as a  result of the pedestrians that are crossing in   front of it and that's something that is very  subtle but can actually be reasoned about very   effectively by our brains and humans take this for  granted but this is an extraordinarily challenging   problem in the real world and since in the  real world building true vision algorithms   can require reasoning about all of these different  components not just in the foreground but also   there are some very important cues that we can  pick up in the background like this light this   uh road light as well as some obstacles in the  far distance as well and building these vision   algorithms really does require an understanding  of all of these very subtle details now   deep learning is bringing forward an incredible  revolution or evolution as well of computer vision   algorithms and applications ranging from allowing  robots to use visual cues to perform things like   navigation and these algorithms that you're going  to learn about today in this class have become so   mainstreamed and so compressed that they are all  fitting and running in each of our pockets in our   telephones for processing photos and videos  and detecting faces for greater convenience   we're also seeing some extraordinarily exciting  applications of vision in biology and medicine   for picking up on extremely subtle cues and  detecting things like cancer as well as in the   field of autonomous driving and finally in a  few slides i'll share a very inspiring story   of how the algorithms that you're going to learn  about today are also being used for accessibility   to aid the visually impaired now deep learning  has taken computer vision especially computer   vision by storm because of its ability to learn  directly from the raw image inputs and learn to do   feature extraction only through observation of a  ton of data now one example of that that is really   prevalent in the computer vision field is of  facial detection and facial recognition on the   top left or on the left hand side you can actually  see an icon of a human eye which pictorially i'm   using to represent images that we perceive and  we can also pass through a neural network for   predicting these facial features now deep learning  has transformed this field because it allows   the creator of the machine learning or the deep  learning algorithm to easily swap out the end task   given enough data to learn this neural network  in the middle between the vision and the task   and try and solve it so here we're performing an  end task of facial detection but just equivalently   that end task could be in the context of  autonomous driving here where we take an image as   an input which you can see actually in the bottom  right hand corner and we try to directly learn   the steering control for the output and actually  learn directly from this one observation of the   scene where the car should control so what is  the steering wheel that should execute and this   is done completely end to end the entire control  system here of this vehicle is a single neural   network learned entirely from data now this is  very very different than the majority of other   self-driving car companies like you'll see with  waymo and tesla et cetera and we'll talk more   about this later but i actually wanted to share  this one clip with you because this is one of the   autonomous vehicles that we've been building  in our lab and here in csail that i'm part of   and we'll see more about that  later in the lecture as well   we're seeing like i mentioned a lot of  applications in medicine and healthcare   where we can take these raw images and scans  of patients and learn to detect things like   breast cancer skin cancer and now most recently  taking scans of patients lungs to detect covid19 finally i want to share this inspiring story  of how computer vision is being used to help   the visually impaired so in this project actually  researchers built a deep learning enabled device   that can detect a trail for running and provide  audible feedback to the visually impaired user   such that they can run and now to demonstrate  this let me just share this very brief video   the machine learning algorithm that we have  detects the line and can tell whether the   line is to the runners left right or center we  can then send signals to the runner that guides   them left and right based on their positioning  the first time we went out we didn't even know   if sound would be enough to guide me so it's a  sort of that beta testing process that you go   through from human eyes it's very obvious it's  very obvious to recognize the line teaching a   machine learning model to do that is not that  easy you step left and right as you're running   so there's like a shake to the line left and  right as soon as you start going outdoors   now the light is a lot more variable tree shadows  falling leaves and also the line on the ground   can be very narrow and there may be only a few  pixels for the computer vision model to recognize there was no tether there was no stick there  was no furry dog it was just being with yourself ah that's the first time i run loading in decades so these are often tasks that  we as humans take for granted   but for a computer it's really remarkable  to see how deep learning is being applied   uh to some of these problems focused on really  doing good and just helping people here in this   case the visually impaired a man who has never run  without his uh guide dog before is now able to run   independently through the through the trails with  the aid of this computer vision system and like   i said we often take these tasks for granted but  because it's so easy for each sighted individual   for us to do them routinely but we can actually  train computers to do them as well and in order   to do that though we need to ask ourselves  some very foundational questions specifically   stemming from how we can build a computer that  can quote unquote c and specifically how does a   computer process an image let's use an image as  our base example of site to a computer so far   so to a computer images are just  numbers there are two dimensional   lists of numbers suppose we have a picture here  this is of abraham lincoln it's just made up of   what are called pixels each of those numbers  can be represented by what's called a pixel   now a pixel is simply a number like i said here  represented by a range of either zero to one or   in 0 to 255 and since this is a grayscale image  each of these pixels is just one number if you   have a color image you would represent it by  three numbers a red a green and a blue channel rgb   now what does the computer see so we can represent  this image as a two-dimensional matrix of these   numbers one number for each pixel in the image and  this is it this is how a computer sees an image   like i said if we have a rgb image not a  a grayscale image we can represent this   by a three-dimensional array now we have  three two-dimensional arrays stacked on   top of each other one of those two dimensional  arrays corresponds to the red channel one for   the green one for the blue representing this  rgb image and now we have a way to represent   images to computers and we can start to think  about what types of computer vision algorithms   we can perform with this so there are very there  are two very common types of learning tasks and   those are like we saw in the first and the second  classes those are one regression and those are   also classification tasks in regression tasks our  output takes the form of a continuous value and   in classification it takes a single class label so  let's consider first the problem of classification   we want to predict a label for each image so  for example let's say we have a database of all   u.s precedents and we want to build  a classification pipeline to tell us   which precedent this image is of so we feed  this image that we can see on the left hand   side to our model and we wanted to output the  probability that this image is of any of these   particular precedents that this database consists  of in order to classify these images correctly   though our pipeline needs to be able to tell what  is actually unique about a picture of abraham   lincoln versus a picture of any other president  like george washington or jefferson or obama another way i think about this uh  these differences between these   images and the image classification pipeline  is at a high level in terms of the features   that are really characteristics of that particular  class so for example what are the features   that define abraham lincoln now classification  is simply done by detecting the features in that   given image so if the features for a particular  class are present in the image then we can predict   with pretty high confidence that that class is  occurring with a high probability so if we're   building an image classic classification pipeline  our model needs to know what are the features are   what they are and two it needs to be able to  detect those features in a brand new image so   for example if we want to detect human faces some  features that we might want to be able to identify   would be noses eyes and mouths whereas like if  we want to detect cars we might be looking at   certain things in the image like wheels license  plates and headlights and the same for houses and   doors and windows and steps these are all examples  of features for the larger object categories now one way to do this and solve this problem is  actually to leverage knowledge about a particular   field say let's say human faces so if we want  to detect human faces we could manually define   in images what we believe those features  are and actually use the results of our   detection algorithm for classification but there's  actually a huge problem to this type of approach   and that is that images are just 3d arrays  of numbers of brightness values and that each   image can have a ton of variation and this  includes things like occlusions in the scene   there could also be variations in illumination  the lighting conditions as well as you could   even think of intra class variation variation  within the same class of images our classification   pipeline whatever we're building really  needs to be invariant to all of these types   of variations but it still needs to be  sensitive to picking out the different   inter-class variations so being able to  distinguish a feature that is unique to this class   in comparison to features or variations of  that feature that are present within the class now even though our pipeline could be used  could use features that we as humans define   that is if a human was to come into  this problem knowing something about   the problem a priori they could define  or manually extract and break down   what features they want to detect for this  specific task even if we could do that   due to the incredible variability of the scene  of image data in general the detection of these   features is still an extremely challenging problem  in practice because your detection algorithm needs   to be invariant to all of these different  variations so instead of actually manually   defining these how can we do better and what we  actually want to do is be able to extract features   and detect their presence in images automatically  in a hierarchical fashion and this should remind   you back to the first lecture when we talked about  hierarchy being a core component of deep learning   and we can use neural network-based approaches  to learn these visual features directly from   data and to learn a hierarchy of features  to construct a representation of the image   internal to our network so again like we saw in  the first lecture we can detect these low-level   features and composing them together to  build these mid-level features and then   in later layers these higher level features  to really perform the task of interest   so neural networks will allow us to learn these  hierarchies of visual features from data if we   construct them cleverly so this will require us to  use some different architectures than what we have   seen so far in the class namely architectures  from the first lecture with feedforward   dense layers and in the second lecture recurrent  layers for handling sequential data this lecture   will focus on yet another type of way that we  can extract features specifically focusing on   the visual domain so let's recap what we learned  in lecture one so in lecture one we learned about   these fully connected neural networks also called  dense neural networks where you can have multiple   hidden layers stacked on top of each other and  each neuron in each hidden layer is connected to   every neuron in the previous layer now let's  say we want to use a fully connected network   to perform image classification and we're going  to try and motivate the the use of something   better than this by first starting with what we  already know and we'll see the limitations of this   so in this case remember our input is  this two-dimensional image it's a vector   a two-dimensional vector but it can be collapsed  into a one-dimensional vector if you just stack   all of those dimensions on top of each other  of pixel values and what we're going to do is   feed in that vector of pixel values to our hidden  layer connected to all neurons in the next layer   now here you should already appreciate something  and that is that all spatial information that we   had in this image is automatically gone it's  lost because now since we have flattened this   two-dimensional image into one dimension we have  now basically removed any spatial information   that we previously had by the next layer and our  network now has to relearn all of that uh very   important spatial information for example that  one pixel is closer to the its neighboring pixel   that's something very important in our input but  it's lost immediately in a fully connected layer   so the question is how can we build some structure  into our model so that in order so that we can   actually inform the learning process and provide  some prior information to the model and help   it learn this very complicated and large input  image so to do this let's keep our representation   of our image our 2d image as an array a  two-dimensional array of pixel values let's not   collapse it down into one dimension now one  way that we can use the spatial structure   would be to actually connect patches of our input  not the whole input but just patches of the input   two neurons in the hidden layer so before  everything was connected from the input layer to   the hidden layer but now we're just gonna connect  only things that are within a single patch to the   next neuron in the next layer now that is really  to say that each neuron only sees so if we look   at this output neuron this neuron is only going to  see the values coming from the patch that precedes   it this will not only reduce the number of weights  in our model but it's also going to allow us   to leverage the fact that in an image spatially  close pixels are likely to be somewhat related and   correlated to each other and that's a fact that  we should really take into account so notice how   the only that only a small region of the input  layer influences this output neuron and that's   because of this spatially connected idea that  we want to preserve as part of this architecture   so to define connections across the whole input  now we can apply the same principle of connecting   patches in our input layer to single neurons in  the subsequent layer and we can basically do this   by sliding that patch across the input image and  for each time we slide it we're going to have a   new output neuron in the subsequent layer now this  way we can actually take into account some of the   spatial structure that i'm talking about inherent  to our input but remember that our ultimate task   is not only to preserve spatial structure but  to actually learn the visual features and we   do this by weighting the connections between  the patches and the neurons so we can detect   particular features so that each patch is going  to try to perform that detection of the feature   so now we ask ourselves how can we rate this  patch such that we can detect those features well   in practice there's an operation called a  convolution and we'll first think about this   at a high level suppose we have a 4x4 patch  or a filter which will consist of 16 weights   we're going to apply this same filter to by four  patches in the input and use the result of that   operation to define the state of the neuron in  the next layer so the neuron in the next layer   the output that single neuron is going to be  defined by applying this patch with a filter   with of equal size and learned weights  we're then going to shift that patch   over let's say in this case by two pixels we  have here to grab the next patch and thereby   compute the next output neuron now this is how we  can think about convolutions at a very high level   but you're probably wondering here well how does  the convolution operator actually allow us to   extract features and i want to make this really  concrete by walking through a very simple example so suppose we want to classify the letter x  in a set of black and white images of letters   where black is equal to negative one  and white is equal to positive one   now to classify it's clearly not possible to  simply compare the two images the two matrices   on top of each other and say are they equal  because we also want to be classifying this x   uh no matter if it has some slight deformations  if it's shifted or if it's uh enlarged rotated   or deformed we need we want to build a classifier  that's a little bit robust to all of these changes   so how can we do that we want to detect the  features that define an x so instead we want   our model to basically compare images of a  piece of an x piece by piece and the really   important pieces that it should look for are  exactly what we've been calling the features   if our model can find those important features  those rough features that define the x in the same   positions roughly the same positions then it can  get a lot better at understanding the similarity   between different examples of x even in  the presence of these types of deformities so let's suppose each feature is like a mini  image it's a patch right it's also a small   array a small two-dimensional array of values and  we'll use these filters to pick up on the features   common to the x's in the case of this x  for example the filters we might want to   pay attention to might represent things like the  diagonal lines on the edge as well as the crossing   points you can see in the second patch here so  we'll probably want to capture these features   in the arms and the center of the x in order  to detect all of these different variations   so note that these smaller matrices  of filters like we can see on the   the top row here these represent the filters of  weights that we're going to use as part of our   convolution operation in order to detect the  corresponding features in the input image   so all that's left for us to define is actually  how this convolution operation actually   looks like and how it's able to pick up on these  features given each of these in this case three   filters so how can it detect given a filter where  this filter is occurring or where this feature is   occurring rather in this image and that is  exactly what the operation of convolution is   all about convolution the idea of convolution  is to preserve the spatial relationship between   pixels by learning image features in small little  patches of image data now to do this we need to   perform an element-wise multiplication between  the filter matrix and the patch of the input image   of same dimension so if we have a patch  of 3x3 we're going to compare that to an   input filter or our filter which is also of  size 3x3 with learned weights so in this case   our filter which you can see on the top left all  of its entries are of either positive one or one   or negative one and when we multiply this filter  by the corresponding green input image patch   and we element wise multiply  we can actually see the result   in this matrix so multiplying all of the positive  ones by positive ones we'll get a positive one   multiplying a negative one by a negative one will  also get a positive one so the result of all of   our element-wise multiplications is going to be  a three by three matrix of all ones now the next   step in as part of the convolution operation is  to add all of those element-wise multiplications   together so the result here after we add those  outputs is going to be 9. so what this means now   actually so actually before we  get to that let me start with   another very brief example suppose we want  to compute the convolution now not of a   very large image but this is just of a five by  five image our filter here is three by three so   we can slide this three by three filter over the  entirety of our input image and performing this   element-wise multiplication and then adding  the outputs let's see what this looks like so   let's start by sliding this filter over the top  left hand side of our input we can element wise   multiply the entries of this patch of this filter  with this patch and then add them together and   for this part this three by three filter is  placed on the top left corner of this image   element-wise multiply add and we get this  resulting output of this neuron to be four and we can slide this filter over one one spot by  one spot to the next patch and repeat the results   in the second entry now would be corresponding  to the activation of this filter applied to   this part of the image in this case three and  we can continue this over the entirety of our   image until the end when we have completely filled  up this activation or feature map and this feature   map really tells us where in the input image was  activated by this filter so for example wherever   we see this pattern conveyed in the original input  image that's where this feature map is going to   have the highest value and that's where we need  to actually activate maximally now that we've   gone through the mechanism of the convolution  operation let's see how different filters can be   used to produce feature maps so picture a woman  of a woman a picture this picture of a woman's   face this woman's name is lena and the output of  applying these three convolutional filters so you   can see the three filters that we're considering  on the bottom right hand corner of each image   by simply changing the weights of these filters  each filter here has a different weight we can   learn to detect very different features in  that image so we can learn to sharpen the   image by applying this very specific type of  sharpening filter we can learn to detect edges   or we can learn to detect very strong edges in  this image simply by modifying these filters   so these filters are not learned filters these  are constructed filters and there's been a ton   of research historically about developing hand  engineering these filters but what convolutional   neural networks learn to want to do is actually  to learn the weights defining these filters so   the network will learn what kind of features it  needs to detect in the image doesn't need to do   edge detection or strong edge detection or does  it need to detect certain types of edges curves   certain types of geometric objects etc  what are the features that it needs to   extract from this image and by learning the  convolutional filters it's able to do that   so i hope now you can actually appreciate  how convolution allows us to capitalize on   very important spatial structure and to use sets  of weights to extract very local features in the   image and to very easily detect different features  by simply using different sets of weights and   different filters now these concepts of preserving  spatial structure and local feature extraction   using the convolutional operation are actually  core to the convolutional neural networks that   are used for computer vision tasks and that's  exactly what i want to dive into next now that   we've gotten the operation the mathematical  foundation of convolutions under our belts   we can start to think about how we can utilize  this operation this operation of convolutions   to actually build neural networks for computer  vision tasks and tie this whole thing in to this   paradigm of learning that we've been exposed to  in the first couple lectures now these networks   aptly are named convolutional neural networks  very appropriately and first we'll take a look   at a cnn or convolutional neural network designed  specifically for the task of image classification   so how can you use cnns for classification let's  consider a simple cnn designed for the goal here   to learn features directly from the image  data and we can use these learned features   to map these onto a classification task for  these images now there are three main components   and operations that are core to a cnn the first  part is what we've already gotten some exposure to   in the first part of this lecture and that  is the convolution operation and that allows   us like we saw earlier to generate these  feature maps and detect features in our image   the second part is applying a non-linearity  and we saw the importance of nonlinearities   in the first and the second lecture in order to  help us deal with these features that we extract   being highly non-linear thirdly we need to  apply some sort of pooling operation this is   another word for a down sampling operation  and this allows us to scale down the size of   each feature map now the computation of a class  of scores which is what we're doing when we define   an image classification task is actually  performed using these features that we obtain   through convolution non-linearity and pooling  and then passing those learned features into a   fully connected network or a dense layer like we  learned about in the first part of the class in   the first lecture and we can train this model end  to end from image input to class prediction output   using fully connected layers and convolutional  layers end to end where we learn as part of the   convolutional layers the sets of weights of the  filters for each convolutional layer and as well   as the weights that define these fully connected  layers that actually perform our classification   task in the end and we'll go through each one of  these operations in a bit more detail to really   break down the basics and the architecture  of these convolutional neural networks so first we'll consider the convolution  operation of a cnn and as before each neuron   in the hidden layer will compute a weighted  sum of each of its inputs like we saw in the   dense layers we'll also need to add on a bias  to allow us to shift the activation function   and apply and activate it with some non-linearity  so that we can handle non-linear data   relationships now what's really special here is  that the local connectivity is preserved each   neuron in the hidden layer you can see in the  middle only sees a very specific patch of its   inputs it does not see the entire input neurons  like it would have if it was a fully connected   layer but no in this case each neuron output  observes only a very local connected patch as   input we take a weighted sum of those patches we  compute that weighted sum we apply a bias and we   apply and activate it with a non-linear  activation function and that's the   feature map that we're left with at the end of a  convolutional layer we can now define this actual   operation more concretely using a mathematical  equation here we're left with a 4x4 filter matrix   and for each neuron in the hidden layer its  inputs are those neurons in the patch from the   previous layer we apply this set of weights wi  j in this case like i said it's a four by four   filter and we do this element-wise multiplication  of every element in w multiplied by the   corresponding elements in the input x we add the  bias and we activate it with this non-linearity   remember our element-wise multiplication  and addition is exactly that convolutional   operation that we talked about earlier so if you  look up the definition of what convolution means   it is actually that exactly it's element-wise  multiplication and then a summation of all of   the results and this actually defines also how  convolutional layers are connected to these ideas   but with this single convolutional layer we  can how can we have multiple filters so all   we saw in the previous slide is how we can take  this input image and learn a single feature map   but in reality there are many types of features  in our image how can we use convolutional layers   to learn a stack or many different types of  features that could be useful for performing   our type of task how can we use this to do  multiple feature extraction now the output layer   is still convolution but now it has a volume  dimension where the height and the width are   spatial dimensions dependent upon  the dimensions of the input layer   the dimensions of the filter  the stride how how much we're   skipping on each each time that we apply the  filter but we also need to think about the   the connections of the neurons in these layers  in terms of their what's called receptive field   the locations of their input in the in the  in the model in in the path of the model that   they're connected to now these parameters actually  define the spatial arrangement of how the neurons   are connected in the convolutional layers and how  those connections are really defined so the output   of a convolutional layer in this case will have  this volume dimension so instead of having one   filter map that we slide along our image now we're  going to have a volume of filters each filter   is going to be slid across the image and compute  this convolution operation piece by piece for each   filter the result of each convolution operation  defines the feature map that that convolution that   that filter will activate maximally so now we're  well on our way to actually defining what a cnn is   and the next step would actually be to apply that  non-linearity after each convolution operation we   need to actually apply this non-linear activation  function to the output volume of that layer and   this is very very similar like i said in the  first and we saw also in the second lecture   and we do this because image data is highly  nonlinear a common example in the image domain   is to use an activation function of relu which  is the rectified linear unit this is a pixel-wise   operation that replaces all negative values with  zero and keeps all positive values with whatever   their value was we can think of this really as a  thresholding operation so anything less than zero   gets thresholded to zero negative values indicate  negative detection of a convolution but this   nonlinearity actually kind of uh clamps that to  some sense and that is a nonlinear operation so   it does satisfy our ability to learn non-linear  dynamics as part of our neural network model   so the next operation in convolutional  neural networks is that of pooling   pooling is an operation that is commonly used to  reduce the dimensionality of our inputs and of   our feature maps while still preserving spatial  invariants now a common technique and a common   type of pooling that is commonly used in practice  is called max pooling as shown in this example   max pooling is actually super simple and intuitive  uh it's simply taking the maximum over these two   by two filters in our patches and sliding that  patch over our input very similar to convolutions   but now instead of applying a element-wise  multiplication and summation we're just simply   going to take the maximum of that patch so in  this case as we feed over this two by two patch of   filters and striding that patch by a factor of two  across the image we can actually take the maximum   of those two by two pixels in our input and that  gets propagated and activated to the next neuron   now i encourage all of you to really think  about some other ways that we can perform   this type of pooling while still making sure that  we downsample and preserve spatial invariants   taking the maximum over that patch is  one idea a very common alternative is   also taking the average that's called mean pooling  taking the average you can think of actually   represents a very smooth way to perform the  pooling operation because you're not just taking   a maximum which can be subject to maybe outliers  but you're averaging it or also so you will get a   smoother result in your output layer but they  both have their advantages and disadvantages so these are three operations three  key operations of a convolutional   neural network and i think now we're actually  ready to really put all of these together and   start to construct our first convolutional  neural network end to end and with cnns   just to remind you once again we can layer  these operations the whole point of this   is that we want to learn this hierarchy  of features present in the image data   starting from the low-level features composing  those together to mid-level features and then   again to high-level features that can be used  to accomplish our task now a cnn built for image   classification can be broken down into two parts  first the feature learning part where we actually   try to learn the features in our input image  that can be used to perform our specific task   that feature learning part is actually done  through those pieces that we've been seeing so far   in this lecture the convolution the non-linearity  and the pooling to preserve the spatial invariance now the second part the convolutional layers and  pooling provide output those the output excuse me   of the first part is those high-level features of  the input now the second part is actually using   those features to perform our classification  or whatever our task is in this case   the task is to output the class probabilities that  are present in the input image so we feed those   outputted features into a fully connected or dense  neural network to perform the classification we   can do this now and we don't mind about losing  spatial invariance because we've already down   sampled our image so much that it's not really  even an image anymore it's actually closer to a   vector of numbers and we can directly apply our  dense neural network to that vector of numbers   it's also much lower dimensional now and we  can output a class of probabilities using a   function called the softmax whose output actually  represents a categorical probability distribution   it's summed uh equal to one so it does  make it a proper categorical distribution   and it is each element in this is strictly between  zero and one so it's all positive and it does sum   to one so it makes it very well suited for the  second part if your task is image classification   so now let's put this all together what does a  end-to-end convolutional neural network look like   we start by defining our feature extraction head  which starts with a convolutional layer with 32   feature maps a filter size of 3x3 pixels and we  downsample this using a max pooling operation   with a pooling size of 2 and a stride of 2. this  is very exactly the same as what we saw when we   were first introducing the convolution operation  next we feed these 32 feature maps into the next   set of the convolutional convolutional and pooling  layers now we're increasing this from 32 feature   maps to 64 feature maps and still down scaling our  image as a result so we're down scaling the image   but we're increasing the amount of features  that we're detecting and that allows us to   actually expand ourselves in this dimensional  space while down sampling the spatial information   the irrelevant spatial information now finally now  that we've done this feature extraction through   only two convolutional layers in this case we can  flatten all of this information down into a single   vector and feed it into our dense layers and  predict these final 10 outputs and note here that   we're using the activation function of softmax  to make sure that these outputs are a categorical   distribution okay awesome so so far we've talked  about how we can use cnns for image classification   tasks this architecture is actually so powerful  because it extends to a number of different tasks   not just image classification and the reason for  that is that you can really take this feature   extraction head this feature learning part and  you can put onto the second part so many different   end networks whatever and network you'd like to  use you can really think of this first part as   a feature learning part and the second part as  your task learning part now what that task is   is entirely up to you and what you desire so and  that's that's really what makes these networks   incredibly powerful so for example we may want  to look at different image classification domains   we can introduce new architectures for  specifically things like image and object   detection semantic segmentation and even  things like image captioning you can use   this as an input to some of the sequential  networks that we saw in lecture two even so let's look at and dive a bit deeper into  each of these different types of tasks that   we could use are convolutional neural networks  for in the case of classification for example   there is a significant impact in medicine and  healthcare when deep learning models are actually   being applied to the analysis of entire inputs of  medical image scans now this is an example of a   paper that was published in nature for actually  demonstrating that a cnn can outperform expert   radiologists at detecting breast cancer directly  from mammogram images instead of giving a binary   prediction of what an output is though cancer or  not cancer or what type of objects for example in   this image we may say that this image is an image  of a taxi we may want to ask our neural network to   do something a bit more fine resolution and tell  us for this image can you predict what the objects   are and actually draw a bounding box localize this  image or localize this object within our image   this is a much harder problem since there may  be many objects in our scene and they may be   overlapping with each other partially occluded  etc so not only do we want to localize the object   we want to also perform classification on that  object so it's actually harder than simply the   classification task because we still have to  do classification but we also have to detect   where all of these objects are in addition  to classifying each of those objects   now our network needs to also be flexible and  actually and be able to infer not just potentially   one object but a dynamic number of objects in the  scene now if we if we have a scene that only has   one taxi it should output a bounding box over just  that single taxi and the bounding box should tell   us the xy position of one of the corners and maybe  the height and the width of that bounding box as   well that defines our bounding box on the other  hand if our scene contains many different types   of objects potentially even of different types of  classes we want our network to be able to output   many different outputs as well and be flexible  to that type of differences in our input even   with one single network so our network should not  be constrained to only outputting a single output   or a certain number of outputs it needs to have a  flexible range of how we can dynamically infer the   objects in the scene so what is one maybe naive  solution to tackle this very complicated problem   and how can cnns be used to do that so what we  can do is start with this image and let's consider   the simplest way possible to do this  problem we can start by placing a random box   over this image somewhere in the image it has  some random location it also has a random size   and we can take that box and feed it through  our normal image classification network like   we saw earlier in the lecture this is  just taking a single image or it's now   a sub image but it's still a single image and it  feeds that through our network now that network is   tasked to predict what is the what is the class  of this image it's not doing object detection   and it predicts that it has some class if there is  no class of this box then it simply can ignore it   and we repeat this process then we pick another  box in the scene and we pass that through the   network to predict its class and we can keep doing  this with different boxes in the scene and keep   doing it and over time we can basically have many  different class predictions of all of these boxes   as they're passed through our classification  network in some sense if each of these boxes give   us a prediction class we can pick the boxes that  do have a class in them and use those as a box   where an object is found if no object is found we  can simply discard it and move on to the next box   so what's the problem with this well one is that  there are way too many inputs the this basically   results in boxes and considering a number of  boxes that have way too many scales way too   many positions too many sizes we can't possibly  iterate over our image in all of these dimensions   and and and have this as a naive solute and have  this as a solution to our object detection problem   so we need to do better than that so instead of  picking random boxes or iterating over all of the   boxes in our image let's use a simple heuristic  method to identify some places in the image   that might contain meaningful objects  and use these to feed through our model   but still even with this uh extraction of region  proposals the the rest of the store is the exact   same we extract the region of proposal and we feed  it through the rest of our network we warp it to   be the correct size and then we feed it to our  classification network if there's nothing in that   box we discard it if there is then we keep it and  say that that box actually contained this image   but still this has two very important problems  that we have to consider one is that it's still   super super slow we have to feed in each region  independently to the model so if we extract   in this case 2000 regions we have here we have to  feed this we have to run this network 2 000 times   to get the answer just for the single image it  also tends to be very brittle because in practice   how are we doing this region proposal well  it's entirely heuristic based it's not being   learned with a neural network and it's also  even more importantly perhaps perhaps it's   detached from the feature extraction part so  our feature extraction is learning one piece   but our region proposal piece of the network or  of this architecture is completely detached so   the model cannot learn to predict regions  that may be specific to a given task that   makes it very brittle for some applications  now many variants have been proposed to   actually tackle and tackle some of these issues  and advance this forward to accomplish object   detection but i'd like to touch on one extremely  quickly just to point you on in this direction   for those of you who are interested and that's the  faster rcnn method to actually learn these region   proposals the idea here is instead of feeding  in this image to a heuristic based feedback or   region proposal network or method we can have a  part of our network that is trained to identify   the proposal regions of our model of our image  and that allows us to directly understand or   identify these regions in our original image where  there are candidate patches that we should explore   for our classification and our for our object  detection now each of these regions then are   processed with their own feature extractor as  part of our neural network and individuals or   in their cnn heads then after these features for  each of these proposals are extracted we can do   a normal classification over each of these  individual regions very similar as before   but now the huge advantage of this is that it only  requires a single forward pass through the model   we only feed in this image once we have a region  proposal network that extracts the regions   and all of these regions are fed on to perform  classification on the rest of the image   so it's super super fast compared to the previous  method so in classification we predict one class   for an entire image of the model in object  detection we predict bounding boxes over all   of the objects in order to localize them and  identify them we can go even further than this   and in this idea we're still using cnns to predict  this predict this output as well but instead of   predicting bounding boxes which are rather coarse  we can task our network to also here predict   an entire image as well now one example  of this would be for semantic segmentation   where the input is an rgb an image just a normal  rgb image and the output would be pixel-wise   probabilities for every single pixel what is  the probability that it belongs to a given class   so here you can see an example of this image  of some two cows on the on some grass being   fed into the neural network and the neural  network actually predicts a brand new image   but now this image is not an rgb image it's a  semantic segmentation image it has a probability   for every single pixel it's doing a classification  problem and it's learning to classify every single   pixel depending on what class it thinks it is  and here we can actually see how the cow pixels   are being classified separately from the grass  pixels and sky pixels and this output is actually   created using an up sampling operation not a down  sampling operation but up sampling to allow the   convolutional decoder to actually increase its  spatial dimension now these layers are the analog   you could say of the normal convolutional layers  that we learned about earlier in the lecture   they're also already implemented in tensorflow so  it's very easy to just drop these into your model   and allow your model to learn how to actually  predict full images in addition or instead of   single class probabilities this semantic  segmentation idea is extremely powerful   because it can be also applied to many  different applications in healthcare as well   especially for segmenting for example  cancerous regions on medical scans or   even identifying parts of the blood that are  infected with diseases like in this case malaria let's see one final example here of how we can  use convolutional feature extraction to perform   yet another task this task is different from  the first three that we saw with classification   object detection and semantic segmentation now  we're going to consider the task of continuous   robotic control here for self-driving cars  and navigating directly from raw vision data   specifically this model is going to take as  input as you can see on the top left hand side   the raw perception from the vehicle this is  coming for example from a camera on the car and   it's also going to see a noisy representation of  street view maps something that you might see for   example from google maps on your smartphone and it  will be tasked not to predict the classification   problem or object detection but rather learn  a full probability distribution over the space   of all possible control commands that this  vehicle could take in this given situation   now how does it do that actually this entire model  is actually using everything that we learned about   in this lecture today it can be trained end to  end by passing each of these cameras through their   dedicated convolutional feature extractors and  then basically extracting all of those features   and then concatenating them flattening them down  and then concatenating them into a single feature   extraction vector so once we have this entire  representation of all of the features extracted   from all of our cameras and our maps we can  actually use this representation to predict the   full control parameters on top of a deterministic  control given to the desired destination of the   vehicle this probabilistic control is very  powerful because here we're actually learning   to just optimize a probability distribution over  where the vehicle should steer at any given time   you can actually see this probability distribution  visualized on this map and it's optimized simply   by the negative log likelihood which is the  negative log likelihood of this distribution which   is a normal a mixture of normal distributions and  this is nearly identical to how you operate in   classification as well in that domain you  try to minimize the cross-entropy loss   which is also a negative log likelihood  optim or probability function   so keep in mind here that this is composed of  the convolutional layers to actually perform this   feature extraction these are exactly the same  as what we learned about in this lecture today   as well as these flattening pooling layers and  concatenation layers to really produce this single   representation and feature vector of our inputs  and finally it predicts these outputs in this case   a continuous representation of control that this  vehicle should take so this is really powerful   because a human can actually enter the car input  a desired destination and the end to end cnn will   output the control commands to actuate the vehicle  towards that destination note here that the   vehicle is able to successfully recognize when it  approaches the intersections and take the correct   control commands to actually navigate that  vehicle through these brand new environments   that it has never seen before and never  driven before in its training data set   and the impact of cnns has been very wide  reaching beyond these examples as well that   i've explained here today it has touched so many  different fields in computer vision especially   and i'd like to really conclude this lecture  today by taking a look at what we've covered we've   really covered a ton of material today we covered  the foundations of computer vision how images are   represented as an array of brightness values and  how we can use convolutions and how they work   we saw that we can build up these convolutions  into the basic architecture defining convolutional   neural networks and discussed how  cnns can be used for classification   finally we talked about a lot of the extensions  and applications of how you can use these basic   convolutional neural network architectures as  a feature extraction module and then use this   to perform your task at hand and a bit about  how we can actually visualize the behavior   of our neural network and actually understand a  bit about what it's doing under the hood through   ways of some of these semantic segmentation maps  and really getting a more fine-grained perspective   of the the very high resolution classification  of these input images that it's seeing   and with that i would like to conclude this  lecture and point everyone to the next lab   that will be upcoming today this will be a  lab specifically focused on computer vision   you'll get very familiar with a lot of the  algorithms that we've been talking about today   starting with building your first convolutional  neural networks and then building this up to build   some facial detection systems and learn how we  can use unsupervised generative models like we're   going to see in the next lecture to actually  make sure that these computer vision facial   classification algorithms are fair and unbiased  so stay tuned for the next lecture as well on   unsupervised generative modeling to get more  details on how to do a second part thank you
Info
Channel: Alexander Amini
Views: 93,398
Rating: 4.9852862 out of 5
Keywords: deep learning, mit, artificial intelligence, neural networks, machine learning, 6s191, 6.s191, mit deep learning, ava soleimany, soleimany, alexander amini, amini, lecture 2, tensorflow, computer vision, deep mind, openai, basics, introduction, deeplearning, ai, tensorflow tutorial, what is deep learning, deep learning basics, cnn, convolutional, convolution, vision, self driving, autonomous vehicles, machine vision, image processing, semantic segmentation
Id: AjtX1N_VT9E
Channel Id: undefined
Length: 55min 57sec (3357 seconds)
Published: Fri Feb 19 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.