Capsule Networks: An Improvement to Convolutional Networks

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello world it's Suraj Geoffrey Hinton is one of the godfathers of deep learning in the 80s he popularized the backpropagation algorithm which is the reason deep learning works so well and he believed in the idea of neural networks during the first major AI winter in the 80s while everyone else didn't think that they would work and that's why he's awesome but anyway Hinton recently published a paper on this idea of a capsule network which he's been hinting at for a long time hinting Hinton was hinting yeah he's been hinting at this for a long time in the machine learning subreddit and talks that he's been doing but a few days ago he finally published the paper for it so I was very excited to read it and talk about it but anyway but it offers state-of-the-art performance on the MN ist data set which is the handwritten character data set it's kind of the baseline you probably know about it if you've done any kind of AI before but it's those hand written characters classifying them as the digits that they are and it'd be convolutional networks at this and convolutional networks are the state of the art so it's a really exciting time right now and what this video is is me talking about what the state of the art algorithm currently is the convolutional network all the developments that happened in convolutional networks then I'll talk about capsules and how they work and we'll end the video with me going through the tensor flog code of a capsule network right so here's an image of a convolutional neural network and if we use a standard multi-layer perceptron with all the layers fully connected to each other it would quickly become computationally intractable because images are very high dimensional right there's lots of pixels and if we are continuously applying these operations to every single pixel in an image in every layer it's going to take way too long so the solution to this is was to use a convolutional Network and this was really popularized by Yann laocoon who is now director of AI at Facebook in the early 90s but the convolutional network looks like this right so first we have an input image right the input image has an Associated label so if it's a picture of a car like we see here then it's gonna have an Associated label which is car right so picture of a car and then car and you do that for an entire data set and what a convolutional network will do is it will learn the mapping between the input data and the output label and so the idea is that eventually you'll give it a picture of a car after training and it will know hey that's a car because it's learned the mapping when we first feed this network a car it's gonna first be applied to the convolutional layer and so in the convolutional layer it's basically a series of matrix multiplications followed by a summation operation and I have a video on how convolutional networks work really in-depth that I'm gonna link to in the video description under more learning resources but right now I'm gonna go over it at a high level right so in a convolutional layer it's kind of like a flashlight it's which is being applied over every single pixel in the image and it's looking for the most relevant parts of that image and so it's a multiplication and a summation operation but basically the convolutional layer will output a feature map these this feature map represents a bunch of features that it's learned from that image right so a set of features that are represented by matrices and so once we have these features that are learned by this filtering operation then we're gonna apply a non-linearity to it like a rectified linear unit for example and when we apply a non-linearity to it it serves a lot of purposes the first purpose is so that the network can learn both linear and nonlinear functions right because neural networks are universal function approximate errs but this for rectified linear unit in particular one of the reasons for using that as opposed to the other types of nonlinearities is because it allows two is because it solves the vanishing gradient problem during back propagation so when we forward propagate through it a network we are performing a series of operations we find the output class probability we compare that to the actual label we compute an error value and we use the error to compute a gradient and the gradient tells us how to update our weights as we back propagated through the network and what the what riilu does is it says it helps solve the vanishing gradient problem because sometimes as the gradient is back propagating it gets smaller and smaller and smaller so that the weight update is smaller and smaller and this is not good we want we want a big weight update right we don't want that weight update to vanish and so really helps prevent that that gradient from vanishing and so at the end of this convolutional block is the pooling operation so let's say you have a matrix right and inside of that matrix of pixel values you have a bunch of numbers right for pixel intensities between 0 and 255 and what what pooling does like max pooling for example is it says let's create sections for all of these different pixel values and let's only take the maximum pixel value from each section and propagate that forward so it's a smaller section that's propagated forward and this and this increases and it speeds up the training time and so if we look at commnets they're actually you know anyone can really implement a convolutional network these days right with libraries like Kara's which are a very high level anyone can implement implement a very powerful convolutional network in just a few lines of code where each line of code corresponds to a layer in the network so it's actually it's it's very easy these days to do that and a lot of people can do that nowadays right so once you've defined your network each layer with its own line of code then you can compile it you can define the optimizer and the loss function you'll train it with the fit function and then you can evaluate it on your testing data so I've got this huge graphic here of the modern history of object recognition which I'm not going to go into in detail here but it's definitely something to check out after you watch this video because I've got the link to this on my slides in the video description but it's a really detailed image of the history of convolutional networks across all the different images all the different image net competitions and all the improvements that have been made but I will go through some that I think are really really significant so one of the first improvements to cnn's was the alex net network which was in 2012 right so there were some key improvements here so the first one was it introduced riilu which I talked about which helps prevent the vanishing gradient problem it also introduced the concept of dropout so dropout is a technique where neurons are randomly turned on and off in each layer to prevent overfitting so if your data is too homogeneous it's not going to be able to classify images that are similar but different because it's it's to fit its over fit to your training data so drop out is a regularization technique that prevents overfitting by randomly turning neurons on and off and by doing this the data hat is forced to find new pathways to go through and because it's forced to find new pathways the network is able to generalize better it also introduced the idea of data augmentation so convolutional networks as awesome as they are they're really bad at detecting they're really bad at classifying images if they're in different rotations or if they're upside down it's got to be in the kind of exact spot that it was trained on or very close to it so what Aleks net did or the so what the author of Alex net did is he input he fed in different rotations of images into alex net than just a single rotation and that made it a belies better two different rotations lastly it was a deeper network so they just added on more layers and this improved the classification accuracy after that there was vgg net which was a major improvement and really the only big difference there was them adding more layers that was it really after that there was Google Annette right so for Google Annette is this it looks like this but convolutions with different filter sizes our proper process on the same input and then concatenate it together so in a single layer rather than going through just one convolutional operation or set of operations remember it's multiplications and then a summation operation it's the several of those together right so it's multiply and some multiplying some multiply and some and then it takes the outputs of all of those and then it concatenates those together and that's what it propagates forward and by doing this it allowed it to learn better feature representations at each at each layer right so and then there was ResNet and so ResNet was the idea behind resin that was well if we just keep stacking layers is the network get better every time and the answer was no it's good up to a certain point and then if you add more there's a drop in performance so what ResNet said was okay we know that there is some optimal limit to the number of layers we should add so after every two layers let's add this element wise addition operation so it just added this operation and this improved gradient propagation this made backpropagation easier and it helped further improve the vanishing gradient problem and after that there was dense net and dense net proposed in connecting entire blocks of layers to one another so it was a more complex connection scheme so there are some patterns here right so there are some patterns here for all of these networks the networks are designed to be deeper and deeper and there's also computational tricks that are being added on to these convolutional networks like real uu or dropout or batch normalization that improve performance and lastly there is an increasing use of connections between the layers of the network but Hinton said okay there is a problem with convolutional networks so remember that convolutional networks learn to classify images right so in the lower lowest layer so in the lowest layers of a convolutional network it's going to learn the the lowest level features of what it's seeing so for dogs for example in the lowest layer it's gonna learn the edges and the curvature of your ear and then the curvature of the dog's ear may be like a single tooth and then as we as we go up the hierarchy as in as we as we go to the next layer and then the next layer each of the features that it's learning are going to be more complex so in the first layer there are edges and the next layer they're gonna be shapes and the next layer they're gonna be more complex shapes like an entire year and then finally in the last layer they're gonna be very very complex shapes like the entire body of a dog for example right and this this is very similar to how the human visual cortex works we know for certain that there is some kind of hierarchy happening whenever we look at something that's there's this hierarchy of neurons that are firing in order when we try to recognize something that we see that's that's the high level of what we know we don't know the exact intricate details of how the routing mechanism is between layers but we do know that there is some kind of jerky happening between each layer right so there is a problem with convolutional networks though there are two reasons first of all subsampling loses the precise spatial relationships between higher-level parts such as a nose and a mouth right so it's not enough to just be able to classify a nose and a mouth right it's like if you have a nose in the left corner of an image and then you have a mouth in the right corner of an image and you have eyes at the bottom you can't just say oh it has these three features it must be a face no there's also a spatial correlation the eyes have to be above the nose which have to be above the mouth right so but subsampling or pooling loses this relationship and also they can't extrapolate their understanding of geometric relationships to radically new viewpoints like I said before convolutional networks are really bad at detecting an image if it's in any kind of different position right if it's rotated if it's upside down if it's to the left to the right it's got to be in the kind of general position that it was of the images that it trained on and this is a problem right the idea is that instead of invariance we should be we should be striving for equivariance right so the original goal of subsampling or pooling is the same thing subsampling or pooling is it tries to make the neural activities invariant to small changes in in viewpoint right so what that means is no matter what rotate what position or rotation some images in then the neural network responds in the same way the data flow is the same but it's better to aim for equal variance that means that if we rotate an image the neural network should also change it should also adapt to how its it's learning from this image or how it's processing this image so we need so what we need is a network that's more robust to changes in to changes in transformations of how images are positioned and we need we also need networks that are more easily able to generalize right that's that's just a general thing in all of AI we need algorithms that are more able to generalize to data that it's never seen before based on what it's trained on and also there's one more thing there's a six to six day old paper on volusion Allah just came out which is a pixel attack for fooling deep neural networks basically the authors found that they could tweak just a few pixels in an image that otherwise looks exactly like a perfect classification like a dog that it would that it would predict perfectly but by just changing these few pixels they found that the entire networks classification was was really bad and this is a problem right that's the that's just not how it should be that's it doesn't make sense that it's that it's that susceptible to an attack right if we're thinking about if we're thinking about creating self-driving cars right these huge machines that are flying across the road and they're they're using computer vision to detect things they can't be susceptible to these kind of pixel attacks right they've got to be very robust so Hinton introduced this idea of the capsule Network and here is an image of it but the kind of the basic idea that Hinton had was the human brain much must achieve translational invariance in a much better way it's got to do something other than pooling and he posits that the brain has these modules that he calls capsules which are really good at handling different types of visual visual stimulus and encoding things in a certain way so CNN's do routing by pooling they route how data is transferred across each layer by this pooling operation right so if if we input an image we apply a convolution a convolutional operation to it and then a non-linearity and then we pull it based on what the output of that pooling layer is the image is gonna go in a certain direction for the next layer right but it's gonna hit certain units in the next layer based on pooling but pooling is a very crude way to route data right there's got to be a better way to route data so the basic basic idea behind a capsule network is it's just a neural network where instead of just adding another layer right so usually we're adding different types of layers instead it nests a new layer inside a layer so that's all it is it's where instead of saying okay we've got this single layer neural network let's add a different layer no inside of that layer let's another layer so it's a nested layer inside an inside of a layer and that nested layer is called a capsule which is a group of neurons and so a typical layer of neurons or units becomes a layer of capsules so instead of making it in the network deeper into the sense of height or I guess you could even say with it's making it deeper in terms of nesting or inner structure and that's that's all it is basically and this model is more robust to transformations in terms of rotation or how this image is I mean it is chief state of the art for MN is T which is it really it's a big deal we'll see how it scales later on to to you know really really huge datasets but for MN ist that sets that's a pretty big deal so the capsule network has two really key features and the first one is layer based squashing and the second one is dynamic routing so in a typical neural network only the output of a single unit is squashed by a non-linearity right so we have a set of output neurons and the based on the output of each we apply a non-linearity to each of them so instead of applying a non-linearity to each individual neuron we group these neurons into a capsule and apply a non-linearity to the entire set of neurons the vector of those neurons so when we apply a non-linearity it's it's to the entire layer right instead of individual neurons and also it implements a dynamic routing so instead of so it replaces the scalar output feature detectors of CNN's with vector output capsules and it replaces max pooling by routing by agreement so each of these capsules in each layer when they forward propagate data it goes to the next most relevant capsule so it's kind of like a hierarchical tree of nested layers inside of layers and the cost of this new architecture is this this is the routing algorithm right so basically the difference here the differ the key difference here between a regular convolutional network is the forward pass has an extra outer loop right so it takes four iterations over all the units instead of one to compute the output and the data flow looks a little is a little more comp located because it's saying for every capsule that's nested inside of a layer apply these operations whether it's a soft max or a squashing function and what it does is it does make the gradient harder to calculate and the model may suffer from vanishing gradients on larger datasets which could prevent it from scaling and becoming the next big thing right I mean in the paper they only applied it to MN is T not hundreds of thousands of images but I suspect that this network is going to scale really well because a it's Hinton and you know if Hinton says something's gonna work it's probably gonna work okay so let's go ahead and look at some code to see what this looks like so what I found is this tensorflow implementation of capsule net which is still in progress I mean this paper is relatively new but I found this code that we can look at right now so this this was built in tensorflow right so in this layer notice though it's only got two imports or numpy and tensorflow so it's a really clean it's a clean architecture here so it's got this capsule convolutional layer right so it's gonna have a number of outputs a kernel size you know these hyper parameters like striding and so it's got this if-else statement where it says if we're not gonna do the routing scheme build it this way but if we are build it this way so let's look at both if we have our routing scheme which we should have it's gonna say we'll start off with a list of all of our capsules and then we'll iterate through the number of units we specified and we'll say okay so for each let's go ahead and create a convolutional layer like a standard tensorflow convolutional layer and store that inside of the capsules this capsule variable and then we'll take the capsule variable and append it to the capsule list right and so eventually we're gonna have all of these convolutional layers inside of this capsules layer so note so this is how it's nested right so and once we and once we have all of these layers we'll concatenate them together to pull concatenate them together and then we'll squash them with this novel non-linearity function so but if we do have this routing mechanism which we wish the paper had will say okay let's again will create a list of capsules and for for however many outputs we want let's create this capsule class add it to the list and append it and then return a tensor with the shape right so if we look at this capsule implementation right here this is class basically this is the routing mechanism right here right just the one that we saw right here this routing algorithm where it's saying okay let's go through let's go through let's go through the number of iterations we want and apply these operations to them so notice that there is a non-linearity here that the paper talked about which is not real ooh it's it's a novel non-linearity which looks like this this is this is kind of interesting this this non-linearity that they used that they found was good for for applying to a group of neurons rather than a single neuron right so for real ooh we apply it to a single neuron but when we're applying a non-linearity to a group of neurons or a capsule they found that this non-linearity works the works worked the best for them and so that's what that's what is happening here so that's the capsule layer so once we have the capsule layer we can import it here into this capsule Network and then we can go ahead and build our architecture right so we'll start off by saying ok so we'll start out by saying ok for each layer we have a primary caps layer and we have a digit caps layer let's go ahead and add our capsule a capsule layer to them so these are these are both capsule layers right so for each of these layers there is a nested convolutional network inside of each of those layers which is a capsule and then inside of those capsules the non-linearity is applied to it the the novel squashing function that we saw in the paper and then at the very end by the way at the very end of this network after it's it's applied these operations to each layer with each capsule it applies this decoder algorithm so in the paper they found that at the end of the network they could just use a decoder to reconstruct a digit from the digit caps layer so after it's applied this first set of capsule this first set of capsule operations and then it can reconstruct the input from that learned representation and then apply a Rican struction loss to improve that learned representation and that's what this is this is the reconstruction loss that it's used to learn that representation and improve that representation over time so this and tentacle implementation is a work in progress I've got a link to it in the video description as well as my slides but if you want you can train it I've I've been training it as well you could you just need to all you have to do to train it is just run Python trained up PI and it'll start training and you'll see the loss decrease over time please subscribe for more programming videos and for now I'm going to learn more about capsules so thanks for watching
Info
Channel: Siraj Raval
Views: 135,707
Rating: undefined out of 5
Keywords: capsule neural network, capsule networks, capsules, siraj raval capsule, capsule network, capsule net, capsnet, capsule network hinton, capsule networks hinton, capsulenet, capsule hinton, hinton capsules, hinton capsule, siraj, siraj raval, udacity, python, programming, coding, software, engineering, data, analytics, deep learning, convolutional network, ML, coursera, capsule, image classification, hinton, geoffrey hinton, artificial intelligence, google, deepmind, machine learning
Id: VKoLGnq15RM
Channel Id: undefined
Length: 22min 43sec (1363 seconds)
Published: Tue Oct 31 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.