ResNet (actually) explained in under 10 minutes

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
foreign [Music] I want you to imagine approximating a function parameterized by a deep neural network in this example we're going to pass our Network an up sample low resolution input image and pass it through each of the network layers we want the network to Output the input image but now in a high resolution a task commonly known as super resolution unfortunately in practice after training our Network on high and low resolution image pairs somehow our network is spitting out images they're even worse than our input after putting all your effort into a beautifully deep architecture you are horrified to see that instead of going down your training loss shoots endlessly upwards your classmates and colleagues can't help but laugh and seemingly counter-intuitive because now the model has more parameters now how can we address this and get your loss going in the right direction [Applause] this problem partly comes down to the fact that we have an input signal that is being lost the deeper and deeper we go into our Network as the signals pass through each of the non-linear functions at each layer look at what can happen to a training signal even after being passed through a single relay function the most popular activation function for neural networks essentially you ask in the network to do two things one is to retain the input signal and the second is to find out what needs to be added to the input image to transform it from a low to high resolution image instead let's look at the problem from a different angle let's first minus the low and high resolution image from one another this gives us what is known as a residual image or the difference between the two images now let's reshift this equation to get our intended output on the right hand side now given we already have the low resolution image at training time let's now just get our Network to learn the only bit we actually care about the residual framing the problem in this way makes the Network's life easier as it doesn't need to retain the entire input signal this was the same intuition that inspired the authors from the 2015 paper deep residual learning for image recognition this paper is now considered seminal in relation to deep learning with over 130 000 citations it is rare to run into a model architecture in deep learning today that doesn't utilize the contributions from this paper and some fashion in the previous example I gave you an easy and intuitive introduction to residuals let's have another look at the layers of a neural network I chose to present residual connections to you using the example of super resolution as it can be visualized very easily by simply adding the input onto the output we can instead learn the mapping to the residual image as you can see here however this approach I've shown you so far has two major problems when generalizing to other tasks the first problem is where we have a task where the input and outputs don't share the same dimensionality for example an image classification where you take an image input and map it to a single class label how would you meaningfully add the inputs and outputs in this scenario the second problem is how the input signal is propagated throughout the network let's consider the midpoint of our super resolution Network at this point no matter what our input or output is it is still easy for the network to lose the training signal this signal is an important piece of information that would be useful for the network to have access to in order to remedy both these problems we can add what are known as residual connections all the way along our Network this not only boosts input signals all the way along the network but also makes it easier to submit inputs and outputs as feature dimensionality is adjusted on the go we can now also view the network as a series of residual blocks instead of a series of independent layers most importantly now the network has the option to not fully utilize all the blocks since it is easy for each block to Output the identity function and take no penalty in relation to the loss function this opens the doors to training extremely deep Networks now let's have a deeper look at the main idea I introduced here the residual block foreign so what exactly was the resnet block they proposed in the original paper let's go through it step by step firstly we pass our inputs through a 3X3 convolutional layer with a stride of one and padding one these parameters mean that our output features will have the same dimensionality as our input we then apply batch Norm to renormalize these features and pass them through an activation function such as relu we then pass the features through a second convolutional layer exactly the same as the first and again followed by a batch Norm at this stage we just have a normal vanilla neural network so let's now add a residual connection we can do this by simply adding the Block's inputs onto the current set of features we do this element wise as our inputs and features share the same dimensionality remember this is only because we have carefully chosen our convolutional parameters however for tasks such as image classification we do actually want to reduce our dimensionality throughout the network more on that in a moment finally we pass our features through a final activation function now that is essentially it it really is quite a simple idea now let's have a quick look at the official Pi torch code implementation for a resnet blocks forward pass and consolidate what we've just learned we start with an input tensor X and save a copy of this as our identity function we can use later we then pass our input through a set of convolutional batch norm and activation layers down sampling the features if required more on that at a moment when we discussed Dimension matching we can then simply add our saved identity features to our current set of features in the network this is done element wise finally we pass this through a final activation function and return this as the output of our residual block note that some of these choices are arbitrary such as applying the activation function after adding the identity function this is simply done because this is what the authors found to give the best results when performing a residual connection we must ensure that the dimensions match such that we can do element wise Edition in the original paper they choose to reduce dimensionality every few residual blocks as their end goal is image classification where you go from a high dimensional input to a low dimensional output the authors decided to reduce Dimensions by halving the height and the width of their current set of features to keep the computational requirements of each part of the network consistent they also increase the number of channels every time they half the height and the width this leaves us with potentially two scenarios of mismatched dimensions firstly where the height and the width don't match and secondly where the channels don't match could be either one of these or a combination of the two let's have another look at the resonate block and understand how the network can downsample features let's have a look at the first convolution which I told you earlier had a stride of one and padding of one to keep input feature dimensionality the same as the output the authors propose to down samples features directly by occasionally altering this convolutional layer to have a stride of two this produces features with half the height and half the width when the authors down sampled in this fashion they also double the number of convolutional filters which in turn doubles the number of channels in the output features this is where we have a problem with Dimension matching as our input that is sent through our residual connection does not have the same dimensions as the features coming through our Network let's now have a look at our input features coming through our residual connection and see what options are available to us when addressing this Dimension mismatch the authors proposed two solutions firstly they propose to match the number of feature channels by zero padding this option has the benefit of introducing no new parameters into the model this is simply done by filling out half the features with zeros although no parameters are added we are now wasting computation on meaningless features full of zeros the second solution is to match the number of channels by passing over our input features with a one by one convolution this of course adds extra parameters that means that our output features only contain real information so for example if our input features have three channels we would now have six one by one convolutional filters to double the number of channels in our output space for both options a stride of two is again used this means that our output feature maps have half the width half the height and double the channels this means that they exactly match our current features in the network essentially these two options are very similar they both skip over every other pixel in the input features the main difference is whether we Zero part our output or use the one by one convolutional option to match the number of channels the authors found that the one by one convolutional option led to the best results hey guys it's Rupert thanks for watching the video I hope you now have an intuitive understanding of residual networks the main idea is that you can now train your networks deeper and deeper whilst keeping training stable please don't forget to hit that like And subscribe button for more machine learning videos and let me know in the comment section down below what you want to see in the next video
Info
Channel: rupert ai
Views: 84,941
Rating: undefined out of 5
Keywords:
Id: o_3mboe1jYI
Channel Id: undefined
Length: 9min 47sec (587 seconds)
Published: Mon Oct 24 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.