Kernel Size and Why Everyone Loves 3x3 - Neural Network Convolution

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

in this video we're going to talk about one of the most fundamental convolution options kernel size i'll explain what kernel size controls and which values you should use for your neural network first of all kernel means the same thing as filter they're used interchangeably sometimes in the same documentation i'm looking at you tensorflow kernel size determines the height and width of the filters we combine the filter with a patch of the same size so the kernel size also determines the size of the input patch we can select any value from 1 pixel by 1 pixel up to the size of the input you can also pick a different value for the width and the height also notice how the kernel size affects the size of the output the output resolution will be the number of patches that we can fit along the input a kernel as large as the input resolution will cause the output to shrink to one by one because it uses the entire input for a single patch a one by one kernel will use each pixel for its own patch so the output will be the same resolution as the input so which size should you use the short answer is that you typically want to use three 3x3 and sometimes one by one for the long answer we'll look at a brief history of the imagenet competition in 2012 alexnet became the first convolutional neural network to win this competition at this point in time alex net was a pioneer and 3x3 kernels hadn't become king yet it used an 11x11 kernel for its first convolution and a 5x5 kernel for the second convolution then it used 3x3 kernels for the remaining convolutions the next year in 2013 matt zeiler won the competition with an optimized version of alexnet one optimization was reducing the size of the 11 by 11 kernel in the first convolution to seven by seven this led to cleaner filter patterns in the first and second convolutional layers notice the mix of high frequency and low frequency patterns in the 11x11 kernel with poor coverage of medium-sized patterns in the 7x7 kernel the medium frequency patterns are covered much better also the low frequency patterns are more vivid and colorful and don't worry about the high frequency patterns they're now detected by a deeper layer back to the architecture the second convolution's kernel size remained five by five but that was about to change because the very next year in 2014 three by three filters took the throne the visual geometry group from oxford one with a network using exclusively three by three kernels you see they made an important realization we don't need larger kernels something magical happens when we chain multiple 3x3 convolutions together each pixel in the first layer can only see a 3x3 patch of the input but the second layer sees a by three patch of pixels in the first layer and that three by three patch in the first layer can see a combined area of five by five in the input image we can say that the second layer has a receptive field of five by five so a chain of two three by three convolutions can effectively see the same sized patch as a single five by five convolution this is a big deal because smaller kernels mean fewer weights and less computation two three by three convolutions only use 72 percent of the parameters and computation as a single five by five and three three by three convolutions use only fifty five percent of the resources of a single seven by seven the one exception to this logic is the very first layer since the input only has three channels so it's common to see a five by five or seven by seven filter for the first layer so that's the story of why everyone loves three by three the next most common size is one by one you'll see this quite frequently simply because it's the least expensive way to change the number of features in your feature map which can be a handy thing to do what about even sized filters we don't talk about those seriously i've never seen a research paper even acknowledge that they exist please discredit me in the comments to summarize the kernel size controls the height and width of the filters which also determines the height and width of the input patch it also inversely affects the height and width of the output you should generally use 3x3 convolutions because they're more efficient than 5x5 or 7x7 except for the first layer where 5x5 and 7x7 is more efficient also use one by one convolutions when you want to efficiently change the feature count and that wraps it up for kernel size in upcoming videos i'll be covering filter count padding stride and more so subscribe if you're interested you

Info

Channel: Animated AI

Views: 24,652

Rating: undefined out of 5

Keywords:

Id: V9ZYDCnItr0

Channel Id: undefined

Length: 5min 54sec (354 seconds)

Published: Thu Jul 14 2022