(1/12) MobileNets: Standard Convolutions (Computational Cost)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone one of the most important building blocks in mobilenets in both mobilenet version one as well as in the second version is something called depth wise separable convolutional layer so this depth wise separable convolutional layer is very important in makes mobile nets very very fast but to understand or before explaining how such building blocks work it is necessary to review the standard convolutional layers so what i'm going to do in this video is to first give you a recall about standard convolutional layers and then estimate the computational cost of such layers and this is important because later we will compare this computational cost with that of depth-wise separable convolutions right so we would have the computational cost x1 of standard convolutional layers then the computational cost of depth-wise convolutional layers and we would compare them with each other to see the advantage provided by depth-wise separable convolutions well let's start by exploring the major steps or the main steps in the standard convolutional layers so we have usual we have in general three steps the first step is called convolution second is batch normalization and the third is application of a non-linearity function or an activation function so we usually use relu and the mobile nets i think they use railway six well um as an aside in recent cnn architectures we do batch normalization at each layer whereas in previous older architectures we do batch normalization only once at the input of the model right anyways what i'm interested in is this step here so at this step what happens is that we take an input f with the following dimensions so let's say so i'm using the notations from the original papers so it would be easier for you if you want to go and see more details so in the original paper they use the notation f for the input the input to this convolutional layer or to this convolutional step right so the input has the following dimensions it has a width or rather a height of df a width of df and the depth or the number of channels is equal to m right and we get at the output of this step something called j so this is the output this is the notation of the output and before telling you the dimensions of this output it is important to specify some hyper parameters of the convolutional operations so i assume or authors assume that at this step here the convolution uses a stride of one is equal to one so this means that the each convolutional kernel moves by one pixel right and the convolutional operation [Music] is a same convolution what does that mean it means that the spatial dimensions of the output is equal to the spatial dimensions of the input for example here the spatial dimensions are df times df when we use the same convolutions then the spatial dimensions of the output should be also df times the thing this requires something called padding if you don't know about this please go and watch some youtube videos about this topic there are tons of videos whatever my purpose here just to give you a recall about these things and now the number of kernels used iso so we assume that the number of kernels used is equal to n and since each kernel produces one channel when convolved with the input f then the number of channels in the output should be equal to n right so now let's talk about the some details about the uh the kernels use so the kernel used we assume that we use n kernels k i right so i this is just the index of the kernel this means that i starts from 1 2 and so on and the last number in i is n also we assume that the spatial dimensions of the kernel or the dimensions of each kernel ki is dk times dk times m or rather n sorry it should be m yeah because the depth of each kernel should be equal to the depth of the input so this is necessary right and this is valid for each i now when we convolve with k i with n kernels k i we say that we can involve the input f with a kernel k capital k such that k has is a four dimensional tensor here are the details of each dimension so we have four dimensions the first two dimensions are the spatial dimensions of each kernel then we have the number of channels of each kernel and then we have the number the total number of kernels right so this was a recall now what i want to do is to figure out the computational cost of after convolving all of these kernels with the input f so [Music] to make things easier i've prepared this visual representation so i assume that i want to convolve a 9 by 9 input with a three by three kernel so of course i have n kernels each kernel each kernel k i has the following dimension so it should have [Music] 3 x 3 and of course the depth or the number of channels should be equal to the depth of this input so i assume that the depth of this input is m and this means that the depth of each kernel is also m right so here normally we have padding this input is padded by zeros right so we have zeros here now when we compute the computational cost of the convolution operation usually we focus on the uh we estimate the number of multiplications or we compute the number of multiplications so the computational cost is equal c is equal to the number of multiplications so at this stage you might be wondering why we do not include the number of additions well the short answer to this question is because the computational cost of a multiplication operation is much higher than the computational cost of additions especially if we are multiplying large numbers with each other right so this is why we focus on just computing the number of multiplications to figure out the computational cost now what is the number of multiplications well first of all the number of multiplications should be equal to the number of elements in each kernel so it should be equal to the number of elements for each kernel k a right because we are doing multiplication wise in each position then we need to multiply that by the number of kernels which is n right and now each kernel performs this multiplication wise how many times well it performs multiplication wise at this first position and then it moves to the next position third position fourth position and so on until it gets here and then it goes to the next row right and so of course we go one pixel at a time because we have a stride of one what does that mean it means that we should include the total number of positions taken by each kernel and the total number of positions taken by each kernel is nine times nine in other words sdf times df well anyways we should multiply that by the number of positions taken by each kernel ki right so this is the computational cost well given the notations we have the number of elements of each kernel ki is equal to dk times dk times n this is something that i just explained so i said earlier that the dimensions of each kernel ki is dk times dk times m therefore the number of elements in each kernel is dk times dk times m multiplied by the number of kernels which is n then multiplied by the number of positions taken by each kernel so in the case when we have a stride of one and same convolutions this thing here is equal to df times df or nine times in this specific case so it's dk squared times df squared times m times n so this result here is important and we will later compare it with the computational cost of depth-wise separable convolutions
Info
Channel: Zardoua Yassir
Views: 244
Rating: undefined out of 5
Keywords:
Id: CWm5wBn1_fk
Channel Id: undefined
Length: 12min 56sec (776 seconds)
Published: Fri Aug 27 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.