Ali Ghodsi, Deep Learning, Sept 28 2017, [Lect 5]

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay let's begin so still we are talking about a regularization and we learned about drop drop out in the previous lecture which as I told you it's very popular and very effective method the last thing that I want to tell you about regularization is a technique which is called batch normalization it's a paper that published in 2015 and again turned out to be very effective method it is a built-in function now in many packages batch normalization so I don't know are you familiar with the concept of covariates shift or domain adaptation okay you know in many it's not about neural network it's a general concept in many cases you have a problem which is called covariates shift in learning and the solution for covariates shift is domain adaptation the problem is that you are training a model on your training set and then you want to test it but the distribution of your test set is different from the distribution of your training set you know for example you are modeling you are basically making a model to predict breast cancer and you have a training set training set is based on the patient that you had so far those who came for a test and some of them had breast cancer and some of them didn't and usually people at age 40 or so you know start to get concerned about their health and they go to physician and make a test to see if they have any problem or they don't have okay so this is your training set but you want to test it on everyone you want to test it on someone at age 20 to see if this person has breast cancer doesn't have so clearly the distribution of your test set very different from the distribution of your training set and the model that you trained is not going to work on your test set okay another example you are making a model for like sentiment analysis for example you are reading comments and you want to decide that this comment is a positive comment or is a negative comment and you are doing this for say Amazon for example and you are training your model for books you know you are looking at the comments that people written for books and your training a model decide if this comment is a positive comment about this worker is a negative comment then you want to test your model for sentiment analysis or on another product on a camera for example the distribution of words in the sentence is different because and the way that basically they conveyed the semantics you know someone is written this comment for a book that this is a book that I returned to every day it means it's a very good book you know I read it every day but for a camera someone says that this is a camera that I returned the first day means it's a very bad camera you know I didn't like it so it's a very different distribution so this is called covariates shift or gap between test and train distribution and the solution for this is domain adaptation domain adaptation is basically as a family of techniques that try to make the distribution of tests that's similar to the situation of training son so people who publish this batch normalization paper actually had a pretty voice and interesting observation about neural network what they realized was that we have a problem similar to covariates shift in different layers of neuron basically when we are learning when we get to the hidden lair say this is the first hidden Lee and the second hidden layer tries to learn something from the first one okay but in this next iteration in the iteration after and after it's quite possible that the distribution of this the output of this hidden layer is very different from the distribution that we learned before because we are applying a nonlinear function to the output of this layer and this could be quiet non-linear and completely shift the covariate you know and the second layer cannot learn properly from this one so let's try to fix this okay so suppose that you have an activation function like a logistic function and you the input to this logistic function is a very simple linear and then you apply it you know it it could different things could happen you know based on the value of ax plus B you might be in this regime which is almost linear or in one of these saturated regimes you know you might be here and if you are here you're in trouble you know because of this vanishing gradient problem because in back propagation the gradient is small and when you back propagate then it's going to vanish after a couple of so you don't want this to happen that's one thing and the other thing is that basically you don't want to get your your covariates shift a lot during the learning you know so something ideal would be that you can keep covariate around the center somewhere here so you don't let the covariance if a lot you don't let the distribution change a lot and you know we have many pre-processing steps usually in neural network many of them are to take care of this sort of problem you know we are trying to initialize weights quite carefully choose vase close to zero we choose weights close to zero to be in this regime not in the saturated regime right and we choose a small learning rate we don't want the wage to be you know explode and we are using rectifier units in a stuff sigmoid again because we want to avoid you know this is a small gradient and we don't want to be in this regime so but seems that all of these problems are a side effect of this covariates shift in different layers that we are trying to solve it by you know a smart or careful initialization or using rectifier unit and so on and so forth so we can't afford actually to do proper adaptive I mean learning domain adaptation in any layer because it could be computationally expensive try to match you know the distribution of this with another distribution but the idea here is or intuition here is quite simple and you know this is basically the problems I was explaining you know when you apply the second hidden layer and the first hidden layer no this is out of first hidden layer and you're applying the second layer on it and the distribution of this output may change during the learning and it caused the problem so the idea is quite simple let's normalize the data after each layer so take the mean out / variance okay centralize the data make the variance equal to one after each layer so in this case you are here you're around zero and you don't have much variance and you do it every time okay so it's not cost and computational it's not expensive you can do it quite easy in batch learning actually you have a batch you can compute the mean of the batch you can compute the variance of the batch and you can normalize so basically if the batch size is M you can compute the mean and you can compute the variance and you can normalize the date but after that try to rescale the data you know if I multiply X by a constant that would be the variance you know suppose that you have data which is zero mean and unit variance I multiply this by a constant that C gamma here so the variance would be gamma squared right and I add bate her to this then the mean will be beta right so rescale the data but treat this gamma and beta as new parameters of the model roughly speaking you know we can think of this method as okay after each step I'm gonna normalize the data make it 0 mean unit value then maybe I'm wrong maybe it shouldn't be that way let to rescale the data to some other variance not 1 and some other mean not 0 but let's model decide what would be the appropriate mean and what would be the appropriate variance so these are my parameters so I when I do back propagation I take derivative with respect to them and I try to fix okay so it's a very simple idea very effective in practice yes you are forcing the distribution of the data to be always say for example if if the data is Gaussian say for example you are forcing this to be a spherical right it's not quite clear that this is always a good practice you know because after each think that you know it's not a pre process that you do it and then let the model learn you're doing this after each hidden layer you know hidden layer learn some parameters and then out you take the output of this hidden layer and you change it that this the output of this layer is a parameter if in function of these parameters you apply this function but then you change the output of the function without changing the parameters see what what happens so we don't know which one would be the best and that's why we let the the network to decide you know what's happening is that the output of this find the output of this layer you know we would like to have we don't want this to have a huge variance each time you know and shift the covariate in a way that it's hard for the next layer to learn what happened here okay we don't want this to happen we don't want to change the distribution here so and we can't afford a proper domain adaptation here because it's you know computationally too expensive so I forced the output to have a certain distribution not necessarily the appropriate distribution okay I forced it to have a certain distribution I would say okay your mean is zero your variance is one but who knows that this is the correct distribution for this you know the only thing that I want to avoid is that I don't want that the variance is such that the the covariates shift and you know I'm suddenly somewhere else that this model cannot learn you know it has nothing to do with what has been learned in the past iterations so I will force it to be at zero and variance one but I have to have some flexibility for the model to adjust it if it's very off yes yeah good question at test time you replace the populate I mean the population of the the mean of the population the variance of the population not the batch the total mean and yeah yes what is oh this epsilon is just for numerical stability because if you don't have it you make it numerically and computationally may get you in a stability it's a very small value that you are so mathematically it doesn't mean anything here except that computationally you can avoid some sort of instability that may have sorry if epsilon is not there the causes the stability would be that if Sigma is twist small then you add some value so in practice it's pretty effective method these are a couple of examples from this paper in 2015 first of all this is a distribution after a hidden layer without batch normalization and this is a VIP pass normalization just visually you can confirm that the distribution changed a lot when you don't apply batch normalization and it's very more smooth if you do apply batch normalization you know in terms of training time without bash norm a batch normalization that's the training care and we patched a normalization this is the one you know it learns way faster you know and intuitively it's clear why it learns faster because you know this this hidden layer here when you don't have passion or memorization needs to learn you know a variable which varies a lot but somehow you restrict the variation of this data here so you help this layer to to learn faster also it helps that you can choose a larger possibly larger learning rate so they have many experiment in that paper and they show that it's faster I mean it converts faster and it helps for performance and sometimes it's faster even 30 times you know with passional realization so there they have many experiment here that I escaped you want to see look at the paper but it is a built-in function in many packages including cross that I'm going to show you yes yes sorry yes yeah exactly this is new parameters of the model no no it this will be I mean this to the out you take the output of one activation you normalize it after normalization that the mean is 0 and the variance is 1 you multiply this by a constant this constant is a parameter and you add a constant to this and this constant is another parameter and you learn these two constants using back propagation yes I mean this is actually how Karis implemented you know it's another Lee information about the transform data like can you explain you know whatever and the data that you get out on the other side is something that has a very wild distribution with lots of higher moments and so on in that case I mean the variance and mean to capture that much weight so yeah those cases in in practice actually the distribution in the higher order layers would be simpler than the distribution that you have in the input I mean it's possible that the distributions do you have in the input head is very you know not as smooth and distributions of the hidden layers are more smooth but it's very unlikely that it's the other way around you know because somehow it's like kernel methods you know in kernel methods you map the data in higher dimensional space in some Hilbert space the structure becomes more linear they're very roughly it's not quite correct but very roughly I can assume that any distribution in that Hilbert the space is sort of Gaussian you know it's the same idea here you know distribution is not that complicated in hidden layers yes the reason we're doing because we're not necessarily even if we don't do much gradient descent you know suppose that you do Garrity and descend not batch gradient descent still that could be a problem I still you know you have a hidden layer here which and have some non linear function which can change the distribution completely and you don't want this to happen you know it's a very nonlinear function you know and you have ax plus B and then this nonlinear function should be applied to this a small change on ax plus B maybe push you from this regime to this regime we're going training no ax plus B is different you're talking about the gradient which is the same but ax plus B is different you know because you change a and you change B right so ax plus B is different and a small change in X plus B may shift you from this part of the activation function to another part of the activation function which changed everything yes if sorry you mean it's noise well I don't think so I mean let me you know fabricating pattern out of nothing could happen in any complicated model you know if you have if your input is noise there is no underlying a structure in terms of the distribution theoretically you are not able to learn anything out of this right but you have very complicated function you may be under the impression wrong impression that you have learned something it's not true this can happen anywhere when you have a complicated function including neural network and whether or not batch normalization intensify this or not I'm not sure but in any complicated model this could happen that you're under the wrong impression that okay we'll learn something yeah have you seen this papers that the quartz controversial papers that comes once in a while and someone claimed that I have identified the gene which is correlated to being or not to to being or not being religious or really I mean to believing or not believing in God for example there's a gene which contribute to this you know I was doing gene selection for a project and I was under the impression that everything works fine and I was debugging my code and I realized the labels the time fitting to the model is not really the basically labels of patients who has this disease who doesn't have this it was gene expression data gene expression that I have many genes and a small number of patients you have 100 patients but you have 8,000 genes so it's 100 points in high like 8,000 dimensional space right so I I realized that what I'm fitting to the model is completely random vector it has nothing to do with these patients still we are finding a gene okay so it's quite clear what's happening you know you're in 8000 dimensional space you have 100 points there it's pretty easy to fit you know a hyperplane take hyperplane which goes through it doesn't matter what your labels are you know it doesn't matter really what's what's your look and take the your labels that who believes in God who doesn't believe in God you know you're going to find something so it is always possible that if it's too complicated it's very high-dimensional you have many parameters in these situations it's possible that wrongly you think that you have learned something which is not true including okay so I just go quickly true this is slides you know we covered this before when I showed you this Jupiter example for this this digit recognition example that we did in the other day here I'll just I put their slides you know these are the review of what we did in class just to show you that in any of those steps that we did there are many other things that you can do there are many other parameters that you can change we define model structure here and when we define model architecture for example we use this activation function to be rare Lu there are many other activation function that you can use and there's a long list of them you know you can just take a look at their slides later you know just to get familiar with some parameters that you can change in the examples that I've already provided on the web page just to mention that drop out that we learned in cross is just a layer so the way that you add a dense layer say here we had a dense layer of 128 nodes with this activation function so we can add another layer to the network basically say model add drop out and then cross will do drop out for you and the 0.2 is the parameter of the bernoulli function that should apply to the output layer of dropper so it means this is the portion of the nodes that is going to be non active also batch normalization is a layer in carest so you can just say model add pass normalization that's so for lost function we use may mean a square error this is a long list of other type of - a loss function that you can use in crisis building and for optimization we use a stochastic gradient descent again this is a list of other type of optimization that you may want to use and you can try that so the difference between these are methods of basically optimization is that some of them get to the mean faster some of them slower you know this or say for example the the red one is stochastic gradient descent and this blue one is atom which is more common these days so it goes faster to the meat takes a better path to the mean compared to stochastic gradient descent works okay any question yes it's it's up to you you know it's up to you you see the model that they have done this but intuitively it seems that you have to do it after any layer it makes sense to do it after any layer and practice you may want to do it after something you know because it's not quite clear what's going on but yeah I mean into if this intuition about covariates shift is correct you have to do it after Italy okay so the next topic that we are going to cover in this class is convolutional neural network convolutional neural network actually is a neural network which has a convolutional layer and commercial layer is a layer consists of consisting of three parts convolutional stage detector stage and pulling a stage so these three together in a layer makes a convolutional layer and if you have a neural network with at least one convolutional layer then this network is a convolutional net okay okay the first step in a convolutional layer is convolutional stage okay so we are convolving the input with another signal so I'm sure you are familiar with the concept of convolution this is the definition of convolution if you have two signals you can convolve these two signals and this is the definition of convolution and if you remember convolutional theorem basically says that you know it has a close relation with Fourier transform you know if you take Fourier transform of the first signal times the Fourier transform of the first the second signal would be the Fourier transform of the kind dilution of these two signals it can be defined on continuous domain it can be defined on discrete domain basically when you have two signals it's as if you're flipping one of these signals and you shift it and then you multiply this to do the other signal and then sum it up you know from minus infinity to plus infinity in this great domain of course you know if you have a limited domain you're gonna compute the summation over that limited domain not from minus infinity to plus infinity and it's not just for one-dimensional signal you can apply to two-dimensional signals the 3-dimensional signals and so on you know there are two concepts one convolution the other one is cross correlation and cross correlation and convolution are pretty similar to each other in convolution you are flipping the second signal then you shift it and multiply in cross correlation you don't flip it you just multiply in many a implementation of convolutional neural network what has been implemented is in fact cross correlation not convolution but in this course actually we don't we will not be picky about this you know that this is really convolution or it is cross correlation you know what was the key again that windows were pee see we have two signals and here actually we are computing cross correlations so we are shifting one of these signals multiplied to the other signal and take the summation so that would be cross correlation between these two signals and if you do convolution you have to flip the signal and then multiply and take the summation so that's cross correlation and this is convolution see the signals has been flipped and then you multiply so cross-correlation actually shows the similarity between two signals so if this is high here and is at this point these two signals are similar I mean at this shift at this time shift these two signals are similar to each other and if it's a small means at this time shift they are very dissimilar okay so that could be helpful I mean the fact that the fact that we can use cross correlation to find similarity could be quite useful let me show you okay see I have two signals here this signal and this signal and I compute cross correlation between them so it's discrete function so this value of the function times this one plus this one times this one plus this one times this one so at this time shift at time zero that's the value seven point something then suppose that I shifts it and then compute again this quantity and they just keep going you know see at this point at this shift look at this value nineteen point forty one it's a large value compared to seven at this time shift these two signals are similar see similarity between these two then here they are very dissimilar it's a negative no and I can do it again again here you know 17 is a relatively large number because you know you can see the basically similarity again here between these two signals see this one with this one this with this you know so basically it can show I mean it can tell me at which time shift these two signals are similar to each other the value of cross correlation can show so now suppose that you have a two dimensional signal like an image and you are computing the cross correlation of this image with another image so basically it's show I mean the values of this cross correlation should tell me the points that these two images are very similar to each other right so this could have many applications and could be very useful including okay that's the original image that's the original signal that I have and this is the second signal that I want to convoke with this or I want to compute the cross correlation the second one is similar to edge look at just the contrast you know it's a it's a simple three by three matrix but this three by three matrix actually it looked like a because what is H it is basically you know a line that you have a contrast on both side of the deadline you know that the color will change significantly so we go from black to white here so it's similar to edge so I convolve this or I compute cross correlation of this and the original image and this is the result of this cross correlation you know just remember what we had in the one-dimensional cross correlation so the higher values was corresponding to the positions that we had similarity so higher values have been highlighted here and our positions that they are similarity similarity between this one and this one means similarity between edge and the image so this is basically an edge detector you know we can compute the edges of this image if I change it to this one I will get this one I don't know how clear is this you know in in one case actually these are sort of horizontal edge in the other one it's like there to call edge you know maybe I show you an example which is more clear than this see this is the original image and this is the second signal that I convolve with this one if I convolve this one with this I will get this you know I just it just detects horizontal edge if I do the cross correlation between this one and the original image I will get this one which capture only vertical edges right and this is basically the combination of both of these two which gives me all of the edges of the image okay so in the literature of convolutional neural network we always convolving signal with the input signal and we call this signal that is going to be convolved with the original one a kernel or a filter okay so the first stage of convolutional layer is to is is convolutional stage in convolutional stage we have a care now I can I look like this and it's going to be convolved what we're going to compute the cross correlation between this and the original image okay and it's going to have an output this output is look like this which is called feature map okay so we convolve a kernel or a filter to the input image and the output is called feature map so that's the first step in convolutional net and you can see here the true this convolutional stage we may learn something useful you know we may learn a feature map which basically are the edges for example of the input image could be depends on this kernel it could be another feature which is useful okay in convolutional neural network we don't fix this kernels we let the model to learn this fix this kernels so I mean intuitively think about this this way you are doing a classification problem the output is y and I want to come I want to predict why along the way there are some cameras that are going to be convolved with the input and produce some features so ahead of time I don't know which feature is useful to predict why I let the network to decide which kernel is useful useful means that it produces features which are discriminative enough for this classification problem and led me to predict by precisely or in a better way okay so that's basically the idea no no you're gonna learn it's like weights in neural network with soon we are going to see that they are exactly you just add them on you know you produce the first image it produced second image ad because in the first image any horizontal edge is large value everything else is very small and the other one like horizontals are large and this one are as small so add them up any other questions yes yes yes it's a three by three matrix which was basically 1 1 1 0 0 0 negative 1 negative 1 negative no it works like this you know this is the original image and then you have a care now you know this is a kernel like 3 by 3 care now and this kernel in this case is 1 0 1 0 1 0 1 0 1 you multiply this kernel to this portion of the image element wise and you add them up you know this is like cross-correlation yeah 1 times 1 1 times 1 so 2 3 4 so you're gonna have 4 here ok that's I'm producing the feature map here then I shift this care now 1 is step 1 pixel and do exactly the same thing so I have 1 1 1 so it's 3 right not necessarily no no and then you know you keep just going and you produce this no it doesn't have the same size as the original you can have the same size as the original one but you need padding padding means that you're going to expand this image you're gonna have some add some columns and rows to this image and when you are here excuse me you may assume that there are two more columns here two more rows here and padding could be like zero padding you may add just zeros here or same padding means you just replicate these columns and rows but in general it doesn't need to be the same size you know we don't this we don't decide about the kernel ahead of time in congressional or network that was just an example to show you the the effect of a kernel on an image in a convolutional neural network you don't set the values of the kernel these are parameters of the model which will be learned through back propagation you know this kernel is this is the kernel that's supposed to be convolved with the input image and I don't know what the values are these are my parameters I'm going to just set some initial value to this and do back propagation through back propagation the algorithm decide about the value of this kernel but the size based on my cost function my cost function is to predict the voyage if it's face detection and if your features are circular it's have to reduce features that is useful for face detection but we let the model to decide about we don't decide you don't set a kernel ahead of time no that's not a common practice you know there are a couple of kernels that people forget the name of the candidate I want to mention but there are a couple of carrots that people observed practically that could be helpful care nots you know but in general you know something about something interesting about machine learning is that you let always you let the model to decide about the feature sometimes the features you know I don't have the exact can't show you this example later that there is a set of pictures and you want to classify between frontal view and side view okay so if you ask a person to find a feature to make distinction between these two you can think of any feature except the feature that the model land feature that model land is just light here at the like this edge of the norm nose you know because when when it is like sighted view it's very bright whence frontal view it's not you never think that this is an important feature here you know so we let the model to learn what feature is important instead of deciding for the model that but that there are some structures that have been observed that could could produce useful features as initial values sometimes we see that Colonels are assigned randomly instead of being land in that case some structure could be yeah but that's different you know you know that the problem of transfer learning is not exactly the same but is similar to the problem of domain adaptation that I explained to you you know you learn a model for a task but you want to transfer this model for another task so I learned for example an image or pattern recognition or image recognition an image net for example which has objects in it like pen and airplane and car and different things okay but I want to apply to medical imaging problem we never observed such images in image net okay but the idea of transfer learning is that maybe the knowledge that we already learned here is useful here because all images are made of some primitive shapes you know all images are made of some corners and some vertical and horizontal and you know angles and so on so maybe the features that I learned here for image that are useful in classification of this image as well so one technique of transformational learning is to use a pre trained model say you trained alex not on image net you use this pre trained model for this new one and you just add some layers to it and you just learn the new layers or you fine-tune that the weights that you have learned before this is different from actually what we are discussing here you know those features have been layered and sometimes what I'm saying is that you can't decide ahead of time that what what is the kernel what is the weight okay so that was cross correlation or convolution whatever so if you think about this think about what's happening here you can actually see that by the whole story of convolution and cross correlation and convolving a kernel to the method what we are doing is two things one is making the weights you know I just think about feed-forward neural network suppose that you have feed-forward neural network and you are making the weights of this forward feed-forward neural network s parts you know in feed-forward neural network I have this input layer and I have this hidden leaf so there is a matrix as matrix of weights which are going to be multiplied to this input and produce this one so it in fact you know there's a weight here which will be multiplied to this value and a weight here which is going to be multiplied to this value but this is dense you know from anywhere to anywhere I have a weight so think about convolution or cross correlation that we multiply a care now to the image to the input image if the kernel has the same size of the input image it's as if for any pixel there is a corresponding weight that they multiply that way to this pixel and then add them off okay so if the kernel has the same size as this input it means that you know this is 1 and 3 of that care now this is another entry of the scanner this is another interior of the scale it's another entry at that cap so I multiply some weights to each entry of this image and then add them up and produce one element of my feature map right but and another feature map actually would be another set of weights which produce another element of my future and so on but if the size of the kernel is smaller than the size of image which is the case in convolutional neural network you know we had like 3 by 3 for example care not here so it means that this I'm not using all of these weights so the size of kennel is a small I'm just using a kernel of size 2 so I multiply the weights of district kernel to this pixel to these two pixels and I produce something here right then I shift the kernel and do it this exactly the same when I shift the care now means I'm one pixel down I multiply the values to this two and produce this second element of the feature and then I shift again so but this bead would be exactly the same as this one and exactly the same as this one and this weight would be exactly the same as this one and exactly the same as this one so two things happen one is that they make this matrix a sparse second is that with high weights you know we don't have many parameters anymore so the parameter here would be exactly the same as this parameter the parameter here would be exactly the same as this parameter so in feed-forward neural network in a dense layer this value was different from this one this valley was different from this one and we had all weights weights we had weights all over the place you know everywhere but we have done two things make the weight sparse second tie the weights okay and you remember that we talked about sparsity the other day that is Farsi really is a helpful thing everywhere including here one of the reason for success could be you know having more as far as one okay so yeah in fit forward if you have a layer of M notes going to a layer of n nodes you have M by n weights but here you have a kernel of size K so the order of weights is K by n instead of M by a so the number of dates are smaller number of bits are small okay and we are doing parameter sharing as I said you know this is the same as this one the same as this you know your type parameter teacher yes size of the care now no you decide size of the candle the same that you decide about the size of the hidden layer here for example you know you just decide that this layer should be 32 or 64 but 128 whatever you know you decide that I'm gonna use kernel of size 3 by 3 or 5 by 5 or 10 by 10 it's it's there's no good way to design thank you at the moment yes it is you know it's you know just think about this as a feed-forward neural network that has less number of leads and ham sub have some structures in the weights you know in in feed-forward neural network from this layer to this layer you have a matrix now this matrix one is a sparse second have some sort of a structure that a part of the matrix is the same of another part sorry no no no we apply many kernels actually it's good point but I was about to show it to you in practical examples later on you know from one layer to another layer we applied many kernels not one can not only one can you know say the input image you have an input image and you apply a kernel to this okay and then you shift the scare now and shift this kernel so it produces a feature map for you that's not the only kernel that you apply another kernel also and it produces another feature so there will be another feature map here and then you apply another care now there were another feature man you know so you produces many future Maps say you have ten feature Maps here in the first layer when you want to go to the next layer if it's a convolutional layer again you are going to convolved this with a kernel but here your kernel is not just three by three it's three by three by ten it's a kernel of this form you know which is going to be convolved with all of these feature maps and produce another feature but again here you may apply and you're gonna apply more than one can as you apply mini caramel okay yes know is still for each kernel you are sharing with you know think of each kernel each kernel is a matrix multiplication which is a sparse and have some sort of a structure but you multiplied many of these matrices but in each matrix you have the concept of parameter sharing and you have the concept of a sparsity this is the definition of equivalence you know if you have two functions F F F and G and you if F G of X is equal to G f of X then by definition function f of X is equivalent to function G and this is an example you know suppose that G is an identity function which takes X and produce X and F is a function which Maps X to Y so if I first applied G and then apply F the output would be y but would be the same that if I apply F first and then apply G because Jesus identity so in this case f of G is G of F ok so in convolutional network or in general and convolution actually our translation and convolution have this property so if you translate first and then convolve would be the same as you convolve and then translate okay you know if say this is an image this is what signal and this is another one you know this is just a step function they are can step ones if I convolve this one with this one I get this one back as the result of convolution okay but if this is my function the second signal and if I convolve this one with this one then I will get this as there is no okay so the second one will appear here actually actually it's a bad example because I'm convolved in just a two dimensional signal with one dimensional set so this is my signal if I convolve with this one it's as if you know I put it here so I get this one back if I convoluted with both of them then I will get an image like this so it will be repeated so if just think about it this way you know this is the result of convolution but if I don't have this one and I have only this one that will be the result of convolution you know shift and convolve would be convolved and shift you know it makes no difference so in in two dimensional space for example if this is one image and this is another signal that has high values here like a spark function and zero everywhere else and if you convolve them you will get an image like this that's the reason that it can be used as edge detection because you know it produces a feature map you know because you had this feature and you convolve it you know and and this edge on the image of elephant has been shifted all over the place when you convolve this as if you convolve and then shift it all over the place you know you get a map similar to those edges in the original image so that that's basically this is this property is called a quiver and and it's with respect to translation is occurring but it's not rest with respect to other transformation like rotation or scaling and so on but in terms of translation it is equivalent yeah as I said you know that's something which helped you to learn a feature map which is useful okay so that was the first step of convolutional neural network convolutional stage the second step of a convolutional neural network is detection detection is basically similar to activation function that we had in feed-forward neural network so in feed-forward neural network in feed-forward neural network we had this layer and we had another layer here when you had a non-linearity at each neuron right so this linear sum I mean in non-linearity will be applied to this linear song which we called activation function could be Sigma it could be reloj and so on so detection or detector stage is this is just a non linearity function which is going to be applied to the output of your feature map so these are your elements of your future Maps right with a sparse matrix with some structure you're going to apply a non-linearity to it it's called detection or detector stage similar to non-linearity nothing new here actually and the third step which is new is pulling a stage okay you have a feature map now as as the output of convolution you have a feature map and you apply some non-linearity to this still you have a matrix of this form now pulling means that you are going to report some sort of summary statistics of this data in a stuff today there are data itself for example I'm going to take the average of this four pixels so this is 2 this is 3 this is 1 this is 0 ok this is 2 this is 3 this is 2 this is 1 and here in a stand I'm going to report 2 plus 3 5 7 8 divided by 4 2 so I'm going to report so I'm going to summarize this take the mean of them for example take the max of them for example let's stuff taking the mean taking the average I may take the max so the max here is three okay this is called pooling I may do pulling with down sampling so instead of reporting three three three three I mean make the size of this feature map smaller and report only three okay this is this is called pooling so pooling is basically could could be different form you can take the average of rectangular neighbors like our first example you can take that l2 norm you can take the weighted average there are many different types of pooling the popular one is max pooling usually people take the max of these so you can think of this as sort of summary statistics about the data you know this is say for example the output of detector stage and this is the pooling a stage the pulling a stage could be the same I mean the output of pulling the stage could be the same size of the feature in this case actually we are looking at you know this the three and take the maximum so I had another one here and this three the maximum was one and then this three the maximum is 1 and then this 3 the maximum is 1 and then this 3 the maximum is 0.2 and if it's with downsampling like this one you reduce the size of this feature map and the size of the output that the size after the pooling is smaller than the size of the feature map and you may have different asteroid means that when you do pooling you may do pulling here and then do it here or you may shift a couple of pixels ahead and with asteroid to or to you know take a step of two or three and then do the polling okay I'm going to explain the effect of this pooling in the next lecture but I encourage you to look at a video that I posted on Piazza it's a talk by Geoffrey Hinton and the title of the talk is what is wrong with convolutional neural network basically he believes that pulling is a very bad idea and it works but it's a bad idea and he shouldn't have this it's a very high-level talk it's not mathematical it's very high-level talk and let you to think about intuition behind neural network modeling brain and how convolutional neural network not mathematically but intuitively how does this relate it to the brain and from this perspective why it works and why we shouldn't have the pooling there because he believes that you know it's completely different from what we are doing in our brain it's an interesting talk and you know let lets you to think more in depth about intuition and insight behind these models yes yeah you can because I mean one possibility actually is as if this pooling is severe you know you may even under fit the model you know that's something that in practice you may observe so yes you can think of this as sort of reversed but it has other property I mean those who believe in pooling actually they shows many good properties for pooling which is not just a regularizer you know it can fix you know the problem of having varying lengths it can make your model basically invariant with respect to translation which you're gonna see later they they believe it has many good properties but Hinton believes that no I was some people some researchers yes sorry yes yeah you're going to lose some unfortunate definitely in summary statistics is not everything that you have computational is beneficial I think that was the first motivation for computation that you know it's too large to go to the next layer let's make it as small how should I make it a small take average of each you know rectangle I think that was the first motivation and then second you know I'm after that secondary they have observed that yeah it makes resistance in two with respect to translation you know of the image and so on but that was the first motivation that computationally it's going to help that it doesn't let you feature maps to explode you know make it a smaller but taking a summary of that you mean learned I mean the permit yeah I mentioned there is a paper I think 2012 that they're trying to learn the operator of pooling there's another paper they're trying to learn the region of the pooling in a stuff a rectangle the sort of classic clustering and turn the area that you have to do pooling gun yeah there are these variations but vanilla common convolutional or network is that you just take a rectangle and usually take the max of that max pooling is most common one okay don't forget to get your shark net account you know I repeat this every day and download image net and see how large it is and how much computation you need you know don't don't wait until the last day okay see you next week
Info
Channel: Data Science Courses
Views: 3,618
Rating: 5 out of 5
Keywords: Deep learning, Batch normalization, CNN, Convolutional Neural networks
Id: yeRHFXiYeX4
Channel Id: undefined
Length: 77min 20sec (4640 seconds)
Published: Mon Oct 02 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.