Tutorial : Mathematics of Deep Learning - Part 1

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

@6:40 He's confusing dropout with dropconnect.

👍︎︎ 3 👤︎︎ u/Kiuhnm 📅︎︎ Aug 24 2017 🗫︎ replies

Throughout most (all?) of the presentation, X refers to the parameters of the neural network.

At 25:20, he states that, as an assumption, l(Y, X) is convex in X, but says that X "is the output of the network," even though Phi on the same slide represents the outputs. Meanwhile, if X is actually the network parameters, then this assumption doesn't hold.

At 37:53, he even writes down the objective function with regularization as l(Y, X) + lambda theta(X).

Can anyone explain what's intended here?

👍︎︎ 2 👤︎︎ u/reservedsparrow 📅︎︎ Aug 24 2017 🗫︎ replies

Here's Part2, in case anyone's interested.

👍︎︎ 1 👤︎︎ u/XalosXandrez 📅︎︎ Aug 25 2017 🗫︎ replies
Captions
good morning good morning it's my great pleasure to welcome you to cvpr and also to welcome you to this tutorial and the mathematics of deep learning my name is Rene Vidal I'm a professor at Johns Hopkins University and this tutorial is going to be given jointly with Roger gages from Tel Aviv University so as we all know ever since 2012 deep learning has become mainstream in computer vision and one of the major reasons has been the outstanding performance in a variety of visual recognition tasks in particular up until 2012 much of the object recognition methods we're improving at a small rate may be a 1 or 2 percent and in 2012 there was this dramatic 10 percent improvement by training a deep network on image net in a network that had also a very large number of parameters and the rest is history so blue our methods not based on deep learning as of 2012 and that's what I meant that the error in image net used to be around 26% and it went down to 15% in 2012 with Alex net but in 2013 the error went down again from 15 to 11 2014 he went down from 12 to 6 and in 2015 it went down again to 3.5% with rest net so at the beginning many people thought well this is only going to be the case for classification tasks and it's only going to be the case when you have a huge amount of data as you do in image net where you have a million examples but certainly people tried a few things so they observed for example that if you fix the parameters of a network that is trained in image net and you use it for other things it actually works remarkably well so what you can see on the top left is methods for just feeding the output of a network to a traditional SVM and then training it for a different task so that's also the case of the second method it was also applied to different domains like recognition of faces with the deep face algorithm so at some point we sort of figure out ok it's probably going to work just for classification problems and mysteriously it has this property that you can use it in different datasets without much of a drop in performance something that was very unique because many of us remember that methods from the 2000s you know you would training data set one you would test in data say two and you would always get a drop in performance of at least 10% so we were proven wrong again so it was extended to many other tasks one of them being object detection for example by training the network with the output being now the bounding boxes so that you could localize what the objects are located and it's been extended to almost any computer vision tasks at these points so you can do post estimation by training a network with the output being the post of the object or you can do semantic segmentation for example there is the famous now segmentation net where the output of the network is then VN encoding the coding type of architecture so once it's worked for almost everything you can possibly imagine you can start thinking why this is the case and so when you ask the practitioners and the experts in the topic they would tell you one reason is that the features are learned rather than handcrafted and therefore that ought to have the effect that the features are more discriminative they are more adapted to the task that you're interested in and that's why it should be better another reason that it's given is that the more layers you add to the network the more invariances you capture so you want of course to do recognition with invariance to viewpoint light etc and the example or the proof that is given is that you actually just do experiments so the x-axis here is the number of layers the y-axis is the mean average precision and indeed you do observe that the more layers you add the better it works and this is sort of very very mysterious because if you think about just training an SVM for instance by making the feature bigger and bigger and bigger or adding more parameters to an SVM you eventually expect that an SVM would suffer from overfitting so it is sort of surprising here that you add more layers and you maybe don't necessarily suffer from the overfitting problem the counter-argument that is given is that we've got way more data but and as we'll review briefly in a few moments that doesn't seem to be a great explanation either because if you go through the calculations from the 80s and how many training examples you would need to train given the size of the network you would need many many many many more training examples than are actually used so that's another surprising time now the other reason for the success that is given is well we can try lots of network architectures and we can feed them with very large data sets because we have a lot of better computing particularly fortunately the algorithm for training deep networks is very suitable to implementation in graphical processing units but the other reason that people are starting to argue is well if we actually look at the networks from the 80s and you compare them with the networks of today other than the fact that the networks are bigger and that the amount of data is more what are the actual changes that have happened and so one of them is that there is a different regularization strategy that is used so back in the old times for SVM's for instance we would regularize an SVM with the sum of the square of the weights and so we would solve a loss plus regularization style problem we drop out instead what to do is that you randomly select half of the weights and you don't update them you only update 50% of them following a gradient descent or within the backpropagation strategy or stochastic gradient descent so at the next step what you do is that you randomly select another half of the weight so why this strategy induces regularization formally as usually done and why that prevents from overfitting continues to be a mystery the other change has been the type of non-linearity so back in the 80s the non-linearity that was preferred was this sigmoid that was meant to sort of resemble what the activation of a neuron does and it was replaced by this rectified linear units that is zero when they are input is negative and the identity when the input is positive so when you ask the experts why would this non-linearity perform better a typical answer is that is given is that the sigmoid has gradient that is nearly zero in most places and so it's very hard or it's very easy to get stuck and not move too much while the rectified linear unit has a gradient values equal to 1 and therefore it's not going to get stuck that's sort of the argument that's given now if you give that argument to someone who does research in mathematical optimization that mathematician will not be very satisfied by that answer because whether the gradient is small or not shouldn't really be a reason to have guarantees of convergence for example it shouldn't be a reason to have guarantees of converging to a good minimum or it shouldn't have any guarantees of escaping a saddle point for example and so there has to be a better explanation for some of these phenomena and last but not least people have also served experimentally that the number of weights sometimes is 0 or you end up with networks with a couple of weights that are different from 0 so I think it's fair to say that by and large in with a few exceptions the theoretical understanding of why these networks work remains very shallow so because neural networks are not a new thing obviously people thought about some of these issues even twenty or thirty years ago so one of the issues that people thought about is what kinds of functions are being approximated by a narrow Network and a very outstanding result back from the late 80s was the fact that if you have a network just that has one hidden layer on it they are Universal approximator namely you can you can train them to approximate any function and so when you start thinking about that you're like well why do we need deep networks them if with a shallow network with just one hidden layer we can approximate anything then why do we need deeper networks now how do we approximate everything though we approximate everything by letting the number of neurons go to infinity so in a sense is by having a very wide network so obviously you would think well there is a trade-off between depth and width and that's probably not very well understood and people have even looked at it experimentally if you train a shallow network with lots of neurons versus a deeper network the deeper networks do work better so why is it the case that deeper networks work at it so some recent explanations for some particular cases of networks have been provided so there's been some connections made between deep networks and wavelets through the work of Stefan mala and Joanne Brunner and there's been this idea that if you ensure that at every layer you don't make the volume of the data grow so in a sense the map from layer to layer is a contraction then you can have guarantees of these networks effectively making the representations invariant to deformations the second result that was back to 2003 is this question of generalization error in regularizes so it was computed back then that the number of training examples that are needed in order to Train Network grows exponentially with the network size and that is a little bit what I was trying to argue that if we look at the sizes of the databases today and we look at the sizes of the networks being trained it does not make sense by the calculations made back in 2003 so either there is something wrong or something mysterious or maybe the way in which the calculations were done back then was overly pessimistic and more likely than so there is extra structure in deep networks that allows the number of training examples needed to be much smaller okay so this has indeed been investigated recently and we will see today some results from rajah on the fact that if you've got networks with random weights there is a notion of Gaussian width that allows you to capture properties of the network and allows you to tell things such as data from different classes the distance from data points gets preserved but data from the same class the distance gets shrunk and so that's a property that networks with Gaussian random weights have and there's been also very interesting recent work on how networks should be regularized and the fact that things such as sum of the square of the weights is perhaps not a good idea and that you maybe need to consider all the paths that go from the input to the output and maybe take products of the norms of the weight as a better regularization strategy and we will see that today as well regarding optimization if we look at the literature in the late 80s and late 90s the only case that people sort of were successful at showing that you could get globally optimal solutions was in the case of a linear neural network so if you eliminate the non-linearity if you only have a single hidden layer and if the data is separable then things are good okay but very easily as soon as you had non linearities people show that backpropagation could get stuck in local minima and so it wasn't until maybe 2005 that people started to observe a very interesting phenomenon and the phenomenon was and this was the paper from venture in 2005 that if the number of variables grows and grows and grows eventually you get some sort of convexity and I'm sort of hand waving here a little bit but the idea is that you can get convex neural networks if you have an infinite number of variables this was done for networks with a single hidden layer and the number of variables here is just the number of hidden units in that single unit in the last three years or so there's been also very interesting work at making connections between narrow networks with random Gaussian weight and certain phenomena that occurs in physics particularly since plate models in trying to essentially look that in physics you always get the energy of a collection of particles and you also want to find locations with the minimize that energy and in physics there is a lot of phase transition phenomena so if some parameter goes up eventually maybe all local minima disappear and so that phenomenon has been sort of extrapolated to some classes of the networks via simulation and so people are starting to observe this phenomena that maybe the landscape is not so bad as we thought it was maybe local minima disappear and particularly so when the networks are large so why this is the case and we will see some of that today as well so to summarize the motivation of this tutorial is that deep networks have led to dramatic improvements in several tasks but we don't necessarily understand the mathematical reasons for it so the goal of this tutorial is to review recent progress at understanding three key properties of deep neural networks one is optimization properties can we guarantee that the weights that are learned are globally optimal the other one is going to be what are the put output properties of neural networks is it the case that we can have good metric learning properties with input output and the last topic will be the topic of generalization how can we measure the number of training examples that is needed to prevent overfitting so by enlarge there will not be a lot of results so in terms of experiments so the purpose of the tutorial is not to show again that it works very well on yet another example but rather to understand the mathematical properties oil so the rest of the schedule will be as follows I'll begin with giving a talk about the topics of global optimality in deep learning we'll have a coffee break from 10:00 to 10:30 and then from 10:30 onwards there will be two talks given by Raja the first one is on the data structure based theory for deep learning this question of metric learning properties and the second one will be about generalization bounds for guaranteeing that there is no repeating and will finish around 12 to 12 15 with questions and discussions that being said I think there are microphones in the middle so if you have a question at any time or otherwise at the end of each presentation we'll be happy to take questions right so there is any question at this point while I transition please raise your hand or go to the mic nope yes yes so yes so the flights are on my website if you go to my website and events there is the list of all the tutorial the slides for the cvpr 16 tutorials are already there and the slides for this one they're very similar so I'll update them today but you already have them available from last year okay so let me begin with the first topic which is the topic of global optimality in deep learning so what is this question of optimization and why optimization is sort of a problem so the key challenge in training neural networks is that the optimization problem is non convex so why is it the case that the optimization problem is non convex so suppose that I begin here with the features and I'm denoting those features by V you can think of that as being sort of the input data to the network and you can think of maybe the dimensions to be you know the number of rows in V might be the number of pixels the number of columns might be sorry the other way around is the number of training examples by the number of pixels in an image let's say so how does that deep network operate just mathematically the first thing you do is that you apply some linear transformation to the data the most common linear transformation is convolution but it could potentially be a fully connected network in whose case you are doing matrix multiplication if it is convolution then just think of the matrix you know convolution is a linear operation so it can always be represented by matrix multiplication so what you do next is to apply a non linearity and as I mentioned earlier it could be a sigmoid it could be a rectified linear unit and that gives you to the second layer so now you repeat the process you apply a second linear transformation and a second non-linearity and you do these capital k times if you have a network with K layers what I want you to observe immediately is the following I am going to introduce this notation C to denote the output of the network and the variables here x1 through XK are going to be the network weights so x1 is the waste from the input layer to the first hidden layer and so on so forth and notice that I'm not writing any dependency on the input V here the reason being that I'm going to assume that the input is fixed ok so you just gave me a training set it is fixed and I'm just trying to train the network for that fixed data set ok so the output is a function therefore of the weights because that those are the variables from the perspective of Timmy's ation notice also that this map is a non-convex map the reason being even if you had no non-linearity whatsoever what is this map it's just a chain of matrix multiplications so if X 1 X 2 X 3 and so on so forth so already if even if you had only one hidden layer you just have the multiplication of two matrices in fact so you immediately start thinking ahead this seems to be related to matrix factorization perhaps so in fact maybe it's very likely that learning the weight of a network I can post it as a factorization problem where I give you the set of labels and you factorize it into the weight of the first layer and the weight of the second layer that is if you only had one hidden layer of course in reality you have many so it's like factorizing as products or many things but in addition you've got these nonlinearities on top that makes that the fact that is no longer matrix factorization problem so what problem is it then basically what you do is that you're given a matrix of labels Y so think of it as 0 1 depending on whether data point I belongs to class J and you would like to compare that matrix of labels with the output of the network according to some laws the typical loss might be the sum of the squares of the prediction errors or it could be the entropy loss whichever your favorite loss might be and then we put some regularization on the weights in the 80s that regularization would simply be sum of the squared of the entries of of that weight today often there is nothing so the second term doesn't exist and instead people do drop out for example so how do people handle the non convexity today so by and large the method of choice for training deep networks is back propagation and today is done with stochastic is a form of stochastic gradient descent where you compute the gradient based on small batches of the data but really at the end of the day really the technology is a particularly specialized form of gradient descent for doing the minimization so because the problem is non convex in principle you could get trapped in many local minima so what is typically done to get a good minimizer is a little bit like trial and error so either you know a good initialization for the weight to be given or otherwise what you do is you just look at the training error if it is not going down very fast it's very likely that it won't be a good solution so you just throw it away and you repeat this process from multiple initializations and then you pick the best a few mysteries have been observed one of them is that you can get many potential minima what I what I tell joking sometimes is that if you give data to two different PhD students and you ask them to train the network they'll come back to you with very different networks potentially yet the classification performance will be about the same what is this suggesting it is suggesting that perhaps there are many local minima are equal or maybe that there are many global minima which by definition they have to be equal to each other so if that was the case this means that the non convexity what it really does is that it produces many global minimizer's and therefore depending on where you start you might converge to different ones also but these changes depending on who you ask but many people say that rectified linear units work better than sigmoids and as I said earlier the explanation given is that this phenomena of vanishing gradient associated with sigmoids and the other phenomena is that you can find many units whose weights are identically equal to zero particularly so for very large networks okay so I already mention a little bit of the earlier work so let me mention the key results that are known as of today at a very high level a little bit hand-wavy but hopefully it will get the message across of what is known as of today so suppose that you're trying to solve this training problem for a neural network and as already said is non convex so the following assumptions are made assumption number one the loss function L which is a function of two quantities the labels which are fixed and the second argument is the output of the network so as a function of the second argument we're going to assume that the loss is convex okay is that a strong assumption but not really because what else what is the common law that is being used today for instance some of the squares of the errors or the entropy loss both of which are convex the second assumption that we're going to make is that the loss is differentiable as a function of the output of the network is that a strong assumption not again the sum of the square you know the square loss is twice differentiable in fact it's strictly convex and twice differentiable so it's even better than what this requires and likewise the entropy is also differentiable function okay so the first assumption is a very mild assumption the second one this might sound very weird to you and so I'll expand a little bit of time explaining what it means so we're going to assume that what is fee fee again is the output of the network so it's written here so is the output of the network as a function of the network wait we are going to assume that both that output as well as the regularizer are what is written here as positively homogeneous what the hell does that mean it means the following suppose that you scale the weight of the first layer by alpha you scale the weight of the second layer by alpha you scale the weight of the case layer by alpha so in the output you look at the output and that alpha actually comes out of the map and comes out raised to a certain power P okay and that's what is called degree of homogeneity okay and the positively homogeneous is because we are going to assume that this is the case for non-negative scales all right this seems like a very strange assumption what does that mean is it actually satisfied by the neural networks that we are currently using so let's go through a few examples let me take the rectified linear unit the definition of the rectified linear unit is is the maximum between the input and zero it is negative it's at zero if it is positive it stays the same what if I scale that by alpha right then the maximum will be scaled by alpha provided that the alpha is non-negative so a rectified linear unit has this property of homogeneity if I scale the input weights it scales the output okay and it scales by the factor alpha so the degree of homogeneity is 1 ok so in summary a rectified linear unit is a positively homogeneous function of degree 1 I hope you start to appreciate that why do I need positive because if the Alpha was negative which one takes the maximum would change instead of the maximum being still X it would now be 0 and so the property will not be true so the reason for the positive homogeneity is precisely in that case because you've got the maximum between X and 0 second operation of our neural network max pooling in max pooling what you do is that you take a neighborhood you know you take a small patch and you take the maximum response in that patch so what happens if the weights are now scaled by alpha right then the responses will be scaled by alpha and the maximum value would also be scaled by off so max pooling is another operation where is positively homogeneous and it's also possible homogeneous of degree 1 okay and I need no negativity again because I'm taking the maximum in a max pooling type operation if any of the things was negative then the maximum will change so that's why we need no negativity for Delta so what about a deep Network right now in the case of a deep network let's say I scale by alpha the weights of the first layer by alpha the weights of the second layer what's going to happen well what's going to happen is that I always have a linear transformation right linear transformations by definition if I scale the input by alpha the output is scale by alpha so linear transformations are also homogeneous so what's happening at each layer is that I'm composing linear transformation with max pooling and with values all of which are homogeneous of degree 1 so each layer is homogeneous of degree 1 and so if I just concatenate two layers I'm going to get homogeneous of degree 2 if I concatenate K layers I'm gonna get homogeneous of degree K okay so this is therefore a property that is already satisfied by existing networks not by all of them right if you do some sort of normalization then everything changes but at least networks where the nonlinearities are rectified in a units and max pooling these properties are true okay so the other assumption that we're going to make let's go back to the assumption is that both the network as well as the regularizer are homogeneous okay let's think about what's your favor regularizer again let's say that is l1 what happens if you scale the weights by alpha with l1 l1 gets scaled by alpha so that's also homogeneous right what if you use l2 squared if you scale the weight by alpha what happens well you get alpha squared so that's homogeneous of degree two okay but look at what it says here look at the very last part it says that they need to be homogeneous of the same degree aha so if the network has K layers and has degree K does it make sense to use a regularizer like l1 no the degrees are not the same so what this theory is predicting is that you need to use a regularizer whose degree matches the number of layers so you should be using something like W to the K in order for this to work as opposed to W squared and so that might be an explanation as to why people have said well weight decay doesn't work so well because weight decay usually they mean W squared right okay so under those assumptions here is a first metatheorem at a very very high level it says that suppose that you pick your favorite algorithm for minimization and suppose that you had an Oracle that tells you okay you found the local minimum if if some entries of that X are equal to 0 and I'm being very vague as to which entries in so far but it's I'm entries are zero that local minimum is a global minima okay in other words giving you sort of a certificate to verify whether your local minimum is global and all what you've got to do here is to check that some weights are equal to zero which weights I'm going to be much more precise in a minute but for the time being just for you to have something in your head think about a network with just one hidden layer suppose that all of the weight associated with one neuron where equal to zero that's precisely what I mean here so a network with a single hidden layer so that all of the weights for one urine are zero that's that local minimum is global and effectively what this theorem is giving you sort of a measure of redundancy because if one neuron has all the weights that are equal to zero you can eliminate that neuron from the network without change okay so that's sort of the intuition on the theorem that says that if the size is big enough eventually some weights will be zero and then you will know that you're a global name the second theorem which is slightly more practical is that says the following if the network size is big enough what is the network size again think about the number of neurons in a network with a single hidden layer if the network size is big enough then a local descent strategy notice that I've written local descent and I'm being very very here but it's not just gradient descent okay a lot of the same strategy will find that global minimizer from any initialization this local descent is going to be a weird local defense strategy weird in the sense that typically we sort of like to have optimization problems you know there are three variables and it's fixed I just minimize over three variables in deep learning okay maybe 60 million but it's 60 million and I optimize over those 60 million here the defense strategy is going to adapt the size of the network as you go along so you might start with a network that is tiny teeny and the theorem one might not be true so you might have no way there are zero so you increase the size you add some more neurons and then as you add more neurons eventually the size will be big enough and then you will get some ways that are zero and then you know that you're at a global minima so by local defense what is meant here is maybe a combination of the sense like gradient descent plus this idea of adding or reducing the number of variables effectively modifying the size and therefore providing a way to search a little bit for the architecture of the network okay so pictorially what this second theorem says is effectively the following if I give you a generic non-convex function the landscape could be fairly complex so look at this fig region a here that is a flat region that is it's maybe called a saddle plateau right because it's not a maximum is at a minimum B would be an isolated global minimizer D would be a non isolated global minimizer and this picture is illustrating the fact that there could be many of them and there could be an entire plateau global minima of course we could have local maxima and you can have a local minimum that's impressive so what this theorem to tells you is that either the size of the network is big enough things such as H right that you need to increase the value of the function in order to escape are guaranteed not to exist so you might have a situation like this flat plateau but notice I can traverse the plateau without increasing the value of the objective function so I don't need to increase the value of the objective in order to escape those those critical points okay all right so that's a very high-level explanation without getting into a lot of details so now I'm going to zoom in a little bit into trying to explain why and what it means and where does this come from and I'm going to do so by first just telling you about a very simple problem which is matrix factorization because as I hope I convinced you already matrix factorization is a special case of deep learning where there is no non-linearity and just one single hidden layer so a linear network is that matrix factorization and then I will explain how this extends to deep learning okay so as it turns out matrix factorization is an important problem in machine learning of its own and it's got lots of applications from hi perspective limits the Netflix challenge is one other example and in computer vision also if you think about subspaces for face recognition and face clustering or motion segmentation or structure promotion cases where you've got subspaces appear all over the place and matrix factorization is sort of the technique or the method of choice for many such problems so matrix factorization there are two types of approaches for solving them at a very very high level one is what I would call the convex type and in the convex case what you do is that you give me a matrix Y and you want to approximate it by a matrix X according to some laws and you have some regularization typically you want the matrix to be low rank and typically the loss is the square laws and you put the rank here then you get PCA so this is a very easy problem and so in general if the regularizer is convex this is a convex problem the challenge is that the matrix can be really really really big for big data problems and so that's why in many cases people prefer the factorized formulation if you know that that matrix is low rank well why don't you exploit that to begin with so in that case instead of solving for the Product X one solved directly for the factors U and V and so PC a non-negative matrix factorization sparse dictionary learning are all techniques that follow this paradigm and the difference among them is what type of regularization do you put in the factors for instance if you want to do sparse dictionary learning typically you want sparse coefficient so you might put a sparse l1 penalty on the V matrix in here the challenge is however that this problem is non convex and it's non convex because that matrix multiplication that you put over there okay so that's the source of non convexity so traditionally people have said that non-negative matrix factorization is np-hard that there is no guarantees of global optimality sparse dictionary learning there is lots of alternating minimization algorithms in fact almost a good number of papers in computer vision write down all the algorithm converge is and we verify that if you run it with three or four iteration it works but proving that that is the case for non convex problems it's actually very very hard so let me skip some of the examples I think I've already given them so what this slide is going to show and maybe is the simplest technical slide in the sense that it gives you the gist of the results by just looking at them it's going to tell you why is it the case that you can have guarantees of global optimality for this non convex matrix factorization problem and the key insight really is that these two versions of it go hand-in-hand they're not equivalent but they're so intimately related to each other that by exploiting the fact that there is a convex guy right next to me I can have guarantees of global optimality for the non convex specifically how are these problems going to be addressed we are going to exploit some result known as the variational form of the nuclear norm what is that first of all what is the nucleon all the nuclear norm is given a matrix X is nothing but the sum of the singular values of that matrix okay and it's supposed to be a convex surrogate for the rank because you know the rank is the number of nonzero singular values so instead of counting how many are nonzero you just sum sum them up okay and by doing so you go from a non convex quantity like the rank to a convex quantity like the nuclear norm but as it turns out you know I've always said that why can we write things simply if we can make them more complex we can write that thing in a very fancy way and this is the fancy way we can write it so on the right suppose that you give me a matrix X and my goal is to compute the nuclear norm and I'm stubborn that I'm not going to just compute the singular values and add them up I want to do something super fancy what would I do you give me X and I tell you find any factory size of X UV transpose there could be many of them I'm allowing the number of columns of U to be variable here among all possible factorizations pick the one that minimizes this quantity what is this quantity is the sum from 1 to R R here is the number of columns of U ok so it's sort of the size of the factorization of the norm of the eigth column of U times the norm of the ith column of e so in other words is sums of products of norms that's what these regularizer is and the magic is that if you minimize that over all matrices u and V you get exactly the nuclear norm of X ok why am I going through this convoluted procedure because I'm going to pick that quantity the sum of the product of norms exactly as the regularizer for matrix multiplication and now I'm going to ask you the question are these two problems equivalent well it appears so right it's like here I just did a substitution instead of X I've got UV transpose I just substitute and instead of the nuclear norm I substituted it by this so that's what I put there so maybe the two problems are equivalent so if I have a global minimum for this it'll be a global minimum flow but not quite so because of this minimum that goes in here so what is true is that this quantity is less than or equal than that but they may not be equal they're only equal if I minimize over all choices of U and V now this thing can be generalized suppose that instead of putting the two norms in there I put any norm and in fact in many problems you want to put an l1 or maybe in many computer vision problems you want to put a total variation or any norm that you like so this is what's meant here suppose that you pick any norm for you and any norm for B it turns out that this variational thing generalizes and it gives some other norm that we are calling here the UV norm and it's maybe not a great name but the interesting thing so this is the gist of and what I wanted to stick in your mind a convex function on the product is equal to the minimization of a non convex thing on the factors that's what this is okay and very good so now with that I write the theorem more much more precisely in the case of matrix factorization suppose that you have to solve this non convex problem given a matrix factorizes as the product of U and V but regularize with l1 l2 whatever your favorite regularization on the factors that you want and this problem is non convex so in principle you do alternating minimization whatever your method is and you have your no guarantees of optimality so what this says is the following suppose that your algorithm found a local minimum and suppose that that local minimum is such that one entire column of U and the corresponding column of V are identically equal to 0 then that UV that you found is a global minimizer ok in other words in matron factorization a local minima such that one column is zero is a global minimizer okay moreover if you take the multiplication of that U and V that you obtained that gives you a global minimum for the convex one okay so the two problems are not equivalent but they're intimately related so the fact is that global minima of the convex and global minima the non convex are same and in fact they're just related that the product of U and V gives you the global minima for X that's the and the proof for it is may be long but let me just write two bullet points here the reason why this is true is because of this intimate connection between the non convex problem that I want to solve in a convex problem that is right next to it and in fact that convex problem is a global lower bound and it's a global lower bound because I just said the first term the lost term is equal because it's just substitution the second term is the lower bound because the fanzine almanacs is the minimum of that sum of products thing okay pictorially what this theorem says is simply that the local minimizer you found is sort of redundant it's kind of almost like saying you know I've got a low rank matrix but I don't know the size and I'm going to overestimate the size so eventually you do know that if you set some columns to zero then you can reduce the size of your factorization and still factorize the matrix perfectly so the certificate for global optimality is exactly saying that that you've put too many columns and one of them is zero the second theorem is this theorem that says that if the size is big enough again what is size here the size is just the size of the factorization the number of columns of the matrix if the size is big enough then you can reach the global minimizer from any initialization how how do you do so so look at this picture here suppose that I start here and suppose that I start with whatever with some size for my factorization so I do gradient descent if you want or I do maybe alternating minimization or I do proximal reading whatever your favorite method is and I go down down down down and so eventually I'm going to write here at a local minimizer okay so what we are going to do is ask this question here are you at a local minimum yeah if you're not a local minimum you know keep going down if you are a local minimum then test it theorem one is satisfied namely test if one column is equal to zero if that's the case then you're done you found the global minimizer obviously I'm here so that's not the case so what we're going to do is that we are going to add increase the size of the factory by one so we're going to add one column to you we're going to add one column to V and we're going to select that U and V in a very peculiar way so that I guarantee that I'll keep going down mathematically or pictorially rather this is really like traversing this flat plateau from the left to the right in just one step by adding that extra column and so now I move from the left to the right by increasing the size and then I keep going down and eventually I'll get to like limit now this is sort of a meta algorithm is not necessarily implementable and there are two things that are hidden under the rug in the required further research first can you guarantee that you are the local minimum for a non convex problem not very often right the question of just guaranteeing that your local minimum is usually np-hard form any optimization problems so whether that is the case and for what classes of neural networks is the case that's still an open question that is heavily being investigated as we speak second suppose that actually for the neural net where you're training you had a optimization method that would guarantee you you are a local min then you have this question of can I increase the size and how do I select a descent direction for some regularizes that question is trivial for the l2 that I was just talking about before you have to solve an eigenvalue problem and with that you choose the U and the V but in general that question is also np-hard depending on what regularizer you pick so developing or figuring out for what regularization you can the problem of augmenting the size of the network is tractable it's another open challenge check the time okay so let me now move on to how this extends to deep networks so I hope I've convinced you so far that matrix factorization is a special case of deep learning with no nonlinearities and just one hidden layer and that for matrix factorization you can guarantee global optimal and how did I convince you of that I said well big problem is non convex but there is a very strongly related problem that is convex and the two are so related that you can use one to guarantee the global optimality of the other that was sort of the high level message what types of neural networks can these be extended to so the first one is nearly immediate these neural networks with a single hidden layer in this particular case it is almost like matrix factorization so these are the weights from the input layer to the mid layer and their weights now from the mid layer to the output layer if there is no non-linearity this is just matrix factorization so the number of columns of that matrix now becomes the number of neurons of that hidden layer okay the second type of architecture for which this framework applies is the one that is illustrated here suppose that you've got whatever your favorite big network is with as many layers as you want as many neurons per layer the only requirements are those that I stated at the beginning that the the functions need to be positively homogeneous so rectified linear units is good max pooling is good but other things might not be so what you're going to do now is replicate that network multiple times so now you put many networks in parallel and then you sum in order to generate the output okay so in this particular case what is the definition of size the size is going to be the number of parallel sub networks okay so here size is the number of hidden units here size is the number of parallel sub networks the reason being that the theorem so far are restrictive in the fact that there is just one number characterizing size or effectively is like what is the complexity of this architecture so for matrix factorization is sort of very clear is the number of columns that you put in that factorization but for deep networks is not as trivial because you could say is the number of layers or is the number of neurons what is it so this is a little bit of a contrived example where the number of layers and neurons is fixed to whatever you want for your architecture in size which is the only thing you can vary is just the number of parallel events so this is the generalized factorization problem in matrix factorization as I've said repeatedly it's a particular case and I've said that both are possibly homogenous which is sort of the requirement for the theorems ok so this is the first thing we need to extend how do we extend the concept of matrix multiplication to multiple layers with nonlinearities so the way we're going to extend it is and think about the matrix factorization here it's UV transpose but you can always write it as sum from 1 to the size of our product of the columns of the matrices so that's all we need to do really and this is the more general formula the output of the network depends on many layers and many weights and they could be matrices or tensors whatever they are but we are going to write it sum from 1 to the size of the network where size I already explained what I meant of positively homogeneous functions of those weights okay that's the key insight and I hope that this gives you an example for matrix factorization the sum is running over the size and the positive homogeneous function is just outer product of the columns ok so in the case of a three layer neural network this is the input output map and it satisfies the properties that I just said so in fact look at it input data multiplied by the matrix of weights non-linearity x now the classifiers so this is almost like matrix factorization the only difference between a neural network with a single layer and matrix factorization is that nonlinear map that you put in the middle ok so the next and the key thing that I need to tell you is how do I generalize the regularizer so the regularizer for matrix factorization came from that variational form of the nuclear norm and what was it it was sum of products of norms that's what it was okay and there was this beautiful property that the fancy norm of a matrix is obtained by finding the factorization of it and among all possible factorizations minimize the sum of the products so now if how in the world are we going to generalize that to deep networks so here is the generalization I'm going to start with the far right so look at where by mouses so pick the output X of the network right select the weights x1 through XK such that the network produces X as its output so this is entirely equivalent to is given a matrix X find the factorization of it similarly here given the output of the network find the weight that produce that output any choice we'll do and the choice is not unique and there may be many of them in the same way that there are many factorizations here among all possible choices of weights that generate that X select those that minimize something like that sums of products of weights so in particular or more generally any positively homogeneous function of those weights will do the job okay and the minimum of that is a function of X in the same way that the minimum of this was some function of X here this was a norm and it was convex and that's what gave me the convex problem now what is this Omega of X and so here is the fundamental result that in spite of these being fancy and having lots of products and that it gives you a function that is convex okay it produces a function that is convex on the output of the network so even though we're going to have a non convex problem on the weights we will find a problem that is convex on the output and so this is now the main theorem stated with more proper notation the training problem for a deep network is minimize the loss between the labels and the output of the network plus lambda times some regularization of the weights that problem is quote-unquote because it's not but it's equivalent to minimizing the loss between the labels and the sum output of the network with some regularizer on the output and as it turns out that regularizer is convex so if the loss is convex on the output then you've got a convex problem to work it and these two are not equivalent but they are strongly related and as I said earlier the convex problem is a global lower bound for the non convex in particular cases our matrix factorization and so what this theorem says again is if you find a local minimizer of this neural network training problem such that some of the weight so that network are zero which one all of the weights associated with one neuron for the case of a network with a single layer or all of the weights of one parallel sub network in the other example where there were many perils on it so if all of the weights or all of those you know if a collection of weights is equal to 0 then that local minimizer is below moreover see of those weights is also a global minimizer for this convex top and again this suggests sort of a meta algorithm for optimizing both over the architecture of a network as well as over the weights and the algorithm is essentially the following start with a network of fixed size let me actually just illustrate it with network with a single hidden layer suppose that I start with a network with five neurons in the mid layer just I minimize over those five weights or whatever the number of whites is and I find a local minimum I check are the weights associated with a single neuron equal to zero no therefore you need to add an extra neuron initialize the wait for that extra neuron in a clever way and now continue with your back propagation again for a new network that has one exchanger you keep repeating this process of minimize add an extra neuron up until eventually all of the weights associated with one neuron are going to be equal to zero when that happens you know that you are at the global minima so to summarize there are three big messages of this first lecture the first one is that size matters namely one reason why we might be stuck with loss of local minima is because we tend to optimize over things with fixed size when we do matrix factorization we tend to fix the number of columns and the message is that if the size you pick is too small then there will be many local minima but if the size you pick is very large eventually those local minima will disappear so today I've covered very simple cases like adding one neuron at a time for a neural network with a single handle a what we would like to do in the future and we don't quite know how to do yet is how do you jointly add new layers and add more neurons to the layers more broadly how do you optimize over the architecture because I think it's fair to say that you know it's true we've gone from handcrafted features to learn features but I think the reality that we've gone from handcrafted features to handcrafted architectures and therefore in the future having a methodology that will optimize over what the architecture should be is very important and I think this framework provides the tip of the iceberg by just telling you how do you optimize when you just have to add one year another time the second message of this lecture is that regularization matters and in particular these sums of positively homogeneous type of regularizes is what shows up and more importantly the degree of the regularizer has to match the degree of the network so we should be using degree k for instance for a network with k layers now how to build more sophisticated regularizer x' that are able to capture many layers and so and so that's open problem from the perspective of Tamizh ation but you will see from rogers talk actually how to build regularizes with good generalization performance in a moment now these results are very preliminary in just the tip of the iceberg because there are two bad things about them or maybe three one is that I've always said many times if the size is big enough everything is nice but how big enough is big enough and so V results today that big number is too big to be practical typically is for a neural network the size of the neural network would need to be the dimension of the input times the dimension of the output so if it was images it will be number of pixels by number of classes so that size is perhaps too big in matrix factorization in particular you know that that number is way too big because if I'm trying to factorize a matrix that is n by n the rank is way smaller than n times n so there is a lot of hope that that size criteria can be made better but not not as of today and the other aspect is that much of the results are just sort of meta algorithm like so verifying that your local minima hard problem of its own adding an extra column so that it's at the same direction also another hard problem unequivocal thank you and I'm ready for questions is a mic there hi so just a quick technical question when you have a matrix X how do you know that there exists some sort of decomposition for metric cyclization we know but so when you have the non-linearity how do you know that there exist the composition the the composition I mean the fact that outputs of the networks do exist is by construction because you have an input and you produce given matrix X pass it exists you don't know because the output make my only generate a subset of the space that being said you are searching the X the X is just a fictitious quantity because you're not optimizing over it you're always optimizing over the weights so that X always exists by construction as you are proceeding today durations that doesn't it break the convexity of the relaxation of the of the which is about X because then you have a constraint that X has to leave if you take the Orlac problem X has to leave in some image of your network which made you non convex sets is it yes but again it is not that we are asserting that the two problems are equivalent we are asserting that the global minimum are the same and therefore it's not needed exactly that you impose those constraints in X ok thank you so it's really interesting work so I'm just wondering so in the context of your study how do you think about the function of our job pout since like the drop out at the rigger ization yes this is unpublished work that one of my students has been working on and it's very long to explain so let me try to do it in in 30 seconds if you apply dropout to matrix factorization by dropping out a certain factor right and if you think of that drop out as a Bernoulli random variable and now you consider the loss so that laws now is evaluated with a random quantity suppose you take the expected value of the loss with respect to the choice of the Bernoulli variables you can prove analytically that that gives you a loss on the factorization plus a regularizer and you can prove that that regularizer is the sum of the product of the norms of the cause of the matrix with the exception is that is squared so module that hand-waving part what i conjecture is that dropout you is inducing a form or regularization that is precisely sum of products of weights and that by studying further we are going to be in that dropout is a stochastic gradient descent approach for minimizing that loss and therefore that you can have guarantees of global optimality because of the type of regularization that is induced which exactly follows within this family that I discussed today but if work in progress okay yes thank you so follow up so in that so in essence so that they fade also the assumption of that the positive very homogeneous has to be others n degree for the job pout no in drop out it doesn't in fact you get the squared and so the degree of the regularizer is twice as much so we had to actually correct a very dropout rate in a way that it increases with the size of the factorization so the dropout rate is not as the larger the matrix the more you have to drop out kind of thing in order to to redo the results for global optimality so it doesn't follow directly it needs a little bit of validation okay thank you yes yes yes yeah yes yeah yes so maybe I was not clear enough but we are solving the problems in the factorized space not in the X space and so we are never solving the the complex problem in a large number of variables yeah ah the deficit I mean if the matrix to be factorizes of the size of an image and you have thousands of images you can't even do it with SDP programming and you can do it with these these methods so I don't have precise numbers of the speed-up that you get but it's very clear that you cannot use the SBB for very large data
Info
Channel: ComputerVisionFoundation Videos
Views: 17,990
Rating: undefined out of 5
Keywords: CVPR17, Deep Learning, Tutorial
Id: Mdp9uC3gXUU
Channel Id: undefined
Length: 69min 49sec (4189 seconds)
Published: Wed Aug 23 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.