Varun Kanade: Statistical Learning Theory II

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so ii but i guess it will be slightly less there will be fewer theorems but i'll try and sort of look at the applications of some of the things that we introduced to more classical machine learning technique so i still won't go deep into any sort of more practical science of machine learning but i will talk about models that are commonly used in in machine learning or at least some of them so the neural networks won't be exactly like the kind of things that people are doing to get good results with cats and so on but it's i'll mention some of the more theoretical results that a few of them that have about neural networks so let's start with the binary classification problem again and so we already discussed that if we have a given data and let's say it's nice and cleanly separable so you have positives on one side negatives on the other side I managed to make this work but apparently it doesn't okay and you want to find just a linear separator right so the separates something at line that separates the positive from the negatives in this case and the fit is if there is a linear separator then we know that we can find it efficiently so we already discuss that briefly and for classify all the points correctly but as you can see in this picture there's well there's many such linear separators so there's not just one and one question is well which one should you pick and is there a good basis for picking one over the other no if we just apply the empirical risk minimization principle it doesn't really tell us it just all of this have these have zero or risk on the data we see so all of them are equally good but nonetheless you know we would probably none of us would pick maybe like this green one going where witches is a bit too close to one of the blue points and there's some danger that it might miss classify something and this is what I wanted discuss next and this whole theory of support vector machines really soso vapnik who introduced the main the vapnik behind the VC dimension and who really introduced the statistical learning curve was also very influential in developing theory around support vector machines and kernel machines okay so so the principle there is which is called the maximum margin principle is that there's many separators then in that case you should pick a separator that maximizes the distance of the closest point from the boundary okay so you look at so every point there's some distance to the boundary and you want to find for any such separator the closest point and that's your margin it's a so that's how much room you have to make errors because that's you know that's you getting too close in certain case yes and in that sense you would actually pick this one because this was in this example this is the black line on the right side is the one that that finds the maximum margin classifier so that's what you would pick and the points that are closest in that setting are the ones that are called called support vectors and the reason for that by they're called support vectors will be clearer but you can think of them as supporting these these hyper planes so well what are these margins in case in a while since you've seen high school geometry let me quickly recount what exactly these distances from these linear separators really means so if you have a hyperplane in in n-dimensional space they're given by this w dot X plus some w0 being equal to zero so that's a hyperplane you have some point in our end how far is X from this hyperplane and it's actually a fairly simple to show that in fact it's given by this quantity here which is well this is the calculation if you look at W dot X plus W 0 the absolute value of it divided by the norm of the vector W and that gives you the distance which is what you get so that's that's going to be your margin right so now if you also might be misclassified so this is what's what's important that all of the points that label positive so Y with Y being equal to plus 1 you want the W dot X plus W as indeed non-negative and four points labor minus one you want to to be negative so you want them to be right on the right side and you know that for any such point this is the margin and we want to find the maximum margin classifier okay so that's that's the set up what's the actual way to find such a maximum margin classifier this is what the actual formulation for support vector machines they said let's maybe spend a little bit of time understanding why this is equivalent to finding the maximum margin so you have okay so the optimization problem here says you want to minimize square norm of W subject to these constraints why I times this quantity W dot X I plus W zero should always be greater than or equal to one so I think it's written down better you see what is this so what is the margin if you find some W satisfying this solution and so why why is this one over the norm of the vector in you right okay so it's so you know that the mountain is at least 1 over W W start based on that because you know that this is going to be at least 1 and and in fact when you do the optimization for at least one of the points you will have to satisfy this inequality with equality otherwise you could actually find a smaller W and it's just why are you really so you get that the margin is really exactly that because that's what you would get as the solution and so what we've achieved by you know so this rather complicated looking expression for margin is not really that complicated and now we have a much cleaner optimization problem to solve because they have fairly nice constraints they're just linear inequalities really so W is the only unknown in this W and W 0 so you have linear inequalities as constraints M of them and we have a simple quadratic function as as our objective and so it's actually a convex so if we're solving minimizing a convex function over a convex set so it can be solved using standard convex optimization technique so that's that's good so these problems can be solved efficiently so that's so it's common to go to actually solve this using the dual frame do all of this primal problem even though in principle there is it's a complex optimization problem and we could solve this and I've written the dual there but maybe I'll spend a little bit of time saying how one got to this and okay so so here's a primal of primal formulation which we started with and it disappears at some point okay right so here's on the Left we have a primal primal problem we can write the Lagrangian associated with it but these Lagrange multipliers alpha and then if you take the derivatives of this maybe L may be it since I put in the slide but I'll write it so one of the things if you take the derivative with respect to W what you really get so the gradient of the Lagrangian with respect to W is this case minus sum over I of I I X I and we want we can to set this to zero and then you can also take derivatives with respect to W zero and that will and substituting that into the Lagrangian will really give you the dual version which you now want to maximize and the constraint you get is by taking the derivative with respect to this is that the sum of alpha iyi equals zero there's something that's worth noting in this form what it tells you is that this W can always be written as a linear combination of your vectors X so the solution you find is going to be some linear combination of the vectors in your data set and and in fact this is if you look at these complementary slackness conditions it says that alpha I times this quantity is zero so so this means that if alpha is nonzero then this quantity must be zero it's a weii times W dot X I plus double zero minus 1 must be 0 if alpha is an not zero so in fact any any vector X I which contributes towards the solution will be sitting exactly on the margin that since so if you look at if you interpret this condition if alpha is not zero then this is zero the fact that this quantity Y I time step T at X I plus W zero minus one is zero means precisely that this point is exactly at on the margin that's the closest point to the separating hyperplane because those contingent those inequalities are satisfied with equality and and those points contribute to the solution and this is the sets in which they're actually support vectors so those are in the support of this linear combination that's giving us a solution and these are these points that are the ones that are defined so these are this is why they're really called the support vectors okay so this is a nice picture for for what happens when the data is is nicely linearly separable because this is not often going to be the case in practice so we have to do something else and what do we do so most likely our data will actually look like this and we've already established that that we cannot find a separator that minimizes the number of misclassification unless P equals NP so we have to find some other relaxation that we are able to solve and this is what SVM formulation does for the non-separable case and that's by introducing these slack variables zetas so instead of these inequalities which said that this should be greater than 1 we're going to allow some slack there so we've just gonna insisted that larger than 1 minus Zeta and Zetas are going to be non-negative so we can always set Z times to be enormous Lee large and satisfy all of the inequalities and this would of course mean that we might get meaningless solutions so we have to penalize the slack so you don't want there to be too much like so we're allowing these inequalities to be somewhat violated but we want to minimize the overall violation of these inequalities and so these are all non non negative slack variables and so the way we can enforce that is by adding this extra term to the objective which is the sum of all the slack variables it's so if the data were indeed separable we would actually get back with our original setting if we chose the correct value of C right so there's also this undefined quantity C there so there's so there's a trade-off there so the question is how much do we value having a la norm W and how much do we value having penalties so there's not much penalty for slack so if C is very very small then even if the data is linearly separable it might try to make some errors and find the W that has has a very small norm but on the other hand if C gets very very large than if the data is separable then it will it will try to make sure that all slack variables are indeed zero because that will be the optimal solution so so once we introduce this formulation it might so happen if we don't use the correct see that even for data that was linearly separable to begin with we might end up with with a solution that doesn't actually get zero training error so this is something to keep in mind so any questions about this okay so so one way to think of this formulation really involved sort of again more traditional approaches in machine learning here you can think of this as as a combination of a regularizer in the loss function and so in that sense what is it a loss function so so we understand minimizing the sum of squares of the parameters is usually what regularization is how it regularization is done in machine learning methods and then we have this extra thing which we want to interpret it as a loss and this is actually the loss function that we are looking at says the hinge loss so so what so why is this the correct no if you look at the optimal solution we are always going to have Zi so Zi ties are required to be non-negative but we would never want to set them to be anything more than this quantity so max of 0 & 1 minus that right so they always require to be non-negative but we would always want them to be as small as possible because that's because they're being penalized in the objective function and and this quantity max of 0 and this is exactly this function plotted here so let's if we say what it is saying is in words what this sure is saying so it says that if you classify the point correctly so Y is the same sign as W dot X I plus W is 0 and in fact this product is at least one then there is no loss for that point at all so you're classifying it correctly and by a margin of at least one or this quantity is at least one on the other hand you're even sometimes penalizing points that are misclassified as long as this it fails to be at least once right so even if you're on the right side of the hyperplane that you are too close to the hyperplane you're still being penalized in that sense there's still a penalty for being too close to the boundary of the separating hyperplane so that is what this loss function is really doing okay so in some sense you can think of this in the more traditional view as being minimizing a loss function plus a regularization term usually you will have a parameter lambda on the regularization and said here you have the loss but there's always a setting of C that you will give you exactly the same kind of objective function okay so now the DL gets slightly more complicated when you have the slack variables but not too much more so we start with our primal formulation and again we can write out the the Lagrangian so we start with the objective function and then for each of the constraints so they're all inequality constraints so all our Lagrange multipliers have to be non-negative and we get these these terms curves so that Lagrange multipliers here at alpha is and mu is and if you see here I've done it explicitly so in case you're wondering how I got the dual version is precisely what what is going on so we take the derivatives of this Lagrangian with respect to the parameters of primal in the primal problem so w's w0w and all the Zetas and we set all of them to 0 and add the dual feasibility constraints that the the Lagrange multiplier should be non-negative and so again it's it pops out exactly the same thing that this W has to be written as a linear combination of your input data vectors and it's if any of you are wondering there's a reason why you get this sum of alpha iyi equals 0 as one of the constraints and it's it's a good exercise to try and understand that because what is what it's really saying is that you're putting equal weight on the positive and negative support vectors somehow and that that's what that constraint is setting in your solution and and it's important that that's how it's done so try and that's a good exercise to think why it makes sense for a maximum margin perspective that that's best constraint should pop out of this and right so we can substitute this and we get a maximization problem is as part of the Darrell which is this function G of alpha and which is so the Meuse disappear actually there instead of that we get this constraints on alphas between to lie between 0 and C and this one can see quality constraint as well and that's sir so so we are max so this is a maximization problem but luckily this is also a concave function and this shouldn't be surprising so we had a convex function that we were minimizing in the primal and now we're getting a concave function that we are maximizing in the dual so this should this should make sense so so can anyone think of why one might want to solve the dual rather than the primal version of this problem sets yes it's okay so the suggestion is you can do it in a more efficient manner than in Bolton's okay so I'll come to a gram matrix in a bit so that's that okay so this is one reason that we can start colonizing these methods and and the deal makes a lot more sense when we talk about kernels but there's also this if you look at the primal constraints there were these you know somewhat unwieldy looking inequalities as W dot X I plus W 0 times my eyes and 0 which gives you some poly tope which could be slightly strange whereas the dual constraints are actually very very nice right so all it says that all of the dual variables dry in a box and then there's this one hyperplane on which all of them must lie which is much nicer than these some unwieldy primal constraints and that since just purely from solving the optimization problem it's much nicer to solve the duo and solving the primal so this is also something to keep in mind so you know you know you do want to solve these problems in the robot and you'll drum them and you want to try and make them run as efficiently as possible ok but the point about kernels is also important and I will come to it just in a little bit ok so this is just to summarize here's the primal and the dual form and and these complementary slackness constraints basically again tell us that our five times this quantity is 0 which again is exactly saying that if our five is nonzero then you're going to have this this constraint is going to be satisfied with in equality and these in this case will be our support vectors again so the ones for which you're achieving you are sitting on the margin and then this is the margin no longer necessarily has a good geometric interpretation but it doesn't mean that you're on that the part of the hinge loss that's not going to be 0 so you're not going to be there and you have this ok and so this is the support vectors so this is or picture again okay so so when we go through these margins and and what what does it really tells us if you we try to connect it back now to our statistical learning theory approach of of trying to give generalization balances this is all about so far we've just said here's our training data can refine the linear separator can we find one fast can we find one that makes sense but we haven't talked anything about what it would say about data that we haven't seen from the distribution and that's what we want to make connections back to this right so so this it's a binary classification problem so we could just resort to VC dimension again so what's what's the VC dimension of linear separators in n dimensions someone okay so okay so it's n plus one right so it's so it's one more than the dimension and and if you if you haven't seen it then it's also a nice exercise to actually show that this is the VC dimension for linear separators and n dimensions so right so if you could derive bounds directly using only the VC dimension and and you'll get some kind of generalization brands in the sense that it says that if you have samples larger number of examples larger than some quantity then you get some kind of generalization and and that's perfectly fine but the question is well we went through all this trouble of having a max margin formulation and we were not just choosing any hyperplane but some more reasonable hyperplanes and the question is can we get a generalization bound that somehow depends on the margin right so that's what we would like to pop out of this this kind of analysis so so in order to do that we will look at a different cost function and so come it's called Gamma Rho and I'll define it from our cross minus 1 1 to 0 so the first argument can be any real number the second one is always going to be minus 1 1 so the second argument of the cost function is always the observed label so in this case it's just going to be minus 1 or 1 and first argument could be any real number and this is going to be defined as through this function Phi Rho which is a bit like the hinge loss but not exactly so being in the Clampetts to 1 if the sine of so if Zetas is negative then it's going to be it's just going to be 1 if that is at least a row it's going to be 0 and in between it's just going to be some linear things so that the whole function becomes continuous and in some sense what we're really looking at is points that are misclassified well we're just going to penalize them what 1 points that are classified correctly with a decent margin they're going to be there's going to be no loss associated with them and points that are classified correctly you know maybe a bit too close to the boundary there's going to be some penalty depending on how close to the boundary you actually are so that's that's what this cost function is actually doing okay so the hinge loss no penalizes explicitly even if you're very far from the boundary on the wrong side it keeps you penalizing more and more and makes sense from optimization point of view but ultimately if you're only interested in binary classification we can really just count the cost of it as one once you have misclassified there is no point in assigning it more cost than that okay so this is what we're going to use to prove generalization well make ad this is not the optimization problem that we are going to solve this is still going to solve using the SVM formulation because this will not give us a convex optimization problem this is not a convex function whereas the hinge loss is a convex function so we will we do actually get a convex optimization problem where as we couldn't directly optimize with this kind of a cost function so this is really just to analyze the generalization bounds not to do any optimization okay so serving in the restrict ourself so looking at these linear functions again which have a bound the norm of W is bounded by by some capital W and we're looking at X again the norm of X is bounded by capital X for all these vectors and this function Phi Rho is is indeed 1 over Rho Lipschutz so we can apply and telephones lemma again or some instance of Salomon's lemma and get this and if I found on the Rademacher complexity if you want to use it to understand this loss function ok so what does what is it exactly that we are going to apply this this on so I'm going to look at yet another loss function the cost function and this is our familiar one so this is simply our classification loss right so gamma gamma of y prime wise is if I if I just had to make a prediction no margins nothing then I will just use the sign of my prime so whatever so so my CPM returns me a linear function that gives me a w WN w 0 I look at W dot X plus W 0 it has a look at its sign and that tells me which side of the hyperplane I am on and that's going to be my prediction so this is so gamma is simply looking at the prediction given by the my back support vector machine classifier so this is just a 0 1 loss tomorrow is this this loss that we defined in the previous slide so the first thing to notice is that gamma of my prime Y is always smaller than gamma rho of y prime y right and this is really because you know whenever gamma Y prime y is is 1 the gamma rho is also 1 so maybe I should put the picture on this slide but the picture really is that this is what Gamma Rho looks like and this is what gamma looks like so point wise it's always it's always smaller and and that's that's good because so we can say something about the loss on the classification problem in terms of what we can say about the risk with respect to Gamma Rho because it's always good it's always going to be an upper bound on the classification error okay so you can refine again the risk which is with for both of lessor or I will simply be use for the risk for the classification error simply the sign of W dot x and y and our row is again the risk with respect to this Gamma Rho function and for both of these I will just use hat R and hat our row to define denote the corresponding empirical risks so you have your training sample so your air risk on the training sample is given by our hat and our hat row and so now we get the following a following kind of bound which tells us that if you look at the R of H which is the risk of any hypothesis here which can always bounded by our row of H because using the first line up there which then we can bound using this Rademacher complexity theorem which is that most empirical error plus something that depends on there are the Metall complexity of this composition of phi width H okay so what is this Rademacher complexity we already had it on the previous slide which was WX over Rho times Rho there yeah so Rho times square root of M and and this is where this margin really shows up and W also is what also controls the margin right so this so we want to choose Rho basically that matches W so so 1 over W is going to be a margin and you will set Rho to be the same thing so so you get something that actually depends on your margin and so in this case you know that this the sample size of size roughly W square X square over epsilon square is sufficient to get absol excess risk of epsilon over what you observe on your shop ok so what's what's to note in this case and so it doesn't depend on the dimension directly at all right so it depends on various so if if there is a dependence on dimension it possibly comes through norms it so if you have a vector in a much higher dimensional space it's quite likely that they will have higher norm that somehow it depends on the dimension but but the dependence and dimension here is come comes entirely through the norm so there's no explicit dependence on the dimension so somehow if your data lies in a high dimensional space but has the correct this norms bounded by a small quantity and it's also the case for you can actually find some linear function that has a small norm then then your generalization error is controlled only by some function of these norms but not by the dimension okay so this is so this is crucial so somehow if you use VC like bounds you you have to pay dimension because the VC dimension of half-spaces is is n plus 1 and you always get that whereas using this you can actually go through this and make sure that if you you know the fact that you can do this with small normal W essentially says that there is a large enough margin right so if you have a large margin it means you can find something the norm of W is not very large and and this allows you to actually use the fact that you have a large margin the solution to your problem that that will give you a better generalization down than if you just use the VC bounds out of the box ok so any questions on but important to note is that although it the although we are giving a bound on the classification error it's not in terms of the best classification error on the training data set right there's this funny loss on the training data set and we don't even know actually whether we've we've optimized with respect to that loss function the loss function that we've optimized on the training data set is the hinge loss with some regularization and and what all we can say is that if we so happen to have a small small value for that loss then actually our classification error on the true data is or to the true classification risk is also not going to be very large but this by no means guarantees that we found anything close to optimal and so that's important to keep in mind that the quantity here is not anything to do with the empirical risk of the same kind that we are looking on the left hand side so we can you can give an upper bound on the classification under the true distribution in terms of training loss on the empirical risk with respect to other loss functions and that's precisely because we have that upper bound that you can always upper bound the classification lost by this this funny Gamma Rho function yes okay so I'm not haven't seen that so what what okay what can be done which I have seen is if you do have a large margin linear separator you can actually apply some dimensionality reduction techniques to show that you actually you can also get a linear separator in a smaller dimension and and you can you can argue that that way that you actually may be a VC dimension is an overkill if you do have a large margin because you don't have a lot much and if you do dimensionality reduction it tells you nothing but but you can actually do some dimensionality reduction because of the margin that you actually have and I forgot who did that stuff but it's it's been done okay okay so that's right so this is this point that we're not really solving the np-hard problem anyway we're still hiding that away by saying that all we're saying is that if by magic you happen to have small enough loss for the hinge loss on your training set then your classification error and expectation is also not going to be very large and that's what that's what this kind of a bound is telling you okay so let's me now talk a bit about kernels and and this is why this idea of not using dimension dependent bounds becomes even more important because once we start talking about kernels we are really going to possibly be lucky living in infinite dimensional space and these kinds of bounds will still apply as long as we can bound the correct norms okay so what's what is this notion of also if we go back to the dual formulation of SVM for a bit so we can think of having the data matrix so we can write the data as a matrix point so if we have these vectors x1 through XM and I can put all of them in the matrix so the ith row of this matrix X's is the vector data point X I and I can write this this product xx transpose which is Intel denote by K which every entry is simply the inner product between the two of my training vectors so that's and this is called the grab matrix and this is positive semi-definite just by construction because I've written every entry as a set of inner products and what we can do is we can ask the following question so you can perform basis expansion so what is basis expansion so so so far we've been looking at linear classifiers have some data I tried to find the linear separator and and that's it well we can try and ask well what happens if my data is not linearly separable then this is one of the common example that get which is yeah so what if my data looks a bit like this I have all these positive points here and maybe some negative points here right and so I don't really have a linear separator but but maybe I do have not a quadratic separator or a cubic separator or some if I'm if I'm willing to use the threshold a slightly higher degree polynomial but not just a linear threshold then I could get a separator for my data which I couldn't otherwise get and and this is what kernels are going to prove useful for right so if you don't want to directly use kernels we can do what's called this basis function expansion so we simply take our data and we can structure order monomials from the data itself so we just look at the squares the quadratic terms the cubic terms and so on and we can get new data points there was ibly much larger dimensional space and now what was a polynomial in the original space is simply a linear function in the expanded space so that's that's the idea of basis function expansion so we're just trying to use all our linear methodology in a higher dimensional space and that's what basis function expansion gives us and and all that will change in this case here is that instead of using you know X X I transpose X here the inner product of X and XJ I'm going to have to replace these entries in this matrix by the inner product of Phi of X I and fire fiction ok ok so so the key thing to note is that if you solved if you want to solve the dual version of the problem all we need is the entries of this matrix its we don't need we don't need any of we don't need to know the data explicitly even we just need to know the entries of this matrix and so as long as we could compute these inner products of 5 X + 5 X J we are perfectly fine okay so let's have a look at some example just to get a sense of how this is done so this is sinkholes which you'll hear the word kernel trick and what this is really mean it's really a computational trick so let's say you want the ability to express thresholds of quadratic functions if you start with original data let's say it's two-dimensional data what I will construct with this map which maps X which is just x1 and x2 to this vector which has 1 x1 x2 x1 square x2 square so now I can represent all possible quadratic functions and I can threshold it and this would just be linear in terms of psy of X right great and instead of using this map I can also use this slightly curious looking map which is not that different except there's some funny square root of two's hanging out here and there but it will still allow me to express any quadratic function as before what have we gained by by doing that if if we use fine through the sign we can we can try to write the inner product between Phi of X and Phi of X Prime and it nicely simplifies into just this nice-looking quantity here which is 1 plus X inner part of X and X prime to the power D and what this means is that computationally this is much better because if I want to fill in the entries of the kernel matrix I don't have to actually compute all these possibly n to the D entries if I'm using a D degree polynomial write them all out and look at the inner products I could just do this and this is actually just takes linear time or maybe linear and log the degree of the polynomial but but is very very efficient computationally it's definitely not going to take n to the D time and so this is what you the kernel trick actually means is that somehow if you use a proper kernel you're going to get computational savings you don't have to actually explicitly perform any kind of basis function expansion okay so so let's we said we have this kernel trigger idea so we will call this the kernel matrix in fact this 1 plus X X prime square I can directly define that to be my kernel and if I define sake apart would be that I can just fill up this matrix by computing Kappa and what we want is this property from kernels called as them they should be positive definite or sometimes called muscle Mercer can also satisfy Mercer condition what the condition essentially says is that if I fill up this matrix I should always get a positive semi-definite matrix so if I pick any end points and fill out this matrix I should always get a positive semi-definite matrix if that happens we will call such a kernel a positive definite kernel and maybe just a quick question so why do we want a positive definite kernel to think out of curiosity so again actually the optimization issue more than anything else so if you have a positive definite kernel you get a convex function that you want to maximize when you're solving the SVM problem so it's it's not so much about learning or representations it's not directly a statistical issue is really an optimization question so you can you actually as long as case it's positive positive semi-definite you're going to get a convex function that you minimize you know a concave function that you're maximizing okay right so so a dual formulation of of support of SVM is simply this so you're looking at this there instead of the inner products between X ionic shares you had you're simply now replacing it by the I J entry of this kernel matrix and and you can solve this and now you're possibly solving finding a linear separator in a much higher dimensional space because you don't actually know you know and if you know it's a polynomial kernel you know what that space is but nothing is stopping you from just defining any kind of you like as long as it's positive definite and which is not always easy to check you can actually use it and you can get increasingly more complex classifiers okay it's okay so how do you predict on the new points so if you had explicitly this feature vector W then it's nice because prediction is easy you just compute W dot X you check its sign and you can make a prediction and now if you have this kernel form all you have these are these alphas which is a bit which is a bit strange and you need to be able to make a prediction and if you want to make a prediction then what you need is you need to keep all the support vectors in your solution and you have to be able to compute this quantity Kappa the kernel excited with the new point that you want to make a prediction on multiply them by the corresponding alpha and that gives you exactly what what the prediction is so so so the prediction so there's a trade-off there so if you solve the do you do in this sense the prediction does become slightly more tricky because you know how to compute the kernel with everything in the support vector which also means that you always have to keep your training set so you cannot just keep a set of vector one vector as as your output hypothesis and then get rid of your training set if you're deploying your classifier you after they always have to keep you training set the other sort of issue is also that these algorithms are inherently quadratic in the sample complexity so you at the very least you have to build this kernel matrix and this is already going to be quadratic in size of your data and this is so computational cost that you have to pay okay so so here's some examples of kernels so there's this polynomial kernel which we can use which we already discussed and more commonly what people will use is what's called as the Gaussian kernel or sometimes also just called the radial basis function kernel but the radial basis functions are more general and all it says is that the value of the kernel at point X and X prime is X both the squared distance between them divided by some to Sigma Square and this to Sigma square is a sigma here controls the the scale sorry about that we didn't actually look at kind of kernel retrogression but that's sir so if Sigma is very large so think of what Sigma is doing so if Sigma is very large maybe I'll draw a picture to try and explain what's these algorithms are doing so remember that our classifier classifier is looking at functions of the following form so this is what we are really for trying to fit in so we we will be learning a function which is given by these alphas which are obtained by solving the dual and we get some new point we want to make a prediction there in the function we are computing is we're computing the kernel with all of them and weighting it correctly and then that's giving us our prediction so this is what what is going on here and then we are applying sine function to it to get the binary prediction and but if you think about what this is really doing is you if you have a Gaussian kernel what you're really doing is you're putting it sort of a Gaussian at various points in in space at corresponding to your x i's it could be negative and it's out depending on alphas they could be scaled so on but you're really just looking at a linear combination of sum of cautions you're trying to represent a function as a linear combination of various coefficients with centers and different paths now if you have the Sigma is very large what it means is that you're using really large gaussians with high variance to to fit your function which in the sense you'd make it hard to fit many functions because your functions are the basis functions that you're using you're extremely flat only with Sigma is very small then you're really fitting sort of these extremely skinny gaussians at various point which and so there's a similar trade-off here so if you choose a very large Sigma it means that you're not going to have you may have a large training error because function fitting functions is harder but but your generalization is probably going to be good because you have low complexity of your model whereas if you fit very skinny things you could probably just all of the observed data point fit something that fits it perfectly but it's not going to generalize very well so you don't sort of avoid this problem of training versus generalization at all by giving this it just shows up in different ways in different places depending on what you do with any of these so in practice the answer is you do cross-validation that's the only answer so you tried different values of Sigma you train a classifier for every value of Sigma and you choose the one that so you have your training data you hide 10% of it or something and that's basically what you would do so all these bounds tell us something in terms of Sigma but you know often they're very loose so they they're really qualitatively explaining the story and you you wouldn't really want to choose the numbers that pop out of these things too precisely to tune these values okay so move on to neural networks me not that I'm not going to say anything very very much about neural networks certainly I'm not going to explain how they're used in practice for the most part so what is a neural network well let's start from that and very soon I'm just going to say that the neural network is simply a composition of some some certain kinds of functions so a single unit in a neural network has basically two components so there is a linear function which can be high dimensional so it takes some vector and maps it to a to a real number it's and it's a linear map and then you take some nonlinear activation function which I'll call a and so it's an affine map and you apply an activation function to it and and the combination of this is what's typically called as a unit so that's a unit means that okay so it's a composition of an affine map and an activation function and there's a variety of activation functions you could use sort of the most common ones are either the logistic sigmoid which is given by this or or just read you so values are more very fashionable these days and value is nothing but or it's it should call it a rectifier value is the entire unit which is just max of 0 comma X so it looks so this is the activation function that looks like so this is the rectifier and this is what a sigmoid will look like so Sigma takes a real number squashes it between 0 & 1 so it's very useful to convert real numbers into probabilities and rather is does something strange which I must confess I don't fully understand but that's that's the function that rectifier that's okay right so that's the fundamental unit of a neural network and the way people construct these networks is just put lots and lots of them together and typically in a layered architecture the then practice the ones that work very well on hard problems which we can't solve otherwise usually use what are called as convolutional neural networks which is something different which which I won't really talk about today and so but the layered neural networks look something like this so these are called the ones in the middle are called hidden layers because you never actually observe anything from them so you have the input on the one side and you have the outputs which are given by whatever target labels you're given in the supervised setting but you don't observe anything that should have been going on in the middle so they call hidden layers and really there's connections between every unit and one layer up to the next layer typically there are no skip connections but basically if you in a given that there is about 5,000 papers submitted to nips there's probably going to be any possible architecture that you can imagine that's being considered by someone at some point of time so yeah so this is really sort of the simplest possible thing this is not what people mean when you're on that but this is not the only thing people mean by neural networks when they talk about neural networks in ok ok so I'll focus only on fully connected but I call this feed-forward neural network so there's no loops so you don't have so this you can think of everything as a directed acyclic graph and and we have some nonlinear activation applied element-wise in principle it's also possible to apply nonlinear activations and multiple units at the same time but I'm going to think of these activations applying individually to each unit not more than one units at the same time okay so please stop me first things that I get as I said this is this is one area this is particularly a lot of jargon involve so it's a bit risky to layer for my purpose is simply a nonlinear map applied element wise to a vector which is obtained by applying in affine function so W here is a matrix so that is a vector B is a vector so you get a new vector and you apply this nonlinear activation element wise so you go from a layer that has d1 units to a layer that has d2 units through this process so that's but tells you how you go from one layer to another and yes yes okay so the suggestion is that do we really need to keep these B terms explicitly everywhere and the answer is no you don't need to but it's this is more common and in the neural networks literature so I'll just I'm just keepin that so you can avoid keeping this constant explicitly by just adding the feature one to every of your data and then essentially you can simulate those constant terms using that Thanks okay so that's a single layer and your n layer L hidden layer neural network is simply a composition of all of these right so you start with some input apply and one L to L and so on until you get the penultimate layer and the last layer I'm assuming there's a single output unit and that's simply an affine map so there's no non-linearity involved at the output level so it's so that's all neural network ace right so this is so this represent these are the kinds of functions that are represented by by neural networks okay good okay so I'm guessing maybe you saw this morning a result that said something like this Ben did you cover this so okay so right so if you so if you use all nonlinearities as being the sine function then then you actually get boolean functions out of neural networks and well actually only the output has to have a non sine function to it but in any case if you have said have signed for all of the nonlinearities then you get you get a boolean function and you can get define the class of all neuron networks that can be realized in a certain by using a certain number of units or certain number of parameters and you can bound its VC dimension by what I hope is correct and so that's so you just basically something that depends on the total number of parameters in your model and that's that controls it and okay so that's good this is typically a very large number right so there's a four units that are networks that are used in practice the number of parameters is often much larger than the number of training samples and and this is you know quite an unsolved problem yet it's why don't your own networks over fit so badly in some sense because so these kinds of bounds don't immediately tell us what they what why they don't overfit neither will any of the bounds I'm about to show you in in any case but we can also try and compute the Rademacher complexity of neural networks and I'd rather have okay I'm gonna try and leave this as an exercise on how to compute this but I'll try to give a few hints along the way of how to actually do this so I'm going to look at a class of functions represented by feed-forward neural networks with L minus 1 hidden layers with the following property so every row of any weight matrix that appears in this network should have the l1 norm bounded by W so this one norm is slightly strange but it's important to do the calculations at least in this particular set of calculations Elisa's so so all this is saying is that so so remember that so the nonlinear map I'm using were of the type W times ed plus B and so the the row of any row of W is what's multiplying this Specter is it so so the Z I'm going to assume has well well it will turn out that it has a low infinity norm which is why you want the one norm on the rows of W so really what you care about is somehow being able to bound the inner products of these quantities so if you get and so you can either have two normal two norm or 1-norm and infinity-norm so every bias vector in this that vector B there also needs to have a bound this time on the Infinity norms every single element of B has to have absolute value bounded by some number B all the activations in your in the network have to be one Lipschitz so they can't increase very rapidly or decrease very rapidly so they have to be somewhat needs and the inputs must satisfy that the Infinity norm of all of the inputs is at most one ok so there's quite a bit of conditions it's a but this all of these together will allow you to bound the Rademacher complexity by what looks like something even even more tricky but okay so here's the expression it's you know it's not that hard to prove this result actually but it's worth noting that there is an exponential dependence here on on L which is the number of layers so unless you have a reasonably shallow Network this is going to grow very fast indeed there's also the sum over L entities and so that's that's the that's the main problem there right so so the otherwise the dependence on B and W is okay but there's an exponential dependent on on the number of hidden layers in in the matrix in the neural network and and so that means that in general for D by DP neural network this could be you this could be very large I don't I wouldn't say that so I mean okay so if if these are the only things that you assume and you're probably yes but so the real question here is that what are the correct kind of assumptions that you should be making on on the kind of things you're doing right so and and that's still a very much unresolved question to some extent this is what are so the kind of regularization that would is typically applied to wait parameter well even the classical things right so sum of squares of the weights and so on you can also do that for neural networks and indeed people do but other kinds of things and I drop out or things I don't even know how to put them in this kind of a framework to understand what kind of regularization that they're doing so okay so this is so I wouldn't take this as kind of any definitive answer on what is going on with generalization abilities of neural networks okay so here's the exercise and so there's a couple of things that you need to use in order to prove this result and so what you want to prove is that if G bar is a function class that you get by getting all possible convex combinations of functions in G the G is a class of functions G power is obtained by taking all cost of convex combinations of functions in G then then the Rademacher complexity is actually the same for both of these classes so it doesn't change so taking yes number of parameters so so the nothing there is no dependent on the number of parameters here some however right so that's hidden oh wait again through the through the norms right so in general if you use lots of hidden units it's likely that the l1 norm of the vectors your network is using grows the phone but there's no explicit dependence on the number of units or number of parameters anywhere in this in this bouncer yeah so yes so I think in general so so the bound on the Lipschitz constant and the bound on the norm of the vector is you really need only one of them you can always set one of them to be one because you can always so if you wanted the function activation to be you know three Lipschitz you can always allow W to be three times W and then you basically achieve the same thing because your that's what that's the scaling you're doing so in that sense it so the bound a being one difference is more or less without loss of generality as as long as you're willing to you agree that this this w there will then change because of that kind of constraint okay so so that's so that's the Rademacher complexity of these kinds of neural networks and there's sort of many papers on this we have put links to a few of them in the bibliography should be available let me say a little bit about sort of more classical theory of what has been considered in neural networks and this is now moving away from the kind of generalization bands we've focused on but really trying to understand representation right so so what is it that got people excited about neural network several times in the last 50 or so yes at different periods and one of these phases was sort of dominated by what these kind of universality result so there were many of them that were very similar that came out roughly at the same time this particular version is due to see Benko and and basically this is slightly simplified version of it which says that if you look at just a network with one hidden layer so no you don't need multiple layers nothing that kind says a single hidden layer and the logistic sigmoid activation function but the result is actually a lot more general you can use other kinds of activation functions as long as they satisfy some basic properties then if you look at the set of functions that can be represented by linear combinations of this form so alpha times this this unit given here by sigmoid applied to something affine function and n can be some arbitrary integers that there's no bound on the number n then the functions of this kind are dense in the set of all continuous functions on the unit cube in RN so this is so this is in this sense this is a universality result that you this you can approximate all continuous functions in on the unit cube in RN or it's in general in any compact set and it's you know so it's like the kind of approximations by polynomials or trigonometric functions going back to my stress and so this is sort of what people excited because in principle neural networks can represent anything that we might care to to learn but this also means that they're generalization will be could in principle be an issue and indeed nobody really uses neural networks with one hidden layer and practice there and there is no explicit bound on n in fact in fact it is known that you require an exponentially many at least exponential in the dimension number of units for this for this kind of a result to hold so you cannot represent arbitrarily well all continuous functions if the number of these units cannot is not allowed to grow at least exponentially in the dimension so so in fact you know that you're going to require a very large number of units if you use one hit of their ears okay so if you want an epsilon approximation to an arbitrary continuous functions then it's going to be something like 1 over epsilon to the dimension so that's the so that's the result and so it's so these are very interesting you know from understanding representation point of view but it's it's less clear what they really tell us about well two things both about training algorithms because it's not that these one layer hidden neural networks are a are any easier to train in fact the practical viewpoint seems to be that actually deep neural networks are much easier to Train than shallower once and and there's a seems to be some reasons for that but so it's not that having a one hidden layer is actually very good from a training point of view and it's also not clear that this is significantly better from the point of view of generalization and again there isn't seems to be very little practical evidence from practice that very deep neural networks overfit a lot so so this is so these are all very interesting results about it's it's not so far it's unclear to what extent actually explain what's going on in practice currently and I'm not going to say tell you much about because I don't understand it very much either but ok so we have these universality results and there's been some more recent work trying to understand this question this is why is it that people are using deep neural networks and and not shallow ones and one possible explanation is that perhaps you could express a reasonably large class of functions which we care about learning much better using deep networks than using shallow networks and and this is sort of the line of work that has been followed by a few people and so in particular I want to point out two results and there's there's a few more so it's not these are not necessarily my my favorite results or anything but I just wanted to pick a one or two of them and so the first result by Alden and Shamir established that this sum function actually it's er it's somehow related to these functions that are symmetric in RN so they only depend on the distance from from the origin and there's a specific kinds of functions so there are these functions that can be well approximated by by depth three neuron that works at depth rehear means two hidden layers so usually the depth is always the number of hidden layers plus one so you can by using polynomially many units in the dimension but require exponentially many units if you use a depth to network it's so at least you know so we know the depth to circuits are universal if we don't constrain the number of units but there are functions which actually have a fairly small representation for depth using the three networks but require an exponentially large representation if you're restricted to depth to networks similar kind of resource so this applies to almost all activation kinds of active to the first one as an internal applies to almost all kinds of activation results used in practice the result but Algar ski is is slightly well it's incomparable in the sense that it shows things for depth that is not just 2 & 3 so it shows for any k there is some kind of a separation but the separation is also not as stark as k and k plus 1 it shows that there's functions that can be represented well by a depth K cube neural network but not but not using a depth K neural network right so this so there's a large gap between the two two depths there varies and the original result is only gap one but there this is concrete numbers two and three where this holds for any k and again sort of the idea is that these kinds of results you know there's this somehow somewhat involved but the key idea and many of this and similar results is to look at oscillating functions and and these are very hard to express using shallow networks but somehow get easier with deep networks so it's a composition is very helpful if you want to create more and more oscillations various linear combinations and not and that's sort of the key idea that goes onto proving these kinds of results so so these are the kinds of functions that are going to be hard to represent Basava networks whereas deep some deep networks can represent some kinds of functions that look like this so that's that's the general flavor of these kinds of results okay so I want to sort of move on this may be algorithmic stability and from the next 15 15 minutes I want to talk about a different way of proving generalization bounds which is really originated by bouzouki and elesif so so far the results that we've seen are of the following kind so either the ones using VC dimension that you may have seen this morning or the ones using Rademacher complexity is that it takes some class of functions you want to look at some conditions so there's some complexity measure on these class of functions and you need some possibly some conditions on the kind of data that you have maybe some Lipschitz conditions and so on but fine but as for some conditions with high probability for every F in this class the the actual risk and the empirical risk are close to each other so this is what all of these results have been of this flavor right so this is these are uniform convergence bound they hold for all F in this class and in some sense it's nice because they're very powerful but one might also ask whether this is sort of too powerful to that maybe we might get away with using different kinds of analyses that that actually don't maybe are not as powerful but they're so sufficient to prove generalization for the kind of algorithms that we actually use and that's the idea we want to want to develop so these kinds of results only develop and only depend on the some kind of a complexity or capacity measure that the class of functions used by the learning algorithm but not on the learning algorithm itself so it doesn't matter what learning algorithm as long as you promise me that the learning algorithm is going to pick some function from this class of functions these bounds hold but we might want to look specifically at the some kinds of learning algorithms and see whether we can analyze algorithms directly to say something about generalization rather than only looking at at classes of functions and saying something about generalization okay so so let's so this is what algorithmic stability is a notion that Righton that tries to make use of this so so let s be some sample x1 y1 through xn YN and strong from some distribution or but X cross Y and s prime is a sample that's exactly the same as s but it differs on exactly on just one point it's apart from that one point is exactly the same instead of the point XM ym it has the point x prime my prime okay so we're looking at learning algorithms possibly random noise so you take some sample and you can output some function on it so we have some cost function as yeah and using this we want to define this notion of uniform stability which is the following so let's pass this again it's a learning algorithm is uniformly beta stable if you have any such samples so samples SN s prime size M which differ on exactly one point and in then for any point X Y so X Y naught here in the sample XY z-- any point in X cross Y you want that the cost of if you apply the classifier return by the algorithm on the set s any function it doesn't have a good classification problem the function returned by the algorithm on s prime if you apply this then you get that the difference is at most beta okay so let's sort of try and understand what this definition is saying a little bit before we try to talk about any results so so why might something like this be useful for for generalization right so so what this is saying is that if you know so this is some kind of a closeness condition on what happens if you have two data sets that don't differ too much then then you're somehow also returning algorithms and sort of outputs that that are not too different and and this means that well okay so that's good but what do these data sets differ on well they differ on one point and this is also a way of saying that but actually your function cannot depend too much on one specific point in that case right and that's kind of exactly what we want from things that are algorithms that won't over fit right so you don't want your you know the output of your algorithm to depend too much on one point or maybe a small number of points of your training data set somehow you wanted to make use of the full training data set in some meaningful way and not on some specific part of it and that's that's the kind of thing that's coming out of this definition okay so that's what we want to choose well it depends on what beta what we want out of beta but we will want beter to be small typically so something that is smaller than one over square root of M okay and so let me so this is the same definition so let me say the theorem which I'm not going to prove but this is Bousquet and elesif which says that if you have gamma is some bounded cost function so the cost can never be larger than M some M and if this algorithm a is uniform the beta stable now if you take some sample from this distribution then for any Delta greater than zero with probability at least one minus Delta it holds that that the risk of the output of this algorithm can be bounded by the empirical risk plus beta plus this quantity two M times beta plus M over okay so so what's important to note here is that there is no for all F in F here so it's really just talking about the output of this algorithm and we can it directly bounds the risk of this of this output of this algorithm based on the property of the algorithm not so the algorithm could output what might what might seem to us like a very complicated thing some arbitrary polynomial size circuit or something but as long as the algorithm has some stability properties you can still get a result like this right okay so this this beta times M here together with the square root of M in this last term also shows that beta should be smaller than one over square root of M otherwise you will get nothing non-trivial out of this kind of a bound right and this also means that yes so not exactly sure what you mean by robustness here but right so it doesn't make the same decision right but it's even in some sense yes it's saying that what your output is not going to be too different in a certain metric so in that sense yes it is a like a robustness condition but but we can talk about this afterwards if there's this right so okay so the Union we want beater to be smaller than this which also means that you cannot use it for zero one classification loss at all right so why well if you look at this definition right so if you for zero one classification loss you look at this gamma here well you either get 0 or 1 right so you either have the same prediction or not and since this has to hold for any two samples that differ by exactly one point the only way to achieve beta stability for any beta that's less strictly less than 1 would mean if you algorithm always outputs a constant classifier so there is not really a learning algorithm at all it just outputs a fixed just classifier so and this is you can't really apply directly to the 0 1 loss but for 4 losses that use even if it's a classification problem if you use different loss functions or things like hinge loss or something faced VM you might be able to apply this kind of results but you can't directly apply to the 0 1 loss ok so so I don't want this I'm not going to prove this theorem it's there and the it's actually there also in one of the textbooks it's referenced and also in the original paper by mu scandalous if both are fairly easy to read so I want to just show how one can we get rid regression and through this angle and then I'll say a few more words and stop so so this is one last notion that I want to talk about which is for Sigma admissible cost function and this should be a there's a couple of missing things on that side so there should be a [Music] gamma missing and the free parenthesis okay let me write that focus what it should be it should be gamma f prime X Y minus gamma F XY is less than Sigma times what is the correct thing okay so so that's what we that's what we want is the definition of Sigma and miscibility so it's really like a Lipschitz miss condition on gamma it's so you so you want to say that if you change by it so the difference between the loss of the cost function on f and f prime that that difference can't be larger than Sigma times a difference and f prime of X and f of X okay okay so so we want to see if use this for ridge regression so what is Ridge regression let's strand and this again so we've seen what linear regression is we look at linear regression from as a constrained optimization problem so we are trying to minimize the squared error constraint for the constraint that the vector W that we find should be in a ball of radius W what's more common in practice is not this kind of a constrained optimization problem but an unconstrained one with this regularization term so you will have this squared loss as we as we do so plus lambda times norm of W square as as a penalty term and what you're trying to find is a minimizer of this this that you can objective function right so so there's a couple of things that we want to note so if if we have that the the norm of X is sorry not W X is bounded and Y is also bounded then that any minimizer of this loss you can bound the square norm of the minima of the output of the minimizer okay so so why is this it's not that hard to see so think of what happens when you said W to zero okay so if you said W to zero you get you're predicting a zero for every single point but why is at most M so maybe you pay M square as overall that's fine but then you don't pay anything in the second term because W is zero it so any vector that you find must at least satisfy that lamda times W square so the minimizer must satisfy that lamda times W square is less than M square and so actually if you solve this problem once you fix lamda you do get a bound on on the minimizer that you're going to get you okay so so that's good so why do we want that well if we have a bound on the minimizer then we know that F prime of X can't be predicting very large values because the value of F prime of X can only be as large as the value of W dot X minus y so that's that can all be bounded and and that's good because as long as you can get a bound on that and we're using the squared loss then it's going to be Sigma admissible because as long as we are looking at the quadratic function on the bounded interval it's always going to have a fixed Lipschitz constant that we can use it says it's not on all of our but on a bounded interval that's fine and so so they're really the key point there is to get get a bound on the on the minimizer of this and and then we can use a result again by Bousquet and elesif which happens if gamma happens to be convex so we need gamma to be convex for this result and Sigma admissible then then Ridge regression is actually beta stable so in this case gamma is indeed convex with with beta being something like Sigma square X square lambda but the important thing is as M in the denominator so you get actually a very high stability so so we know that anything is less than 1 over square root of M is good enough but here actually we get something something else as good as possible which is something like 1 over 1 over m ok so that's and this is sort of different from the kind of results that we've seen in the first part briefed it's not they're not saying anything about function classes what we're really saying is this algorithm is going to start spying a minimizer of this thing and the minimizer has certain properties so we're not saying about properties of all linear functions necessarily but about the specific kind of algorithm and this will this will be beat as tables and not every W will be beat a stable but the minimizer will be okay so I just want to point out again sort of a recent paper by heart erection singer which has can use this idea of algorithmic stability and shown that if you if you analyze stochastic gradient descent directly and you have to change the notion of uniform stability a little bit because it's a randomized algorithm so you won't have to do a few things in expectation rather than saying that it always holds but but basically using the same kind of idea of algorithmic stability they've shown the stochastic gradient descent with early stopping is uniformly stable right so this this gives you some kind of generalization bounds if you use to guess your gradient descent to do optimization and it's the same kind of idea and so there's this one kind of view that if we do want to understand why neural networks you know don't over fit that badly we might have to look at the algorithms that we're using to train them rather than just their capacity and this is definitely in in that vein this kind of a result okay so it's just to summarize so what we saw in the fill in the first part so we saw the view no uniform convergence bounds to as a tool to bound generalization error so generalization error is the difference between the empirical risk and the true risk and the other way around using Rademacher complexity bounds and we applied these bounds to a variety of machine learning methods with linear regression generalized linear models support vector machines and then said a few things about neural networks and it's very much an active area of research we don't have definitive answers to many questions surrounding neural networks certainly not from a theoretical point of view and the very end that talked a little bit about algorithmic stability as as a different way of achieving ways to get bound generalization error which is different from the uniform conversions so thanks very much [Applause]
Info
Channel: Federated Logic Conference FLoC 2018
Views: 682
Rating: 4.4285712 out of 5
Keywords:
Id: i80Qms4ziPI
Channel Id: undefined
Length: 83min 22sec (5002 seconds)
Published: Mon Jul 30 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.