Varun Kanade: Statistical Learning Theory I

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so what am I going to try to cover in the roughly three hours that I have so I'll introduce the statistical theory of statistical learning theory framework and I should add that this I find the distinction between computational learning theory and statistical learning theory kind of artificial and really they should all be seen as one whole topic so we will make connections with lots of things that you probably did see earlier this morning and so in particular I'll also talk about capacity measures of classes of functions so VC dimension is one of them and I'll focus mainly on Rademacher complexity which gives a different way of saying what we can say about the actual error your classifier Chief's not on the training sample that you actually observe but on data that you haven't seen before and this is two uniform convergence results for if you can bound the rather natural complexity of certain classes of functions this is not really going to be any much about machine learning so I'm guessing that doesn't disappoint most people in this room but that's that's fine and but I will talk about a few machine learning methods now and then just to try and put everything in context and try and give a sense of why we're actually talking about these notions in the first place and towards the end if I get time I'll talk about a different notion of trying to prove generalization bounds which is through algorithmic stability rather than capacity measures and classes of functions okay I share add that and some machine learning is subject that's full of jargon and I've tried to keep it low but I might say something that is to me seems obvious but if you haven't seen it before it would make no sense so please interrupt me and ask me it doesn't make it much easier so so if anything essentially just stopped me and asked me while I'm talking okay so here's the outline of what we'll cover so let me start straight away with the framework that we're going to look at and this solution seemed familiar with what you've seen this morning so we'll think of having an instance space to think of this is where all our data is going to now at least the input part of the data so this is images documents whatever you want think of your data is and and we're going to think of this as being a subset some subset of some our end for some N and there's also labels associated but the data and that sense this is going to be supervised learning so we get to see the inputs as well as the data and so this set of target values will denote by Y and sort of two special cases which mainly will cover what we need to is B when y is just a set minus 1 1 which is binary classification it doesn't really matter whether 0 1 or minus 1 one it's a bit more convenient to do the mathematics with minus 1 1 and where Y just being the real numbers is which we think of as regression problems ok so I'm going to think of data being generated as according to some unknown joint distribution over the set x and y x times y it's the cross product and well sometimes we convenient to factorize this Joint Distribution as a distribution only on X and then the conditional distribution of Y given X ok so this will use this but beyond that we're going to make no assumptions on how Y and X relate to each other so so there is some joint distribution on this space of inputs and labels or targets and we get data from this distribution and so this is what's sometimes called as diagnostic setting so all of those are references to papers that introduce these settings of the original one really goes back to vapnik in german and kiss in the late 60s early 70s so this going to be no specific functional relationship assume between the observed targets and the actual labels and so and so being have to be a bit precise in how we even define the meaning of learning in this kind of situation okay so here's setting so so how can we fit you know some something to this data and so the classical approach you know maybe pre 1960s or so 1970 1960s B 1970s has been well you get some data and you try to fit from some you know fairly general approximation process so you can think of fitting polynomial so trigonometric polynomials and these are universal in the sense that they can approximate any continuous function under some right and this kind of method of estimation if you do it carefully works fairly well for univariate or maybe if the data is has a small number of dimensions but this up these approaches break down completely if you want to have high dimensional if you have high dimensional data and and this is because the space of functions just explodes and even though you might fit the data that you have perfectly with maybe even not that complex polynomials so on and they will actually not do tell you anything about data that you've not seen it and this is what's sometimes called discuss of dimensionality so this is also sort of an overuse term so there's lots of things that are understood by curse of dimensionality to different people it means slightly different things but the general idea is that things are complicated in high dimensions and what we use student1 nice one-dimensional settings no longer works right so so this is this becomes a problem and so this notion is called as overfitting in machine learning so if you can fit a sum function to the data that you actually have but it's not going to help you make any predictions on or very bad predictions possibly on data that you haven't seen yet so far and you would say what you over fit it to the data to the observations and in order to avoid this problem you're not going to try and fit arbitrary functions but we are going to restrict the kinds of functions that our algorithm is actually allowed to choose from so so we'll think of function classes that have so fixed notion of complexity or capacity and I've put this in quotes right now because I mean then not in a very precise sense but we will try to make precise what complexity or capacity we talk about when we're going to actually look at learning algorithms okay so that's that's what we're going to do and that will allow us to make risk to make sentences or statements of the form not sure that you know you're you can fit some function that's not going to perform much worse on data that you haven't seen compared to what you have is see on the data that you have seen right and so this difference is what is called the generalization error and this is what this is mainly going to focus on bounding this quantity so we're not directly asking for minimizing the training error so if your function classes very very simple you might have a very bad training error but at least what you see on the data that you have will be exactly what you expect for data that you haven't yet seen as long as it comes from the same distribution so that's sort of the focus of this theory is really understanding how to bound a generalization error okay so I'll make a few asides and make connections to sort of machine learning that some of you might have seen before and so so what is the more common in classical statistics of machine learning is not this agnostic framework but to try and make explicit probabilistic models of the data so you will either model the distribute the distributions on the data itself on the inputs and the conditional distributions are sometimes the only model the conditional distributions and this is sort of these there are names given to this so this is what's called as generative modeling so generative modeling is when you try to model the entire Joint Distribution on both the inputs and the outputs and this gives rise to methods like Gaussian discriminant analysis naive Bayes and many other models the discriminative modeling which is a little bit more closer to what we are going to do but still they're not going to try to make any probabilistic models at all in on this sense it only tries to model the conditional distribution of the target given the input so there's no assumption and what the input really comes from but conditioned on the input you want to make an assumption and how the labels are distributed and againsted of the classic examples in linear regression where you would you know so this is just one way you could model the linear regression there's many ways you could do it is to say that the observed value conditioned on the model parameters and the input is going to be some linear function plus Gaussian noise so this is the standard modeling assumption or if you were doing classification problems again you will say that you know the observed label is you pass you have a linear function you pass it through some sigmoid so that you can convert it to a probability and then you're just going to say this is the Bernoulli distribution with that parameter will actually give you the observed label so this is what sort of classically machine learning of statistics has made models for the data and this morning you probably saw what's in the pack framework in the so called realizable setting where there's an explicit functional dependency assumed in fact so you assume that the target is just a function of the input and in fact this function is it is restricted to be from a certain class and in that case it's the VC dimension of this class that controls exactly learnability from data okay so you probably saw a proof of this result this morning and so this is this is the connections to things that you might have already seen okay so let's go back to the more general framework so I'm going to think of consider F to be some class of functions from X to Y and the learning algorithm is always going to output some function from this function class F so so we're no longer restricting F and that Y the targets to be boolean so we have to define cost functions suitably and so I'll say a cost function is simply a function from Y cross Y to the pod non-negative reals and so again we can look at our familiar examples if if then we are actually trying to solve a boolean boolean function problem or a class binary classification problem we can simply define the cost to be the indicator function so if the labels match there's no cost otherwise there's a cost of one in regression you can use something like my friend - wide absolute out to the power P so if P is true that would be the familiar squared loss but you can use all the values of P as well and there's many other loss functions cost functions that you can construct and indeed there's staggering variety of functions that are actually used for various purposes right so so called the loss of a function f in in this class f is just if you apply f of X to look at the cost that you get so f of X is going to be the prediction made by this function and the main notion that we want to look at is what's called as risk and this is just the expected loss of the function on the data add on from this distribution so this all of this data is coming from some distribution and we're just simply looking at the expected loss of this function if we use that function to make predictions under this distribution and that's what we will call as a risk exactly sure where the word risk comes from that it's that's what is commonly called okay okay so that's a notion of risk and of course we would like to find a function that minimizes the risk the trouble is of course that and even calculating the risk is is basically impossible on most interesting problems because you have some high dimensional distribution but you don't really have access to this distribution can the distribution can be arbitrarily nasty we're making no assumptions about it so minimizing is of course very hard but even calculating the risk is it's hard and essentially the only access we have to this distribution is through samples drawn from this distribution and this is what we are going to focus on so I'm going to throughout this talk I'll think of a sample s of size M being drawn from and from this distribution which is which is iid so they're all independent and all drawn from the same distribution D right and this is somewhat of a simplifying assumption for many things that will infer data and practice but this is what we will use to actually develop a theory and a learning algorithm is simply a map that takes any subset of this space X cross Y and it gives a function in F so that's our learning algorithm it could be randomized so given any sample that I'll put some function in this in this class of functions and the goal we would like to have is that we want to guarantee the learning algorithm with high probability over the sample drawn from this distribution if f hat is the this is the function that the learning algorithm outputs then the risk of f hat is as close to the best possible risk so this is that these are the kinds of guarantees that we would like to make right so that's that's our ultimate goal and and the the stock will focus on tools that will allow us to prove these kinds of results for certain classes okay so this is an example so what is so what's the first thing that we can think of well you know our only access to the data is through a sample so we can try to look at the risk on the sample and so they can define this simply as the average of the loss on the sample or average of the cost on the sample if you make predictions using F so this is what we'll call this empirical risk the idea is going to be simply to minimize if you'll find a function that minimizes the empirical risk all right so this is what's called as the ERM principle or the empirical risk minimization principle and so you can try to find a function that minimizes the risk on the sample we have and we may be briefly make a slight couple of points that we will mainly going to focus on statistical questions around doing empirical risk minimization so the question is if we could do an empirical risk minimization is this a good is this a good output function in the class computationally in fact for almost any problem of interest this is very hard so they're np-hard in most instances also bloopers and so for example if you want if I give you data like this there's a binary classification problem and I ask you to find a linear separator that minimizes the number of disagreements can you do it okay so maybe on this data possibly but in general it's it's np-hard so we can't really solve this problem on the other hand if if there's a promise somehow that this data is there is a linear separator that actually separates all the positives from the negatives so that's not at all guaranteed by the assumptions are made so far but if we were given such a promise and then the problem is actually easy anyone see how to do it have you seen well not not on this case but in general it's just linear programming actually so you just find something that satisfies all the points on the correct side so you can just write a linear program in this Soulstice right so so under this promise this problem is easy but without such a problem is actually minimizing the risk is impaired is np-hard and in fact for most problems we to solve minimizing the empirical risk is going to be np-hard so so this is really understanding the statistics of this not computational aspects yes but it has to have no error so even if it is a epsilon error that's still already hard so this okay so this is and what's all this we're going to focus on empirical risk minimization folks for a little bit so how do we guarantee that the actual risk is close to optimal if we actually just find something that has as small the smallest possible risk on the sub movie we have and so let's focus on classification for for a second and then you've already seen how to do this for in this morning so suppose you're looking at binary class William function says binary classification problem and the class of functions that we're looking at has finite VC dimension and we're simply looking at the cost well for for boolean inputs or binary classification there's only really one meaningful cost function you can't really define more than one meaning for the cost function and a result which is basically something along the lines of what is slightly different because you probably only solve version of this where the data is guaranteed to be generated according to some function in this class but even if it is not there's a result using the VC dimension that allows us to prove a results like this which tells us that the risk of for any function so if we have a sample that's of size M drawn in iit from this distribution and for any Delta with probability at least one minus Delta for every function in this class the actual risk the true risk so the one without the hat is always going to be the true risk can be bounded by the risk on the sample plus some terms like once there but the thing to notice that of course you have M in the denominator and both of these things under the square root so as your sample size increases so in particular once M is larger than the VC dimension then you're going to get non-trivial bands out of this and actually you guarantee that your actual race will be close to the empirical risk and this is the case for every every function in this class right and this allows you to prove that actually empirical risk minimization is a good strategy right so how do you finish the argument bear so let's so this suppose that F star is the minimizer of the true risk so and it's in quotes because in general there's you can't guarantee that there is a minimizer is defined but maybe we can only get arbitrarily close to the infimum but let's say we have a true minimizer and F hat is the minimizer of the empirical risk then we can write a couple of lines basically saying that the risk of the actual risk of the output of our classifier is close to the empirical risks using this theorem and this is of course on the sample what algorithm outputs is better than f star because algorithm chose something that minimizes the empirical risk on the sample and but this is also close to its true risk so we can use the flip version of this theorem to them set this in a one-sided version but actually the theorem guarantees that the absolute value of the difference between the true risk and the empirical risk is small so if we flip this around then we can get that this is at most Epsilon which means that the risk of the the function output by the algorithm is going to be close to the best possible risk that this is so this is so this argument is what we can to always use so once we can get a bound like this which shows that the true risk and the empirical risk are close to each other then we have we know that doing empirical risk minimization is enough okay so lets me move on to slightly different questions so far and I I said let F be a class of functions and we I didn't really say and you didn't exactly challenge me to buy well what is f ed why should one use F and but it's a valid question so what F should we use and and so you know so there's a trade-off there if we look at this kind of this result so so we want to minimize the empirical risk and ideally we would like the empirical risk to be small because if if our function has a large error on the training set then maybe it doesn't have a very high generalization error but it's not very useful because it's going to have high error nevertheless so we do want to minimize error on the training set or the empirical risk but we don't want to do it with the cost of having a very large term in the bound there and so that's the trade-off right and so this is so if you have a very large class of functions then the terms in the right can get large but if we use function classes that are very small then it may not be possible to get a small enough empirical risk and so maybe learning is not very useful and that's that's really the trade-off between complexity of the kind of functions you're fitting and then in the sense of the generalization ability and the ability to actually fit data in statistics this is sometimes referred to as the bias-variance tradeoff says the sense of the same kind of thing and so should we pick a more complex class of functions the more complex class of functions means that we can achieve smaller empirical risk but at the risk of having the difference between the true risk and empirical risk growing larger and so there is a way around this in principle there's a way around this which is called a structural risk minimization and so the what we should do is not just use a single family of functions but an infinite family of functions index by some complexity measures a D I'm going to think of here as a complexity measures to think of it as the VC dimension but it can be other kinds of complexity measure as well and what you want to do is not just something that minimizes the empirical risk but the empirical risk plus an additional penalty term which depends on the complexity of the classifier that you have a capacity measure of the class of functions that you're using and this amount of data that you have so M is the number of data samples that you have so you can imagine putting the term on the right there is this work actually so the terms on the right the middle one really that's the relevant term because that's the one that has D so you can imagine it adding a term like that to cap out there and that's going to determine the penalty and again if you've seen sort of more applied machine learning this is reminiscent of what you would actually typically do which is you would not just minimize the empirical risk but add something that's usually called regularizer a regularizer and machine learning which is something that penalizes the complexity of of the function that you're learning so typically this would be the sum of the squares of all the weights that are the parameters and model you're trying to fit or the sum of absolute values of the weights or some other such term and so this is you know this is really coming from the same principle that you want to have a term that penalizes the complexity in not just the error if I flex my model favorite disfavored so there's a term there that depends on D and M right so the square root of D over m in general you'll always have a term there's a deed reflects the complexity of the model so it means that the the gap the best bound you can get or some bound that you can get on the on the gap between the true risk and the empirical risk is controlled by that so D gets very large then you can't say very much about the difference between those risks so what you really care is having the true risk being small not the empirical risk this is my comment this is how complexity plays a role there that any other questions okay so let me sort of may try and make all of this more concrete by looking at an incredibly simple models so going back to linear regression so what's in there regression so by simply looking at saying that F the class of functions that refitting is just a set of linear functions so W is some vector in RN comes from some set of vectors and linear functions simply mapped X to W dot X it's a linear function you can have an extra constant term but it doesn't really change almost anything so I'm not going to do the constant term in most of these things will consider the squared loss as the error so if the model predicts something like Y prime and the actual target is Y then the cost is given by Y prime minus y squared and D is some distribution over some arbitrary Joint Distribution over X cross Y so I'm going to define G of X simply to be the expectation of Y given X okay so this is and the X there should be bold sorry so okay so this need not be a linear function by itself but let's define this function and now let's define any other function H which is from X to R and try and understand what the risk of H would be so so far nothing but anything about G and H being linear so the risk of H is simply there this quantity here which is the expectation of HX minus y square where X and y are on from the distribution and we can write it like this and by adding GX in between and the rest is of course a extra cross term but because GX is defined to be thoughtful which is the expectation of Y given X this cross term is actually just zero because if you take the expectation with respect to Y in the cross term then you just get 0 and so what you really have is is this quantity here so for any function H this risk is given by this quantity here the first term is always non-negative because it's an expectation of a square term plus the risk of G so what this tells you is that the best possible thing that you can do if you want to minimize the squared loss is simply to use the function GX which is the expectation of Y given X okay you cannot do better than that on an arbitrary data so if G were indeed an F so if G itself for a linear function this is what we would refer to as a realizable setting so so we don't have to say that the data is XY is exactly a linear function of X but as long as expectation of Y given X is a linear function we will so call it a realizable setting and that's what would be in and we're not really going to make that assumption but I just wanted to make the connection between the two so so in some sense that's the best possible risk we can hope for and that's that just down to the noise so Y can be noisy and the R of G dead so the risk of G rip simply represents the noise level which we absolutely cannot beat ok so again maybe I'll make a quick aside to make a few connections to machine learning so so this is a very extended discriminative model in in machine learning which is to model Y for linear regression Y given W and X as being just linear function W dot X plus Gaussian noise and and what's commonly done is to look at the likelihood of observing the data under this model so if you assume that the data really does come from a linear model of this kind then for any set of parameters W and X 1 through X M as inputs you can write or is right something is proportional to the density of observing y12 ym on this model in this case is easy to write because you just have to write the density function of a Gaussian distribution and if you look at this and looking at the log of this function is slightly easier so if you look at what's called as log likelihood then you get this first term that doesn't depend at all on W minus some term that's some of the squares that we actually care about right and so there's also this connection that if you do make a probabilistic model of this kind where you think of the target being generated as a linear model plus Gaussian noise then then it's actually the maximum likelihood estimator of the model parameters is exactly the one that minimizes squared loss and so this is one of the reasons why squared error was was used because it's common to assume that the noise is Gaussian and this this is what the method of least squares and this goes back at least 200 years I think to the Laplace or Gauss so this is a very old machine learning model indeed so so we want to solve this problem and go back to our statistical learning theory framework so so we think of some family of linear functions think of K are just being set of vectors with norm bound by capital W and again write the empirical risk minimizer and we can try to find the empirical risk minimizer for this particular loss and question is how do we argue about generalization properties of this algorithm so that's what we want to be going to develop machinery to be able to argue this because now we are outside comfort of the boolean world so we have to develop something else and so we have to use a different capacity measure so VC dimension will not work will focus on rather muscle complexity but I want to sort of make make a note there's actually any number of capacity measures that are used and indeed several of them can be used so things like pseudonym engine factoring dimension and so on and so I'm going to focus on rather macro complexity but there's a large number of capacitor different capacity measures that can be used and we can require some boundedness assumptions on both the data and the functions that we're trying to fit its so okay so I'm going to talk about Rademacher complexity so it's a good point to pause a bit and see if there's any questions which formula okay so that's that's only for boolean functions so that's so it's given by the VC so the D there is the VC dimension of the class of functions which you can only define for boolean functions so now our functions are no longer boolean so we want to come up with a similar formula with use replacing the VC dimension by something else which is what we're going to do yeah it will it will in this case so for for binary classification if we look at there's only one meaningful loss function because you know your loss loss is either 1 or 0 you cannot have any other meaningful notion of loss so so this is why it doesn't but because there is essentially only 1 loss function for binary classification once we start looking at real valued functions you've had lots of losses and it will indeed depend on the loss okay so let me define Rademacher called complexity and I want to try to make it slightly more abstract because we will apply about this notion to several different functions not just to the class of function that we're trying to look at learn but also precisely to if we compose a loss function with the classes of function that we're trying to learn and this is why I want to keep this a bit general so let that be some set and look at the class of functions from Z to interval a B in on the real line and s is it'll be a fixed sample from a subset of set of size M so so far I'm not even bringing any distribution on that but soon we will come up with a distribution for now just think of s as a subset even in multi set so you can have repetitions of size M in this and okay so we'll define the empirical Rademacher complexity so there's a lot of texture so I'll pass it slowly of G with respect to this set s as the following so we're going to introduce these quantities Sigma where Sigma is so Sigma is a vector of Sigma 1 2 Sigma and each of them is a minus 1 1 valued random variable they're all independent do they take values minus 1 or 1 with equal probability and and now so that's Sigma and these are called Rademacher random variables and what we are looking at is the empirical radhanagar complexity is the expectation with all possible Sigma's of this quantity in sight which is the supremum of so you find the best gene in this class of function that correlates with these minus 1 1 valued and at variables okay okay so it's leave this for a second because it's not but let me ask you to think about two questions while we're doing that so let's go back to our boolean world so boolean well for me everything is minus 1 1 not 0 1 so suppose G is set of boolean functions so the takes values minus 1 1 and s is a sets that's shattered by the class G what's going to be the empirical Rademacher complexity right right okay so almost rights so the somewhere it becomes exactly M so M divided by M is one and you're just taking an expectation of pawns it's actually 1 not 2 to the N but yeah so you always get one so in fact that's going to be so that's the FS this chapter this is going to be 1 and that's actually going to be a very high rot or not empirical matter in complexity it means that this function of the function G is rich enough to explain almost everything right so so so what we're generating is essentially random noise and if you can correlate very well with random noise it means you have a very very large class of functions you can represent random noise so so let's look at the other extreme so if G only consists again we're still in the boolean world but G only contains the constant 1 and minus 1 functions so what's the empirical Rademacher complexity then it's not quite zero but so okay so G consists of only two functions the constant one and constant minus one function so what's the empirical Rademacher complexity okay there's a sub over G so it's not exactly it okay so this is roughly 1 over square root of M so calculating it exactly is a bit tricky but so so if you think of this right so if you look at all of the signals there you know almost so surely the difference between the positives and the negatives if you generate them randomly is going to be roughly like square root of M and so depending on whether there's more positives or negative you choose two plus one or minus one and so then you get you will always get something like 1 over square root of M there so take an expectation you get 1 over square root of M so so that's intensive so a very simple class of functions has rather macro complexity like 1 over square root of M whereas the rich class has has a rather massive complexity one and that's the range that we are going to be looking at it's a 1 over square root of M means it's going to be very easy to generalize because it's a very simple class of functions whereas constant or 1 is going to be bad this means your function class is too rich it can express noise ok that's so that's that's the motivation for looking at a term like this is that does that make sense because this is important so you want to make sure that ok so that's a notion of empirical Rademacher in complexity the rather macro complexity then is not that hard to define you simply define the expectation of the empirical Rademacher complexity when you draw this set from some some distribution be over of size M from this distribution right so for any fixed M you can draw a set of size M from this distribution iid and then you get this quantity and this Rademacher subscript with M of G is simply the expectation over all the subsets the empirical complexity of success yeah and so the so the empirical data local complexities is not worst-case in the sense it uses distributional information so unlike VC dimension which really looks at just any possible set of points and doesn't really take the distribution into account the same is true for the empirical rather market complexity but when you look at the Rademacher complexity if your distribution happens to be nice and maybe you're but there might be some sets which have very bad empirical Rademacher complexity but you ran amok and complexity might still be alright and that's ended ok incidentally you can also have a more distribution version of VC dimension I don't know if you covered that but ok okay so here's our rather macro complexity and we would like to have a theorem otherwise we wouldn't want to define the concept like that and so here's the theorem again it's a bit there's a lot of words there so let's pass it so what so G is this class of functions mapping said and you just use the interval 0 1 instead of ABH just an issue of scaling if it's not of length 1 and you'd get some sample S sub size M drawn from some distribution over Z and and for every positive Delta with at least one minus Delta probability for any G in G the following holds and that's the kind of result what we what what do you want right so if you look at the expectation of G fans that is drawn from the distribution it's close to its empirical is expectation so if you just look at the expectation of G of Z on the sample that you draw plus some terms right the second term is is the Rademacher complexity of this and then sometimes that again decays is roughly 1 over square root of M so there's a log 1 over Delta there but essentially it decays as 1 over square root of M right so the best we could hope for for the Rademacher complexities it also decay as 1 over square root of M because that's what the last term decays as but even if it decays with M that's that's okay but constant is not okay because then you're always going to have a constant error in this estimation so that's so we want so as M increases we want the run a local complexity to go to zero learning to be possible okay all right so I'll just introduce this notation because this will come up a lot I don't didn't want to keep on writing this summation everywhere so so use this expectation hat of Zed drawn uniformly from some set s G of Z to simply mean the empirical average of G of Z and O's for Zi in this set I will do a full proof of this theorem it thought I should be that hard or if you use the correct inequalities but let's first apply this to linear regression and see what we get so so our instance space X is some subset of RN and now I'm gonna need some conditions on this instance space so I'm going to assume that every vector X in this is bounded the two norm let's say by some X big X the target values that I see are also bounded in some interval minus mm so they can't be unbounded on the real line and again I'm looking at the class of functions or linear functions but again the W in the linear functions all of them have to have a bounded l2 norm by given by capital W so we want to compute let's the Rademacher complexity first of all the set F so the set of linear functions so let's take some set s of size and and and write down the definitions have pulled out 1 over m outside that's just a definition of the empirical Rademacher complexity of this yes and what do we get well the first thing to notice is that we can pull W out of the summation because it's always there and all of the inner products and so we have this w dot the sum of Sigma I times X I and we are looking over to find the best search W in our set so what is the best that stop you okay so it's so basically something that points exactly in the direction of that sum right so if you want to maximize the inner product what you want to do is just find something that's exactly aligned with the vector you're trying to maximize your inner product with and then increase the norm as much as you can and so that gives us this Big W are pulled outside because we're going to give that vector that norm and we take the vector that's aligned exactly in the direction at the sum of the Sigma X I and so what we're left with is simply the two norm of Sigma X I guess it's it's really just the equality condition of cauchy-schwarz in case you drank that's all okay okay so let's stop there so that's what we have so far and then we are going to apply couple of elementary inequalities so first we will use instance inequality or just arithmetic mean being less than root mean square I suppose whatever you call it and but then it then we can actually compute that expectation a lot more easily because now we have a square norm inside and these Sigma's are all independent and take minus one and one with equal probability so all of these things the cross terms all vanish is simply left with the sum of Sigma square which are all 1 and the squared norm of the X's a little bit more you can show that this is just WX divided by square root of x so that's really nothing hard and doing these calculations it's really just using inequalities that all of us know and and using them at the right places to get an in some things are satisfying because this there is this issue of scale right so once we're looking at real real valued functions and not boolean functions somehow scale should matter in terms of complexity measures and the scaled route the relevant scale here is how large can W dot X be and since all the vectors X are bounded into norm by by big X and all the vectors W are bounded by big w-w-w times X really is the right scale and you'd get that divided by square root of m which is what you would hope for automatic complexity of this okay good so we're we're making good progress with proving this results but okay so we don't really just want to apply this result to linear functions but we won't apply to the risk and so we we want to make we have to do a little bit more work and so we want to compose the loss function with the linear function and for that we will make use of a lemma of telecom which is I won't prove this but I'll just state this is telephones RMS and so if you look at any so the five should all be the same there's a verify there and the five here sorry about that so you take some class of functions that from A to B and five from a B to R is just a univariate a Lipschitz function then if you compose any every function in G with Phi and you look at the Rademacher complexity of the new class all you get even bounded by L times around another complexity of G it's this so composing with the Lipschitz function just gets this Lipschitz constant additional Lipschitz constant in the Rademacher complexity that's what telephones lemma gives us and that's that will basically allow us to complete the proof and gives you realization bounds for linear regression so the class of functions we look at is is this so we want to look at XY mapping to f of X minus y whole squared where X is in the input space wise in the target and F is one of these linear functions right so we're looking at all possible F in this each of them defines this map and consider Phi which is from M plus okay and in this interval and the reason for this interval is not again a mystery really because the largest possible value of f of X is w times X and the largest possible value of FX minus y is w times X plus M because that's their bounded all of those and that interval so so we want to be able to apply the squared function to that and as long as you have a quadratic function on the bounded interval it has a fixed Lipschitz constant right so over over R and there is no Lipschitz constant you can define for the quadratic function but as long as you're living in a bounded interval it's fine and that's what you get as a slipshod constant which together with using that for any constant function it's random local complexity simply M over square root of M and this simple fact which maybe you should try to prove as an exercise just to make sure you you get both Rademacher complexity is defined to show that if you look at f plus G so you'll just add all possible pairs of functions from F and G then you get this relation between the empirical Rademacher complexities you get that this Rademacher complexity can be bounded by well things that depend on all of these bounds on X Y and W but the crucial thing is that you get square root of M in the denominator which means that assuming everything else is fixed as as you get more data you're going to you're going to get a good guarantee on the generalization error so that's what that's that square root of M in the denominator that's what you should be looking out for okay so I went a bit fast on this but hopefully it was made sense enough to at least see what's going on and what with ideas going to compute well as long as R is bounded by something that decays with M it means that are bound does decay with M right so that's all it doesn't prove a law it doesn't show that directly that you do need more data I agree that but but there is also there are lower bounds of showing it you have to in do need data as much as a rather macro complexity but yeah okay so this is this is only going to give you a bound on the generalization error so it's the difference between the error on your training set and that error on it so if you have a lot of noise you're going to see a lot of error in the training set and a lot of error or under the distribution but the difference is not going to be that large that's what this is saying yes yes sorry well I didn't exactly what it is but this is what this is what we need it for yes these are consequences actually this is what's this how stated at least in the learning theory literature I'm sure there is a more general version and the probability theory would do you know been the top of your head okay yeah so yeah okay yes okay and okay so it's easier to actually bound the the Rademacher complexities to empirical Rademacher complexities always bound as long as you can prove it for any set s then that's all basic then to be a bound on on the actual Rademacher complexity and we've made no assumptions on the set s so far apart from s sighs and that's that means that the Rademacher complexity for parameter m is the case is 1 over square root of m which proves what we need okay so so that's that's just that if we can perform empirical risk minimization then we'd get generalization using Rademacher complexity bounds it does so doesn't tell us for that we can do an empirical risk minimization in this case in the case of linear regression it turns out that we actually can and it's worth spending a little bit of time I won't spend too much time on this but what is the optimization problem that we are trying to solve so here's our empirical risk which have called J of W for some reason mainly to keep it an optimization framework I think and what we're really looking for is to find the w hat that minimizes this J of W subject to the constraint that it's the norm of W is bounded by a capital W okay so how do you solve this problem well okay so if there's if I didn't have this constraint on the norm it's actually very easy and you can have a closed-form solution for this for this optimization problem with the constraint it's not it's not too hard to but you can you have to resort to some kind of convex optimization technique so you can use gradient descent so Casta gradient descent project with projections onto this set so you always want to project so that the norm of the vector you have at any point is that most bounded by W and these these algorithms are guaranteed to find a near optimal solution and the reason why I wanted to talk a little bit about optimization is again to make this point that in in the learning world in machine learning more generally it's often enough to use not very fancy optimization algorithms right it's a gradient descent based algorithms are not great in the sense that the dependence on epsilon is polynomial in Minooka reps alone not polynomial in love 1 over epsilon but ultimately you're going to get a large error term because of the generalization bound anyway so there's not much point in minimizing your empirical risk to something below the epsilon that you're going to get from the generalization bound so often it's actually enough to get reasonably optimal solutions for the empirical risk and these algorithms are much faster than doing more complicated convex optimization problems so so these are these are the kinds of algorithms that are always actually used in practice even for certainly for more complex models but even for things like linear regression it's perfectly fine to just use gradient descent here's briefly what the algorithm does so there's some parameter ETA which decides the step size and the algorithm really just takes the gradient follows it but you might the reason why you need this extra projection step is because after you follow the gradient you're norm might be larger than than the bound you have so you project it back so that you're back into the feasible set and your output in this case the average of all of the iterates that you got the steps 1 2 T and so this is our projection operator so we're simply reducing the norm and this is an informal version of the kind of theorems and that you get which is that if you have a bound on your set of feasible things that you're looking at and the gradients of your loss functions are always going to remain bounded then you don't need that many steps you actually get error that the case is roughly 1 over square root of T so that square root of T is the number of steps that you have to run this gradient out and great in this end for so again T equals 1 over epsilon squares enough if you want epsilon accuracy and go back to actually proving the main result on on Rademacher complexity which is the main thing we would answer so what is it that we trying to prove so we have this class of functions mapping z20 1d is some distribution over over Zed and we have this we get a sample size drawn from D and we want to prove that for any Delta with probability at least one minus Delta for every G the following sort holds okay so that's so okay so what's going to be a strategy so we will use a concentration of measure and equality sub McDermott's inequality in particular which is the following and again it is a lot of texture so let's pass it slowly so so that is some some set and you have a function f is defined from z m2 r it's a MM variate function such that for any i there's always some CI such that for any possible values that want to set them and Zi prime you have this thing here so you apply F through Z 1 to Z m and you apply F only by replacing Zi by Zi prime so everything else remains unchanged and then the difference between this can be bounded by CI okay so so if you think about this what this is really saying is that there's it's bounding the influence of any individual variable on this m variate and variate function right so so if you just flip one variable you can't change the function value too much and and this is already enough to give you a very good concentration right so so sort of the easiest case would be when all of these are you'll apply this to independent things and this will simply be the sum of the random variables which is what will give you things like Chernoff bounds but but you can get much these kinds of bounds for much more general things so all you need really is that any single variable does not have too much influence on this function f and as long as you can bound that you will get get an exponential concentration of this and so provided you have such a function then if that one to ZM are drawn iid taking values in Z from any distribution then you have this that the probability of being larger than its X expectation plus epsilon can be bounded by this X minus 2 epsilon square over something and this is where the dependence on the CI shows shows up and as long as the CI is are reasonable you will get a good bound there so that's so it's a very powerful inequality and it will it's actually use very widely in in the main theory and so we'll make use of this to prove our main result so here's what we will do so let's start with a sample s and I'm also going to make a sample s prime which I won't use immediately but we will need an s prime which are both drawn iid from from the same distributions of the samples of size M each and we want to define the following function so to take any set of size M from of subsets of size M define this function on this set which is the Super mm overall G in your class of functions the expectation of G for Z drawn from this distribution minus the empirical expectation for points in your set pause just for a second to make sure that this is OK so this shouldn't be completely surprising right because this is what we would like to bound right so this the the term on the left is what we would like to pound by the term on the right plus some small things so if this we can bound this quantity then that's going to be that's going to be what we essentially want ok so what can we say about 5 so 5 we can think of Fire's basically a m variate function so if we instead of using s if we use this set S sub I which is simply take s but replace the ith element by Zi prime and now we consider 5 s minus 5 s Prime what do we get so this the difference between these two soups in this case can be bounded really just by the difference between G of 0 and G of 0 Prime and then there's a one over m because of this expectation in the in the empirical estimation it's essentially there's a one over m there which is why you get one over m it's in G is in the interval 0 1 you just get this is bounded by 1 over m okay I'll give you baby a few seconds to think of why that's the correct bounds it's an inequality is on this it's not very hard but writing a complete proof would be a little bit annoying because of the soup if it were max it would be easier but this is essentially what you get has difference again let me remind you but like them it's an equality again which we want to apply directly so we will use this this is essentially we're saying that we can use CI equals one over m for all of our I right so for every one of them we're going to use exactly one over m this is what we have which gives us this kind of a concentration so we get that the probability that 5s is larger than its expectation plus epsilon can be bounded by something that's exponentially small in m so you get this minus 2 epsilon square n factor here okay so alternatively we can flip this around and say that if for any Delta with probability at least one minus Delta the following holds for the five phases at most the expectation of fire fester on from this plus something that's order of log 1 over Delta divided by n you really just put Delta there instead of the X and reversed role of epsilon and Delta and so that's that's good that's starting to look like the kind of things that we want okay so this is what we have so far so what we have so if we just look at what fire phases a fire faces soup remember overall ecology it's in particular we get that the problem for any Delta is probably this one minus Delta for every G the following holds it so this is just saying exactly what Phi is and this empirical estimation is simply the expectation had is simply the empirical expectation and so we have this what we want to show which was the goal of our main theorem which if here in case you have forgotten was is this so we want to show that the expectation of G is bounded by this empirical estimation plus this order log 1 over Delta square root squared log 1 over Delta over m term there and in between we want this two times twice they Rademacher complexity of G ok so we have something that looks almost like that expect that except that this is this weird expectation and you know so what we want to hope for is somehow this expectation is nicely bounded by a rather logical complexity of of the kingi right so that's if we could just show that the expectation of fire of s is at most twice a random actual complexity of G then we'd be done and let's prove that then it's so this is when you have to sort of putting on your probability theory has some think of it creatively in how to get around this so so let's start with what the expectation of fire face is so it's the expectation of a restaurant from this D to the end of the super mmm there and you have another expectation inside again so so this is where our fresh sample s prime is going to be useful so so we want to introduce artificially this sample s prime inside this so the expectation where can be replaced by this weird expectation where we first draw a sample from from this distribution and then take the empirical expectation on that sample but of course the combined effect of that is exactly the same expectation as we had so that's really not changing anything and this so far okay and again if you've solved this proof of vc-dimension this this also had the same doubling the sample trick so this is commonly used in all of these inequality at some point of time you have to double the sample and use this trick somewhere to get these kinds of pounds right so now we move the supremum inside the inner expectation so we pull the expectation over s prime outside and we get this quantity now which starts to look a bit nicer and so this is what we have and now we want to this is what you but this is where we want to use the fact that SNS prime are exactly identically distributed so you want to see that well we draw SN s prime but we don't want to access or what we do is we draw two endpoints and we don't want to decide ahead of time which one would go to s but s n s Prime we want to be able to swap points around between them and this we can get by introducing these rather Macha random variables okay so if I put so I can either put Z I in S or Z I in s prime and Zi Prime in s and I don't want to decide what I'm going to do ahead of time and instead I let that be controlled by this Rademacher random variable Sigma I which takes uniformly the value minus one one if it were in s if Z if this were an i s sorry if that I were in essence that I prime more in s prime then then I find I would just want that with sine plus one and if it were the that would one the same quantity but with sine minus one because then the first thing would contribute to the second expectation and the second would compute contribute to was the first expectation okay so this is what I have posted that you see if you believe this because this is but so because SNS primer are identically distributed it doesn't matter which one I put it in which and that's the same as saying that well I could pick the Sigma it will be randomly minus one or one and I can introduce this Sigma inside this sorry there should be a sum in both of those things which is missing right once we have that then that's actually nice because well signal is exactly what we want for our Rademacher complexity right so there's a difference of two terms there and we can we can just get a two outside and upper bounded by it's just one of them they're identically distributed anyway so we have this if assuming that there was a Sigma here and here that's exactly the Rademacher complexity that we want that make sense okay that's that really completes the proof because we wanted to buy it have this expectation bounded by Narada magical complexity okay so let me to the theorem statement again just you I and so this is this is what we have so so now we can as long as we can bound the Rademacher complexity of not just at like class of function that we're fitting but the loss composed with that and as long as you're really using losses that are bounded and Lipschitz and this is partly why we need to use bounded and Lipschitz losses soften and machine learning then then composing with the loss function is fine because we will still get about an the Rademacher complexity and we will get this kind of a result okay so actually if it's okay i want to go back to talking a little bit about the generalized linear models now because i think it would yeah I'm not so bad on time as I thought and it's there's something interesting going on with these generalized linear models so so what's up so we mentioned linear models which was it just had X mapping to W at X so what are generalized linear models so there's slight linear models but there's there's this extra univariate function which could be nonlinear so X is now going to map to U of W dot X okay so so this is what the generalized linear model is and so this in general you can put conditions on you I so you can remove some of these conditions or you can add others but what's enough is U is bounded increasing and one Lipschitz for the purposes of what I want to do and if again in terms of neural networks this is really GLM is a unit and mid neural network this is really all that's going on in a neural net work and write so so you have a linear function apply a nonlinear function on top of that that's what every unit in in the neural network does and so this in some sense this is as simple as possible in your own network it has a single unit right and so we can ask well what about learning generalized linear models instead of linear models and we can consider the same erm problem as we did for linear models which is write out the empirical risk in terms of the squirt loss and try and find something that minimizes the empirical risk which is this find something again that has a bounded norm W and minimizes this loss so we have two questions again one is one is algorithmic and one is first from the point of view of statistical complexity well the first one is actually easy because you can bound the statistical complexity the second one I guess the way in which I said it because arab-american complexity can be bounded easily by everything we've seen so far we've already bounded the Rademacher complexity of linear functions here we are simply composing this additional function which is one Lipschitz so that doesn't really change anything so this U is assumed to be known in all of these models and so it so we can use the same analysis as we had earlier what changes now is the optimization problem certainly becomes hard this is no longer convex okay so that's and in this particular case of generalized linear models there's there's a sort of a cute trick that can be helped to to get around this problem and that's what I want to briefly talk about before we can take a break so so I want to construct it's a different cost function and to have a boat I don't have a boat okay so maybe I'll give some intuition about what this cost function is doing so I'm going to define this cost function gamma Y prime boy to be defined as follows I'm going to get the integral from zero to you you prime of U inverse of Y prime of U of Z minus y so the question is why this loss and well there's there's two answers to that question but okay so U is strictly increasing and so it can take its inverse it's well defined so I can define this this integral and this this particular cost function the good thing is that now this becomes if I if I use this loss instead of the square loss this becomes convex okay so so why is it that it becomes convex and that's really because U is monotonic and what shows up is okay so maybe I'll try to do this right so this is the loss these that's W dot X you've said is y so what's the gradient of L with respect to W okay so this case it's just you have W dot X minus y times X okay just for reference and me right on the side what's going on here so if I use did use a squared loss what is the gradient here with respect to W sorry there's a u here you have W dot X minus right so there's two of you of you W dot X minus y which is fine but then you get this u prime here and then you get an X so the only difference between here and here is that you don't get this annoying u prime here at all and you just get this which is which is great because once you take the second derivative here what you're really going to get is simply something that looks like if you look at the Hessian and I won't write it neatly but you really just get xx transpose possibly some x some u prime somewhere but this u prime is positive because U is monotone and and that's what makes this let's let's make this convex and so we can actually solve this convex optimization problem on the other hand this is not convex you can you can't see it directly from the derivatives easily at least I can't but but there are you can construct instances where this is non convex and the other thing which again if some of you have heard of maybe and this shows up is this this u prime actually is quite bad because if you think and this again relates a little bit to what you would do in neural networks and things that you would have heard of is because typically function that you will use for you are things like sigmoid u prime happens to be very close to zero and this means that your gradients are very small and this is this gradient saturation problem that you get in neural networks because you directly taking the gradients here and this also somehow disappears when you do this actually the gradients here are very very good okay so that's that's good so we have some convex function which we can actually optimize great and the question of course is well does it mean anything okay so I just made up some kind of a loss function and it it doesn't actually tell me very much what what one can show actually indeed is if we are again if we go back to the realizable setting so if we in the absence of any assumptions we can't really say very much but if we assume that actually the expectation of Y given X is indeed of the form U of W dot X then the global minimizer's of the squared error and of this funny loss are exactly the same and and in fact approximate minimizes are also approximate minimizes of each other so in fact if you can approximately minimize this in the realizable setting you're actually going to get an approximate minimizer of the squared loss and and this is also not that hard to show but but I want it's and since I didn't prepare it I would risk making an error which I wouldn't want to do what I do it on but it's it's it's really just understanding this integral and this is where this one Lipschitz mess comes up so you get sort of this integral is going to be at least half the squared loss essentially is what you get okay okay so it's okay I would like to take a break roughly at this point before starting Espeon but maybe I'll since you did do a computational learning theory thing in the morning I will let me point out a couple of other computational issues that come up with these GLM's so you can ask what what happens if you don't have this increasing thing but just Lipschitz constraint this actually doesn't change very much the statistical complexity of the problem at all because it's really the Lipschitz constant that determines the statistical complexity but the computational complexity becomes suddenly becomes vastly different even in the realizable setting or something like the realizable setting and this relates to this problem called learning priorities with noise which have you heard of learning priorities but who has heard of turning parties with noise okay so then maybe that doesn't make much sense okay so that but I am happy to talk about it in the break if anyone was so so so it's actually so this nice thing that we could do if you is indeed increasing we can no longer do if it's not if it just Lipschitz so the statistical complexity doesn't change very much if you just remove the increasing condition the complex computational complexity vastly changes because we believe that there are no efficient algorithms for learning ecology because it noise and this would imply that and so that's that's all I wanted to say about generalized linear models and then after the next session I'll talk about SPMS a bit about neural networks and algorithmic stability yes sorry all right so you yes so the question was what happens if u is not bounded so don't think exchange okay so typically we will assume that X is bounded and W is bounded so so U is have would have to be very weird if it's not bounded because it's defined on the bounded interval and that's but if it's Lipschitz on on the bounded interval it country V be unbounded so so you need some kind of founded miss assumption so that's not really required it's it would be implied by some bound on X and and W and that one Lipschitz also actually can be traded off with the bound on W really so if you haven't L Lipschitz there you can just allow the norm of W to be larger and then that's really the same thing so that's this is there's a single scale for U and W you can if you want to keep u1 Lipschitz then you can just increase W and that's the length of W and that's perfectly fine yes [Music] yes yes yes so so yeah so you have some function of random variables and alright so you can look at the expectation of that function and you can do actual you can value of the function and you want to serve well how likely is it that this function can deviate significantly from its expectation and and all these bounds is like that actually this happens with a very the probability of deviating significantly from the expectation is very very small so that's what all these concentration of measure phenomena is saying and you know you get different rates depending on different on the kinds of functions that you use in this case for example then MacDermid says if you have you can bound the influence of any single variable well enough then you actually get this very sharp exponential tails in and things like that but you know it's your in general depending on what conditions you have you could to get different kinds of bounds but that's that's the general idea how does a random variable do how close is it should be to its expectation and can you give a bound of the probability that it's it's going to be more than some distance away from his expectation that's tight and right
Info
Channel: Federated Logic Conference FLoC 2018
Views: 1,029
Rating: 4.8000002 out of 5
Keywords:
Id: JjsOezNsXbU
Channel Id: undefined
Length: 78min 21sec (4701 seconds)
Published: Mon Jul 30 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.