James Worrell: Computational Learning Theory II

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay so I got a couple of requests during the break just to clarify some things from last time so what I want to do is just quickly give you a summary of what we did in the first half so I talked about this basic set of a PAC learning we're in a PAC learning problem you have a concept class this class is a set of functions from the input space to 0 1 or if you like a collection of subsets of the input space and we consider several examples that such as rectangles as a concept class linear classifiers neural nets which I've got by composing linear classifiers so with a fixed architecture where you fix the graph and the only thing is varying the weights of the neural net so we talked about the notion of PACA learnability of a concept class on and underneath but that's after that we talked about the notion of a VC dimension the VC dimension of a comp a concept class so this is a number could be finite or it could be infinity well we're heading towards is the idea that a class is packed learn about their only if it has finite VC dimension that's the fundamental theorem and as a stepping-stone along the way we talk about the growth function so the growth function says are all subsets of so it's a function from n to n and given it's defined by a concept class H concept class has its growth function and for each n the value of this is I look at a any set of size m and look at what if the most the the maximum number of labelings of a set that I can realize by concepts in the class and that's so in particular if it has infinite VC dimension for every every set of for every M there's a set of size M that can be shattered so the growth function in that case is 2 to the M for every N and if it has fun I'd VC dimension the growth function is polynomial and that's the reason we talk about the growth function is is to prove this fundamental theorem that finite VC dimension is learnable and the most significant thing in the first half was just this definition of what it means for a concept class to be PAC learner ball so this is for all epsilon this is the accuracy of the hypothesis that we want for all Delta this is the confidence this is I have a failure probability that we just learn a terrible hypothesis there exists M the sample size that our learner needs to see in order to get this accurate hypothesis with this high confidence there is it exists an M in the learning map so what the learning map does is it takes a labelled sample of size M and returns across the fire so in the case of the rectangles what the learning map would do is it would take a labelled sample rectangles in the plane it would take a labelled sample which would be a bunch of points in the plane labeled positive and negative and it would return a classifier namely the smallest rectangle enclosing all the positive points and so the learning map takes a sample returns a classifier this is what does the learning from the Sun labelled sample the classifier and what we require that it be probably approximately correct so approximately correct means the probability that the error so given the sample here's the learnt classifier the probability that error is less than epsilon so that's the approximately correct is at least 1 minus Delta so that's probably so the probability here is over the sample so you draw a sample if the sample is completely unrepresentative of the true distribution then your deads and you just learn a terrible classifier but if it's representative if you learn a good classifier and the point that's hidden there is the definition of error here refers so there's this distribution D in the background fixed an unknown the distribution is what you draw when you draw your samples from the spaces by this distribution so that governs your samples and the error of your learns hypothesis is also defined in terms of the distribution it's the probability that your learned hypothesis differs from the target concept on a random example drawn according to D so it's the closeness of the learned hypothesis to the target concept so that's kind of my summary of the last time and the work what I'm heading towards now is concept class defined in first order logic so I finished with this slide so let me just give an example here here's a structure it's the reals with order so the signature is clear and here's a formula and the variables are partitioned into two kinds and the way I'm thinking of this formula is I'm thinking that these these Y variables I call the parameters these are defining the two corners of a rectangle in the plane an X 1 and X 2 at some point which may or may not be in the rectangle and what this formula says is that this point is in the rectangle so X 1 is between y 1 and y 3 and X 2 is between Y 2 and Y floors so this this formula says this point is in the rectangle determined by these these variables so if I fixed values of the parameters so B 1 before now particular real numbers then I get a rectangle so this is now a formula so I define this notation here means a formula just in these two two variables here so rather it means the denotation of all the points that satisfy the formula so what this thing here is litter is the rectangle whose corners are determined by these parameters so this is the rectangle and this is the class of rectangles so calligraphic see as i let the parameters vary over all four tuples of reals so this concept parts of rectangles is exactly realizable as a concept class of this form for this formula and for that structure so here's a very generic way of getting concept classes with a single formula in a single structure so there you vary the parameters you get a different concept and as you for every setting with the parameters you have a concept and as I remarked neural nets can also be over a fixed graph can also be realized in this way you know hopefully you into intuitive way so and well and what I really and now I'm interested in the VC dimension of the concept class that I get from a formula and a structure so let's see some examples of bounding that so let's here's a here's a nice structure to consider so the reals with addition and multiplications so this is quite nice because now we can find polynomial so here we have the reals with just order but and what I'm interested in is giving bounds on the VC dimension of the concept class that's determined by a particular formula Phi in the structure in terms of the complexity of the formula Phi so Faiza formula first order logic and the key tool this is the following to do this so so just say why actually why would it be helpful to have addition and multiplication in your structure so imagine that instead of axis aligned rectangles you wanted to define polygons in the plane then your parameters would be the would be the the vertices of the the polygon and the and but you need a formula to say that the point lies inside the polygon and for that you'd need to do a little bit of arithmetic you couldn't just say it with order so why do I keep on doing the pencil myself so I could have so if I had triangles I could define them with six parameters and I could imagine a formula that then tells me whether a point lies inside the triangle but I'd need to do some arithmetic to do that so I'm working out this structure so here's the key tool which is a basic theorem in in semi algebraic geometry so I mean ancient so slightly tweaked by Goldberg and jerem and their paper on on neural nets it says the following I consider fixed pollen attraction of polynomials of degree at most D in carry real variables then the number of realizable sign assignments either P I being positive negative or zero is at most this quantity which is exponential in the number of variables but otherwise polynomial and the other data so what I mean by realizable sign assignment is these are polynomials in K real variables so if I plug in K real variables the polynomials will either be positive negative 0 or 0 and this tells me the number of real double sign assignments and with this in hand you can show the following so let consider this structure and let consider a formula here so what is what is a kind of formula you can write over this structure so this structure has quantifier elimination so that's even consider just quantifier free formulas so that this formula be a boolean combination of so what is a quantifier free formula look like it's a boolean combination of polynomial equality's and inequalities and so they're supposed that there are s polynomial inequalities and inequalities so the atomic formulas MiFi the atomic formulas in phi look like this so some polynomial how many variables are there YK so here's an atomic formula it's greater than zero the other zero and there suppose that these polynomials have degree at most D then the VC dimension first of all the VC dimension of this induced class is finite and in particular it is at most this number so linear in the number of parameters so this is the kind of magic thing here that here we have the parameters K and we're linear in the number of parameters okay and in fact let me maybe give let me give the arguments here just so that hopefully it reinforces some of the concepts that we're learning so what we want to consider so we want to bound on the Viets I'd written V CD it should be V the VC dimension I want to bound the VC dimension so I want to bound the size of a set I can shatter so let's s equals so how am I going to draw this a 1 a big shutter so let me consider a so this is so this is defining so just to be clear for every instantiation of the parameters I get a concept and the concept is on RN so I've got a set that shattered and so let me consider the formula so let P be set of polynomials that appear in some formula Phi a I for some so I think this is gonna be visible for some high so what I'm gonna do is I've got this set that I'm claiming to be shuttered and I've got this formula Phi so let me write the formula here Phi and I'm gonna plug in different values of of AI and for every value of AI I get a formula now in three variables what one up to K this this formula is a combination of polynomials so how many polynomials were there in Phi there were s polynomials so when I plug in the values of these a I still have s polynomials and there are M values of a that I plug in so the size of this set of polynomials is M times s and the degree is still no bigger than it was before is it most d okay so then the previous lemma of Warren tells me how many sign assignments there are to these these polynomials so let's make this observation that let's consider two parameters - two sets of parameters B and B prime parameters we're interested in shuttering s and the question is these parameters induce some concepts and the question is do the concepts induced by these parameters have the same labeling on s so they have the same labeling on s induce same labeling on earth well actually can someone tell me if and only if what so these two parameters are going to induce the same labeling on s if and only if I want to say something about the polynomials in P so the polynomials in theory all polynomials I get by in state instantiate in these these elements of s exactly so they in they induce the same labeling if so I take the polynomials in P these are polynomials in the parameter variables if these polynomials have the same sign on be a zombie prime so they if and only if they have they induce same sign assignment same sign assignment on P but by the previous lemma we have an upper bound on the number of sign assignments that can be consistent sign assignments to a set P in terms of the number of polynomials in their degree and so when you just push through the the arithmetic you get a bound on the number of possible sign assignments and you see that you use the bound to show that this bound is polynomial in in in M here but in order for it to be shattered the number of labeling should be two to the N so for a certain M 2 to the M is less than the polynomial in M so you get an upper bound on the on the size of the set that can be shattered so I don't actually do the arithmetic because it's somehow it doesn't gain to be known but but that's how that's how you prove the result but the thing I really want to emphasize is that the number of the v c-dimension depends on the number of parameters which is kind of the intuition we bid building up and there was this one counter example that we sorted this which was the concept class to define this sign and the reason is that this this this is a nice result on a number of kind of connected components or as it were so this is kind of well it's closely related to the fact pounds on the number of connected components of sets defined in this structure but if I admit functions like sign then I get all these kind of wild properties so here's a useful lemma in this setting and this is a lemma just coming purely out of model theory and it says the following so again what am i interested in is just to remember this what this notation means is for a formula in a structure I have a it defines a concept class named was which I was calling CFIA and this is the vc-dimensional of that class and it says well suppose I have even a class of structures and let me look at a formula like this so this formula has parameters y1 up to Y n but just one variable X so I'm defining a a class of concepts just on the universe but on one one two bulls so not pairs or triples so so I've got a pound on the VC dimension of the one-dimensional sets I define then for free I get a bound on the VC dimension of the set of M tuples for every M so the bound is horrible but but I have that so that means if I'm trying to bound VC dimensions I can just care about one tuples so so just give you an example of a proposition that just uses this theorem so a is let a be an O minimal structure so no minimal structure is a structure with order such that every finite every every definable set is a finite union of intervals every definable subset one dimensional subset of the structure is is a finite union of intervals so you know so here's a kind of classic example of a minimal set so if I think of a subset of the reals definable by a formula with one free variable it's a finite union of intervals so if you think about it's kind of obvious when you think about it for so first of all I don't need quantifiers because this structures quantifier elimination and then let me just think about the atomic formula so an atomic formula that says some polynomial P of X is greater than zero clearly that defines a finite union of intervals and if I make boolean combinations I still have a finite union of intervals now what can you tell me about the VC dimension of let me let me take the sub that the input space R and let's say C be the union of two intervals this is a concept class so what is the VC dimension of this comp concept for us on this on the reals so what's the largest set I can shatter my concepts are you disjoint unions of two intervals or maybe overlap three yes so I can know me before I'm gonna get confused between three and four yeah so I'm going to say four I can realize all label links here with you unions of intervals I think yes agreed four okay so yeah but five I couldn't because I couldn't realize this labeling okay so the point is that the the VC dimension is well again that it's defined by the number of parameters here in the parameters at the end points of the intervals so the VC dimension is finite so so what we know is that so Varsha last previous result in order to show that the VC dimension of the set of concept class defined by formula over this structure here is finite it's enough to show it for every formula just in one variable so with many parameters but just with one variable so these are defining subsets of the reals now here's the key thing so for every for every set of parameters so for every set of parameters we want to be and this as I've said bio minimality this set which is a subset of the reals is a finite union of intervals but the key fact is that there's an absolute bound on the number of parts of this parameter as I vary b1 up to be a so and that's just as a general result about a minimal structures okay so this gives me a bound on the the VC dimension so in just to add that I said that the reals with multiplication and addition are amenable but also if you add exponentiation you stay on minimal this Alex willkie's result okay so this gives you a bound on on on so for this structure this gives you about so this you can use for a bound on on the VC dimension of sigmoid rule in neural nets I mean the terrible bound so if I add sine to this is it still own minimal so other than definable so that one-dimensional definable set should be you know finite union of intervals yeah yeah yeah indeed so clearly and as we saw we had this concept class defined the design that had infinite VC dimension so these kind of results about logic can well it can be used to give VC bounds on on neural nets with fixed architecture so you fix the graph you have a late field feed-forward neural net and n neurons these are the the number of nodes in the graph and Omega waits in total so the weight sites are the edges of the neural network carrying weights and these are the number of weights then the VC dimension of the corresponding class of functions is so just to recall if I fix the architecture of the net the graph I get a class of functions by varying the weights and this is what I when I train in that this is you know what I'm varying so the VC dimension is is proportional to the number of weights and neurons if the activation function is is piecewise polynomial so it's a bunch of pieces and the pieces of polynomial so for instance the reloj activation function would be would fall into this and this is the bound if you have the sigmoid activation function so this bound doesn't come from this previous result about I mean the previous result I talked about our minimal structures would give you an absolutely colossal bound so this is from the second bound is a paper by Karpinsky and McIntyre with a more kind of direct analysis of number of connected components number of cells in that structure okay so these so but is somehow there's a close connection between bounding the VC dimension that formulas and and you're on net so I want to give another example maybe more more like graph like structures so a here was the real numbers of continuous structure now I'm thinking in terms of graph so I want to bound the VC dimension of concept classes as a ranges over a class of structures so just to there's a lot of kind of going on there so just to recall we've got a fixed formula with some variables and parameters as it were and it defines a concept class on on given their own structure it defines a concept class and the concept class has got by varying the parameters to get different concepts and that concept classes a VC dimension which we're calling VC 5a and I've just pointed out that if you fix the structure like the real so the real field and then every formula has a finite VC dimension and now what I want to do is I want to keep the formula fixed but let the structure vary and so I'm interested in it an upper bound on the VC dimension as I let the structure vary okay so I mean so maybe there's some motivation for this in maybe in some databases if you are trying to you have some pattern of a query you want to learn a query and you know that your query is going to look like you've got some formula in mind for your query and you see some positive and negative examples and you wants us to learn some parameters so your creme query will be determined by putting some parameters in here and you want to be able to learn the parameters and but you don't know there's the exact shape of the database which is your your underlying structure a and so you want some absolute bounds on this vido VC dimension as a varies over class of structures now this is going to be impossible in general and here's an example so completely trivial formula so the formula here is fixed the formula so the signature here is just the signature with a single binary relation and let's consider the following structure so III wonder v c-dimension bound over class of all structures so I consider it a structure which is a bipartite graph with n nodes on the on the left and with two to the n nodes on the right and let's say this node is three so there's two to the N nodes on the right and I've got an edge here just in case that the ice bit so there's an edge here because the first fit in the binary expansion of one is is is set the second bit is not set it looks so much for my binary skills yeah okay so so in general desert there's an edge from I to J over here if the I fit in the binary expansion of J is one okay but then it's clear that I can shatter so my formula is just Phi XY is e XY so here the the kind of the parameters and is clear by instantiating with this parameter I induce any subset of this set so I can I can shatter this set air so the point is by changing the structure keeping the formula fixed and changing the structure I have no bound on the the I have known I mean the VC dimension can be as high as I like so how can i how can i how can I hope for result here well let's just consider a less general class of structure so again I'm going to consider a class of graphs and the graphs I want to consider a forest so that based be kind of directed graphs that look like this so every node is going to have at most one parent and that pointer is going to point there maybe a note could be an orphan and these these trees here in each tree in the forest could be infinite so this could descend like this but this is what a forest looks like so it's a bunch of trees with parent point there's some orphans that maybe there's infinite chains of descendants so that's the forest so and I want K to be the class of all forests and now the claim is that if I fix a formula then I have an absolute bounce so for any formula here the VC dimension is bounded independently of the structure so it's a forest and it just depends on the formula it's about okay so if you like forget about another class of structures just fix fix a forest and we want to bound them the v c-dimension independently of of the structure and again we can apply the Shiller result that says without loss of generality we can consider our formula just has one I mean apart from the parameters it has 1 3 variables so we're just defining what we do what this formula is defining is subsets of the of the of the structure so this fits a formula of quantified debt K and I'm going to argue that there are two kinds of shape sets that cannot be shattered by the concept class that arises by varying the parameters here so yeah so so just a bit of just notation so in this this is my forest but for any two nodes in the forest I want to talk about their distance and there so that their distance in the forest is just if I look at the underlying undirected graph of a forest that's an undirected graph and then the two nodes have a distance in this graph so these two nodes have distance infinity they're not connected but these two nodes here and here have distance - these have distance three so every node time has distance and I claim the following that a sufficiently large set which is a subset of the forest such that four distinct a be in the set their distance is greater than four K where K is the quantify depth if this can't be shuttered so it's a bunch of points in the forest and they're all far apart think of think of it like this from far apart relative to the to find that for the formula and just to be clear how are we going to shatter this such asset we're going to shatter it by taking this formula and varying the parameters and therefore realize every dichotomy every labeling of that set I claim also that a sufficiently large set as such that all the points in the set are close cannot be shuttered so if you would grant me these claims can someone tell me why this gives me a bound by for instance this now tells me that the VC dimension is finite of the other set so I've got these two kinds of unstoppable sets so to show the VC dimension is finite I need to show that there some number such that no set of a certain size can be shuttered beyond that number on that side so why is it enough to show that I can't shatter sets of this special form so I mean suppose I take a large set that I want to shatter can I argue that such as a set should contain either one of these guys if it's big enough one of these types of sets or one of these types of sets how could I make such an argument so this is a set where every distinct pair of points in the set are far apart and this one is where they're every distinct pair is close yeah yeah I mean this is the right intuition there's some pigeonhole principle going on and specifically the Ramseys theorem so I you know if I randy theorem says well given i mean if i color an edge between two points if they're close together given a sufficiently large graph there's either a big creek or a click of a certain size or an antique week of a certain size so just by the ranty theorem I have this so it suffices to show that I cannot stretch shatter sufficiently large sets where everything are closin and sufficiently large sets where everything is far apart so I mean this just briefly consider a large set where everything is far apart and kind of let's see why I can't shutter so so here's a forest so it clearly it's clearly a depiction of a forest and here is a set where every element in the set is far apart and I said sufficiently large and so well let me just draw my formula that I want to do the shattering with it looks like this so it has n parameters and let me have them a 2 a 3 n a 3 n plus 1 pay for it so what I'm going to claim is that this set that I can't shatter so I want to prove the first claim I claim that I can't shutter a set of for M elements for n elements in the set which are all far apart in the in the in the the forest so why is this well I need to in order to shatter this set I would in particular be able to need to realize a labeling where exactly half of the guys get labeled positive so try to realize the fifty-fifty labeling so the labeling where exactly half of these guys get labeled positive and exactly half negative and what I want to do is by instantiating the parameters here have have the have a formula that is satisfied by exactly half of the guys so I'm going to instantiate the parameters with some some concrete values from that from their forest and I want exactly half of these guys to be true okay so these parameters live in the forest and with the trees and and the thing is these points are all far apart so a parameter here could be close to one of these these guys and by close I mean I could have some parameter close to so far apart here means distance for K where K so quantified that and this is what close means this could be closer to this but it can't be close to two of them these these these are far apart so the parameters can be close to what at most one at most one of the a is that we want to shatter so here are my own parameters and without loss of generality I can assume that these parameters are all far apart from a a one up to three end because this one say close to this c2 is close to a n a 3n plus two so I've got this situation so my parameters are all far apart yeah and now yeah so I neglected to tell you one very important point about these this set here I'm going to assume that that AI and AJ satisfy the same formulas formulas site acts of quantified debt okay so in other words I'm assuming that there K types of AI and AJ are the same so these for these points I'm assuming are indistinguishable by one variable first-order logic formulas of debt K and I can assume this because there are finitely many types so if I restrict the quantified debt from the number of variables there are finitely many formulas of first-order logic up to up to a logical equivalence so this is finite so I can make this assumption that these are indistinguishable by these points here are all indistinguishable by first-order formulas with one free variable and quantified that okay so now I claim that because of this so this is just locality of first-order logic that the type of AI and the parameters c1 up to CN so I hope that people at the back I'd rather told me not to write at the bottom I'm living dangerously so I claim this for all the distinct pairs i n IJ in among these among these guys so for all 1 less than IJ less than 3m so what's the situation I've got guys that are far apart I've got these parameters that are over here and these guys have the same types and therefore so by general properties our first-order logic the type of this tuple is the same as the type of this tuple and in particular they're indistinguishable by formulas of quantified depth K in particular by this this formula here ok so that's kind of hard work so that's that's detailed so so this I mean I won't cover the second bullet point but this proves the there is that they're quantified that is bounded and there's a nice kind of it's not quite a generalization but I guess it is a generalization this is Martin gray and against around that said if I have a relational signature so first of the signature with the relations and I have a class of structures of bounded tree with then for any formula Phi now of with second-order quantifiers monadic second-order logic if I look at the quantity if I look at the concept class C Phi a and let a very over this class of formulas with an absolute bound on the tree width then the the then the the VC dimension is again bounded and this is rather than talking about types they use tree or Thomas up to this okay so just to recap I mean you know just in case you're lost I'll give a chance to reconnect because we defined concept classes by formulas of some structures and we really were interests so this is the logic and learning part of it we have a concept class defined by a formula and a structure and the concepts are got by varying the parameters and the idea is the first thing we said as well let's fix the structure to be something we're really interested in like the reals with exponential and then the concepts that we can define may be match with the thing concepts were interested concept classes were interested in like neural nets we can get general bounds on the VC dimension now I haven't yet said that finite VC dimension equals learnability next and then the other thing we did is we let the we let the the structure vary over a class of reasonable structures such as structures with the tree widths at most ten and again we bounded the the VC dimension of all the concept classes we could have just as long as they were defined in a reasonable fragment of logic so in this case okay so okay are there any questions oh yeah so the signature in a case that forest was a really just one relation it was just one relation whose which was the which was realized as the as a functioning the parent function yeah so by binary relation symbols yeah I'm a little bit confused did I should go till half past twelve yeah okay okay so okay I want to jump over this so okay I've been talking about vc-dimension a lot and if you know what vc-dimension is then you're happy maybe or maybe if you know it too well you're like I know all this but if you don't know much you'll be like well I came to hear about learnability and all you're talking about this vc-dimension so why why am I talking about vc-dimension well here's a kind of fundamental theorem and somehow okay so if if I want to kind of tell you one thing here to kind of keep in mind I think it would be the cart there's a very neat trick in the proof of this theorem and I just want to it's very standard somehow very common in learning theory but I just want to explain it because it's very nice I think Varon will build on it in his lectures I expect there's a neat trick called the kind of symmetry zation double sampling trick so here's the key theorem the fundamental theorem so let C be a concept class of finite VC dimension D let epsilon and Delta be given and let M be such that n is at least well some function that involves the VC dimension and the epsilon and Delta in particular the smaller epsilon and Delta are the bigger M should be and we've seen expressions like this before if you remember the sample bound the number of samples we needed to learn rectangles in the axis aligned rectangles it looks something like this not unlike this where D was for the VC dimension of rectangles so let M be big enough then for any target concept that we want to learn and distribution D on the input space the probability the sample of size M is consistent with some hypothesis H so we draw the sample it's labeled by this target concept the probability that it happens to be consistent with some hypothesis that has error greater than epsilon is at most Delta so why is this map important why is this a theorem important so let's think about our learning map what is our learning map do is it it Maps I mean the most obvious way to define a learning map is to take a sample and map that sample to some hypothesis that predicts the labels on the sample so we call that a consistent hypothesis okay that seems like a sensible thing to do so what could go wrong if you do that well what could go wrong is you pick a hypothesis that's called as a bad hypothesis that happens to agree with the target on the sample you've drawn and yet has error greater equal to Epsilon so again the error here is the probability if I draw X according to D then HX is unequal to the target so H is a bad hypothesis if it's error is greater equal to Epsilon so I'm bounding the probability of a bad event if the sample is large enough and most key here is this is over any hypothesis so C is a general and infinite set and this is where the finite VC dimension is is key to to make this work so a corollary here is VC crosses are popular noble namely by the learning function just picked picker consistent hypothesis so any questions about that so I want to I mean justice just to be completely explicit about this so I mean just to go back to this you know this yep ah okay that's a very good point so how do I know the exists well I mean I I'm assuming there's a target concept so this is out there there's a target concept so when I draw a sample what do I see eyes draw a sample and I see the following so I pick I draw these from the input space according to the distribution and what I see is the sample and the label according to the target so if you think of this learning rectangles in the plane what did I see I see some points in the plane and I also see their label that says they're inside the target and they're outside the target so there's a more general setting than on the so called agnostic setting where you don't even know that there I mean which is the real world setting of course I'm learning spam it's not the case that there is a there is a target really there's a linear classifier okay but it exists because they abide by construction so I can't like if I then if i I fit a consistent hypothesis so I can't go wrong and I want to just go through a little bit the this will be the last proof that I do so I need to make progress so there's a so called double sampling trick so this is a very nice trick and it's where the growth function and VC dimension come in so first idea we want to have is that if I draw two samples of size M from the distribution then intuitively if the VC dimension of my class C is bounded and these two samples should behave similarly with respect to every statistical test from C with high probability over the over the samples so specifically what do I mean by this so the probability that where I draw samples s1 and s2 of M elements from D I'm interested in the probability that there exists a hypothesis that somehow distinguishes them and by distinguishes them I have I claim well the following is the the kind of way in which H distinguishes them so I shall define you some terminology so looking in the background I've got this target concept C and the error of a hypothesis over s is so s here is a sample that I've drawn of M elements this is the proportion of elements such that H X I disagrees from their truth label I equals 1 up to M so error so error of H is somehow the true error over the distribution D and this is the error on the sample so I look at the sample and I say how many did H how many did you how many labels did you get wrong in the sample as a proportion of the sample that's the error and I'm interested in does there exist a hypothesis such that the error on the first sample is 0 so the hypothesis is consistent on the first sample and yet when I look at the second sample it seems to have a non-trivial error and I'm saying what is the probability that I draw two samples there exists a hypothesis with this with this discrepancy and I want to claim that it's less than Delta R over 2 well the hidden assumption here is if M satisfies the the bound from my theorem so my overall theorem says if M is sufficiently large then I well then I have designed behavior so I'm saying if M satisfies this so now here's the neat trick so imagine the following experiment I draw two M elements a sample of two M elements from this distribution D independently and for no for I equals 1 up to M sorry that should be M I swap X I with X I and with probability 1/2 so I swap the first guy with so here's my sample X 1 X and X M plus 1 X 2 n so I draw this sample and then I swap these two with probability 1/2 and I independently swap the next two with probability 1/2 and so on so imagine just doing that that's it this is an experiment you might do and then you say well I'm going to set s 1 to be the first time guys after doing the swapping an X s to to be the the next M guys after doing the swapping so here's a question what is the distribution on s 1 and s 2 that sets the diet obtain if I do this I draw to M do the swapping and then partition them off into s 1 and s 2 what's the distribution on s 1 and s 2 or can you describe more simply the distribution did I get ya is it it's the same distribution as if I were just to draw two sets separately so what I was interested in here is the probability that if I independently draw two sets that they differ according to some hypothesis and now I'm saying well this is the way I'm going to draw the sets so how could this how could this help by doing this kind of convoluted way of drawing the sets so this helps me to bounce the probability of this event because now I can condition on the choice of this sample so so fix s so fix a choice of s fix a choice of [Music] s equals there's still some randomization left here because I've got the swapping but at least if I fix this I claim that conditioned on this fixed choice of of s for any fixed choice this the probability of this event here is at most Delta over two so we'll pull the event e so e here determines stands for this event here the one that whose probability I want to bound and I claim that the probability of e conditioned on s is less than Delta over two for all this uniformly for every fixed choice of s okay now why is that so what how could it be that I've got H so let's actually just draw a little picture I've got some hypothesis so that's even fix an H fix H so what do I see where my draw s I see two M elements and underneath each element I can write a tick or a cross and I write a tick if H agrees with the target on that element and across if it doesn't so if I fix the hypothesis so I'm worried about there exists a hypothesis that has no errors on on s1 and lots of errors on s2 so if I fix a hypothesis i s is fixed so if I just look at a hypothesis here then how could I be in this bad situation well the bad situation will occurs if I've got a bunch of crosses and when we do the swapping I move all the crosses into the second half and all the ticks into the first half well the probability of that is small if there are if there are enough crosses the probability of that is small but the problem is the following so I can bound the probability of this event for any fixed age but I need to bound it for every every edge I need to so if I just I here on quantifying over it if I fixed an H then I'd be basically home free the probability would be small if n were large enough but how do I deal with the fact that H is not fixed so here's whether the the VC where the growth function comes in because the number of choices so the number of HS in the concept class is fixed but the number of choices of H restricted to s so we fixed s we're conditioning on this is polynomial in n so this is by the growth function the growth function just tells me how many labelings of of s are realizable by different ages so in fact I really only need to consider polynomial e many in M different ages for each H when it's fixed so I can give a probability bound that this random swapping moves all the crosses into the other side this is exponentially small in M for every fixed H and there are only polynomially many ages on us probably know many many of them and that's what gives me my bounce so so what that shows me is that with high probability over all hypotheses the probability I'm consistent on s1 and and reveal to have high error on s2 is small and okay so I need to make progress so let me so let me I'm gonna jump ahead this is very bad I want to let me prove the theorem so how do I relate this doubling double sampling situation to what I'm really interested in so what I'm really interested in is the probability of the following bad event so I'm trying to learn and I'm learning I output as a learner a consistent hypothesis and the bad event is always consistent with the sample but in fact it has high error in general this is the bad event a there exists a hyper this is insi whose error is greater equal is greater equal to Epsilon and it's consistent with the sample I've drawn and his so this is the event whose probability I want to bound this is the inventions probability that I know I can bound by the previous reasoning the probability that there exists a hypothesis mu 0 is greater equal to epsilon which is consistent on s 1 and I has at least M epsilon over 2 errors on s 2 so these are the two samples s 1 and s 2 and this is that the the hypothesis that differs between them I've thrown in the extra requirement that the error is greater equal to Epsilon but that even makes the result holds even more and here's the key key thing why the double sampling trick works is the following observation that the so a is in fact a B is a subset of a B is a sub event of a and so now by what I skipped over the Chernoff bound the probability of B given a is greater equal to half solve our prove this by intuition so I'm in the situation where a is happened so what is a there's a hypothesis that's consistent on s1 but actually it has it's actually not very good hypothesis it's just by bad luck it was consistent on our sample that's what okay and then there's just what a CH there is at least one of them but let's fix that one and we're interested in the event D that there's a hypothesis of consistent on s1 and on a second sample is revealed to have lots of errors but in fact it's very likely if the error true error rate of this is greater equal to Epsilon it's very likely that on the second sample if the sample is big enough with probability 1/2 that it actually has at least M epsilon over two errors or it's error rate on the second samples at least epsilon over two that just follows from from a concentration bound that I take a sum of lots of independent random variables it's concentrated around the mean so if I just fixed the if I fix a single hypothesis then yes on the sample if the if the sample is big enough probably then the the error race is going to be close to the true error rate and and that's so this this is what gives this line and basically then we're home free because we know we're interested in bounding the probability of a and we already have a bound on the probability of B and we know this we know this is greater equal to 1/2 so basically with some simple arguing we're home home free so I mean just so I realized I went over this rather quickly there are two things I really want to emphasize so we something I have to use a term off bound and but the Chernoff bound is not enough the Chernoff bound just says that if I have a single hypothesis and I draw a sample then the sample gives me the error rate accurately of that single hypothesis but I need to bound overall class of all hypothesis hypotheses which is where this double sampling trick and the growth function bounds came in okay so so basically what this says is if the VC dimension is finite then there's a bound on the sample size which is enough to learn and there's a converse if the VC dimension is is infinite there no finite set of samples is enough to learn and this this is a nice application that the probe listing method the proof of this so I've got a concept class that has VC dimension at least D then for any learning algorithm there exists a bad target concept that can should be C and C and in a bad distribution so that if I only give D over two examples to the learning algorithm then the the the the hypothesis has with some non-trivial probability a bad error a constant error so in fact VC dimension finite if and only if Pat learner ball okay so so what I kind of just want to kind of so I think I well I say remember this slide for later I'm gonna make it extremely hard for you just by skipping over it and you know but I'll make it easy for you by coming back so I just want to explain so maybe a slightly less technical level the notion of a sample compression scheme because this is a really nice connection between logic and learning which I really want to get in time so so this is going to give you another characterization of pac learnability so I haven't gone some concept class and I define the notion of a sample compression scheme for a compact concept class so it consists of a number K which is called the kernel kernel size and the finite set I which stands for an information set such that the following with the following data so it's a compression map so what does the compression map take it takes a labeled samples so here s is a subset of the input space and F is a function from s to 0 1 so it's a set of instances labeled positive and negative and what the compression map says this 2 is a subset of the sample of size K plus a finite bit of information on the side I'll give several examples in a second just to make make this run this home so what it Maps the sample to is a subset of the sample and with the labels on a finite bit of information on the other hand so it's just something from it's taking a subset of the sample and then there's a reconstruction map which takes whose domain is the same as the codomain of the compression map so it takes this this labeled sample of size K the kernel size and the information and it reconstructs our hypothesis and it's a sample compression scheme if when I take it a big sample and then I compress it using Kappa and then I reconstruct it using rows so I get a function like this when I restrict to s I get back the labels of my original sample that's what a sample compression scheme is so I think to understand this we need to see an example so I go to the will be the rectangles in the plain lecture so yeah so let me just go back to thee that what was the domain of of of this so the domain is samples labeled according to concepts in the concept class so what does one of these look like with four rectangles in the plane so so let me draw invisibly the target concept or a concept C and C and here's a sample that's labeled according to this concept so what I want to do a compare compression map is going to map this sample and I clean this into a sub sub set such that I can reconstruct the labels of the whole sample from the subsets so what subset should I select and how many elements yet yeah the four corners so if ice is the four corners these this subset of the sample so the compression map will take this label sample and map it to these four guys and then the reconstruction map will map will will say well I construct this hypothesis and that hypothesis will correctly predict the label of all the of all the of all the points in my original sample so what about interval are there any questions about that okay so what about intervals in the line so let's say so surtsey say c is the collection of unions of three intervals in AA so what does a label sample look like for start well it's a sample of points that are labeled according to some concept in C so maybe I see something like this so you agree that this this looks like a label sample now how should I compress this which which which are the kind of the important sample points for the compression that yeah yeah so the first positive point maybe maybe the first negative point beyond that the next positive point so that the sign changes essentially because then I'll guess so the compression map will we'll send to the subset of the samples that the points where the sign changes and the reconstruction map will we'll construct a hypothesis so just to recall the reconstruction map just what it's tight is that's what it Maps so just to recall the type of the reconstruction map is it takes this well in these examples there's no inside information it just takes the subset of the the the sample and gets you a function so the reconstruction map is generating your hypothesis half-spaces this maybe is a bit more non-trivial so I've got a labeled I've got now say again in the plane so that's that's the only thing I can draw I've got some points in the plane that are labeled according to some half space so I want to I want to kind of say that the comparable but what should the compression map remember from seeing this so what is exactly support vectors so so if you know it so there's a classical machine learning algorithm support vector machines and what this I mean I mean if you want to learn a linear classifier a natural linear classifier to learning here is the one with maximum margin so what I mean by this is here's a linear classifier that classifies these this consistent with this and the margin is that the margin is the distance to the closest point so and it should be the same on both sides but these are the points which are closest to the classifier so I picked a classifier with maximum margin and this can be solved by a convex optimization problem and I've picked these two points of the maximum margin so these are what are called support vectors and in fact this this margin is this classifier is determined completely by the support vectors so the compression map picks out the support vectors and the reconstruction map takes the support vectors and reconstructs the cut classifier so just to emphasize the classifier is determined completely by the support vectors it's actually very clear when you you set out the optimization problem yeah yeah well okay so what is the support vector so let's say we understand what a linear classifier is and then I define the margin of a classifier as the the distance to the closest point so this this this thing this this line doesn't pass through any points but there's a distance to the closest point that's got the margin and a point whose distance is exactly equal to the margin is a support vector it so happens for instance the the normal of the classifiers is actually in the span of the support vectors but just for the definition here as a support vector is the point that's closest to the classifier and these determine the classifier so this is a subset of yeah sorry yeah so the subset of the sample is the well it is the support vectors of the maximum margin classifier so you have a sample there exists a linear classifier that has maximum margin so we know that the sample is linearly separable because we're it was labeled by some some linear classifier from here so there exists a linear classifier and in particular exists one of maximum margin and for such a satisfied the guys that are exactly on these margins are the support vectors the the the things from my labeled sample well depends well [Music] it's well it may be not in this case I mean it is clear that the support vectors determined this I think from when you set up B if you look at the dual optimization problem I agree that in this case it's clear that the the to do not are not enough but in this case I think there should be two support vectors I think that should be the case yeah okay I need to you need to get back to you on that one so that it somehow sohow analytically if you said if we look at the jaw optimization problem it seems clear but yeah so by the way I should just say that I mean I'm rushing and fumbling and bumbling but I have notes and they should be reasonably watertight allow them spread these out and later in the afternoon or tomorrow okay so one last example of a compression scheme and this is gives you a kind of let's kind of go out of our our boring pack world and consider a different type of learning algorithm there's going to give us a compression scheme so here's the learning algorithm so it's the so-called mistake bounded mr mistake driven online learning algorithm and this is going to give us a the sample compression scheme so so this is have a quick excursion into online learning so I'm learning works as follows so you're you're given so there's there's no probability distribution anymore as as impact learning you're just given a set of examples so you read in a set of examples so there's an input space and you read in a set of examples and you're required to predict label of an example is it the yes or it listen though and so you predict yes and then some referee says you are wrong mistake it was a no and then the next one comes in and you predict the predict be exactly the label of this is this a yes or a no instance and you say it's yes and the referee says no you're wrong it was alone now okay this seems like a biased game but here is the game is not quite as rigged as it may sound because there is lurking let's say a concept class and what the referee must do I mean to play fairly is that the answers of the referee on these inputs must be consistent with some concept in here so there must exist a concept in here such that this is the concept that the referee had in mind so he can't just just contradict everything you say all right yeah so in some sense he doesn't have to commit to it I mean just as long as his answers just had to be consistent with some concepts yeah indeed and so hidden like he has a little he or she has a little bag of concepts and maybe you know every time they give an answer they knew they have to eliminate some from the bag but they just need to be consistent and so you know and how is your learning algorithm working how are you making these predictions maybe you have in mind a working hypothesis which is maybe a concept and you make these predictions according to the working hypothesis and every time you make a mistake you try and update your hypothesis so here's an example of such an algorithm so the concept class here is the class of monotone disjunction monotone distractions so the input space here is 0 1 to the N and the concept class is so propositional formulas monotone disjunctions so they are things that of the form P 2 or P 4 or P 7 okay and well what's the desideratum of the of the learning algorithm we want to give a mistake found we want to say we're going to give a bound on the total number of mistakes we make until we're perfect until the every every every prediction we make is agree match by the referee and the the prediction we're going to give is so we're going to give an algorithm that has a mistake bound let me explain the algorithm before the mistake valve so the algorithm is going to use as its working hypothesis a linear classifier and its working hypothesis is going to be a linear classifier of the following form and these are weights so these X I R so that area are determined by the so our input spaces n vectors and vector 0 1 vectors and then in n dimensions and so we're given the example which is the 0 and vector and and dimensions and we output once are all our weights are initialized to 1 and we output 1 if the following linear sum is greater or equal to n so we say this is a positive example and we say it's a negative example otherwise now the referee tells us whether we made a mistake and what do we do if we made a mistake well it depends whether we made a mistake on it on a positive example so the first kind of mistake we could make as we predicted negative when we should have predicted positive so what we say is well though our weights some weights actually somehow got it right and some got it run the ones that got it wrong if if X I was equal to 1 and we should have been positive then in fact this this guy was actually correct this this wait WI and so we done as a reward we double it so we were wrong but WI which contributed to our as being right is doubled if on the other hand we predict positive when we should have predicted negative well who are the villains the villains are the the weights WI where X I on that example the the I entry was one so we have these these weights and we keep on going and then the missed mistake bounce we get so if so if labels so so labels are consistent with classifier so this is our concept fast so notice that the learning algorithm is not actually not using something from our concept class it's using a linear classifier but if the labels are consistent with a classifier see with our disk junks then the mistake bound is something like three our log of n so this is the mistake now this is a total bound on the number of mistakes and this is very nice because in some sense we don't have linear dependence on the dimension and which could be huge we just have linear dependence on the number of districts in the target concept so this is somehow the idea is that you learned there are there are a few relevant attributes and your your mistake bounds is determined by these relevant attributes okay so that was an excursion into the winner algorithm and online algorithm and mistake bounds and what does any of this have to do with compression schemes so by the way I'd say that this this algorithm is called mistake driven because you update the hypothesis only when you make a mistake when you don't make a mistake you keep the hypothesis the same so how could we get a compression scheme from such an algorithm so I come a compression scheme for what so let's suppose we've got let's see that the concept for CB again monotone disjunctions and we've got a sample that we want to compress so what was what what does our sample look like it looks I guess it would look something like this so it will consist of a bit vector and a label positive that say in another big bit vector another label and then another big bit vector and in another label so this sample is where does this sample live this is a labeled sample of n bit vectors and the labels we know come from a monotone disjunction okay and we want to define a compression map that somehow going to pick a subset of the labeled examples such that the reconstruction map can reconstruct a function that predicts the rest of the labels of the sample it does have to be a subset in the the formulation I've given yeah so yeah so we want it to be a subset maybe you know maybe plus some some information but and we want to use the the mistake bound learning algorithm to get this yup exactly so let me just repeat this severe on camera so you you you assume there's some order on the bit vectors and you to compress you run the the mistake driven mistake bounded learning algorithm on the these inputs and you just remember the ones where you made a mistake and that's your compressed set and then when you want to recover the labels of the label of a given guy in your your sample what you do is you just you can just rerun you've remembered the ones where you made a mistake so you rerun the the learning algorithm on the on the ones that proceed proceed the the the guy whose label you want to predict so you gave a very eloquent description of this which I managed to mangle but that was the correct answer so so yeah so just to go back so the all these these concept classes have sample compression schemes now what's the connection with vc-dimension so well the connection is that a concept class has a sample compression scheme just in case it has finite VC dimension just in case it's learning and so there are two things to prove from one of which is is somehow a relatively recent result so this result goes back to the invention of this the notion of sample compression scheme by little stone warmers so he says let C be a concept class on an instance space that has a sample compression scheme of kernel size K so the kernel size is the size of subsamples to which you compress then given epsilon and Delta then some expression of the kind that we've seen lots of that we kind of are beginning to recognize that if our sample size is at least this so there's a dependence here on epsilon and Delta and on the kernel size here in the information set then we have a learning function and the learning function takes samples of size m and returns functions hypothesis and what is the learning function it's just got by composing the compression and reconstruction maps so you take a sample so what is your learning function you take a sample sufficiently large you compress it and then you use the reconstruction map to correctly predict the labels of the sample you took from this compressed set and then you say actually this should be a good hypothesis this should generalize well on unseen delft on unseen data so I mean I I don't get the proof but why intuitively should this work well the intuition is that the reconstruction map is the one that builds your hypothesis and the hypothesis built by the reconstruction map only depends on a small fraction of the doubt of the data so somehow it's it's not going to over fit so it's it's a function of a small fraction of the data and so it's going to generalize well so and the proof is is not difficult yeah there's a fit so the interesting direction is the other way so if I have a if I have a sample compression scheme I can build our learning that but the question is if I know that the class has finite VC dimension you know wearing a how am I going to get a sample compression scheme out of so I gave you lots of examples and they each of them seem to require some ingenuity to construct so there's this conjecture with so I may not so anyway I hope everything I say here is correct by the 90s it's 1 minus Delta the probability is correct so the conjecture here is does every concept class of VC dimension D have a sample compression scheme of kernel size D and so this was kind of been looked at by by the model theorists and clay well this is the open direction and this this question is still open so here are some results so again going back to these concept classes that we've introduced which are gotten from logic by fixing a structure fixing a formula and letting the parameters vary to define a concept class and so here across is where the these are classes particular concept class of finite VC dimension that do have sample compression schemes so if there is a sample compression scheme if either the former for the formula is stable or a is a so-called ni P structure so so let me just say what it means for a to be an ni P structure is is a structures an IP structure if for every formula so a here is an ni P structure if for every formula here the VC dimension of this concept class is finite so an example of an ni p structure would be for instance has we've seen the reals with addition and multiplication the VC dimension of every concept definable concept class is finite and if it's like this then there's a compression scheme so any jump over I have to hurry a little bit the definition stable and in particular there's a very nice result so this I think was the Johnson Laskowski result that if a is so minimal such as like the reals with additional multiplication or the reals with additional multiplication and exponentiation then a concept class that I defined by a formula with parameters has a compression scheme so this is again a class of a wide class of a wide collection of concepts with finite VC dimension that have compression schemes of kernel size equal to the number of parameters of Phi and this is something we've been seeing again and again and again that the kernel size which here corresponds to the VC dimension is determined by the number of parameters and so this is the kind of these this is these were results in model theory and it recently there is a very nice result by learning theorists that somehow partially answers this conjecture that says the following a concept class of VC dimension D and dual VC dimension D star admits the sample compression scheme of kernel size of D times D star so we've seen that if the the VC dimension of a cost is d is finite d then the dual class has VC dimension finite at most due to the day we saw this some time ago so this gives this D star here is at most two to the D and so this gives a sample compression scheme of kernel size exponential in D so it doesn't fully answer the conjecture but at least it shows that if you have finite VC dimension you have a compression scheme so you've got this equivalence pack learn about equivalent to finite VC dimension equivalent to there exists a learning scheme at a sample compression scheme so the proof here is actually this is a kind of long standing open conjecture here the proof here is is very neat it's some i reproduce it in the election in the in the handout notes i don't have time to go through it this kind of application of the minimax theremin in the in for zero-sum games so it's so if you ever seen the the i mean the idea that you the idea that you can use boosting to to produce consistent hypotheses it's very much like this so okay so i don't have time to go through that now i haven't looked in detail at the proofs here so i'm curious at the the different machinery that's used in model by the model theorist to the to the kind of very simple-minded machinery that's used here so let me wrap up and so then we embarrassed i'm going to be very embarrassed and skip over the material that i in my delusional mind i thought that i would have a chance to cover okay I mean in my defense I've only got like 15 years experience of teaching to fall back on [Music] yeah so summary I we've introduced some basic a basic theoretical framework for supervised learning that this PAC learning model we've emphasized connections with logic like in my mind I haven't spoken about automata cuz I just run out of time we have but in any case Porsche will be here and we'll be talking about weighted automata so further developments this week so very will come and talk about statistical learning theory I hope which will touch on things like I've talked about here so notions of complexity of complex concept classes that go beyond the two variable case so things like Rademacher complexity the the case Borja we'll talk about automata learning there will be further connections with logic and verification so verification of learning algorithms and using learning algorithms and verifications and branching out they'll be like you know talks about reinforced reinforcement learning and Bayesian learning so there were no priors on any of the hi I was I mean there were no prior distributions on hypotheses in in in in the setting here we were just picking hypotheses that the sample okay so thank you for bearing with my presentation and slight the slightly technical details [Applause]
Info
Channel: Federated Logic Conference FLoC 2018
Views: 386
Rating: 4 out of 5
Keywords:
Id: Ms9KE-tbobM
Channel Id: undefined
Length: 86min 50sec (5210 seconds)
Published: Mon Jul 30 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.