Borja Balle: Automata Learning II

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Oh let's do another brief recap about what we've been doing so we're talking about learning data like functions on sequential data and we want to represent those as weight automaton because otherwise we'd be trying to learn something that simply dimensional and I've said this a couple of times but it's good to remind us that there's this very deep connection through the Kronecker fleece theorem between huncle matrices unweighted automata in particular minimal weighted automata and this has this proof that is kind of very intuitive where you decompose the huncle matrix as as using this rank factorization pns which in a sense I mean you could interpret it as saying that that you're kind of like measuring the correlation between the past and the future like in the Hong Kong matrix that's that's an intuition that people that use hung kilometers in other contexts like control theory for example years and and using this decomposition or this like a correlation interpretation we said we can derive learning algorithms by just looking at finite box of these huncle matrices using like tools like SPD that will give us like the compositions that are robust to know is like we'll be able to recover with the automata that will also be like slightly robust to noise so now everything is set up except estimating the huncle matrix and I like I ended the first part by showing like how you estimate this hunka limit is where you have data sampled from some distribution over our strings or a dynamical system and showing that these estimators that I gave you which typically just involve like counting things over the data so this one this is one of the nice things about this approach which is like very efficient in terms of like how what you have to compute with the data is just one pass it's not like an iterative algorithm where you have to look at the data many many times you just look at the data once pump you get your huncle matrix and then you do something with it but the missing ingredient is basically how we can justify the this hunk of matrices that were estimating they're not only consistent when we get infinite data but we can also say something about the quality of this approximation when we have a finite amount of data okay and this goes down to some of the stuff that you've probably seen in Ben and Burroughs lecture about concentration inequalities and like sample complexity bounds and stuff like this so this is no different in here we have the same type of results right so the first observation is that we had this block of the huncle matrix defined by a set of prefixes and suffixes like rows and columns so we keep this fixed and the things that we put in there are say we're in the first setting that I described where the things that we put in there are just empirical counts right so by concatenating prefix P and a suffix s in here you get a string and in that corresponding entry of the empirical the estimated huncle matrix we can put the number of times that we have observed this particular string in the data right this is like an empirical probability and I showed you that entry wise right if you look at the particular entry the expectation of this is the right number and obviously by se law of large numbers as I get more and more data or converge to that number now the question is what can we say about the matrix as a whole because this approximation results that we have for example the one that tells you like well if there's a small difference in the huncle matrices you get small differences in the automaton that requires the bound that is about like the whole of the matrix not only entry by entry so well the first thing that that one can show a kind of like and this is like in an intuitive level so far is that if you get a sample size that grows to infinity then these difference between what you're trying to estimate this idea of huncle matrix about the distribution where you're getting your data from and these empirical huncle matrix that you've estimated with the sample s that contains m iid strings this is gonna go to 0 as the sample growth and the rate is like one over square of them okay now there's some hidden constants in this zone in this type of results that are obviously gonna depend on the dimension of this matrices and also sometimes on some combinatorial properties of the strings that you can find in here because that is kind of gonna control how many repetitions you have remember that the huncle matrix is redundant so you can find a string in there many times so if a string is in there many times well I like every time you see that string you're you're taking that information I'm putting it in several entries of the matrix so that's kind of like mixed like the estimation of these matrices is a bit special because the entries are correlated and it's not always easy to see how you can like use this correlation to like get this bouncing in a tight way the second one is that these norm that I put here it can be many different norms so usually the easy way to prove this sort of results is to use either an operator of Rubinius norm so this is an induced norm and this is just the so the sum of the squares of all the entries and then you take the square root and the second the third thing that I want to remark about this type of bounds are that actually here I'm just putting this one where you count strings but if you do the same thing with prefixes or with sub strings or with a single trajectory you can always prove bounds like this the problem is that each of these settings requires a different proof and proofs of concentrations of all on matrices are kind of unknown like exponentially harder to to get right than proves on just like scalar quantities so there's this like one result for each of these settings and all the results have different proofs you can find the ones for Strings prefixes and sub strings in this paper by the neon collaborators very nice paper and what the single trajectory thing is something we did last year with a colleague in li'l but what I wanna give now is one of these bounce so it's not the tightest one is not the best one but at least give you an idea of how you prove this kind of stuff and and and again I'm aiming for the stuff that fits in one slide so so I'm gonna like prove something that's relatively simple and this is something that you might have seen have you seen my pyramids inequality before okay a signal yes okay so somebody has seen my dear means inequality some people haven't so so let's remind what would disease so it's it's a concentration inequality like for example you have like Chernov or huffing bound stuff like this but it is a bit more interesting because typically when you have like Chernov or huffing your the thing that you're doing so you have like several iid random variables and you're building something that is a sum over them right it's always like about averaging and and sometimes you have like things that take em iid random variables and what you get is not some is something different so you have a more complicated function of these random variables and this requires you using like other types of inequalities beyond Chernov and huffing and mcdeere means is probably like the simplest one that is kind of very general because what it says is if I have a function Phi that takes m elements from some domain Omega right and spits out a real number and and this function is such that changing the coordinate cannot change the output of the function too much in the following sense so if I pick any coordinate I and then I look at like any setting for the M coordinates here x1 up to exam and I also pick something I replace the X I with some x i prime right I do like for all eyes and for all settings of this if comparing Phi on x1 up to exam versus X 1 up to exam with X I replaced by X I'm prime if this thing is bounded by C right so if this function has bounded differences when you change one point in the input then you can expect concentration in the following sense so now suppose that I'm evaluating this Phi on a random variable that is X that is X 1 up to exam which are iid copies from any distribution on Omega and actually I think you can even give up on the identically distributed you can just need independence so then you have the following concentration bound which is this type of inequalities that say well the probability that this random objects right the function Phi evaluated on the random variable X goes very far from its expectation right so like what the concentration phenomena are like things are going to concentrate around into expectations and that's useful for us because we know that this huncle mate is that we're trying to estimate the expectation depends on the automaton so if we can show that it like what we get from the data is close to what it to its expectation then we will be in good shape we will be able to apply all this concern like all these perturbation results so that's one statement of this of this type the probability that Phi of X is bigger than its expectation plus T is bounded by something that decreases like exponentially like T squared there's an M here so the more samples you get the less concentration you get that is kind of unto an T intuitive at first but there's the C squared here that compensates because usually C squared is going to be something like 1 over m so you will get an M in there and and things will be more concentrated as you get more data so usually instead of writing this type of bounds like this you rub you write them like that so basically you say well I'm gonna make this X this probability very small say Delta and I'm gonna solve for T and I'm gonna say I'm gonna like give the reverse I'm gonna say that with probability at least 1 minus Delta this random quantity has to be less than its expectation plus something that depends on this Delta and this like how the things can change when I change one point and the number of data points that I have so this is kind of like as I said a very simple concentration inequality and we can apply it to prove one of these bounds that is going to tell us like well like given M strings from some distribution over strings how close the huncle matrix that we're getting is to its expectation so here's how you do it and this is where the thing that I said that makes this different from for example Chernov or huffing appears here because we have this empirical huncle matrix that we compute from some sample h hat s and well this I can actually write as an average but instead of trying to show that these concentrates around that which is ideally what you want to do but but it's very very difficult to do in the matrix in the matrix setup and and the bounds that I pointed to you before basically like they're like entire papers devoted to that so it's highly non-trivial instead what you can do is say well the function I'm interested now is in this in the norm of the difference right in in the Frobenius norm and probably I mean I've said this a couple of times but probably at this point should write what the Frobenius norm is so if you have a matrix it's Romania's norm is the square root of the sum of the entries squared okay it's kind of like this is an l2 norm if you think of the matrix the matrix is a vector okay so we define this Phi and we're going to show that this Phi concentrates around its expectation and we're going to show that the expectation of this thing is gonna it's gonna do the right thing so how do we do this well as I said we can write this guy as a sum as an average of things so you can write these HS as an average of huncle matrices they're just gonna have entries in 0 1 and each of them is going to index by 1/3 in my sample so for a string X H hat X on P and s is just an indicator of whether P and s gives you a string X or not so I'm writing these empirical huncle matrix as an average of huncle matrices there are just zero ones and that half ones in the strings I observed in my sample it's just a simple way to write this and then I have to introduce a new quantity which is gonna depend on what are exactly the rows and columns that I'm I'm taking in this matrix right so remember that this is a finite block of a huncle matrix is gonna be indexed by a set of prefixes and a set of suffixes and then I define this this kind of like constant CP s that is the maximum time over all possible strings that things occur in in my set P ns so basically what this tells me is like well if you look at these matrices like these are again over P and s so how many what is the maximum number of ones that this zero one matrices can have and you can actually also rewrite these as the maximum of the ravine is known squares of these things it's easy to see so using that now I want to apply McDermott and mcdeere MIT tells me you need to show that your function has bounded differences when you change one sample in this case the samples are the strings so I write s for this sample x1 a2 exam and s prime for a sample where I change one of the strings the excise and then basically I need to compare this with that thing with s replace by X Prime and you can show by triangle inequality that this is just the difference between these two empirical matrices and using this thing that their averages over all the strings in the sample I only change one of the strings so this is just like 1 over m the difference between the huncles that I changed so now I just need to compare X I with X I Prime now these two strings in principle could be arbitrary so what I'm gonna do is I'm gonna like apply triangle in again and say that this is well the some of this room use norm plus distribution or or two times the maximum of these two guys and remember that I said that the maximum of this Frobenius norms over all possible strings squared is this coefficient so I get a square root CPS there so that's the bound that I get for my C the C here in McDermott is gonna be for this fight that I'm using right now two squared C PSM and I think you can even get the two to be square root two but I didn't fit in these lights I mean the argument not the square okay so that's the first ingredient like bounding the differences right so all I all I used is like well like triangle inequality and the second thing is I'm gonna show that then this error this Phi of s concentrate surroundings expectation so now I need to bound the expectation right because I want to show that this is more I would like to have an expectation that goes to zero so what you need to do is compute this thing and here again you just have to apply something very simple you apply gentles inequality that tells you that if I put a square here right instead of the expectation I'd look at the expectation squared well I cannot per pound this by putting the squared in sight which by the definition of the Frobenius matrix gets gets rid of this square root so I just have a sum of stuff right is like the sum of expectations over like the squared errors and all the entries which is really because the expectation of these guys this guy is just a variance and this is a variance of what of whether I'm getting or not that string in the sample so this is a Bernoulli distribution so this is the variance of a Bernoulli with with that with the probability HPS which is the probability of observing the string in a sample and then this times this when you sum it over all it gives you like the Frobenius norm of the matrix and if you sum these over all PSS you can show that this is upper bounded by CPS so you get this bound and then you just well you can just forget about this and you this expectation is going to be bounded by the square root of CPS M all right so you put all this together in McDermott by saying my function with high probability and the function is this error that I'm trying to control is upper bounded by its expectation which I have upper bounded by this so something that goes down with M remember this is a constant plus this kind of like this thing that depends on C and M and here I get this this this thing that was happening before so remember that before I had so mcdeere mean has C times the square root M so now because my C has an M here the square root M here cancels and I get a square here M here so I get something that goes down to 0 with them at the rate that I told you at the beginning right so this is like the only proof of this type of concentration for for huncle matrices that fits in one slide but is already is already quite nice because think for example so so think about this cps right so one thing that you could do instead of doing this argument is do something which is much more naive that says well I'm going to apply Chernov because these things are just Bernoulli random variable so I can apply sure enough to each coordinate and then do a union bound what's gonna happen there is that you're gonna pay something that's gonna depend on the dimension essentially unless you can kind of control the correlations between the different entries which is kind of tricky in here you're paying something that is much smaller than than than that I mentioned in a sense because you're just paying for like how many times something appears in there so you can pick for example like very large SPEA nessus know for example if I pick P to be all the streams of to length T the dimensions of my matrix would be exponential would be the size of the alphabet times T to the power T sorry so when I would I would be doing a Union bound I would be paying like something here that that is much worse than this just roughly okay so so here's like the complete results and by the complete result I mean if you combine the offering that I gave you to go from huncle matrix to automaton that is robust to noise you say now I'm going to apply the concentration bound for huncle matrices then I'm gonna use the error bound that I get there and I'll plug it into that result that says if you're recovering your automaton with SVD a small changes in your huncle matrix translate into these changes in the weight of your automaton and then you further apply the result that tells you if you change the weights in the automaton that's how much your language changes you put all this together and you get a pack lettering result for stochastic weight automata and I know that there is a lot of things to chain in and but but I'm just like summarizing this in one slide so what is the setup again we have a distribution F on strings that has rank n right so that means that it can be computed by some stochastic weighted automaton and you give me M samples x1 up to exam iid from this distribution and you also peak right and this is kind of an assumption of what we give the algorithm the algorithm knows end and knows a set of prefixes P&S such that this condition that I had before that the rank of the seblak equals the rank of the infinite huncle matrix is satisfied right and this I mean theoretically they might like this is a slightly unsatisfying but the good thing is that there are many many many businesses that satisfy this assumption so in in practice that that is not really a burden and for Theoretical purposes if you said well I want to get rid of these and actually try to like infer those from data I think that you would end up having something that that's kind of very hard to solve I don't want to say like a particular complexity but I think it would be really hard but anyway if you assume that these are given to your algorithm then the algorithm is pretty straightforward you just take like this hunk of matrices H and H Sigma for all the Sigma's you put in there the empirical probabilities and there here you could substitute like prefixes or sub strings or anything that I told you before you apply this spectral algorithm in the set which is the one that like here it takes the rank n approximation given by SVD and then applies this P so inverse times H hat Sigma times pseudo inverse to get an automaton and then you get well first you get this error bound by combining all these stuff that I told you and and something that I hinted to but I didn't show you which is this telescoping bound when you sum over like many strings so basically what you can show is that with high probability over the sample that I'm getting right the error when I look at the probabilities on all strings of length up to L of the original distribution versus what the automaton computes well this thing is a Big O of this ratio of parameters so the first thing to observe is that you have a square root M here so the more data you get the faster you go down and I think that like there's no way to escape this like 1 over square root m rate and then you have other stuff so the more the further into the distribution that you look the more you pay in the error that's not surprising because really what would what you're doing here in a sense is saying well I mean it's not a Union bound it's something more clever but but you have to pay like for requiring things are give you like guarantees for like longer strings you pay well the more symbols you have the harder it is to like learn this thing you have to learn more matrices so more symbols means you get more error more states means you get more error so like it's it's harder to learn more States than less States obviously and then you have here something that's like a bit tricky which is the smallest singular value of the huncle block that you're looking at so you have to think about this in the following sense so if I am trying to approximate this huncle matrix from data right and then I'm gonna apply an SBD on it what I set is the matrix that you're trying to estimate has rank n but when you add noise to it the you lose this property the matrix can have like arbitrary rank as you get more and more data the matrix approaches what you're supposed to be getting that means that it approaches a rank n matrix now at some point what's going to happen is that well I mean for this algorithm to succeed what you need is that you're actually recovering up to some good accuracy the top and singular vectors because these are the ones that kind of give you the right say linear space where your initial weights and transition weights are going to live so if you don't recover that linear subspace accurately enough your learning is gonna fail and basically this singular value here this end singular value of the hunka matrix that you're estimating is controlling for this thing so this Big O is a big goal not in in the sense of like well there's a missing constant here but it's also big o in the sense of like unless M gets big enough this doesn't kick in and the place where these kicks in is when when when like M is big enough to kind of like make sure that you're estimating the top end single vectors correctly okay that's more or less the intuition of goal of what is behind the result that I told you that if you change the huncle matrix you get a small change in the automaton you have to account for this thing and there I hit the constant went here I'm giving it to you the other important remark about this algorithm is that is quite nice in terms of its running time its running time is polynomial polynomial in what polynomial first is kind of okay it's linear in the time in the size of the sample times the number of streams that you have in there and this term is gonna change if you move from strings to sub strings or or prefixes and so on but it tends to stay like linear in the sample so that's very nice because that means as I said before you just do one pass on your data you estimate this matrix and you can forget about the data and this for example is very very easy to paralyze so if you have access to a cluster you can paralyze this if you have like lots of data so it's very nice the second part of the algorithm is this part here this part doesn't look at the data this part does an SVD decomposition and some sort of inverses and some matrix multiplications or matrix vector multiplications so again this is polynomial and its complexity basically depends on the size of the alphabet times the number of traffic systems the number of suffixes times N and this dominating complexity which is kind of like cubic term that I mentioned of the matrix are like like quadratic in the dimensions and then the rank this is dominated by the SVD so so as for example if you want to speed this up again what you can use is try to use like scalable algorithms for singular value decomposition of which there are many and some even parallel eyes so this is the nice thing about this thing it is like well polynomial random a polynomial running time you have like sample bounds on on the accuracy and so for example you can compute like for example what M you need to get a certain error epsilon here if you want in practice in practice works very well I'll give you some some plots at the very end if we have time but actually if you like care about accuracy to the like the last say last drop of accuracy from your sample sometimes what you have to do is just take the output here and then refine it with some other algorithms that are more good at doing like fine grained tuning like gradient descent methods and so on but but this is quite good and theoretically is very satisfactory yeah with high probability means that there's a lock 1 over Delta here that have hidden and that this bound holds with probability at least 1 minus Delta yeah so you can control that so like tuning your probability also changes your error ok so moving on what has happened so far is that I've been assuming all the time that my data was generated by some automaton now you might say yeah mother no in some applications like the the network protocols applications or stuff like these that like data that comes from computers this is pretty reasonable but obviously when you move to other types of data like well natural language or bioinformatics that robot is data this is like kind of like I know I mean it's a convenient to do theory but it's obviously like like not satisfied in in lots of applications so the question is whether we can extend this sort of ideas to this more general set up or statistical learning where we just assume that we have samples from nature if you want which is modeled by a prohibition that gives us iid samples and we can do a similar stuff in the sense of learning weighted automata from this data using the same time of techniques of like estimating and huncle matrix and then applying this reconstruction using a factorization of the huncle matrix and still and still keep some theoretical guarantees and we can do that to some extent what's gonna happen is that the guarantees that we're gonna get are gonna be weaker because otherwise if you try to like get like very like good guarantees or strong guarantees you again run into the sort of problems that you're trying to solve like np-hard or even worse problems in the same way that I kind of pointed when we began in here by saying that one way to escape the negative results by currents and Balian is to say assume that object that is like generating my data is the same as the one that is the one that I'm trying to learn right as opposed to the general pack setting where you have an arbitrary distribution and something that labels the data okay so this is the motivation for for looking at learning automata in the statistical learning framework right so instead of focusing on the realizable case we we don't want assume now that the data generated from a nice model and this is what a statistical learning theory sometimes does and also like it is the framework of agnostic buck learning but this is like this realm where like almost everything you want to do is np-hard so let's not go there okay so what's the set up now the set up has changed slightly and it's also like slightly more general that would we have been doing so far so now what I'm gonna assume is that again i have a sample x iy i that contains m iid examples and now note that i'm not only getting strings so x i's are gonna be strings but they each of them is gonna come with a real label y i right so i have a distribution joint distribution over Sigma star times R and I'm assuming I'm getting like M iid examples from this distribution and what I'm gonna try to learn is the function that map's a string to the corresponding real based on this data right note that I'm not even assuming that there's an underlying like function from strings to real numbers here well conditioned on a string you could take the expectation over the label and you could think that that's what you're trying to learn which is the the Bayes optimal predictor but in general like in this sample I could have like the same string many times with different labels right so this is like getting close to like real data like messy stuff like no like you're doing sentiment analysis and you give this for people to label and they like go over the tweets and they say all these to this positive attitude is negative and then like you get the same tweet with somebody says this is positive or this is negative and this is like like real data like messy stuff so then okay so how are we going to learn what we're gonna do is I guess you've seen this in in Barron's lecture so I'll just like recap it briefly you assume a hypothesis class that is a class of functions that go from strings to real numbers and you also assume a loss function a loss function that we will use to compare a true label or a label coming from the sample with a label predicted by a hypothesis and will give us like a loss so typically if the prediction is good the loss will be zero if the prediction is bad the loss will be a positive number and for like I mean for most of the statistical stuff you don't require this thing to be comebacks but for the algorithmic stuff you typically require like this loss to be come back so I might be assuming comebacks losses all all the way just for simplicity so now we have the ingredients we have data coming from the distribution we have hypotheses we have a loss function what's our problem our problem is to find an f hat that is a hypothesis in this hypothesis class H as a function that assigns real numbers to strings that ideally approximates this minimum right and the minimum is I look at all possible hypotheses and for each of them I compute I compute this quantity which is expected risk or expected loss or expected error right what what is it it's like I draw a new fresh example which is a string and a label from this distribution D and I compute the error of predicting this label Y that comes from distribution using the hypothesis F in my class and I look at this in expectation over this new example and I want to find an F that minimizes this and I'm gonna call this this F star now this might be complicated and might be complicated for several reasons two of them are that first I don't know how to represent this quantity so there's no way I can put this thing into an algorithm to get minimize and the second is that depending on who H if is even if I had this it's it's probably like it could be that this optimization is is not is not easy to solve like it's MP hard or worse because this is like non-combat sick for example so so the way that you you go around this and again I think burn man might have told you that is that usually you do empirical risk minimization which is motivated as follows so if I have a large sample s and I have a fixed hypothesis well then this thing that I'm interested in minimizing right is an expectation so I could kind of approximate it by this average over my sample right so instead of saying I'm gonna get like new expectation over X's and Y's well I have a bunch of iid X's and Y's so why not like compute an empirical as expectation here and usually like we call this guy like the expected loss or the expected risk and so L that depends on D that's distribution the hypothesis that I picked and the loss function and this kind of approximation is the empirical laws or the empirical risk and it's now hot it depends on the sample so that's if the sample is around a lot the sample is a random variable so that's really a random variable and you have well again it depends on the function that I picked on the loss and the classical approach to statistical learning is to do empirical risk minimization which basically tells me if you want to minimize that thing but you don't have access to it well just you might as well minimize this thing if you have a big sample right so you pick the empirical loss stuck it into this minimization and try to do this again now here is where you run into this problems that I was hinting at before which is we for like string two real problems right is this problems where I I this hypothesis contains functions from strings to reels I want to choose a hypothesis class where this will be this can be solve efficiently and that's quite non-trivial but in addition what we want to guarantee and that's like the like all like what statistical learning is about is that we will not be overfitting the data we will get generalization and the risk of over fitting the data comes because well I'm using a minimizing these over the data so well what happens is that for a fixed F this approaches this but now if I am picking an F that depends on the data this might not mind not longer be true for the if I pluck F hat here because F hat depends on the data and this approximation claim is for a fixed F right that is slightly subtle but is exactly what what happens when you analyze this empirical risk minimization problem so we have to make sure that we pick an age that will prevent overfitting or that will give us generalization so how do we do this did you see random occur complexity in balloons okay cool that's very nice so yeah so the way to prevent is this overfitting means that you want a generalization bound and and you've seen this in Brown's lecture what I want to show is that for any F simultaneously uniformly right and and that's what's gonna give us it's going to get around this problem of like choosing the F depending on the data might break this approximation between the like to loss and the empirical loss so for all FS with high probability over a sample I want this loss the nor the expectation of the error over the distribution to be close to the empirical expectation plus some complexity term and this complexity term will depend on the sample on the hypothesis class and on the loss and ideally it should get smaller as I get more data and it should also get smaller as my class H gets simpler so that by minimizing this thing I get something that also minimizes this thing right because I'm minimizing an upper bound plus some some complexity term that is independent Ernest so one way to control this thing is the Rademacher complexity of of earlier class and yeah Byron told you about this this is just like like the specialization of that concept when I'm looking hypothesis classes from strings to real numbers but is the same thing right so the Rademacher complexity at sample size M of the hypothesis class is these expectations of a supremes right so what do I have inside the expectation well somebody gives me like an F and a sample and I look at all the labels that this F predicts on the sample and then I try to like compute inner product with the Sigma's which are which are plus and minus 1 right so I'm looking at how correlated the predictions of F are with this Sigma and I try to find the f that correlates the most with the Sigma on average and then what I'm doing is I'm averaging over the Sigma's we where they are like independent and uniform so they're like random noise so this thing is saying well how much can you correlate with random noise and expectation so if your FS can correlate with random noise and expectation very much that means that they have a lot of power to overfit so you like your generalization will be like you have bad generalization power which will translate in a large Rademacher complexity and then sometimes you you don't have this thing under other mucker complexity depends on the sample or sometimes you have the expectation over the sample and Rademacher complexity depends just on the sample size so I don't know which one bare ruin'd used or maybe huge both for my purposes I'm just gonna stick to the expectation over the sample okay so if you know how much your your sample of strings when you pick like the the worst F in your class can correlate with random noise you can't control this quantity here and I'm not gonna do the proof of this there's many many proofs there's like different ways to show this type of stuff but basically I'm gonna I'm gonna recap this result that says that if your loss is is bounded and Lipchitz which is I mean it's a pretty strong assumption you can relax it sometimes but it makes for like the more sixteenth version of the bounds so then with high probability over the sample you have this type of generalization bands that I was hinting at with a complexity term that is like the Rada Maher complexity Paso plus something that's 1 over square root M right so basically all all you have to do now to show that you can find classes of functions from strings to real numbers that prevent overfitting is find classes that are gonna have a small Rademacher complexity meaning Rademacher complexity that goes to 0 as M goes to infinity typically at the rate of 1 over m or 1 over square root m and not surprisingly these classes we can character like we can give several examples of these classes in terms of way to automate or all the other stuff so here's the first of these results by the way any questions so far I think I've been just recapping stuff that you've seen in previous lecture so ok so now how do we define these classes of width automata that are gonna give us good generalization or like automata or languages well the first thing or the most I think we can do is well look at the weights if the weights are not very big when I multiply these matrices the numbers of my F cannot grow very big so in a sense no like like my power of like overfitting which overfitting typically means that well I predict like exactly this number here that was probably an outlier and then I predict exactly that number there that was maybe also a liar and then I have to come back and like go back to normal for predicting the rest of the data so you can try to prevent this thing by saying well the weights of your weighted automaton are gonna be relatively small and again here comes our like this PQ norm that I defined at the beginning so that's kind of a kind of nicely closes the circle this thing appears here again and and you can do the following so you can look at weighted automata with n States so you can look at a subset of those a n the where these norms are this finds a bound R right you just put an or on the weights depending on on this amount on the weight depending on this norm on our weighted automata with N or maybe like at most n states so using this class as your age right this gives you this gives you a class of functions from from strings to real numbers one of the problems of this class is that is that it depends on the representation and because you can have different representations is not obvious like which functions are in there and which not right but anyway if you assume that this is your your class of functions then you can prove this stuff and and you we have a general theorem for like R but it is easier to state for our bigger than 1 because there's some like power things that disappear and basically you can show that the Rademacher complexity when you have a sample of length M on this weighted automata with in states that have weights bounded by R with respect to this norm is of this form so you have a term L of M which is the expectation of them of the length of the maximal string in a sample of length M so basically if you're trying to learn with distributions that generate very long strings you your macro complexity will be worse than if you're trying to learn from distributions that have like short strings so that's kind of the of the intuition because well for longer strings you have more strings so you have more risk of overfitting in a sense so you have this first term that depends on like the length of the strings in 1 over m and then something that depends essentially n square times Sigma is the number of coefficients in in your automaton right is the number of coefficients to express the transitions and then you have this log M over m and here you would get like something that depends on R in a rather ugly way if your R was not bounded by 1 but but for example remember that if I put P so but this holds for any P and Q right so for example if I'm talking about like realistic automata if I have P equals 1 and Q infinity these guys bounded by 1 so like these already applies to like wait probabilistic automaton stuff like this okay but this has this problem that I said that well there's some ambiguity right because for a function I can have different representations some might fall in here some might not so like like there's some ambiguity going on here and and you might want to do like to have another way that is more canonical to control the complexity or rather macro complexity of your class of functions so well another thing that you can do is you can say well let let me look at this object this function as essentially this vector or this like hi like infinite dimensional vector that I had before and put a norm directly on the language right so I compute the P norm of F as well the sum over all strings of the absolute value power P of f of X and then I take like 1 over P Peter root so that defines the norm on the language now I've obstructed this from the automaton now this doesn't depend on on the automaton so that's nice I get this nice behavior and then again I can define well a class RP which are all the functions that have like p-norm bounded by r so again these in a sense should say well I'm not allowing functions that are too complex and not allowing functions that are going to give you like huge values because well because this P norm has to be bounded so again you can use this class to con you can control the run macro complexity of this class and what do you get here so here I give you the result for like general R because it's simple so if you're looking at P 2 right there would be kind of an alien norm in for these like infinite dimensional vectors then your adamah complexity is Theta so in here we have an upper and lower bound basically R over square root M so you have something that is kind of very intuitive right allowing for larger languages hurts your generalization you have more power of overfitting getting more data like improves your generalization brings your complexity down you can do the same thing so we have a general result but I'm just giving you examples of what happens if you put P equal 1 and 2 because they're kind of intuitive so if you put P equal one you get something similar but now instead of the square root M you get an M so this thing will potentially go to 0 much faster with the size of the of the sample but you get some other well some other number here the CM which again depends on well on the distribution or on the sample so now this time this time cm is the expectation of well the square root of the maximum number of times that you expect to see a string in your sample [Music] yeah so yes so it's nice because you know you get things in there that actually depend on the distribution so tell you like which distributions are going to be easier to learn from or what these distributions is going to be harder to learn from okay so I've given you two ways to control the random error complexity of hypothesis class on weighted automata you look at the weights of the automaton or you look at the norm of the language both have advantages and disadvantages but both have a common problem and the common problem is that if you if that you cannot get algorithms with provable guarantees that will work directly on those representations the reason is that if you try to optimize directly on the on the language well you have to deal with this infinite dimensional object or have a way of every time you get an automaton compute these norms and some of these norms I mean the l2 you can compute if I give you the automaton but the l1 you cannot so it's not easy to get like good learning algorithms from that in general the other thing is if you're trying to like control the complexity by the weight of your automaton then the natural thing to do is to optimize on the way directly and you get something that is a non convex optimization and I'll go back to that later so well what's the solution the solution I mean hunka matrices right so we a way to control the complexity of the class in terms of properties of huncle matrix and that's that's the one that is not so straightforward but here the right way to control these complexities in terms of the chaton norm so here's an aside for anyone who hasn't changed in shuttle norms I'm not gonna run the pole anymore but I'm assuming that everyone who's who had seen like said Oh embraces an SV this might have seen shuttle norms and the other way around so the shuttle norm is basically a norm of a matrix that depends on the singular values and and here's how it is defined so if I have a matrix M of run K and I have K singular values as one of two sk arranged in a vector if you pick a P between 1 and infinity then you have the Shatan P norm of the matrix M SP is exactly the P norm of this vector of singular values right so I mean this is not as quite intuitive way of measuring like how become a to exist right like larger singular values will give you like larger norm and it turns out that most of these settings for different piece are normed are known elsewhere or have different names so if you take P infinity well P infinity you're taking the largest of these numbers right the L infinity norm of this vector which is going to be as 1 by construction so it gives you the the norm as infinity is just a larger singular value which happens to be the spectral norm or the operator norm or the induced l2 norm these norms like appears everywhere in different names if you pick P equals 2 then P equals 2 is just going to be the l2 norm of this vector right it's going to be the square root of the sum of the singular value squared and you can actually show that this is equal to that Frobenius norm over there so this sum over like the squares of the entries is also the sum over like the squares of the singular values and if P equals 1 which is kind of a very interesting norm you get something that depending on on the literature it's called the nuclear or the trace norm this is just the sum of the singular values right because it's a null one but all these guys are non-negative and what happens here is that this is very interesting because in some sense the nuclear norm like the shuttle norm with because one is the best comebacks approximation of the rank function right so the rank function if I give you a matrix and I ask what's the rank of this function this is a non convex function on on the entries of the matrix the the shuttle norms they are all convex and they're all comebacks in the entries of the matrix and this one happens to be the best convex approximation meaning that is the convex envelope of the rank so that is interesting because because we know that the complexity of a huncle matrix in terms of the size of the automaton that we can recover from it is its rank right like higher rank will give us more states and kind of it's intuitive that more stage will give you a more complex hypothesis therefore more potential for overfitting so we could try for example to say well what we want to do is like fine huncle matrix that agrees with the data but minimizes the rank but because the rank is non convex this is going to be like hard problem instead what we can do is is try to come up with a convex approximation of the rank and put it in there and that's going to be the nuclear norm and that's like well it's a heuristic that can be justified theoretically in some cases that is used throughout same machine learning and signal processing when you have to do things that like try to minimize a rank and in general we can use these shuttle norms to control the complexity of huncle matrices both infinite and fine it and we can use this to define hypothesis classes for learning with width automata so again like given a p between 1 and infinity and an are bigger than 0 we can define a class of huncle matrices which is gonna correspond to a class of functions from strings to real numbers because there's one-to-one correspondence there by taking all the matrices of the hunka matrices that have shot p-norm bounded by our and when we use this hypothesis class for learning I mean this is kind of a theoretical construction right we will never be learning with infinite huncle matrices but if we could well we would be getting like this random aher complexity bounds for like our likes a generalization bounds which again look in in the case to it looks very similar to the one that we had for the language our over square of them but obviously these are is a different R and for the nuclear norm right we get again like something that is our do we have an extra log term instead of square root M we get an M and then we have a new quantity that again depends on the distribution or or the sample depending on which run Rademacher complexity you're using exactly and if you like being tracking these quantities that appeared right the first one was the expectation of the longest string which is kind of pretty intuitive the second one was kind of like about the expectation of like how many repetitions you will see in in your data set and this one I mean it gets a bit weird it's just what comes up of the analysis but it basically says that if you have a sample you can like take each string and split it into a prefix and a suffix and you want to do this in a way that minimizes the following it minimizes the maximum between the number of times the prefix appears or the number of times the suffix appears and again that is related to the fact that when we like factorize this hunk of matrix like we're looking at prefix and suffix s separately and I mean this quantity just appears there's some intuition there but it's just something that comes up in from the analysis but the good news is that well we can control the Rademacher complexity of learning with huncle matrices just using the shuttle norm and by using the for example the nuclear norm we'll get something that will be kind of the best comebacks approximation to trying to minimize the rank there for trying to minimize the number of states in your automaton so yes but so far these are all like statistical learning results we need to turn these into algorithms somehow and it's not so straightforward so the first you can do is I mean obviously I had this know I had this random aher complexity bound if I controlled the PQ norm of the weights so you could try to like minimize in there directly as I said this is a non convex optimization problem which is like minima find an automaton with n state or at most ten states that minimizes the empirical risk over the data and has pick you know mounted by are you can actually try to solve this if your loss is differentiable using like stochastic projected gradient descent or projected gradient descent and actually people do this all the time these days with recurrent neural networks it's not about algorithm it's just that we don't know how to prove much about it so to get an idea of what you need to do here if you if it was like five years ago probably you would have to go and compute these gradients manually which means that well you have this loss on a excellent some string and why and you want to for example find the gradient with respect to the weights on the symbol a so if your string is ABC a you end up with these horrendous expressions and you would have to do this by hand five years ago the good thing is that these days we have this automatic differentiation that the people that work on deep learning have worked like very hard to to get going and these are like basically like software tools that compute these gradients automatically so right now I think this is feasible and some people have actually used it in in practice and you can like well you can justify this choice of regularizer you using the Rademacher complexity bound but i mean this is as far as it gets it might work very well and politically you might have like underfitting because you because you might get stuck in a local minima at least if you use project project is gradient descent if you introduce like stochastic gradient descent which means that each time step instead of not each iteration of gradient ascent instead of looking at the whole of my data set I just look at a subsample like you can solve this problem with local minima sometimes but it might work it might not or it might require like a lot of tuning of the part of the parameters of your gradient descent and so on so it's it's I mean it's needs but it doesn't have theoretical guarantees so if you want theoretical guarantees for for this type of learning algorithms where you just get a sample of like strings and real numbers and you want to wait automata that kind of agrees with those you have to use the handle matrix right because now this is a completely like convex optimization problem right it says find me a hunk of matrix over some set of prefixes and suffixes that I've given you that agrees with the data with respect to this loss function that I also specified because it makes sense for my problem and make it so such that this matrix the it's shuttin P norm is always upper bounded by R and therefore I get good generalization when I take this H hat and convert it into an automaton using the algorithm that I showed you before and this is kind of interesting because in in a sense what you're doing here is a matrix completion problem right so if I pick this set of prefixes and suffixes and I'm trying to learn a huncle matrix here and I have this data that tells me for example on a I want you to predict one so I look at the a and I put a one here and so on so then I end up with this ha this kind of Hong Kong matrix that has some like missing entries in it so what what you're doing here is essentially like some Hong Kong matrix completion problem and and and well we know how to analyze like statistically what you like the generalization power of what you get out of like solving this which is a convex optimization problem for like this regulator using both the Rademacher complexity and unlike some previous analysis would based on algorithmic stability but the missing the biggest missing ingredient here in in my opinion and that's kind of one of the of the open problems that that I want to highlight in this lecture is that the way you would put this optimization problem into like say standard convex optimization to kids it's pretty naive pretty redundant and very this is very far from efficient so the the missing bit to make these things like really really practical and and obviously like like scale to large alphabets right that's that that's the biggest constraint here the size of the alphabet because if I add like like string say up to length D here I get like like Sigma to the T Rho so this grows really fast is really having algorithms that are like specialized for this type of optimization problem so that's kind of the the biggest open thing here we know how to solve them like but is although is polynomial time it's not very efficient in practice because it's not specialized to the structure of the problem okay so now I want to move beyond sequential data a little bit but any questions before we do that okay so it turns out that these tools that I've been giving you these ideas of like let's represent your data using a huncle matrix let's look for objects like weighted automata who's like whose algebra is very related to the properties of the huncle matrix so we can do this learning trick and and so on and so forth you can extend these ideas beyond sequential data so in particular you can use the same ideas to learn objects that are models that go from sequence to sequence as opposed to sequence to number and you can also do similar things for other types of objects that are compositional right in though in the way sequences are compositional trees for example also compositional and even graphs but I'm not gonna go there so what happens when a sequence to sequence important it's important in a lot of natural language processing applications right so for example if you're doing sequence tagging where somebody gives you a sequence a sentence sorry and you want to do part of speech tagging meaning that for each word I wanna sign like it's it's part of speech role in a sentence so I want to know that this guy is a bird this guy is a noun and so on so this is a sequence to sequence problem and and the particularity here is that the length is preserved right I want to label for each input symbol some other sequence to sequence problems the the sequence the length of the sequence might change for example if you're doing spelling a spell correction like somebody wrote apple with one p and your sequence to sequence model what it should output is like the correct spelling upper with two peas so this is like sequence to sequin modeling in natural language processing but similar things happen in reinforcement learning probably you've seen like in in hadas lecture that when you have an agent that is operating or interacting with the mark of decision process or even a partially observable Markov decision process the dynamics that go that go on is that the agent collects traces of the following form so it takes actions action one action to action 3 and for each action it gets back for example a pair of observations and rewards right so this is again a sequence to sequence model where for example you might be interested in learning well I want a model that if I I conjectured I'm gonna do these actions I want to know like what rewards I'm gonna get or am I gonna like observe these things because I'm trying to get to like that I don't know like that shop so am I gonna observe the sign of that shop if you're a self-driving car and stuff like this so it happens that that for many of these applications actually you can turn this sequence to sequence modeling you can split it in two parts you can split it in two like learning an inference and maybe inference is not the right word here but what I mean basically is that you can like find function from strings to real numbers like we've been doing all along but now the strings at each position have an input and an output symbol as opposed to just one symbol or you could have more complicated things that like take strings of different lengths but I'm not gonna discuss discussed in this case too much and then you can like learn something of this type from data and afterwards when you want to like do sequence to sequence model and you say well these are my Sigma's for example find me the the symbols in Delta that maximize this F or that minimize this F or I don't know average over all possible or any type of these things so basically using this observation I'm not saying that this second problem is very easy it can be very hard in practice but that's how usually like you can split this this problem like at least theoretically into two nice parts if you assume that part then basically you're back to learning something that is like from sequences to numbers but now your alphabet is slightly more complicated so that's how you formalize this thing you can exactly use the huncle trick where now what you're trying to learn is like an input/output weighted finite automata and like honestly like there's so many different names for this thing that I've given up on trying to find a standard so like people call with the final state transducers predictive state representations when you're doing reinforcement learning input output observable operator models and like from from some people in physics so yeah just model of this type where now you're transitions are induced by pair of symbols and yeah so so this is the the kind of things that you can learn so when you're doing tagging right the semantics of this function that you're trying to learn can be like compatibility between like an input and an output or when you're doing dynamical learning dynamical models you can be what's the probability that I observed like these observations given that I take these these actions or you can even like in reinforcement learning instead of like the dynamics you can be trying to model the rewards if I do this observation these actions and I get these observations what is my expected reward after this sequence so everything that I showed you before with some little caveats here and there somewhere that I'm not gonna get into applies where you just now use the huncle tree with huncle matrix that is defined over like pairs of sequences there have input and outputs and you can like check these references I'll put this light in in my website afterwards for writing maybe in in the in the summer school website and you can check like applications of these and more concrete algorithms of how you do this in practice because yeah I mean it's I'm just fine waving a little bit here it's not straightforward but but can be done so the other thing that I want to highlight is that similar ideas work when you're dealing with trees and again trees is something that appears quite often in natural language processing applications where for example you want to model dependencies right so you might have dependencies inside the sink inside the sentence is that whatever is the name what is the verb what depends on what even inside the document like how do like how is a document arranged like is like well this is the abstract this part explains the introduction so you can do like basically tasks where you have a sequence as input so Mary plays the guitar for example and what you want as output is a tree and the tree that explains the dependencies in the sequence for example here we say well here's the sentence it has a noun phrase on a bed for a bird phrase the noun phrase composed by just a noun that is Mary the verb phrase is composed by a verb which is placed and a noun phrase that contains the terminan and a noun the retar okay so here I mean it's interesting because like the tasks that you might want to solve and the models that you might be dealing with and the type of data that you might be getting like there's a lot of diversity right so but essentially what goes on is that these techniques based on huncle matrices you can use them to learn like weighted context-free languages and stuff like this so you could even think that you're not interested in the tree you just interested in the representational power that you get from the tree in like in the same way that you go from Rush from regular languages to context-free languages all right and yeah so when and when you get data here you can get different types of data like so like in NLP applications things can get really messy so you might get like label trysts where you have nice annotated data like this sometimes like somebody just don't give you like the dependencies so this is how the tree looks like I'm not gonna imitate this stuff because you can derive this stuff from a part of speech starter for example or sometimes you're doing like a totally unsupervised learning so you only have the yields and you're trying to find well like trees that explain the regularities in your date and the dependencies in a winner is kind of unsupervised so one mobile that works kind of like it kind of like serves as an umbrella for all these methods could be like waited three automata so in weighted tree automata is kind of similar to what we've been doing so far but instead of modeling a function from strings to real numbers now you model a function from labelled trees to real numbers and it works as follows so you have a ranked alphabet which means just an alphabet that is like decomposed into several sub alphabets each one with its own index so you have like Sigma 0 Sigma 1 they're disjoint and so they're disjoint and you can have a weighted tree automaton with n states which I mean formally it looks quite similar to what we've been using so far so you have a set of initial weights alpha you have a set of transition tensors nobody get too scared so a tensor is just a generalization of a matrix right so matrix is a two dimensional tensor so if I arranged several matrix together I can get this three dimensional object right so right so you could have like a collection of matrices here and a collection of matrices would give you a tensor of order three oops right so here you have tensors of different orders that will depend on the rank of your symbols symbols in the alphabet and then you have final weights but now instead of just one set the final weight you have one set of final weights for each symbol in in in Sigma 0 and this has like a similar expressive power or higher than weighted context-free grammars and latent variable weighted context-free grammars and and here's what's going on really right so you have this sort of trees that I like think about the the Mary plays the guitar tree where you have leaves right the Leafs would be the things in Sigma zero would be the the symbols that have drank or ret zero and they cannot be composed with anyone they can just like hung at the end of the tree and then you have symbols for example a here that would have like has two things so this would be in Sigma two for example right it would have like two inputs and C for example has only one input so it would be in Sigma one that that that Sigma I is is the RIT and the good thing about these trees is that they can be composed right so I can cut anywhere and represent this as a concatenation of two trees where well the the part that hangs below is this part here that is kind of is a normal tree and the part that hangs above just has one copy of one special symbol where you can like pluck an old tree right so usually people discolor an inside-outside factorization so this would be the inside of the tree if I split in this arc here and this would be the outside of the tree and the interesting thing is that in the same way that when I were splitting sequences the computation that the weighted automata did factorized across this splitting the same thing happens with weighted tree automata so I don't have too much time to tell you how the weighted 3 automata like compute things because basically it like it takes the things in the leaves and aggregates them up by using like so C would have a matrix here so you would multiply this vector by a matrix to get a new vector in this edge here and you will get two vectors here and put them into a tensor to get a new vector and so on so you need tensors because you sometimes need to take two vectors to get a new vector but basically that's that's this kind of structure that factorizes so if I have tea that is an outside tree composed with an inside tree the function that the automaton assigns on that tree is gonna factorize as an inner product of something that depends on the outside tree and the inner tree and then if I have this symbol Sigma so if I say that the inside is kind of like Sigma which would be for example analogy to symbol apply to two trees t1 and t2 so in this case Sigma would be B and t1 would be C and B and T 2 will be a so then I can like rephrase this beta TI as something that depends on the beta of T 1 the weight of T 2 and the symbol that I have here in the middle and this thing that I said that I'm not going to make too explicit that you can take a tensor and in this case is a rank 3 then order 3 tensor that takes two vectors and produces a new vector and one of the ways that you can see that is if you take this tensor you take all those matrices there and put them one next to each other as opposed to a stacked in a three-dimensional object if you do if you do this thing that we call like flattening of the tensor where you get a matrix that is it's kind of rectangular you can actually see this as a this as a matrix vector product where now the vector is kind of the the Kronecker product of these two so this is just to say that you can do everything that I've done so far to learn waited through automata with a huncle trick where now the huncle matrix instead of having like prefixes and suffixes which were both strings in the rows and columns in the rows is gonna have outside trees so it's gonna have like trees that have this special symbol where you can pluck a new tree and in the columns you will have regular trees so for example you can see that if I take let's see what is an example here so for example if I take this outside tree a with empty slot and I pluck an A into the empty slot I get a minus one and this is the same that if I take this empty slot and I applied this a a three here so that's why I get like a minus one as well so this has the same kind of like redundancy structure and you can also like prove a version of the fleece Colonel Kurtz theorem for this type of huncle matrices and waited three automata and you can again do like spectral learning which was like for this using this type of representations first and by bayi and denis and then like many people in natural language processing have taken these ideas and made like very interesting algorithms that work with these ideas maybe they don't work exactly with this matrix but but these are the same ideas that are going on in there okay so I'm very near the end if you have questions while moving to the conclusion that's a very interesting question you can do some things but it requires you to craft by hand a representation a final dimensional representation of your symbols what I mean by this is that if you have like a huncle matrix or an infinite alphabet you could always like kind of multiply it on the left and on the right by some projections and if if you pick this projection right this is gonna preserve the rank so if you can pick these projections in a way that says I know that although I have infinite symbols they're gonna have these regularities that I know how to encode manually into these projections you can still show that most of the theory goes through but is yeah I mean it's pretty a lock is a trick if you encounter situation like this like a concrete example like you can look and actually like like this idea like is present in there somehow like they don't have infinite alphabets but they because this matrix can go can be really huge they work with projections of it that preserves this rank property and the projections are derived from things that you know about language so these guys are people who like look at like are really NLP people and they know a lot of our language and they know how to derive these projections for a case where limited is not infinite but for but you will get like English words in your rows and columns so this can grow huge but then I mean like no for the theory to go through you have to say assume that my projections satisfy these and that and yeah then things kind of work okay so I didn't show any experiments but yeah it works too so yeah over the years like most of the papers that that we've done and I worked on on bees for example are published in machine learning conferences and there we usually like half have applications in mind so we've compared the complexity of like different implementations and different algorithms we have shown how you can apply these things for example to to problems in reinforcement learning to learn about for example timing things about sequence to sequence modeling in different models things about language modeling and and so on so there's there's there's the references that I gave along the way most of them like like like these papers these spots come from those papers so yeah I mean this is a real rich area and there was no time to cover applications but I just wanted to show that but in addition to pretty nice theorems you can like get so no no I don't know I don't think the plots are as nice as the theorems but it works open problems there's plenty of open problems so I just wanted to highlight a few so one of the most important open problems here when you want to do this in practice is how you pick pns right these P and s that define like your your huncle matrix this is a parameter of your algorithm we hope we know how to prove results under like someone would say mild assumptions and would this p NS do but selecting those from data and doing these optimally in a sense that you minimize the size of p NS because the bigger they are the more running time you have so ideally you won't like spend less time computing but also the smaller they are the more things about the data that they ignore so you want to make them big so what is where is the trade-off and how you do this like from data is kind of one of the biggest open problems here the other one that I also pointed out at some point is how we do like scalable convex optimization overhung kilometres for solving this matrix completion problem all the things that we don't know how to do that might be hard but might like being able to like theoretical knowledge and some con concrete cases is how do we like constrain things about the output of the weighted automaton so for example if we are learning a distribution maybe as output we want a probabilistic automata right but we cannot guarantee these with spectral methods because there's this queue that gives you the rotation that we can control so maybe the data was coming from a realistic automaton but what I'm learning is this probe is tokoto me turn to some queue and the numbers in there are negative and they don't look like probabilities anymore can we do something there that for example will go back or like will constrain the output to be Pro ballistic automaton because in that case for example we can do more this progress demos are more interpretable and then you can do more interesting things in terms of like putting the outputs as initialization for other types of machine learning algorithms like Bayesian learning and so on there's something I didn't talk at all about which is that actually learning and some problem that we've been studying lately which is approximate minimization are kind of very interrelated right approximate minimization as the problem so we know that if you give me non minimal weighted automaton there's algorithms that minimize this thing that give you the smallest automaton that you can that can't represent the language but you might try to go be below that you can say well no like the minimum has a thousand examples that's still too big for me I want something with 500 examples so you can pose this as something that is called approximate minimization right minimization beyond the minima and and in a sense what happens when you're doing learning with huncle matrices is that you're putting your data in the huncle matrix at least when you're doing distributions and that is kind of representation of an automaton that completely mimics your data right is minimal but it completely over fits so you can think that the algorithm that learns the automaton by taking an SVD decomposition and then doing all this linear algebra is actually doing some approximate minimization of the automaton that completely like mimics your data and doesn't generalize at all so I think there's very nice connections there that are still to be explored and we have we have a couple of papers on this which you can find I don't think they're on the reference you can find them on on my website well one thing that I mean like I get this question all the time so I'm just like being pre-emptive here how much of this could be extended over semi rings I don't know to be honest I mean I'm pretty sure we can deal with a complete like all this machinery would work with complex numbers although I haven't seen any learning application that requires complex numbers other fields some of the stuff that we do here works with other fields with some like codes in it like no finite fields and so on like you can you don't have as VD but you have these req instruction algorithms for with automata and so on beyond that like arbitrary semi rings I have no idea but some of my colleagues in NLP have good intuitions for why in some applications you might want to use semi rings well some particular semi rings semi I think like max plus semi rings instead of some plus semi rings to model some natural language phenomena so like that that is like very much an open problem but it would have applications into applications if somebody like knew how to do this and yeah so as I said like in some cases you want to take the output of the spectral algorithm and put it into someone as an installation point of some other non convex gradient based algorithm because that improves your accuracy it can give you an extra boost in accuracy because you kind of squeeze your data a little bit more and I mean seems like perfectly plausible and very like reasonable to do like in practice works it's amazing like how much you can improve with this but we have no theory about this well because like doing theory bodies would be like doing theory about non convex optimization although in like non convex optimization where you have a good initializer but you need to prove that it is a good initializer so like doing theory around this I think would be really interesting but yeah I don't know how to do it either so take-home points basically all I've shown you is based on a single building block you use SVD of huncle matrices to learn weighted automata like different variants of with automata different variants of how you estimate a huncle matrix it's amazing that how much you can do with a single building block in addition this single building block only requires some linear algebra and well like representing this huncle meters in memory that sometimes is like the biggest bottleneck because it can grow very big if your p NS are very weak well to get into this area all you need to know is a little bit of linear out here a little bit of probability a little bit of convex optimization and a lot of pay a lot of patience so it's not that far-fetched and well you can get like pretty cool practical applications for a variety of models and applications obviously this was done like over a few years and now if all you care about I guess is like just accuracy on some dataset you'll probably do deep learning but if you care about using something that has like proved or guarantees or your domain is one where like the blurring doesn't apply because say say you want to do like so I didn't talk too much about the exact learning but these things can also do exactly learning because I think Ben was supposed to cover on which is a version of this that does exactly but you have time to do it but anyway so in like I know it has some applications and the theory is really interesting if you want to know a little bit more we have a tutorial that was recorded as well and in this tutorial you can find more things about the tree stuff so I think there's almost an hour about risk so if you're interested in the tree stuff there's a couple of survey papers I wanna highlight one was Mary Ann Marie and this one by thorn and Yeager this one tells you a little bit more about the history of these algorithms and different versions about like with with queries and so on as opposed to only the the spectral bit and this one is very nice because it it gives like a survey of all this like weighted automata and remember at some point I said on this like on so bussiness Pro predictive representations observable operator models and so on so this kind of tied is like cleans house a little bit untied is like all these connections and explains all these models into a single paper so if you want to dive deep into this literature further it's good to look at here because it will tell you like key words for looking at other parts of the literature that don khon don't call these models automata and call them all things there's very nice Python implementation of this factor learning algorithm by people in marseilles so if you're interested in like running this in in practice just check it out and yeah this is the thing that I just mentioned so like neighboring literature produces representation observable operator models by people in reinforcement learning and people in physics they're also like starting points to dig further into this and finally I just want to say thanks to all the people who have collaborated in this research in one way or another all very nice people thank you [Applause]
Info
Channel: Federated Logic Conference FLoC 2018
Views: 101
Rating: 5 out of 5
Keywords:
Id: QwXwz5Z9L-Y
Channel Id: undefined
Length: 89min 45sec (5385 seconds)
Published: Mon Jul 30 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.