Borja Balle: Automata Learning I

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so yeah I work for Amazon but most of these stuff is stuff that I finished before joining Amazon I'm working on different stuff now and when Alessandra and Prakash and not asked me to talk about automata learning for the summer school it was a bit tricky because there's a lot of stuff in automata learning so obviously what I'll do is just talk about the few topics my personal selection mostly based on I guess like what I already had slides on no shame in admitting that but I want to put this into context so one thing that I find fascinating about automata learning is that in the same way that automata and language theory have been there since the beginning of like computer science and theoretical computer science automata learning has been there since the beginning of learning theory and this is something that probably like like the mainstream people in machine learning they don't know about it these days because this is like gold goes back to the late 60s and early 70s but there are a lot of like the first like say theoretical learning results we're actually about automata or most of them so the first thing that you can find in the literature is this result by gold on 67 where I mean he showed that you can learn regular languages in the limits and in the limit is one way of formalizing learning is not one that is used today because it's kind of very it's very unrealistic but it was one of the first results on on learning theory then I'm going to refine this and in 67 by the way ungluing is gold's not active anymore but the Anglian still is still active she's still teaching at Yale it's it's if you ever visit there like get in touch with her if you're interesting this - it's just an amazing person so so she so she showed that regular languages can also be learned in a different setting learning from queries membership and and coolest queries and I think Ben was supposed to talk about these in his like computational learnings here but maybe he didn't make it did you see angle in South with him no okay well unfortunately because mentally he wasn't going to do it I didn't love it but but it's fine like it kind of like you can see where it comes from from some of the stuff I'll gonna I'm gonna talk about so I'll probably make a remark about this again learning from queries like is somewhat and not too realistic learning model when it comes to learning from data from the real war but it is more realistic when you're trying to learn with the system that you can interact with so this is getting a slightly more more realistic but then like like in in say the early 90s like like learning theory really like started to blossom and people proposed more like save realistic models of learning like definitions of what it means to learn and what means to deal with data that might come from real world by introducing randomness for example and and in these more realistic models like PAC learning for example the first results were kind of negatives so so in in the early 90s in 93 Pete and warmth showed that PAC learning like regular languages or DFA is actually np-hard whereas the previous algorithms you could do it in polynomial time in this kind of modern realistic learning models and then in 94 current Sun Valley and extended is like np-hardness - also cryptographic hardens so this was kind of a over down and then like the the machine learning community kind of transition to look at other problems because well this is automaton regular languages seem to be too hard but some people like special like stubborn people like Clarke the need like Aronson and so others what they did through this period of 20 years is say well the general problem might be too hard so they they went and looked at particular instances like restrictions of the problem and so on and using methods from community oryx that came from these results by golden and green with some tools in statistics and linear algebra they managed to make like partial progress and and to me this de special progress like like kind of like is summarized by these 2009 results which like really like started this field of spectral learning of automata which is what I want to talk about today so what I'll talk about today are things that started from 2009 so it's kind of a modern take on automata learning but they're obviously inspired by all this stuff so what I want to do in these kind of lecture tutorial I want to motivate these spectral techniques I'm talking about especially for learning way to the automata and other models on sequences and tree structure data I'll obviously like there's a lot of talk over here so what I'll focus mainly mainly is in providing intuitions and fundamental results that if you're interested in this area will help you effectively navigate the literature we'll survey some of the formal results and some of the proofs but not in too much depth and especially I want to make emphasis on on the tools that we use to analyze these algorithms and improve the results which are linear algebra concentration bounds from probability and unlike a few well-known things from learning theory which probably you've seen in Byron's lecture what I won't do is dive deep into applications so there's like lots of applications of these methods I'll kind of stay at the abstract level not deal with applications to my throw I'll give pointers to things and where they can be applied I cannot obviously provide an exhaustive treatment so that's beyond the scope of an introductory lecture I'll give some pointers at the end in case you want to keep digging and instead of like trying to give like complete proofs of everything I'll state I'll give a few proofs that kind of illustrate the methods that these linear algebra applications and concentration results and so on so the plan is I'll start with some sort of like motivating story about like sequential data and why weighted automata play a role in learning from sequential data I'm pretty sure you've seen sequential data in in Havas lecture about reinforcement learning I'm not sure if like the other lectures on Bayesian learning statistical learning and computation learning talks too much on sequential data so I'll do a brief introduction there then I'll talk about like these fundamental results evaluated automata reconstruction approximation which are kind of the the main building blocks for analyzing and designing the learning algorithms out there I'll explain in part three and four which fall in the PAC learning category and statistical learning so this would correspond to what a band was talking about and these woodburn's was talking about a little bit the setups and then if I have time I'll explain a little bit so this is all for sequential data how you extend these ideas to learn from transactions which are sequence to sequence models and models on other structured data like trees and yeah feel free to interrupt me at any time I'm happy to the questions along the way clarifications notation whatever just just stop me okay okay so well sequential data if I'm not sure what the audience here if you come more from lodge if you come more from learning but in any case is pretty obvious if you think about like learning from real data there's a lot of examples where you want to apply machine learning where you'll have sequential data right starting with natural language processing where you have sequences of words or sentences sequences of sentences as documents and so on to other stuff like computational biology so when we represent like DNA a sequence of like a CDG's or when we represent the proteins as sequences of amino acids like you are dealing with sequential data there's not the only like representation of these data sometimes you also have graphs but it's again structure data time series analysis is obviously like the type of data that by nature is ordered in time so it makes the security maybe makes the sequence sequential decision-making like sequences in the name there and and and robotics so robots that interact with an environment they see like sequence of observations and have to take a sequence of actions so like sequential date is very important in in in these physical systems that what I mean time and physics are very interrelated so what makes this like sequential data kind of special is that most of the times we cannot apply that the common machine learning algorithms that everyone knows because this algorithms assume that we can represent the data so the inputs as vectors of a fixed size right and and and that in in sequence is kind of tricky right because you can have sequences of length 10 or sequences of length 20 or 25 how do you come up with a representation that encodes all these sequences of different lengths as vectors of a fixed size this is what makes it like very hard for example to apply here things like logistic regression like linear regression and many other of like this very basic machine learning algorithms of course if you have something that kernel eise's you can think about kernels over sequences and there's like a whole lot of literature there but but I'm not going to go into there without what I want to focus on is on the compositional aspect of sequences so if you have two sequences you can catenate them you get a new sequence if you have a tree and you append another tree to one of the leaves you have another tree so this data that has this compositional structure kind of leads itself very naturally to be analyzed by by the kind of methods that I'll present which are methods based on like I mean at the intuitive level they're usually in algebra to discover latent variables and you'll see more about when I wouldn't I go further into the material and and also when we talk about sequence data one thing that is important to realize is that things can there's a lot of diversity even within within sequential data right the observations in the sequence can be continuous or discrete and the time at which we observe things could be like continuous time could be discrete time or could be like I don't know like arbitrary times and we only know that well once one symbol come before another so what would I'll be assuming today in this lecture is that we have sequences represented by that have finite symbols in a finite alphabet Sigma star okay so that's the the set of all sequences that I can build using symbols from an alphabet Sigma that is finite and in some of the examples I'll use it's going to be the simplest alphabet just letters a and B but you can think that it's in general for example if you're dealing with natural language application Sigma could be the set of all English words and and the goal will be to learn a function from strings to real numbers okay so we'll be given some data and the goal will be to produce a function that given strings associates to them real numbers and these real numbers will have some semantics will like we'll try to get these numbers to do something that is interesting for our application so so there are many things that can be represented like that for example a language model so in language model what we have as input to like this function f is a sentence sequence of words and what we would like to predict for example is the likelihood of observing this sentence in a specific on in a specific natural language so this is used for example in applications when you do in like machine translation so you have machine translation so you're translating from say French to English and you have a model that might propose like ten translations for sentence so then to try to decide like which sentence you pick one thing that you can take into account is how good this these proposals are grammatically and you can use the language model to score this because like things that are more grammatical will have higher probability of occurring in the language then things are non grammatical in the in the for example in the computational biology example you can have like protein scoring models where you you give like a protein encode it as a sequence of amino acids and you're trying for example to use F to predict what's the activity of this protein for a particular biological reaction in the sequence of Robotics or sequential decision making or reinforcement learning you can have for example a sequence of actions and you might want to predict what's the expected reward an agent will collect after executing the sequence of actions and in for example in network applications you can have it like sequence of packets in a network and try to determine what's the probability that the sequence of packets will transmit a message successfully through the network so there is just a few examples of applications to motivate that what I'm gonna talk next although it's going to look pretty abstract you can you can use it in in a lot of applied settings one important remark here is that this F that I'm trying to learn is like you can actually write it like this write something from Sigma star to R is something in R to the Sigma star and that's an infidel an infinite dimensional object so if we're trying to learn this thing we'll have to put date in your computer and get something out from a computer so we need a way to represent this in a finite way and this is an infinite dimensional object and here is where weighted automata come in they provide us with final representations for objects on this class and they're obviously not the only finite representations there can be many different ways of representing that and I'll get a little bit into this at the end when we talk about like more complicated stuff that used trees to generate sequences but for most of the lecture I'll be dealing with representations of functions from strings to real numbers that are given by the way to automata if you've seen weighted automata before a disclaimer I'm just gonna use real weights here so like if you've seen like things about like weighted automata over semi rings here I'm just gonna use kind of the simplest or or the most basic semi ring just a real numbers and if you don't know what I'm talking about don't worry so what's a way to tell Tom Eaton weighted automaton you can have different ways to look at it so one way is a graphical representation so a graphical representation of a weighted automaton basically says we're gonna have a number of states find it number of states is define it which we usually gonna denote n so here's an example with two states right these states are are these round things one is named q1 and the others named q2 and we also have an alphabet as I said before here I'm gonna just use the alphabet a B so there's like some letters and then what we have is for each state we have initial weights so the initial weight of q1 is minus one indicated like this the initial weight of q2 is 0.5 we have also final weight for each state here is 1.21 here's 0 and then for each pair of states and a letter we have way a transition weight so for example from q1 to q2 with letter A you have weight minus 1 and you have this for every pair of like state letter State so the graphical representation is nice because if if you if some person draw this well I mean this is kind of a random example but if some person is designer away the automata to do something you can probably like infer what that is trying to do from the graphical representation but what happens when we do machine learning is that the output is just going to be a bunch of numbers so it's very hard to to interpret these things so it's actually more convenient to work with a different representation that will help us encode these things in in the computers and also do algebra with it which is representing all these weights as vectors and matrices so what we do is we have like a vector alpha for the initial weights remember minus 1 0.5 minus 1 0.5 and the dimension of these vectors correspond to the number of states right so two states two dimensions the final weights beta 1 point 2 0 1 point 2 0 and then for each symbol in our alphabet Sigma in this case we had 2 symbols a and B we're going to have one transition matrix which is going to be number of states times number of states and it's going to contain the transition state associated with the that particular letter so this minus one here that goes from state 1 to state 2 corresponds to the first row here second row second column here the minus ones this minus one is this minus one here coming from me and so on so succinctly what we will do is we will typically write a weighted automaton as this topple of initial weights final weights and the transition matrices for each symbol in the alphabet this if the automaton has n states which sometimes I'll denote by a bar size of a these guys will be n dimensional vectors and these matrices will be n times n matrices okay so I said we wanted to use a weighted automata to represent functions from strings to real numbers so here's the function that the weighted automata computes right so it's a function that sometimes I'll denote by F a for automaton a and it takes a string X 1 up to X T there's just the sequence of symbols in the alphabet and what it does and what like the typical definition that that somebody used is this kind of cumbersome expression basically says you take this string and for each pair of states you look whether there's a path in the automaton that connects them that is labeled by this string so you will find some weights along this this path what you need to do is multiply all the weights along this path so that's this this product here right from like so you have a sequence of states there's gonna be n of them there's gonna be T of them and they're gonna have take values between 1 and N and you have a sequence of weights that you multiply you multiply also the initial weight and the final weight and then you sum this for all possible paths so this is kind of easy to see in the graphical representation but in the algebraic representation you have this more compact expression where basically you take the vector of initial weights as as a row vector you multiply the sequence of transition weight matrices given by the symbols in your sequence and then you multiply by the final weight beta and and you can see that you can summarize this in alpha transpose ax B so this is a row vector this is a matrix and this is a column vector so it's kind of a it's kind of an inner product type thing and obviously we can have functions of this type that are not represented by which automata so we give the name recognizable or rational to the functions of this type that can be represented by a weighted automaton right so we say that F is recognizable rational if there exists with the automaton a that computes this function and the smallest number of states for which we can achieve this right the the the the least end for which we can find an automaton that satisfies this we will denote the rank of the function and and you'll see where this notation trunk comes from in in in couple of minutes and so therefore we can also say that we have an automaton we say that the automaton is mínimo if the number of states that it has is the rank of the function it computes and on one important observation here is that waiter automaton are not unique you can have like many weighted automaton that the automata that compute the same function so if I have this automaton given by by this initial transition and final weights if I take any invertible matrix Q that is n times N and I multiply this guy by Q I conjugate this guy by Q and I multiply this guy by Q inverse well this automaton that I get has different weights right I mean you can argue are the same weights represented in a different basis but they're different numbers but all the Q's cancel along the way so you're you have like two automata that compute the same thing so that's going to be important because that's gonna mean that if we're trying to learn away the automaton and we have a learning algorithm we cannot insist on getting the same automaton back all we can insist is on getting an automaton that confused the same function okay so that kind of is important when you're trying to learn things are probabilistic objects okay I was seeing the hand every person seeing you sir the mini-malls are yeah when you're minimal you are otherwise things can get slightly trickier but yeah there are a few normal forms but usually the only way we can get normal forms out of learning is learning something and transforming getting you into a lot normal form so it's not that useful to design learning algorithms but yeah you can come up with normal forms as well okay so this is kind of a very general object and there are two very classical examples that you might have seen before one is the term mystic finite automata so in a DFA that is like the object that we used to recognize regular languages you can easily encode it as a weighted finite automaton where all the weights are either 0 or 1 and alpha is going to be just an indicator of what is the initial state beat is just gonna have ones in the coordinates of the states that are accept accepting and otherwise it will have zeros in there rejecting States and then the transitions for like symbol Sigma I and J will be like 1 if there is a transition from I to J label by Sigma and for each like I and Sigma there's only going to be one of those so you can check using my previous formulas that if you put something like this do you take a DFA like this then you are actually you actually get a function from strings two like zero one that defines a regular language another well-known example more from like the the statistics community are hidden Markov models so in hidden markov models now the weights are going to be like probabilities they're going to be like in the interval 0 1 and alpha is going to be in this case like a distribution over initial state so if there's going to be a probability distribution given us a vector beta like there's like some subtlety here like different definitions the one that I like is beta is a vector of ones so there's there's nothing to do in beta and then the transitions if you have a symbol Sigma an initial state I in a final state J basically the way it is the probability that you go from I to J and observe symbol Sigma and in an hmm usually you assume that these probabilities are like the probability of transition and the probability of emission are conditionally independent given the current state which you can write as this say that this probability is the probability of going from I to J times the probability of observing a sigma from mine so if you put something like this you get a function from strings to probabilities that defines a dynamical system so it gives you like probability of generating next symbols and so on okay so these are just two of the examples to put things into context any questions so far okay so we have to learn these objects and to learn these objects we will need to exploit some structural properties and the easiest way not the only one but this is way to see these structural properties is to look at the huncle matrix of the function from strings to real numbers okay so that's kind of I mean if you've never seen this before might seem kind of warped up but it is at some point it becomes intuitive and you get used to it so if you have a function from strings to real numbers its huncle matrix that I will denote HF is a real matrix with rows and columns indexed by strings okay so because we have infinitely many strings this matrix is going to be infinite in principle and we usually denote the indexes of the rows prefixes and the indexes of the column suffixes and the reason is because this matrix is defined as follows so if I have HF and I have a row index which is going to be a prefix P and a column index which is going to be suffix s then the entry in in the PS coordinate is going to be the function evaluated on the string that I obtained by concatenating P and s okay so that's a new string that has like P as a prefix as a suffix and I'm evaluating out there so for example empty concatenated with empty F an empty empty concatenated with a even a and so on so you see here for example that this is kind of a very very redundant way to represent the function because this function from streams to real numbers is actually containing already in the first row of the matrix or in the first column and actually if you have a string X of length L f of X will you can find it l plus 1 x in this matrix is very redundant but this redundancy gives us something give some structure which is the fleece Kronecker theorem which says that although this matrix is infinite it can have find it rank and it tells you exactly when it's gonna have finite run it's gonna have finite rank even only if this function is rational that is it can be represented by a weighted automaton and actually the rank of the matrix is gonna be the rank of the function that is the minimum number of states that you need to represent this function as weighted Optometry okay so this is kind of like I know the first time I saw this it was really surprising like you have this infinite matrix it has fine rank and you have like a very very crisp characterization of when it has finite run meaning that you can well take this matrix and represent it as something that has like no if the matrix has rank n you can represent it at something that has roughly n squared numbers and is an infinite matrix rational means that it can be represented by a weighted automaton is this yeah is this definition that I gave here so in particular is rational when the hung committee has final track ok so I'm not gonna give the full proof of this but I always try to give an intuition of where this comes from so so then you see because this intuition is kind of embedded in in the in the machine learning algorithms that we derived from this algorithm now from this theorem sorry so so what's the intuition here so suppose we actually have a rational function right so suppose we have FA that is computed by some way the automaton a if I want to prove the the Kronecker fleece theorem what I need to show is that well one like one of the steps in the proof is to show that if you have if you're in this setting then this matrix cannot have rank larger than the number of states of a and that is very easy to see why because you take a prefix P and a suffix s write a row and a column in this matrix and you look at this entry and this entry is going to be this rational function evaluated on the prefix P and the suffix s but we know by the definition of what the the function the automaton computes that this has this form right this is just alpha transpose times das corresponding to the symbols in the prefix times the ace corresponding to the symbols in the suffix times beta now this is a row vector if I split it here this is a row vector and this is a column vector so it's kind of an inner product right and and the interesting thing if I call this alpha P and this beta s like usually I call this like a forward vector because it's like moving forward in the automaton and this is a backward vector because it's moving backward in the automaton so for example if now I keep be fixed and I change s right which means I'm changing like which column am looking at this part will not change right so that basically means that anything that is in here in this row is going to be an inner product of this fixed vector with other column vectors so what I can do is I can fix this orange vector alpha P here and then look at all these possible column vectors and arrange them in a matrix right enough like this this kind of little row by all these columns gives me this row and if you do this like for all rows and columns you see that you've actually like final factorization of this matrix this huncle matrix into this form like a tall skinny matrix and a short flat matrix which I call P of a and as of a and the intermediate dimension here is the intermediate dimension here which is n the number of states right the dimension of these vectors which tells me that because I just written this matrix as a product of two matrices that have intermedia mention n therefore the rank of this matrix cannot be more okay so that's how you prove the upper bound and these factorizations are gonna like we will gonna use them to derive learning algorithms so what we will do basically start with these huncle matrix factorize it and because these vectors are kind of related to the automaton try to from this factorization go back to the automaton so sometimes I'm not sure I'm gonna use this too much in in this talk but sometimes we call this factorization the forward vector factorization induced by a of the huncle matrix and and know that like this is infinite and both these are infinite okay questions no questions cool okay so so how do we use this chroniker Fliss theorem to build learning algorithms I kind of gave you the intuition but let's look at these a little bit more in detail so I just show you that this is kind of like the intuition behind the Kronecker freeze theorem this like forward backward factorization so now let me do something that at first might look strange let me pick an arbitrary symbol from my alphabet and put it here in the middle right between my prefix and my suffix what's gonna happen what's going to happen in this representation here is that now I'll keep my orange vector on my green vector but I'll have this red matrix corresponding yeah I mean this should be an A or this should be a sigma sorry right so you'll get this red matrix in the middle which in terms of like a factorization of a matrix so if I now build this matrix that I mean is no longer huncle it doesn't satisfy the huncle constraints but at each prefix and suffix I'm gonna put the evaluation of the function on concatenating the prefix the symbol again this should be a sigma and and the suffix then the factorization that I get is the same one that I had but with the red matrix a in the middle so I've just built a system of linear equations that tells me how to find this guy and the following sense if I have this factorization so if I have this matrix and a factorization of it and I have this matrix here then I can solve for a a and algebraically this is how it looks so I have H equal to PS and 1/2 H Sigma equal to P a Sigma's and therefore I can use this pn s to try to cancel the p and s here and get a a and the way to cancel them is using these people as an S Plus which are the more penrose pseudoinverse which is a very convenient way to write something like solve this linear system yeah so yeah maybe it's good to do a little bit of recap of what the more penrose pseudoinverse is good seen this before okay just a few so this is something from linear algebra though it's not usually taught in like a basic linear algebra course but is something very useful it's a generalization of the inverse for matrices that are not square right matrices are not square they cannot have inverses but sometimes we need to do things with matrices are not square and one way to generalize the inverse is the more penrose pseudoinverse which for a matrix m that is n times m we usually write as m plus and it's going to be like it's gonna have transpose dimensions n times n and it satisfies a few properties it satisfies this kind of relations which I mean you can see like things that like be ident the normal inverse would satisfy anyway it kind of like cancels things when the rungs are right and it coincides with the inverse when the matrix is square the most important property the one that I'm gonna use a lot here is that you can use the pseudo inverse to solve linear systems in the following sense so if I have a system of linear equations M u equals B if I compute the pseudo inverse of M which is M plus and I multi why be right if M was square invertible that would solve the system if M is rectangular and like M plus is the smooth partners and the inverse would you get from here is this kind of like the arc mean or something of some are mean of something which like the interesting cases are summarized here so if the system is completely determined then basically M plus V solves your system right there's only one one solution and M plus B gives you that solution now not all systems are completely determined you can have underdetermined systems where there might be more than one solution so if that's the case M plus B gives you the solution that has the smallest norm in L two cents in like this aquitine norm and you can have these systems that have no solution right so if the system is over determined then M plus B well you cannot get a solution but it gives you it gives you the minimum norm solution to the least squares problem right so you if you have no solution you can look for use that minimize this thing right if there's a solution the minimum is 0 otherwise it's not going to be 0 so you'll have use that give you this minimum but you again might have many use that give you this minimum and M plus V gives you the minimum norm among those ok so that's a very convenient way just M plus V to solve like linear systems with very nice properties depending on whether system is completely determinant the determinant nobody tournament why am I saying that well I think it's a fun fact that this becomes very useful quite often but I was just using some set of inverses here and and you actually need to justify that you you get this thing back and using the properties that I just sketched you can't justify this but you can do something slightly better you can show that you can do the same game even when you don't have the infinite huncle matrix even when all you have are like financial blocks of these huncle matrix crafted in a particular way that satisfy certain assumptions so what blocks will we be looking at we'll be looking at finite huncles two blocks that are usually indexed by a finite set of prefixes and a final set of suffixes right and you just give me a set of rows or a set of columns so in this case if the set of rows is a be the orange ones and the set of columns are is epsilon a B then I can like identify this red block here right which is it's a block of this huncle matrix that I'm usually gonna denote H so I could put h PS to remind you what PS are but the notation gets really cumbersome so I'll just write H and there's another type of su block that we will also need to use which corresponds to this H Sigma I had also when I was doing this with infinite matrices which is if you give me P and s then I can concatenate a symbol Sigma to all the prefixes in P and keep s as it is and you get another sub block so if for example if you take the letter A and concatenate them to all these orange guys so I get a a and B a so this would be the purple guys and the S the green columns stay the same so this block would correspond to this row here and this rock here okay and and note that actually I did this in a way that they don't overlap but H and H Sigma could overlap right because if I had like epsilon and a in P then epsilon concatenated with a would give me a and for example this would be in the Wratten and the purple block but because I don't know how to like write like draw right on top of purple I skip that thing in the in the example so now that we know how to take like su blocks of huncle matrix which is pretty intuitive you just take a set of rows on a set of columns and you need to know how to translate those by a given symbol appended to all the prefixes then you can go to one of these kind of like basic building blocks or basic results of this theory which tells you that if you have blocks of this form right you have finite blocks of this huncle matrix satisfying assumptions you can get the automaton back and but that'd mean I mean an automaton that computes the same function remember that these are things like they're they're like equivalent up to this rotation Q so what are the assumptions so suppose that you have a function that has rank n meaning there's gonna exist a minimal waited automaton with n states that compute this function and you have identified a set of prefixes and suffixes pns that both contain the empty string ok that's that's important because otherwise you cannot get like the initial and final weights and such that this block that I gave you in the previous line indexed by P and s of these infinite huncle matrix has the same rank as the full matrix right so remember that that this matrix has rank at most n so actually you can show that when this is the case you can have you can find blocks that are size n by n that have rank n and and this doesn't need to be n by n this can be anyone any blocks but you need to have this full rank obviously like like any sub block cannot have like drunk bigger than n as the matrix would have ranked more than n but you need a sub block that has full rank so when that is the case you can construct a weighted automaton that is going to compute this function by doing the following so you take this fine it's a block and you compute a factorization pns like I had before right and and for this in this case you can use any rank factorization algorithm that you like QR yeah like a single value decomposition again decomposition whatever you like and because this matrix has rank and you wanna rank factorization meaning that the intermediate dimension here must be n and the rank of these two guys is going to be the rank of the full matrix right you can show that this must be satisfied so then what you do is you just take the first row here probably I can draw this makes more sense so you have your H which is indexed by prefixes and suffixes and remember that you have the empty string in the prefixes and suffixes and you just found this factorization P and s right so here you have dimension and you have dimension n here you have your empty string here you have your empty string and the other yeah it's these are a calligraphic P and calligraphic s so these are going to be your initial weights these are going to be your final weights and then I'm gonna use the same formula that I had before with sodium burs this but this time we find that matrices so I'm gonna take that a sigma to be piece of the inverse times H Sigma which is a translation of this guy by adding like Sigma to all the prefixes in P times s plus so I claim that what you get is a minimum weighted automaton that compute F and to prove this I don't need to do much I just need to show it computes F because it has n states so it's going to be minimal by my assumption so this has a relatively easy proof what do you do well you know F is rational and admits a weighted automaton with a minimum weighted automaton with n state so let's suppose a tilde such an automaton if I can show that a the Ray that is constructed by this algorithm and the a tilde are related by a Q right then they compute the same function and I'm done so how do you do this well I showed you also that if you have an a tilde this induces a factorization of your like full huncle matrix that I can also take as as a factorization of any seblak right so using a tilde I built a factorization of H which is the sub block that is going to be P tilde s tilde and similarly I have a factorization for H Sigma that is P tilde a sigma tilde s tilde so then what would how what did the algorithm do the algorithm let a sigma b piece of the inverse h Sigma as a pseudo inverse so I plug this thing in there right so I expand the H Sigma like this in here so I get P so the inverse P tilde the I tilde as tilde as plus so that basically tells me that if I'm looking for something that has this form maybe the Q should be SS plus and then you go and check that this is the case using some of the properties of certain inverse you show that these guys satisfied what it must satisfy and that it's inverse is actually people as P tilde I think this is the trickiest calculation which is here and you're done so by just factorizing the huncle matrix of of something that could be computed by a weighted otamatone I got back one way to automaton that is equivalent to the one that I had before so now if we want to design learning algorithms based on this what we will need to do is like well find out what's the right hand column matrix estimate it and then apply this reconstruction algorithm but being a bit careful because here I assumed that I had the exact uncle matrix and the exact rank and so on and when you do this from data you'll have some noise so you'll have to account for noise and you'll have to see whether you can make this algorithm robust to perturbations okay so this is just like a diagram of what I just said how do these learning algorithms work that are based on what we call the Hanko trick we take data we estimate huncle matrix and we get away the automaton back so well this sounds like a very straightforward obviously when you go into the details like it it can get tricky so so what I'll do in the rest of the talk is basically like iterate this diagram like several times and show you how you use this paradigm in different settings in different settings are going to mean essentially different depending on the type of data that you have so for example sometimes we're trying to learn stochastic automata so automata that are that compute proper distribution in this case it turns out that most of the times excavating the huncle matrix is just some counting so you count like how many things you have observed things in the data put it in huncle matrix get an automaton that that compute this probability is back from the matrix in more general cases you'll have to do something like related to statistical learning theory like empirical risk minimization obviously when you design these algorithms we know that the rank of the matrix is going to be related to the number of stages of the weighted automaton so if we want to prevent overfitting if we want to ensure that there's gonna be generalization we would like these matrices to have low rank so we'll need to find a way to enforce low ranking in in this building of the huncle matrices and obviously whenever we do that we want to work we find that uncle matrices so we'll have some parameters that are going to be the rows and columns of the hunka su block that we're looking at that typically is something that you have to choose depending on well a bit of knowledge of your application and so on and once we have that uncle matrix we'll just apply this offering that I showed you but we must make sure that we can make these robots to noise okay so how do we measure these robustness to noise by the way any questions so far things are going to get slightly technical now but don't worry thing is like the top of the technical and then it goes back down again okay so now we want to make this algorithms robust to noise so we need to talk about approximations we need to talk about the proximation zuv like these functions f and we need to talk about the proximation z-- about this automata so while approximations and norms are some things that go hand in hand so let's try to define some norms on to begin with weighted automata so if I have a weighted automaton and I have a pair of numbers P and Q between 1 and infinity as a holder conjugate meaning that they satisfy this identity then you can define a PQ norm of a weighted automaton a which is going to be the maximum of the pin or of the initial weights the q norm of the final weight and the maximum of the cue norms of the transition matrices we're here this cue norm is the industry norm meaning that the cue the induced cue norm on a matrix is the supremum over if I take all vectors that have Q norm at most one of the two norm of the vector applied to the to the matrix okay and and you will see why this is kind of a good definition I mean you could come up with many many ways to measure the the say the size of an automaton here I'm measuring the size of the weights in a way that is going to be compatible with the tools that I want to use to derive this bounce but this is not so unnatural so for example if you have a probabilistic automaton so that these guys are probabilities these guys are like like initial probability stopping prophets and transition probabilities then you can show for example that if you take P equals 1 and Q equals infinity this norm is exactly 1 so you have like settings of P and Q that play nice with like some restrictions that are more semantics on what the automaton does right so it's not completely arbitrary way of defining the norm so I introduce these norms because I want to talk about the proximation z-- and I'll give you two approximation results this is where things get a slightly technical but this approximation results will help me like motivate why these algorithms that we're going to derive are going to be robust to noise so the first approximation result is about the following problem suppose you have to wait it out on at a and a prime both with the same number of states end and now well assume that like both the the PQ norms of these two automata are bounded by some constant row but also importantly assume that kind of the distance between these two automaton with respect to that norm that I define which is the maximum between the the distance between the initial wedge in pin or the distance of the final weight in Q norm and the distance between the transition weights in india skinner suppose that this thing is bounded by delta right so i have two automatons if this Delta is more what I have is like they're kind of close to each other the weights are not too far apart from each other and what I would like to say is that well if the weights are not too different the languages that they compute are not too different right so there's many ways to do this the one that I'm doing here is the one that fit in one slide meaning that is is obviously not the most sophisticated one and not the one that gives you the tightest bounds but all this is supposed to do is give you the intuition of that we can actually prove this sort of things if the weights are closed the languages are closed okay so here's the statement for any string X the function computed by a on X and the function computed by a prime on X differ by at most the length of X plus 2 times draw to the power of X plus 1 times Delta ok so this can be a bit unsettling at first the fact that this bound grows with the length of the string in fact it is but remember for example that I told you that if I take P equals 1 and Q equals infinity and I have a ballistic automaton I can put a 1 here which would make this guy go away so you get rid of the dependence on X if you're kind of like weights are controlled if you have something that as I give longer strings don't grow like the weights don't grow too much which is what happens with the probabilities for example the next thing is that if you make Delta small this bound becomes more so that's kind of like the good thing to happen here it's kind of a continuity property so how do you prove this you prove this by induction on the length of X so remember that what I have here is like no the initial weight of a time's the transition weight times the final weights and the same here so what I'm gonna do first is I'm gonna pound the difference between the transition weights with respect to the Q norm and I claim that this is bounded by the length of x times this factor that is almost like this but with a with a plus 1 minus 1 and the Delta and you do this by induction so if the length of X is zero there's nothing to do right you have two identities so the difference is going to be zero so the bound is obviously satisfied so what if I have a string that is X concatenated with Sigma well what I do is basically I am gonna add and subtract here something that is gonna be ax prime you see if I tried this so the first step what is doing is I'm adding it's obstructing something that is accrue a crossing between the two automaton right the transition weights with respect to the primary common turn on x times the transition weights on sigma of the other automaton and then I can apply triangle inequality and su multiplicative 'ti of these norms to get this thing here right so multiplicative atiim basically means that if I have like a product of two matrices their norm is bounded by the product of the norms this is these induced norms always satisfy this and then well this is what I'm unbounding by induction this is what I can bound by row these again I can bound by ero to the X and these is again my assumption so you get this bound that you want it and then you you do the full thing which is again like applying tricks like this of like you take these two guys and you add and subtract alpha a prime beta Prime so across of like between the two automata and you apply triangle inequality and so on and so forth at some point you have to apply holder inequality which is essentially here right so if I have yeah let's remind holder inequality so the second trick is holder which basically says that if you have something of the form so you have an inner product something alpha transpose beta and you have P and Q that are conjugate you can always upper bound is by the pin arm of cue I the Phenom of alpha times the Q norm of beta which is what I'm applying here okay so basically that's all you need to get this proof like a little bit of algebra but like well the plus/minus trick which we all know holders inequality and some multiplicative VD of these norms and you get this bound and as I said you can get more fancy things so for example if you make further assumptions on your automaton and instead of trying to bound things for a single string you try to bound things for example for the sum of this thing over all strings of a fixed length you can get like huge telescoping arguments that give you like like nice cancellations and so on but it doesn't fit in one slide you can for example you can look this up in my PhD if you're interested okay so we know now that if you change the weight of the automaton a little bit the language is not going to change too much so that's a good thing that means that now if we want to come up with these learning algorithms to like learn a particular automaton as long as we can make this this reconstruction of the automaton proposed to noise so that I get weights that are close to what I was supposed to get then I'll get a good language if my target is to learn the language so how do we do this second step the second step will require making this algorithm that I told you that takes a Hankel matrix and recovers the way to automaton robust to noise and the way to make it robust to noise remember that that out that algorithm had a particular so has a rank factorization here so what you need to do is peak run factorization that's going to be robust to noise I mean is pretty intuitive so this decomposition is the single value decomposition if you seen this before this before okay yeah roughly the same people that have seen the survey members because they're pretty related so any matrix has a singular value decomposition again this is kind of in the same way that some matrices have like egg and decomposition all matrices have single value decomposition so if the matrix is n times M and has rank K what you have is a factorization of this form so you're writing M as UD b transpose where d so the the intermediate dimensions here are the rank of the matrix so there k these diagonal is K times K and into the n roll it contains this the singular values which are K numbers that are non-negative and that are actually sorted in the diagonal and then you won't be contain what are called singular vectors okay so so these are like ortho orthonormal matrices so u transpose user identity and B transpose B's identity and u contains K left singular vectors which have I mentioned N and B contains K right singular vectors which have dimension m and another way to write this is to basically say that you can write m as the sum for I equals 1 to K of this rank one matrices that you obtained by multiplying the left singular matrix with a singular vector with the right singular vector so these are like rank one matrices scaled by by the singular value so in that algorithm that I showed you to get from a huncle matrix to an automaton you need it to pick rank factorization as we do gives you several run factorizations one that is kind of convenient is this one I'm not sure if I'm using this one sequence lights maybe a different one but anyway if you partitioned here or here or the D in the middle like in here you get rank factorizations which gives you like this this factorization of P and a Q if you put a hunka lemaitre's here that you can use in the algorithm as I said this is related to the say the inverse so if you know M you can if you know the SVD of M you can compute the pseudo inverse M plus by just like moving these matrix to the other side is to the other side with some transposition and inverting this so this is very very easy because inverting D because it's diagonals just like I'm taking 1 over similar values this is kind of the simplest way to compute the set of inverse and this is robust to noise this factorization is robust to noise and one way to see this is that this thing provides optimal low-rank approximation in the following sense so if you have a matrix that has run k k might be large because for example this matrix m might be like a low rank matrix plus noise and you give me some K Prime there is more than K right you can find a matrix M K prime that is gonna have rank K Prime and that is going to be the best rank the best K prime rank approximation to M and the way to do this is very simple you just take this sum and you truncate it to K Prime right you just like discard the last K minus K Prime singular values and singular vectors forget about them so this is the property stated formally it basically says that this M K Prime that I just built is a minimizer for the problem of finding like looking at all the matrices of rank at most K Prime that approach and minimize the approximation with respect to M in operator norm and actually the same thing is satisfied if you put here Frobenius norm instead of an operator now and this operator known by the way is the induced to norm that we've been using all along okay so here's the other technical stuff this is how we do the second part of the approximation we show that if we have a good approximation of the huncle matrix we can get a good approximation of the target automaton right and we'll let the weights will be close and by the previous result also the languages will be close so how do we do this well we take these are the same assumptions that I had in in my previous algorithm and in the previous light of like how you go from a huncle matrix to to an automaton and this is like a reinstallation of that algorithm where instead of putting here any rank factorization here I'm making very specific that I want an SBD right and I'm taking the rank factorization coming from the SVD that I had in the previous slide and now we are going to build another automaton coming from a noisy huncle matrix or noisy hunk also blocks right so suppose that in addition to H on PS I have some H hat again on PS and in addition to this H Sigma here on P Sigma as I have some H hat of the same dimensions again and that these differences like the differences in norms between H and H hat and H hat Sigma H has Sigma for all the Sigma's all these differences abounded by by Delta with respect to for example the operator norm but then the particular norm is not that important here the fact is that the true huncle matrix and these approximated hung kilometers are close so now obviously what happens is that this matrix might not have rank n anymore right this is some matrix of rank n plus some noise where the noise is basically this difference so if I run this algorithm like verbatim I'll end up with an automaton that's gonna have like the number of states of the rank here which usually if you have a low rank matrix and you are noise you get like much higher ranks so you'll get an automaton that is too big so you have to modify the algorithm a slightly so it gives you a note Oh size n and the way to do it is by using this property that SVD like give can give you like Laura optimal or rank approximations so now I take this matrix and I compute the optimal rank and approximation using the SVD right which is going to give me this P hat times as hard which now it's not going to be equal to H hat it's going to be an approximation to it that's going to satisfy this thing that I had in the previous slide and but now what I do is just like this symbol pushing like blind algebra I take this P hat and this s hat and do the same thing that I would do here so I take like alpha hat and beta hat to be like the first row and the first column of these guys and then I saw for a sigma hat by like multiplying the H Sigma hat by P hat in the left so the inverse and s Hutton and pseudo-inverse on the right and then you can show and this definitely doesn't fit in one slide so this is one of the of the well this is probably the most important result that I'm not going to prove which is that if you have any pair of collar conjugates P and Q then you can show that the bound here is Big O of Delta and this Big O hides a lot of constants constants that might depend on well on the dimension on the size of these guys constants that will depend on things like the singular values of H and and many other things but the important thing right the the conceptually important thing is that if H gets closer to H hat right this if this Delta gets more this thing gets more right so if I have a better approximation of the huncle matrix I get a better proximation of the weight of the automaton kind of like intuitive and these are like the results that we used to be like learning algorithms wait the automaton yeah yes that would be a parameter yeah you could hope to prove like approximate thing so you could hope to prove for example if you run this thing we'd say aim and minus one and here you run this thing with n minus one right so instead of doing a full SVD you do a rank n minus one SVD you can also prove approximations in there so yeah you can prove approximation results of this type in practice what happens also is that you can tune this by kind kind of cross-validation right you try several parameters and you look which one performs best on your data okay so questions maybe no it's not time for a break yet okay so let's briefly recap what we've done so far we've seen that if we want to represent sequential like functions on sequential data in a finite in a compact way we can use weighted automata weighted automata are very related to huncle matrices and there's a way to go from huncle meters or sub blogs of them to back to the automaton we hope that we can estimate this these matrices from data and if we do this and there's a little bit of noise we know that these reconstruction algorithms will kind of be tolerant to this noise so this noise will be propagated through our approximations grace lee so now all that is left to do in a sense is how do we estimate this huncle matrices in such a way that there's a small noise as we get more and more data and this is what I'm going to do in these two sections in two different settings so the PAC setting you know the computational learning setting that you saw in Ben's lecture and the statistical learning setting which is kind of the non realizable case that you saw in Barron's lecture okay so the first setting is interesting because if so been told you about puck learning and impact learning usually you're trying to learn for example a concept or a function that will assign some wise to some access and you have this pairs of access drawn from some distribution and the labels why are produced by some object that you're trying to learn so that means that when you do usual PAC learning you have two things that are going on you have the F and you have the distribution and and for automata this is kind of bad because these results this negative result that I show you at the beginning basically the way you prove these things is to show that well like no matter which automata you pick I can always find about distribution that will give you examples that won't give you enough information for you to learn properly so the way you can go around this is by trying to learn distributions computed by automaton in which case what you're trying to learn and the distribution generating the data are the same thing so there's no room for adversity to begin at a distribution that is going to try to fool you so this is how some of the result that I said like work in particular cases managed to prove like learning results that are kind of like efficient like in the sample complexity sense and in the computational sense by like removing this dissociation between where you sample from and what you're trying to learn so here this is the type of result that I'm gonna prove are about learning probability distributions on strings that are computed by way to the automaton or like types of weighted automata and usually when you talk about this you well you say well I have a function from string story'll numbers that computes probabilities so there are going to be numbers between 0 & 1 and there's a few settings you can consider the two typical settings are when this F is a stochastic language meaning it is a probability distribution over all strings so when you sum f of x over all x you sum up to 1 and in this case what you can assume is that you can sample strings from this distribution and use this as your data something that is slightly different that also appears in applications are dynamical systems so in dynamical systems you will have you could potentially like generate strings of infinite length because it's a system that evolves over time and keeps going until somebody stops it but but we're not trying to model the stopping time here so then in this case what you have is that f gives you a probability distribution over a strings of fixed length right so for any T bigger than 0 if you sum over all strings of length T then you get a distribution but you have a distribution for each T so that means that I can sample like a prefix and I and then I can sample further symbols like the process doesn't stop so this is more common in say like natural language processing and this is more common in say reinforcement learning or robotics and like time series analysis and so on and what I'm gonna show basically results that I'm gonna make them like specific for this case but you can extend every last one of them to this case with just minor tweaks so it's it's kind of fine ok so how do we estimate the huncle matrix from this type of data the first thing that you can do is it's completely obvious so if you give me a sample x1 up to exam that contains m strings there are sampled iid from some like a stochastic language there's a probability F that is a probably distribution over Sigma stuff well if what I can do is I pick like a set of prefixes and suffixes right and I estimate huncle block that on prefix P and suffix s is going to give me like some approximate function computed on the sample of like well concatenating P and s and how does this approximation on the sample goes was just an empirical count right so if I look at string X then I can see like how many times X appears in my date and normalized I am right that's that the the empirical frequency of the string X in the sample s so this kind of straightforward is very easy to show that this is kind of unbiased and consistent so if I get more and more data I will converge to the true H there's a sub block of the huncle matrix which is the expectation of this thing over like sampling from the distribution but depending on how you choose the prefixes and suffixes of your huncle matrix these can be data inefficient so what I mean by that is that if I take p NS that are too small I'm gonna ignore some data in my sample oh oh I thought there was like something that turns this thing's red so here for example right so if you have this this hunk of matrix here for example this string doesn't appear in here so when I do my County will be ignored the same happens for the string and the same happens for the string and the same happens for many other of them but I cannot buy bali so that's kind of unsatisfactory what could you do you could take like a bigger set of prefixes and suffixes right but but that also might not be that satisfactory because like there's a trade-off between ignoring data and putting entries in this matrix that you have estimated from too little data right so if I like grow this set of preface and sophis is very big obviously at some point I'll have a bunch of zeros but are these zeroes because the distribution doesn't like produce anything in there or are the zeroes because I didn't have enough data to look at like longer prefixes and suffixes right because it might be that not the probability of generating strings of a certain length decreases with the length so unless I get more samples I will never get to see these these probabilities and I'll be putting like zeros in my matrix that shouldn't be zeroes right so you will be introducing more noise into your matrix or you will be ignoring data and we want to find like a solution around this well it's not clear what the neighborhood would be in this case I mean you can do lots of things like like like I'm just like serving the basic method here and people have done a lot of things so you could also like say well I'm gonna like kind of label this thing together with its experience that I estimate or like the uncertainty that I have in there and then try to come up with like say more no more fancy decomposition methods that can take these certainties into account you can do lots of things what I'm gonna show you are the things that kind of play well with the automaton representation that we are assuming for this F that we're trying to learn one of them is saying oh so I'm estimating these counts from full sequences right and that's why for example I'm ignoring this thing but if instead of estimating this from the full string I was just estimating it for a prefix well then for example this string has prefix a a so a a would appear here so is there a way that I instead of putting here something that depends on me observing the string a a and nothing else depends me depends on observing the prefix a a and and that's what you can do when you're estimating this thing from prefixes which is one way to make this slightly more data efficient so under the same assumption and at the same time of sample now instead of an F F hat that I had before I'm gonna come up with an F bar and this F bar on an input string X is gonna compute the number of times that X appeared as a prefix in my sample right so now I will not be ignoring I will be knowing less strings and you can actually show that if your F is computed by an automaton then this thing also gives you statistics about the automaton right about this district a stick automaton that generates the distribution and the way to do this is basically first you compute the expectation of this guy and what happens is the expectation of this guy is basically the sum over all suffixes of f on XY which is the probability under this distribution F of generating a string that starts with X that's kind of like straight forward that these are what am I'm counting like strings that begin with the X so I get the probability of like string that begins with X but the nice thing is that if this thing is computed by a probabilistic aloha like a weighted automaton in this case we call it usually stochastic automaton because it defines a stochastic language these expression here if you expand it in terms of the automaton you can get you can show that these counts here that you're getting are come from a related automaton and that's how you do it so you take well this probability of generating a string that begins with Y X is the sum over all suffixes of F x times the suffix I expand this in terms of the automaton so that's like initial weights transition weights for X transition weights for Y and beta I put the summation of Y inside so I have this part that depends on X and I have this summation and then you do this summation and you can show that if you're summing over all strings Y this is actually the same thing as summing for all lengths the sum of like the matrices for all possible symbols to the power T I rewrite this as a basically so I get the sum of eight a to the T times V T for all T is nonzero and in the same way that I'm pretty sure you all know this but in the same way that if you have some of row for say T bigger than zero T this is 1 over 1 minus Rho if this thing converges then if this thing converges this sum of a to T is well 1 minus a inverse right which is the same thing as a row there and so nicely you can say well now this count that I'm estimating come from a different automaton that has alphas its initial weights which are the the same initial ways that I had before I can recover from it the same transition ways that I'm interested in so that's all so far and then all I did was change the final ways instead of like the original final weights bit of mind of the automaton that I'm trying to learn I have a b bar here that is related to to the original weights by this thing and from if example if i learn the a the a sigma i can compute this thing because this is just the sum of the a sigma so you from this kind of counts that look at your sample more thoroughly so to speak you can actually get bad the automaton that you're interested in now you're going to say this is still data inefficient because if i have like very long strings in my data and i'm only looking at the prefixes well i might be looking at all the strings but i only look at the finite piece of each string and this makes this thing still from data in efficient and and you're right and we can do something even better instead of looking at prefixes we can look at sub strings these gets a slightly more tricky so maybe after this will be time for a break it's just like incentive to hold with me for the next five minutes so how do we do this thing over sub strings so again you give me a sample and now the hunk of it is that I'm gonna estimate if I have string X that is P times s what I'm gonna put in there I think I now denote F tilde these aren't like things too to distinguish them so now it's gonna be a count right so I'm gonna like average over all strings in my sample this excise sub bar X which basic buy that I basically mean the number of times that X appears as a substring like continuous sequence of sequence sequence of symbols in my X I which I can write using this indicator variable notation right so now I'm definitely looking at all of my sample for each X that index is an index in my huncle matrix I'm going through all the strings in my sample and counting how many times on this X appears a substring in my sample so there's no using the data more than this and the good thing is that again you can play the same trick and show that this comes from a transformation of the original automaton so you're actually learning something that is related to your automaton that makes sense it's not like just some Bob bogus counting because you could do other things you could try instead for example of counting the number of times that something appears as a substring you could try to count the number of times the frequency of X appearing as a substring somewhere not how many times these other one which is kind of similar to to the number of times that you appear as a substring is not related to the automaton at all so so that's something that you need to be really careful when you do these derivations and again you can show that well when you take the expectation of this well you get what you expect so you get the expected number of times the text appears is a substring when you draw some Y from the sample which you can rewrite like this basically right like the number of times that Y appears as either except is a subsample of Y times the probability that you sample Y and using this representation you can again like derive this from like your original weight automaton majority the samples where now you have to sum over like some prefixes and suffixes at the same time so instead of just a bit Abbar you get an alpha bar and a bitter bar okay so this is the the most efficient way when you're doing we when you're dealing with several M MIT iid strings to obtain a hunka matrix that represents your data to then learn the automaton and if you're really interested in the initial and final weights then you can do like invert these transformations but before the break I just want to do one last thing which is remember that I told you that you can have distributions over strings or you can have this dynamical system that will never stop so when you have this dynamical system that never stops it might be that you don't have access to several trajectories of your of your dynamical system it might be that you just have access to one very long trajectory of your dynamical system again what you can do here in practice is come up with all sorts of heuristics that say well I'm gonna chop my trajectory into pieces somewhere so that they become independent samples and so on and do some heuristics but you can actually do also something principle here which is estimate huncle metrics from a single string when you have this like very long string and how you do you do this well again you're gonna have this like thing that you're gonna put in Europe in your handle matrix at index X that's gonna depend on M that is how long like how long your trajectory has been going on and that is going to count how many times X has appeared in your trajectory right it's pretty intuitive what happens again is that you can show that if this dynamical system is computed by some width automaton you can again do the calculations and show that these things you're estimating their expectation so by the way this is called like a Chesser average like average over time instead of like across samples you can show that these Chester Burridge's are also related to your way to automaton and I'm I don't want to inflict your this like more algebra on you basically what happens is again you have like the beta stay the same the transition which say the same and then you have an initial weights that depends on how many samples you have observed in expectation and actually if what you're dealing with has real really like some sort of probabilistic interpretation like this is a probabilistic system or an H a memorable ristic automaton without like stopping probabilities you can show that this actually converges to some sort of stationary distribution so in most of the cases you can imagine we can get the estimations of the huncle matrix that have like the right expectation so there are buyers they're consistent and the question is how do we show that what you're getting from certain number of samples is a good approximation of the huncle matrix of what you're trying to learn so that then we can invoke all these robustness to noise results and that's what we'll discuss first after the break
Info
Channel: Federated Logic Conference FLoC 2018
Views: 453
Rating: 5 out of 5
Keywords:
Id: IW2MBOVfrUM
Channel Id: undefined
Length: 83min 19sec (4999 seconds)
Published: Mon Jul 30 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.