Stephen H Muggleton: Inductive Logic Programming I

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
inductive logic programming is really the intersection of various different fields and this is the way that it was formulated so it's the intersection of machine learning computational logic and programming but its roots go back now in twenty eighteen fifty years to the start of the PhD very important PhD of Gordon Plotkin who's well known in program semantics and and other areas of computational logic as well but his PhD was on doing induction within first order logic and he was inspired by some of the resolution theorem working there improving work that was going on that came out of Robinson's work so so I'm going to take youth I'm not going to take you through 50 years of research what I'm going to do is I'm going to make give you a kind of rapid progress in the first part of my tutorial through some of the key ideas possibly at the first twenty even thirty years of that and then fast forward for the remainder of the talk into things that are current okay so this field just to give you an idea of has been having international conferences one every year since 1991 that still goes on the next one if anybody was interested it is in is in Italy coming up this summer and so I as I say I'm not going to cover all of that the way that this is arranged is in two 90-minute blocks and I broke broken each of those into three parts each of which should take half an hour so in the first one this this part of the lecture or this one point one part I'll be giving you an introduction to to what - the the ideas behind inductive logic programming key ideas to do with generalization and some techniques that have been introduced as well as noted mathematical notions to do that that relate entailment to the notion of generalization I'm then going to progress through into as I said more recent stuff basically in the last five years in which there's a new framework that has started to be explored in a lot of detail and I'll motivate why that has become important what it deals with that the previous inductive logic programming framework didn't do and this framework has called meta interpretive learning I'll show you various different aspects of that which cover both the logical side of things but also end up with the probabilistic at the entry the probabilistic constructs the way that Bayesian inference is worked it I'll Kant has been worked in to Metro interpretive learning and although as we said this is logic day in fact Bayesian inference has up has played an important part at least since the 1990s in the formulation of the machine learning side of inductive logic programming so this is a rather recent part of that so I'm trying to give you as much detail of what new and interesting in the hope that you're drawn to this area if you're intending to look at it from a research perspective ok so the lecture material in case you want to get hold of it is on the web and the web address I've given here does anybody wants to look this up it's www I see AC UK till the SA chairman which of my initials flock ILP and then if you look in that directory you'll find lecture 1.1 etc there's also for each one of those lectures there are associated purrs so just to give you an idea for this part 1.1 the papers 0 1 and 0 2 from that site are there so I'll be if you want to refer back to the material and understand it in more detail because I'm just gonna in half an hour I can't give you more than a kind of glimpse idea of what what's going on these are inverse internment and pro goal and it's the extension paper to theory completion so take a look at those if you want to study further the material so what I'm going to do is I'm going to start with one of the key questions that that as I mentioned Gordon Plotkin's work addressed which is that the intersection area of understanding logic and and machine learning because the notion of generalization is is is at the heart of the idea of induction so for those of you unaware of of philosophical induction this is an idea from Aristotle where you can take specific facts you just take a set of individual statements and then you you generalize those statements you form a generalization of them so I'm going to give you some examples here so suppose we looked at statement a and statement B I could ask you now which do you think of a and B is more general Daffy Duck can fly or all ducks can fly so put up your hand if you think it's a half a hand there anybody for B everybody agrees Daffy Duck can fly or all ducks can fly is more general than Daffy Duck can fly ok now let's look at statement C and D which is more general Merrick lives in London or or merit lives in England so anybody who thinks that C is more general one there two three interesting for and D Merrick lives in England the rest of you okay so so I'm afraid the that although everybody was right with this first one most people were wrong with the second one actually for obscure reasons I'm just about to show you Merrick lives in London is a more general statement that married lives in England which is confusing to everybody when they look at this problem to begin with but in order to to give you an idea of how that comes about you have to start you have to actually study some of Plotkin's formulation of the notion of generalization and we're going to start off actually pre Plotkin with John Reynolds work on simple generalization and we start with a lot of the notion of a logical substitution which is a set of pairs of variables versus terms which tell you how to replace the set of variables in a formula f with corresponding terms here okay we write F theta where theta is this substitution is form by replacing all of those variables and then we can take the notion applied to atoms and we can introduce an ordering which Reynolds refers to a sub sumption this a greater than or equal to B and say that that's true if and only if there exists a subsea such that a theta is equal to B okay so this is a this is an idea of generalization if you like over atoms so an atom the atom a is a more general one than be just as our they are actually the other way around all ducks can fly is more general than a in this in circumstances that we can apply a substitution and get a an atom B then Plotkin generalized this to clauses so disjunctions of atomic formula universally quantified disjunctions of atomic formula and he says course c sub seems clause d similar to this or generalizing over the notion of atomic subsumption if and only if there exists a substitution theta such that C theta is a subset of D where we're taking clauses as being sets the disjunctions represent the set of the literals in that clause okay so but so these are both ways of describing a form of generalization known as Thetis absorption so let's have a look at a particular example the one that we looked at before we have to take the statement Daffy Duck can fly turn it into a logical formula that might be something like can fly the predicate can fly applied to the to the constant Daffy or ducks can fly so universally quantified can fly X I'm using kind of Prolog e-type style conventions of leaving out the Universal quantification over X here but just read that as for all X can fly X and then we can see that can fly X theta subsumes can fly Daffy because there exists a substitution theta is equal to X substitute for Daffy right so so we can take in this case this is support to the statement that all ducks can fly is more general in IE at theta subsumes the other statement right so that idea of theta sub sumption is a very much of a a syntactic notion if you like but it is in line if you think about those relationships with specific applications of the notion of entailment ie the relationship between of subsets relationship between the models of the formulate so in this more general setting we can say that C is more general than D if and only if C entails D and in fact if you look back at the Plotkin definition here this is a special case of of that formula so this is not true in the case that C and D a recursive clauses but otherwise it's it's consistent so plot can then generalize this notion to relative entailment where he brings in this notion that when you're learning you often have some background knowledge about the formulae that you looked at you're looking at and he says that you can generalize this simple generalization notion to that of relative entailment with respect to some background formula be if you say that C is more general the D with respect to B if and only if the conjunction of B and C entail D ok so B and C entails D is your notion of general general entailment okay so what where might this be useful the idea of relative entailment well let's look again at the second example the harder one and you can see where this kind of notion of background knowledge might be useful because when we write out these formula in logic we write merrick lives in london and it says lives merrick london and lives merrick england there's something that probably most of us in the room know which is that London and England are not simply symbols in some arbitrary signature but they themselves have some relationship which we know about it's our background knowledge which could be stated and that background knowledge might be stated in a formula like this for instance that anybody who lives in London lives in England okay that living in London implies something about living in England right so we cannot simply treat these two statements in the way that we would did before because clearly neither subsumes the other right because there's no substitution that takes you from lives marek England to merit London or vice versa however they are they are related statements there is some relationship between them but only if we take into account this background knowledge right and if we do so we can see that one of them implies the other with respect to the background knowledge okay so looking that at that now ask yourself which one implies the other and in fact with respect to this background knowledge if you know that Marek lives in London you can entail that Marek lives in England okay so that means that there's a relative entailment between the first statement and the second statement and that means that according to Plotkin relative generalization the first statement is more general than the second now that's confusing that confused everybody apart from a few in the audience but if you think about it in the sense that which is more informative of the two statements then it's clear that the first statement is more informative than the second it tells you more right it Olymp eliminates more models and so therefore it's it's a more general statement in that sense okay so we've already seen that the three elements that are used in that in all of the definitions of inductive logic or at least in the standard definition there are some non-standard generalization settings for ILP but in this one which comes initially from Plotkin we always consider there as being at least three elements okay we can think of background knowledge examples so background knowledge might be like the statement about London and England examples a set of unit clauses and hypotheses and if you put these into a logic programming framework or actually into a computational logic framework in general you find that you can you can set up a model of generalization which is based on Plotkin's relat relative generalization such that if you've given some examples and you you're provided with some background knowledge things you already know then you learning consists of formulating a hypothesis and adding it to the background knowledge such that it explains the examples okay so this is a this is clearly a form of machine learning because it involves generalization and it involves hypotheses and the generalizations allow you to predict outside of the initial sample of examples that you were given but it involves an element this B element which you won't find in any of the other forms of machine learning that are around that I know of okay so I'm not I'm saying that with a lot of knowledge of having looked at all of the different forms of machine learning this is really not this is one of the most important aspects that duct divides inductive logic programming as a form of machine learning from all others okay there are other aspects that divide it so one of these is it's a general way of formulating the notion of programming and programming by examples ie learning or synthesizing a program from examples it allows and support it's relationships in a way that you don't find in other forms of machine learning which are feature based and so and the the allowing of background knowledge also has a rather interesting property when you think about what we mean by the notion of learning okay if I told you that as an as a human being every time you learn you have to do so in the absence of all previous learning that you'd ever done you would find that very strange because learning allows you to learn more things over and above what you've learned already if you look at standard forms of machine learning they don't because they don't have a place for background knowledge you can't easily formulate a notion of continuous learning where you build our background knowledge which allows you to learn better the next thing that you're learning so even understanding for instance how to learn mathematics is a very weird thing if you think about it in most forms of machine learning in inductive logic programming it's completely natural because background knowledge can be augmented revised and used to learn further things so it's quite a it's quite a kind of natural notion of learning with respect to human learning right search and refinement these are the kind of key idea again a key idea once you've got to the the idea that the its fulfilling the relationship between B H and D that is the the goal of the learning the obvious question is how do you do the learning therefore this is the algorithmic question algorithmically how do we find H given B and E and there are many different answers to this okay so in general you're carrying out a search of clauses you might do this for instance from simple to complex you start considers considering simple hypotheses and you only consider more complex ones afterwards or you might go the other way around you might start with general concepts and then make them more specific the idea of refining a hypothesis successive refinement of hypothesis is again again one of the central bricks of of inductive logic programming introduced by Judy Shapiro at least 10 years after Plotkin's work completely in the context of logic programming in his PhD thesis under Dana Angwin at Harvard and there was a fantastic piece of work if you ever have a chance to read that the book of this it's really really worth reading because all of the notions are so clear and he managed to prove some very very general results just to give you an idea of what was quite of his central idea of refinement so this is the same well this is an adaptation of the notion of refinement within within programming languages so Shapiro's refinement works by considering formulae in this case let's start with the simplest possible formula false and then alter that formula in a variety of different ways so he has refinement rules one of those rules allows you to add a claw a clause with an additional item in it so you can go from false through here through two lives UV okay you can consider lives UV as a hypothesis if you then wanted to consider a more specific hypothesis because this says everywhere everybody lives everywhere then you might say okay well we'll say lives UV of lives WX so this is adding another literal into the clause but in this case because it's a lot a definite Clause we have to add it into the body but all of the variables are disjoint from the head another a form of refinement shown on the second arc here you say okay let's bind those two variables together make U and V the same and the third one we might bind not the variables together but bind them to a function symbol in the language or a constant in the language and Shapiro managed to show in his thesis that just with those three operations you can construct every definite Clause program using it that using a systematic search of that kind so yeah so there's a very beautiful and elegant theory that Shapiro has incorporated into his Mis system which has also been used in a nuttin countless other systems including Ross Quinlan's file system the search in all of those systems which is based on refinement has a property which is that it's open-ended okay so it's never quite clear how far down in the subsumption lattice the lattice of subsuming formula you should apply before stopping every time you and every time you consider another point in the space you need to test those clauses that you've formulated against all of the examples and it's an infinitely descending chain okay so it's an infinite hypothesis space there is no entered and in fact some of Plotkin's results show precisely that that you can always construct subsuming clauses as far down as you like so this was a weakness I I think in retrospect of the Shapiro model there was an alternative approach which was introduced back in night in the mid-1990s called inverting entailment which was somewhat different I mean if you notice carefully there's a relationship here possibly between the examples which you're constructing these formulae out out of and the function symbols and predicates symbols that you've been told her in the signature but there is no relationship there's no bounding relationship to the background knowledge and I told you background knowledge it's really essentially important within Plotkin's definitions of of the framework so here B and H entails e when you look at this or when when I was looking at this I suddenly started to realize that you could you could rearrange this formula okay so if I said B and not e entails not age so just by switching a and H around the entailment sign and negating them you could actually get an equivalent formula so this is true for every benh that the first one is true of but it has a different property because the problem with B and H in tells E is that we don't have any way of going backwards from E to H right so we know that we can use deductive inference to go forwards but the question that we're looking at in induction is how do we go backwards in an inverse inverse since Hellman we solve this problem by simply switching those around the formula and now we can use deduction to take the given elements background and examples and deductively derive every hypothesis that is put that is possible so B and E entails H gives you a formula for generating hypotheses okay and in in particular you can generate clauses and I managed to prove that there exists a most specific clause which all of the individual clausal hypotheses entail right and because they all entail it in fact all of them subsume it so you can turn this problem with background knowledge into a pure substantial refinement graph where you know that once you've considered the entailing hypotheses of bottom claws then you've considered the entire hypothesis space so it's a it's a way of using the logic if you like in order to restrict and simplify the problem the outcome is you get a bounded search with bottom at the at the base of this and the the empty empty Clause at the top and the subsuming relationship here holds right so whatever our hypothesis is it subsumes h and h subsumes bottom so it makes it a much more contained search and the effect of that when it was used on the first kind of hard problem in trying to identify patterns within molecules that were mutagenic cancer-causing happens is that you've got a dramatic reduction in the size of the search so without the bottom clause you've got 3 times 10 to the 7 clauses that you have to consider once you use the bottom clause it reduces that by a factor of 10,000 and in the end you can look at 2500 clauses is enough to to consider everything that's worth considering right the all of the other clauses actually don't even entail the examples right so they are they're unrelated to the problem so it's a way of zeroing in on which of the parts of the hypothesis space that are really worth considering okay so that's the I'm finishing off this first part and this is a summary of what I hope you've got an idea of logical entailment gives you a general framework for the notion of generalization refinement provides a mechanism for search through the space of those generalizations that's so the first thing is what can I do the second thing is Shapiro's idea inverse entailment is a model theoretic approach to I hope you based on algebraic transformations of the logical constraint so it cuts down the space and then there's a system which incorporated this the pro goal system which uses admissible search and is efficient because it's it's supports finite interval search so progo is still used as a system though it dates back to the 1990s are still widely used and probably the most cited of the ILP systems and immediately it showed this new approach showed that you could search spaces that previously were in if the search techniques were ineffective in handling okay so we're more or less on schedule if anybody has a short question I'll answer it at this point No okay well then I'll move on to the next part of the tutorial so one point two okay so so that's it there's a nice story there of a development over maybe 20 years of techniques in inductive logic programming but there was a kind of shock wave hit the field in in 2011 when people had found certain types of problems that really were not being handled properly within the Plotkin framework or the initial Plotkin framework and an example of this appeared when people started looking at simple grammar learning to learn automata our finite or context-free grammars and the paper here which is an attempt to approach that comes from 2014 so not not that long ago and so okay if we look if we look at what inspired inductive logic programming really the logic programming movement the kovarsky and others introduced back in the 1980s we saw as an exciting thing because it showed that you could treat programs as logical forms of inference and you could formulate programs in in a clean and clear way when when thinking about how to do how to machine learn such programs the this topic of inductive logic programming started off with a workshop in 1991 that I I ran and the thing that I think inspired most of the people who were there was this idea that if we start with a general-purpose programming framework we could do machine learning not just on decision trees or conjunctions of formulae we could machine learn arbitrary programs we could think of this as being a tool in the toolbox of computer science that allows you to go in programming and build any program that you like out of examples and in fact a lot of the early stuff that you see in terms of examples you see people including my own papers we're looking at how do you learn quicksort for instance okay so we're actually inspired by the the issues in programming but when we had progressed for some time we found that various difference simplifications have been made and they were you know there were ways of making systems run more efficiently and when assessed in 2011 there was a meeting or was a panel where people have discussed what the limitations were and there were various basic things that we really couldn't find being done in any of the ILP systems despite the original intentions and in particular this idea of predicate invention and the idea of recursion came out as two things which really were not being handled properly and in particular so what is predicate invention well predicate invention from a programming perspective this is really central this is the notion that when I'm writing a function f right I would like to be able not to just to write it as one huge function description I would like to break it into parts and once I've broken it into F 1 and F 2 and some function between them then I'll break f1 and f2 each one of those individually into F 1 point 1 and F 1 point 2 and so on so I'll break out recursively divide and and simplify the problem by introducing and subdividing the problem ok so completely central to to their idea of programming that we shouldn't we should do programming in this in this way that breaks out individual sub programs and predicate invention which has introduced the late-1980s where there were methods that were there those methods really hadn't been efficient enough to be used for hard problems similarly recursion was giving all sorts of problems and people didn't quite see what was going wrong if you trace it back you find actually it's to do with limitations of Plotkin's definition of generalized of logical generalization with respect to background knowledge but to see that you'd have to read through quite a lot of detail of that thesis to understand but both of these were impeding the systems that existed so to look at that problem students and colleagues of mine work together on a specific problem and what we were looking at the starting point can be summarized in this example here so imagine you're trying to learn a finite state acceptor okay so something simple like this so this is a parity acceptor so it's got two states Q 0 Q 1 you come in on the start state Q 0 you can accept as many zeros as you like if you get a 1 then you change to state Q 1 and you can't accept in state Q 1 unless you get another one and go back to Q 0 but in Q 1 you can accept as many zeros as you like so every string that is accepted must if you look at it have an even number of ones okay so unless the string has got an even number of ones then it's not acceptable so let's look here at some example so the empty string has an even number of ones so does the string 0 so does the string 1 1 0 0 101 okay if we write this down we can write this down in Prolog we can translate it into a set of productions and then we can translate those into something called a definite clause grammar and a definite Clause grammar is a recursive program like this which has the predicate symbols you can imagine just being named after the names of the states so you've got a definition for Q 0 keya zero accepts the empty string or something with a zero goes back to q0 q0 with a 1 goes to q1 so you notice with these clauses there's a direct relationship between each clause and an individual arc in the finite state acceptor so if I look at this one here that goes q1 1 to q0 that's here that's q1 1 to Q 0 okay and I can associate the kind of strings that are accepted by elements of that clause so here is 1 101 requires you to for instance have that particular clause now if I look at this from the perspective of 2011 there's various different aspects of this tiny simple little logic program that make it totally intractable and one of those is that it's not only recursive necessarily but it's mutually recursive and in fact there's a small literature you can see back from the 90s in saying we can't learn mutually recursive definitions so why well because to do the mutual recursion you actually have to invent a predicate and in this case q1 if you were just given examples of q0 you'd have to invent q1 as a state and then you'd have to consider how mutual recursion happens within it so it's got its own definition and you can't remove any parts of it without destroying the whole structure so it's a hard problem to learn from a pre 2011 perspective but take another look at this sorry no it was possible but with quite strong restrictions okay so the learning of recursion was typically done you had to have a base case example such that when you learn the recursion you could test it down to the point at which the base case accepted right if you just had one example you couldn't learn from that because you'd have to have at least two examples one containing the recursion and another containing the base case okay whereas here it seems quite natural that if I was given this example I could just with that example consider introduction of a new symbol and it's recursion but that should be within the hypothesis space of this generalization I mean if I think about it as an Intel ment relationship that should be there somewhere in the hypothesis space but it was not and it was because of a detail of something called C derivations in Plotkin's thesis that that people just haven't done that okay but what we did in this paper is we said let's take a different look let's look again at this and what we've got here and consider the fact that all of these clauses look really very similar what most of them do these last four all have pretty much the same structure right all that's really varying between those clauses is whether you've got a q0 or a q1 in the head q0 or q1 over here and whether you've got a 1 or a 0 in that column there and the rest of the structure is identical right why is that because we've represented this in chomsky normal form and in Chomsky normal form we can always represent every rule in a very simplified way it's either of this very this first type or that second type right so why can't we just tell the learner that's to begin with we say look somehow you gonna learn rules like that or we're gonna learn rules like that where QQ C and P those are the things that we'll select and the rest of it will just always have that structure okay if all we were trying to do was learn regular grammars this would be a useful thing to tell the learner okay now how do we tell it that okay so we want to tell it that these two things are the case the first problem with with inductive logic programming is it's all formulated and first-order logic so we can't tell it that information at all so we have to at least allow ourselves a second-order logic so that we can consider variables of that kind within the within the language of but that biases the learning but then what even if we had that how would we do it well in the case of a learning regular languages we might consider writing a parser such that the parser somehow incorporated these higher-order rules okay and this is such a positive response says well we're going to look we're going to parse things of this form Q empty empty or Q C X Y and then we've got some conditions on and those conditions are this is only in the case the Q is an acceptor and this is only in the case that a Delta relationship holds between these predicate symbols Q P and there's an and see now that Delta function is normally what's used in order to identify a particular finite state acceptor right you can tabulate it usually so suppose we can think of this as a tabulation and the rest of this was always the same okay so now what happens well if we if we were given a particular set of ground fatsia like q0 is an acceptor and Delta 1 goes Kia 0 0 Q 0 and so on where all of these relate to the transitions that we had the transition arcs then these two together these ground facts and that that passer are sufficient to accept exactly the parity language so now okay we if you if you can see that that actually works if you've got enough if you understand the logic of it then from the machine learning perspective we don't want to have to be told these facts we want to be able to guess them right so how do we guess those facts if we're not given them together well we've still given this general purpose regular grammar parser well we can do it using logical abduction is the answer if we just say abuse Delta 1 an acceptor we can incorporate a very powerful method from logic programming called abduction that works as long as the facts the only thing that you want to do is is grab it is construct from your examples a set of ground facts that are consistent with the examples so we've turned a problem of learning general rules in this case into a problem of learning some specific facts these specific facts though themselves have a kind of higher-order aspect to them because they're they're facts about predicate symbols ok so this is a delta one has arguments the predicate symbols Q 0 and Q 1 so these are meta facts of the some of them ok so the general idea is here you've got a pair background knowledge and examples you've got some kind of parser meta interpreter which has atomic background that can be given to it and the hypothesis is formulated as higher or existentially data vlog atoms such that BH entails e we need to make sure we can also handle negative example so that we can do the kind of general version of machine learning in this and in fact this whole framework is turns out to be consistent with inverse entailment a special case of inverse entailment involving essentially abduction of facts right so how does this work in practice so I'm given a set of examples like this I say pars this is the these are the positive examples you must be able to pass empty 1 1 0 1 1 etc these are negative examples don't pass any of these and then the negative examples can be written out in as as logical formula I can slightly short of time on this but there's various different properties which are consistent with the properties of regular grammars so we the definition of in centrally entailment so general entailment we can redefine that in terms of the MIR setting we produce a lattice again of the same kind that we do in in inductive logic programming we have a lattice of generality relations it turns out there is a unique top theory the most general theory which is unique and there's also a unique bottom theory and most specific theory in all cases in learning regular grammars so the question then is have we just done something very specific here right does this only really work for regular grammars and in the same paper we showed no actually it's very straightforward to extend the learning of regular grammars to the learning of context-free grammars in the same way and all that it requires is going back and looking at the general definitions of what a a context-free grammar is turning those into parse statements with different kinds of Delta functions and you've got a general-purpose way of learning context-free grammars now this to me was shocking because I actually looked at trying at the literature of learning context-free grammars and realized nobody had published anything although there's loads and loads of papers on how to learn automata there aren't any general-purpose ways of learning context-free grammars from examples and there are problems with doing so relating to some of gold' results but when you implement this in Prolog you get the most amazingly compact machine learning algorithms okay so here's a parser that one that I showed you before we introduced an abduction element into it a two line definition of how to do abduction using member and we give you a set of skolem functions at the bottom which are scrolling constants and then query a query to that program simply consists of the list of positive examples with a list of negative examples with not in front of them and you pass that query to prologue and Prolog comes back and says the answer is and it's a set of deltas okay so Prolog just just taking this little input this small amount of input comes back and gives you an answer and if you look carefully at the answer here Delta s 1 0 1 1 s 1 1 s 0 is the parity except it's found the parity accepted from that small number of examples if you give it any smaller than it'll actually not find it if you give it any more it'll find it right so if you give it a superset of these this is the simplest explanation of that set of examples I if I had prologue running here I would tell you but it basically it gives you an infinity of different answers this is the first one in the derivation tree that it finds negation as failure yeah yeah so you so couldn't be simpler it's logically makes sense you can give all with your positive and negative examples in one line as a query and it tells you the answer right it tells you there's a finite state machine and this does it right all done using the simplest of Prolog to do it okay so so this was very exciting the first time we did it we then thought okay we need to carry out some experiments to see how we can how well this approach works so we formulated something called meta call R for regular grammars and various different hypotheses for the experiments that you can't learn from randomly chosen regular languages you can't outperform our state-of-the-art ILP systems you can't learn randomly chosen context-free grammars so these are all null hypotheses so and and so on and then we ran experiments by randomly sampling hypotheses and example sets and we got a set of results which were interesting so meta goal R for instance here in terms of predictive accuracy outperforms one of the most recent state-of-the-art IRP systems there by a considerable degree I mean so this is topping our way before medical R is the looking at the time meta Goulart is taking hugely less time to do so when we look at the second hypotheses we're finding that with context-free grammars the situations even better we're learning meta Gould CF is continuously learning and improving its performance whereas MT top log is is actually degrading after some time the time again is completely way out here so these are you know imagine it's 60,000 milliseconds and this is really low for a medical CF and then we improved the performance even better by having a system medical our CF which simply start off by assuming that the grammar is regular and then if it finds there are no solutions it switches to being to looking for a context-free grammar and this does even better than either Metagross see our own medical CF so it outperforms it but again both in terms of speed and time which is which is quite satisfactory because you can exceed you can understand what's going on in terms of the sizes of hypotheses that are the size of the hypothesis space that's being that's being searched and you can currently you can look at that and explain it in terms of the Blumer bound which is what were that one of the most basic computational learning theory bounds for this so we were able to use the same approach for learning simple things like a you know a grammar for a small dog walks into the house and in this case you can see that interval noun phrase components verb phrase noun phrase and so on are being built by the context-free grammar learner and so this is related work there's quite a lot of related work by anyway who agura this is a big survey that was done in 2010 there's some learnability results of context-free grammar learning this is the closest to a general-purpose context-free grammar learner from 2000 Sakakibara is one in 2002 requires that you have past sentences rather than example rather than example sequence is given so and Langley and Strom Stone is a purely heuristic approach so there's no guarantees of convergence so summary and limitations metric interpretive learning looked like something interesting right this is something we haven't seen before it in our community yes good question we haven't tried very hard problems because we set off in a different direction after this but there are standard linguistic barriers I suppose to do with [Music] how do you for instance handle things like anaphora right so so language learning is pretty hard and those things that people know they are not handled by context-free grammars so that's one and agreement of verbs and all of that kind of stuff but as far as taking fragments of of English of English and learning those context-free fragments there's no reason not to do so that you are bound to to hit certain barriers in terms of the complexity of what can be learned and later things that I'll show you we know that there are limits as to the size of the theory that you can consider before the search becomes overwhelming okay so on so this this is a kind of ongoing story but so so what did we cover here so we've covered that interpretive learning it's theory and so on the grammar application the predicate invention and recursion the fact that this is a this is a form of declarative bias this idea that you can give these kind of general-purpose rules of the kind that I showed you for what for what a regular grammar consists of and we were left with a lot of questions at the end of it so for instance by using this kind of top-down theorem proving that you get for free and Prolog are you not giving yourself a headache shouldn't you be trying answer set programming if you're looking at grammar parsing shouldn't you also have some notion of chart parsing underneath to make it more efficient there are natural you know how what could we learn in terms of natural grammars which is your question what happens when you try to learn from nothing with no background knowledge and so on but we decided that in terms of our resources and efforts we wanted to see whether or not learning grammars was really was really at the core of what was going on with Metro interpretive learning or whether we should be looking at more general logics and that's what the next the next lecture is going to be about so the learning of madmen arrogance and dyadic logics high to recursive languages yes yes true so and I hope that this is going to address that because what we discovered when we looked outside of that the grammars is that there that universal Turing machines are an interesting generalization of what we were doing and so hopefully I'll show you some of that in this lecture okay so Soumitra interpretive learning of higher-order dyadic data log so this is the next step or this from our point of view is the next step at that point the paper if you want to take a look at it came out in machine learning journal back in 2015 and that's it they're the kind of central issue is this this representation of higher-order dyadic data log which is something that we came up with meaning of the obviously higher-order logic people have done a lot in that but for some reason people hadn't taken seriously the data log fragment of higher-order logic and in particular we thought we wanted to concentrate on the dyadic fragment for reasons that I hope will become clear and we wanted to really take a look at the issue of predicate invention which had come up in the grammars but is is prevalent in all kind of general-purpose programming issues same similar kind of motivation as last time but we're wanting to look at how to solve this general problem of predicate invention and recursion in inductive logic programming not just in grammar learning okay so we thought okay what what do we start with now let's take the standard you know introduction to Prolog one where you're shown how you can you can define a family relation as a logic program okay so imagine you've got this family tree here Jake and Matilda get married they have very stiff an offspring we marry and so on at the bottom we've got various different individuals okay so you can represent that family tree as a tree but you can represent similar information including generalizations of what's there in a target theory so particular facts might be things like Ted is the father of Bob and Ted's father a Jane but you might also want more general facts like every parent pair is a mother Paris Oh mother XY implies parent XY father XY implies parent XY you also might want to define recursive things in this setting so for instance ancestor is naturally defined recursively saying that every parent pair is an ancestor pair and similarly if X is the parent of Z and said is the parent of Y then X is the parent of Y okay so so as a look that we haven't yet defined what the learning problem is yeah but there's different ways that you might do this well one way is I give you some examples about parent I give you some examples about ancestor and then I maybe learn each one of those definitions separately and then I'm done okay so that's a kind of standard approach in machine learning and also in ILP but what we were interested in here was what about predicate and invention and recursion right so why not strip the whole thing down let's say I don't I'm not going to tell you about parents I'm just going to give you examples of ancestor here and I'm going to give you some I'm gonna give you the relationships of father ship mother ship etc from the family tree and then you have then the learner has to construct the whole rest of the theory so that means not only building a recursive definition of ancestor but also introducing the parent relationship okay so inventing that get made it won't give it the name parent little college as something arbitrary let's say that's happy with that but is it possible just from examples to do all of that because that's equivalent to what we were trying to do when we were learning the parity example okay because we had to learn recursive transitions and we had to introduce new States in this case parent is like a new state right when we do that we're going to have to think about the information that's represented we can think about it in first sort of terms so we have examples background knowledge and hypotheses which look like this but when um when a a Metron turpitude learner learns things it has these it basically considers various different higher-order facts and we'll call these meta substitutions okay so we'll have facts a bit like this but there'll be instances of particular meta rules okay so just as before we had instances if you remember of the transition function here we're going to have to have instances of general forms of rule okay so let me show you what I mean by that okay so here's a generalization to begin with of the part of the notion of a parser right if you generalize the notion of a parser to general purpose programming you get the notion of an interpreter right so a program interpreter in Prolog can be defined that in a couple of program clauses one tells you what happens in the base case when you're answering a query and you've got nothing left then you get the program I'm giving this I'm turning though this though into something which creates a program okay and has a given program so think about the arguments here the first argument to prove think of those as the examples that are being given the second argument is the background knowledge and the third argument is going to be the hypothesis now this hypothesis unlike the one I've shown you so far the third argument contains the second argument so it's an extension of the of the of the second argument but these are programs okay now when we try to prove examples the atoms here we do so not like prologue prologue goes off and tries to grab a claw from its claws base to prove it what this matter interpreter does it goes off and it gets a matter rule which has a general higher order form but it still has an atom in the head and the body to it and by when it when it matches the atom that comes in the example that comes in it unifies that with the head of the metal rule creating a partial meta substitution okay so it creates some substitution for predicate symbols and constants within that meta substitution it has associated this turned out to be a very important aspect that hadn't been necessary in the grammar learning you need to have some kind of ordering over the search and so that's represented by an order constraint which has to hold and having found the meta substitution you then have to abuse it or simply save that substitution into a substitution base and the substitution base it turns out is the revised program okay so you're taking a set of substitutions that come in and you're adding further substitutions into it so that your final program is simply a set of substitutions right so you think well I wanted a program why did I've only got a set of substitutions but their substitutions associated with meta rules so as soon as you substitute them into the meta rules you've got a first order program okay so it may sound complicated but it's it's the seemed it was exactly the thing that was required but there's a second part there's a double recursion here having matched the head you now have to go in and prove the body right so the body is from the matter or the body if you if you remember is basically undefined right so you now have to prove that and in the process of proving it you extend the program even further but what have you extended it by doing either you've matched some things that you already had introduced as predicates or you've introduced new predicates in the process of proving it right so all this happens as a as a side-effect of this recursion and then you have to prove the rest of the examples right so by doing that double recursion you take one pass through all of the examples and in the process you construct the whole program consistent with those examples maybe obviously with some backtracking involved but the overall structure is extremely simple and related to top-down programming your this thing is writing a top-down program this is going down into the structure and constructing further invented predicates and predicates required by those invented predicates and then finishing off by ensuring that they all agree with the examples okay so know that now the mystery what are the Metall rules okay well to understand the Met rules we have to in the case that I was showing you you have to look as we did with the grammar at the form of the rules here and you notice once more that all of these clauses here are suspiciously similar to each other they're all the same clause they just have various different predicate symbols involved so we're going to have to have oh sorry these these three are the same that one's different okay and those are different so we're going to have to have meta rules associated with each different form of clause that we've got there this first one gives us ground facts okay so when that's executed then the X's and Y's become constants and those constants are abused as you normally would in an abduction engine the second one P X Y of Q X Y generates a clause that's like this but with substitutions for P and Q and the order constraint says that there's a total ordering over the predicate symbol such that P must be greater than Q that's critical because that's to do with forcing a termination I'll show you later of both the learning process and the program that's learned so you can you end up with a guarantee that this this program is going to terminate chain the chain rule gives you P XY if QX y ry so that's a bit like the ancestor rule except the ancestor rules or can also be thought of a special case of this thing called tail req which again has orderings so these orderings here are P greater than Q and Q B greater than R whereas in tail rec it turns out you need an additional constraint to ensure something called interval convergence which turned out to be important so that's this constraint here over the constants involved okay so there's something weird going on in these because you might ask the question so why are these lower case X and why and why are those upper case X and why well there's a there's a notation that's being used here upper case means existentially quantified variable lower case means universally quantified variable when the prove when they when the the matter interpreter runs it saves the substitutions only of the existentially quantified variables in these rules and leaves the universally quantified variables as they are so that's how this works and if you think about it logically would you ever really want a rule that said for instance P of X Y is true for all Q X Y and that's universally quantified it doesn't make sense it's not true for all P and Q clearly but in your program they may exist exist a P and the Q such that those things are true and that's the semantics of these naturals there to do with finding solutions for some the substitutions that give a program fragment okay so that the general form is this you have a general meta rule can be formulated by putting existentially quantified variables for P and Q Universal for x and y and the whole idea here is that that can then support predicate and object invention through the order constraints oops coffee and yeah so in the family relations we consider data log logic programs and we consider a fragment which we call h 2 2 so H 2 2 might sound as though it's hydrogen or oxygen or something but it's actually a hypothesis space of logic programs in which you have at most two variables or to place the places for each predicate okay so you're allowed up to dyadic predicate x' and in the body of the definition you're allowed up to two atoms in the body okay so it's again it's reminiscent of the constraints for Chomsky normal form but in what placed in a kind of logic programming setting so arity of most two and two atoms in the body so what is h2 two it seems it seems to work for our little family relations fragment but does it what are its boundaries now it turns out that that question has been asked quite a long time ago in 1977 Tamland I think in the second logic programming international logic programming conference answered that question and showed that in fact h2 to is is sufficiently general to contain the universal Turing machine it's very expressive right surprisingly expressive right because you think well you can't really do very much for the language like that but here's a variant of that universe of the universal Turing machine so we just call it UTM imagine that the variable to represent tapes or lists in the first statement it says if the if the tape s has a halt statement on it then terminate in the second one it says look at this that the the tape s executed to give you the new tape s1 and then rip and then it right okay so this gives you kind of this the central loop of a of a universal Turing machine the execute loop and then lastly what is execute mean it means fetch an instruction and the instruction fetch is done by simply inspecting the the tape identifying where the present execution symbol is let's call it F and then applying F to the tape right so here you have an explicit second-order formula which is an important variant of this so if you if you have h2 to you in in your learning setting you have a problem right and the problem is the halting problem right how can we ensure that if we're taking that fragment h2 to that we will halt not only halt in terms of the learning process there learning procedure but any of the program's themselves because if we can't guarantee that the program's halt we can't guarantee that the learning will halt if you think about it because the learning has to actually execute the program and if it doesn't if the program doesn't halt then the learning will not halt and to find the solution to that we went right back to news and Bendix I don't know if anybody here is familiar with Newton Bendix fantastic piece of work in the 1970s on rewrite rules trying to find how you how you formulate a set of rewrite rules that are confluent ok these are these are these are general rewrites that you want to ensure come to a halt they actually stopped so that Nathan Bendix approach was proved to work but it applies only to rewrite rules it was extended then by Yahya Fernandez and minka in 94 to apply to data log programs so gerund termination data log programs that include recursion and this is done in both cases by assuming certain kinds of ordering so orderings over the rules or orderings over the predicate symbols if lots of different orderings are involved okay so one of them if you ensure that the Herbrand base is totally ordered you can guarantee termination simply by the fact that every proof is if every proof is forced to consider only things lower in the ordering in order to complete the proof you can guarantee that as long as you're always heading down the ordering and that there's a finite finite termination at the I mean it's it's a it's finite downwards may be infinite upwards but finite downwards then you can't go lower than the base of the Herbrand base so in in what we did and you saw it in that in that order constraint we use both lexographic orders and interval orders interval orders you imagine the way this in terms of a sequence where you've got X and Zed in the way x and y and we said that x and y you must have must contain said what that ensures is as you recurse down to X 2 versus Zed you're shrinking and shrinking an interval over the natural number sequence and again you can easily demonstrate that you can't keep shrinking forever until two two values coincide within the natural numbers so that was one of the things that made Matta call that improved it because a guaranteed termination the second thing was looking at the question of how you can learn a set of concepts ok suppose that I'm giving you five different things that you have to learn if you provide an episode sequence so it says before you learn let's say ancestor you should you need to learn parent and before you learn parent you have to learn something before it so you can show very easily that if you try to learn everything simultaneously then you get an exponential explosion in the search if you instead because it basically you've got a product search you've got all these hypothesis spaces and the product of the hypothesis space is the one that you're searching if you search them in order it's the sum of the hypothesis spaces right which is much more tractable okay so by introducing sequencing episodes over the learning you can guarantee better much better efficiency within an episode then there's a another thing to increase the efficiency which is to do what in other areas of artificial intelligence is called iterative deepening that is you start off by considering that the size of the definition that you're trying to learn has has size 0 okay ie in order to explain the examples you don't need anything additional right the background knowledge you've got already explains it suppose that fails right so there's a certain example that can't be so can't be explained right now for those examples you then consider a definition of with one clause in it right now with one clause you can only define you can only consider quite simple programs and you can exhaust that clause quite quick that set quite quickly if you fail to find a one clause solution you then step on to two clauses and so on right so what's the advantage of this the advantage is you're always considering the smallest hypothesis space that you possibly can which is a cheaper than it's cheaper to search but secondly if you think about it at the end of that process you guarantee that the hypothesis that fits is the smallest hypothesis that can fit because you've considered all smaller positive spaces right with all the above you can get a pack result out of this Metro interpretive learning that's probably approximately correct result which is in the paper by showing that if you put a constraint of log2n where you say I've got any examples I'm only ever going to consider a small definitions of up to the logarithm of the number of examples I've got in that particular case the but the space is sufficiently bounded to ensure that you can you can you can find the solution in polynomial time and the error is polynomially bounded in the size of the formula okay so all of this gave us an implementation to get the implementation is available to anybody wants to download it from its github site and we've developed a nice user interface which you can use to put your examples in so you don't have to type in in to type in lots of Prolog and get lots of errors and process that from your typing so this PHP interface you can also try out and try some of the simple examples what you'll find is it's actually it might sound as though the search is going to take a long time but in fact it's blindingly fast most of the things that you try will take a fraction of a second in order to find solutions so so I suggest giving it a go enough in our experiments we tried out doing more complicated programs in in our first paper which was actually a rich guy we looked at robot examples robot strategy examples and we built little programs that contain the mixture of actions and form in the form of dyadic predicate and tests fluence which were monadic predicates so this is our language array shooty language in operation there was very certain predicate inventions here the thing we were trying to learn in this case suppose that you're given there a robot it's given up a bunch of examples and plays around and sees some of the structures that it makes when it puts them together they fall over and some of them don't right so you want the positive examples for stability of a built thing come from those that don't fall over right ok so you've got a B and C so in fact C will will fall over because it'll collapse in the middle B will fall over if you make it high enough a if you keep on extending it is the kind of thing that you see in every wall in the world so we'd like to be able to learn something like a and this little strategy after after giving it enough positive and negative examples learn something that will accept structures like a and reject structures like B and C and it Doug did so by inventing both actions and fluence in the process again we train this by providing random inputs in terms of numbers of examples you can see that math Cody actually goes up from a base level of about 50% up to close to a bit below 100% and in this relatively small numbers of examples right so after about 20 examples it's a fairly complex domain but what's also encouraging is that the time that it takes increases pretty much as a something quite close to linear so the the various different approaches that we put in to make it more efficient were were effective in practice we tried it in other domains so this was something a project going on at Carnegie Mellon called the neverending learning of language so the project is based on applying natural which pauses to webpages its slogan is read the web see just read everything that's not up there on the web after a couple of years they had 50 million facts of in the form of triples and these triples were ideal for a dyadic learner right so we'd have thought huh we want to have these so these were a small set that Tom Mitchell sent me when we started looking at this so it has things like place bosses it's got a name eva longoria plays baseball and so on so it's very different facts about American sports and we were able to use those in order to learn rules such as the one up here in the process we unveiled a bug in the nail system because although we learn rules that they agreed were consistent when you applied them to the actual stadiums and so on and America and looked up the web pages he found actually they're wrong they're making wrong predictions and the reason is because they had some assumptions built in to know about the relationship between how many football teams had were in stay they were assuming every every there was a one-to-one relationship between two stadiums and football teams and it turned out that that was not the case and that comes out I think in in some of these examples so one and three disagree with reality we're able to also to learn some simple higher-order meta rules that was kind to explain symmetry and so this was something that now they didn't have any way of doing this they were putting in all of these kind of symmetry relationship ships by hand and yeah so this is related work it's quite a lot of related work so in particular this work in the National Institute of Informatics in Tokyo they had been doing a kind of a deductive predicate invention this has cuts in the in any ways work using using rules in a way that this is somewhat related to what we're doing so they do what they called meta-level abduction when looking more carefully at the the authors agreed with me that what they were doing was essentially propositional right so the thing that was different in our Metro interpretive learning was that we're dealing with relations there dyadic relations so we there's also quite a long literature going back on the use of higher-order background knowledge within ILP so particularly Lloyd has wrote a book back in 2003 but there's quite big differences between the higher-order representation that he uses and ours and there's this work on higher-order provo learning which was also relevant okay so just to summarize these are the various different things we achieved just again in terms of formulating a kind of declarative machine learning there's this is something a term that Luke Durant used back in 2012 his intention there was that all of machine learning could be replaced by constraint solvers right and because the formulation of hypotheses is a constraint solving problem which I like that idea the issue is how do you formulate the declarative statements and what kind of statements being allowed our meta rules our declarative statements and our way of solving them is a form of of constraint solving it can be done using this approach and in a ways group has shown that matter interpretive learning can be done using answer set programming but the idea of incorporating a kind of general-purpose matter interpreter as the learner reflects Luke's idea in that paper I think so it's also we found that hd2 is is a tractable language it's also turing-complete and we can think of it as a fragment of two higher-order logic the Knuth Bendix style ordering allows us to guarantee determinations and something really satisfying to me and using using this approach because you it shows that you can get guarantees on the termination and the properties of programs as you're formulating them you can build that those guarantees in so that you've already proved the termination criteria by the time you've formulated the hypothesis in this setting going beyond classification learning so pretty much all of machine learning actually is just about classification or maybe a regression in some cases but we've been able to show several times that you can do things like strategy learning where are you learning quite complex structural hypotheses in this kind of setting so these are some of the limitations this first one we haven't really solved dealing with classification noise I'll tell you about in one of the upcoming lectures probabilistic matter interpretive learning how do you introduce probabilities into the hypothesis and into the search process an active learning how do you avoid the kind of curiosity gap that's typical of machine learning that you just have to load all this data in and the learner does very little Lynch in in the selection process if we think about something like scientific experimentation it's an active process where you carry out experiments in the world in order to gather your data an active learning is something that we've started to explore in our group within the context of this kind of probabilistic measure interpretive learning
Info
Channel: Federated Logic Conference FLoC 2018
Views: 4,261
Rating: 5 out of 5
Keywords:
Id: 7ocgy8VjfJA
Channel Id: undefined
Length: 91min 15sec (5475 seconds)
Published: Mon Jul 30 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.