Stephen H Muggleton: Inductive Logic Programming II

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Oh okay so this is the second set of three lectures in this lecture I'm going to be talking less about the kind of basic theory of ILP possibly less about the basic theory of meta interpretive learning more about particular issues which come up when you try to apply this in real settings and so we're going to be exploring a notion called bias reformulation the paper for this lecture which again is on the website that I put up at the beginning is this one which appeared in the European conference on AI it was a collaboration between my group josh Tenenbaums group at at MIT and it was so the inspiration for it came from a presentation at a - - a seminar that we both attended on a system called flash fill being it was developed by Microsoft and I think is the at the time as the was the first mass-market product associated with program induction which is what the seminar was about so you'll see this term one-shot function induction which was the thing that really caught my eye on joshs okay I'm going to start by asking you to look at a standard kind of IQ testing that you've probably seen before textual analogy problem so we have a word here b.o.b lower case another word here b.o.b and this is supposed to say Bob goes to Bob as Alice goes to question mark something else okay and so pretty much everybody can can get this one so I now say is this the quiz thus the answer right anybody want to vote for this as the answer why not because I can tell you that just as there is a program which turns that word into that word by using uppercase characters right upper casing all of the initial characters there is a program that takes that word to that one can anybody tell you what that program does very good very good yeah so it takes the first two letters in this case always puts Bo just like it did there and then the rest it takes the the end of the string and upper cases it okay so why didn't everybody jump to that conclusion right because I suspect everybody jump to the idea that this would be upper case Alice here right well I would argue that when we face this problem as an IQ problem we've already fake looked at a lot of text in the past and we've got a very strong bias as to what we think is the case okay if you treat this as a problem of machine learning we've been given just one example of what happens as you take Bob to Bob and we're supposed to now know what would happen with every other example okay but if you think about that in terms of machine learning that's extremely hard to do how could how could you machine learn just from one example typically machine learning algorithms you know if we were using a neural net you have to have tens of thousands of examples in order to get anywhere here people seem quite happy to do this every time with one example right this is all of these these are analogy problems are like that so how do we do that I think we've got a very strong bias okay now here are other variants of this okay so imagine you had this task one goes that string to that one this one here goes that string to this one task three is this one to that one and what was interests is that the presentation we saw of the Microsoft add-on to the two Excel from 2013 was able to handle problems like this you'd type in one example like this and you'd say this goes to that and then it would fill in columns on the spreadsheet and get them right almost every single time and we were astonished that this was actually possible when sumit gowanni explained what they're done it was quite clear that it wasn't so easy for them to get it to do that it took something close on six months of us a group of 30 programmers to get the bias honed well enough from examples that they'd got from users in order to be able to do this so what we were what I was curious about at after that was could we actually learn that bias could we actually learn to do a kind of general bias of that kind by looking at lots of example types like this in and zeroing in on a strong bias that once we got it it was sufficient to learn from one example right a very weird thing to try and do but it seemed worth doing another way of thinking a bit about this is could you learn this human text transformation bias could you get a computer to learn to act like a computer like a like a human in terms of text transformation right so this is a thing that that amazed us we looked this is from granny's papers Popple papers back in 2011 and 12 involving flash fill the flash fill system this is the one that he showed us so you've got all these email addresses Brent on Herald hotmail comm and then it says that has to be transformed into that and then once you've given that one examples of the the user types in that that Brent Herold on the right the flash fill in a flash fills the rest out like this okay and in the process it's looked I don't know how many Heath gates yes astonishing figures of numbers of different programs that are being considered and it comes out with one that does all of these so it will always somehow turn this thing into Matthew Russman and so on so having seen that we thought well we will have to beat that within a short period so we'll use Metron Tobit of learning to do this okay so it sounds impossible and Josh's RA said it would be impossible but we tried to do it one of the things which encouraged me when they when the group tried to build up sets of examples was the flash fill often made very weird errors or sometimes make very weird errors so they made up some some some unusual examples and this one here takes in Rodney and this is the example given and this is the test example Stanley Travis turns into I tan Lee really Travis okay and this that's if you remember the one the bow ice wanted that I had earlier this is exactly that it's found some crazy way of doing the transformation and if you look at this for long enough we Stan did it to try and find out what the program must be and at last figured it out if you want to bother then try it but there it actually is a program in their space which does this weird transformation okay so we want to do better than that right so in our in our approach what we did is we built a little some background knowledge involving various different primitive transfer transformers things like skip rest copy word and so on and then we gave examples actually just one example at a time and here's one of those that's taken from the previous slides Brent taught Harold at hotmail.com this is all a list given in Prolog and this list is the output everything in the list is an individual character then these lists have to be treated as pairs and we give this as one of the target programs to be learned which in our set of examples who had lots of different types this was this was problem number four and after nine seconds of considering this problem Mehta Gor pops out this answer here which is a dyadic program of the kind that I've shown you before it has lots of invented predicates it's a little bit hard to follow right but once you get the gist of it you can actually see the structure right at the top level it breaks this problem of the transformation into two subproblems for one and for two in for one it again breaks the problem into two subproblems for three and four for and then at 42 it does a43 with a skip rest and for three make copyright so what's it doing okay so follow it down for one what does for one do it does this thing for three and four for that does a make up a case so take that that lower case B there make it up a case and then it copies the word so that's our ent until the dot it copies that to the output it then goes back and it needs to do a four for which it says skip one so it skips the dot right and then it writes a space on the output the space there and now it's finished okay so it's actually done for one it's got Brent space okay that's the first half of what it did so the next one for two and you go through and you can actually figure out it reuses for two this for three guy here and for three doesn't make up a case again to take H to Harold and then a copy word to get the aro LD until the at and then it's finished and at the end it skips the remainder of the output okay so there is actually an interesting program under here which involves decomposition right it's a little program that hits figured out it took it nine seconds of computation which actually is a quite as a very long time for Mehta courts do and it's only treating one examples do that but it gets something which actually correctly does this in the same way as the flash film does on all similar inputs outputs of the kind that was given so it's correct okay now we now when trying the sound we had a whole we had 17 different tasks and at one stage for some reason we looked at sequencing these tasks and something odd happened right so we did a one of the tasks was this top one to transform lowercase James to uppercase James of the full-stop at the end and then EP 0 4 we just did EP 0 2 and then EP 0 4 one after another and got an odd result which was whereas it had taken us nine seconds to find an EP 0 4 solution it now take us 3 seconds okay but it's three seconds to learn both the tasks how can that be how can it take less time to learn two solutions on separate tasks than to learn just one well it's a it's it doesn't make any sense so we tried it the other way around learn 0 4 and then 0 2 and it didn't happen right it was more than 9 seconds in that case so why in this case is it so much shorter well it turns out that in invention in doing this first one it invents EP 0 to 1 in for the first task which is then used twice in the construct of EP 0 4 which means EP 0 4 has a definition which is one clause shorter than it was here here we had 5 clauses here it's only got four clauses right so by having this little library where it's it's it finds things that are useful when they become reused it massively reduces the time for the learning so we thought up great but it may just be one of these lucky things that happen in that just that case so we decided we were going to do that a bit more it's more systematically first of all though we wanted to try and understand these complexity results so that what the complexity advantages were so we worked out what the common notorious of the space were and if you take the so Amma's being the number of meta rules and is the number of clauses in the target definition P is the number of predicate symbols then the time taken in order to stop the space size if you convert considered all of the all of the hypotheses is order MN p3n okay the three there is to do with with h2 to write it's to do with finding three predicate symbols in a in a dyadic h2 to clause so what is it that's what what is it in this in this formula that is the killer in terms of the size of the search well it's n right it's the number of clauses so the search increases exponentially in terms of the number of clauses so whatever we whatever a strategy we apply we'd like it to end up searching smaller number of clauses and precisely in the case that we looked at this is what made a big reduction in the amount of time taken so what does that tell us well it tells you that by reordering the set of predicate Sevilla if you're learning if you're doing a multi task learning problem which we were then there may be a good ordering in which to do it which reduces the overall amount of search and correspondingly improves the out-of-sample error so we tried this idea by using a particular strategy so the strategy involves you start off by considering all seventeen different tasks and you apply a constraint you say find me solutions in which you can express the solution in only one Clause right and it considers all of the different hypotheses all of the different tasks it turns out there are only two of the tasks we had 14 and 16 which had a one clause solution but having found the one Clause solution you're now allowed to reuse the definition and any invented predicate so now in the next round you're allowed up to two clauses okay it turns out that in one of those cases you've got two and 15 in one of those cases it's able to reuse 15 tasks 15 is able to reuse the definition 16 right so there's a little bit of a space save there there's a little bit of a time space but at the next level up when you're allowed to you learn from 3 you're allowed up to 3 clauses you suddenly find the 15 is introduced for something really significant that's able to be used by lots of other things ok so it turns out that all of this reuse starts to to grow rapidly that's what's happening here in what we call independent learning an independent learning no reuse is allowed right you just have to learn everything independently so here 14 and 16 and learn here you could learn too but you couldn't learn 15 which you could only have learned if you had reused 16 and so on up so by the time that you get up to time out in independent learning there are still 4 tasks which are unlearn able unlearn Irbil in the number of clauses but with the dependent learning only task 9 can cannot actually be defined because you've introduced all sorts of useful redefine herbal predicates in the process okay so looking at this graph here this dependency graph in fact it's a calling diagram right so 3 is calling something of 12 or maybe an invented predicate of 12 and that's calling an invented predicate of 17 15 down to 16 so there's quite a lot of dependency it's quite a deep program that's being learned and this is a slice through that program okay so that is three up there okay it's calling twelve one and twelve itself and then twelve is calling twelve one and 12 - twelve one etc all the way down here right so if you look at that as a prologue program you're producing something that's not very dissimilar from deep learning except this has been done in logic hi the search space so the the definition is too big we just set a timeout the timeout we set as being six okay so so you were you weren't no sorry that's beyond five so when you go to five actually it's taking quite a long time to do the learning right because of the exponential increase in the search time so at that point if you've looked at five you've covered in this case all of them and only nine is left out all of those are timing out seventeen four nine nine five okay only nine is timing out here right because the definitions are too big even after you've introduced all those these extra predicates that could be used for defining it but when you look at this you see not only is this very similar in the in the way the deep learning builds one layer on top of another it also has one of the problems of deep learning which is it's impossible to read the program now right it's really hard work to see this is only a bit of the program it's really hard work to interpret what twelve one or twelve three are in fact though you have a slight advantage because if you work through it systematically you could give a name sensible names of 15 1 skip alpha num skip 1 and substitute that name into here and work your way through it and end up with a readable program ok so we have some of the upsides and the downsides of deep learning in a lot even in a logical in terms of performance what what happened that was interesting is that the independent learner was chested both allowing recursion and non recursive definitions and throughout independent learning solved fewer tasks and had higher error in the comparisons that were done by Josh's group against human beings we find that human beings have even after ciphering have a slightly higher performance than our learning had achieved but not very different from flash fills right flash fill appeared to be very good until you ciphered the inputs and then it looks very much like a human learner but if you notice that we were bound by 17 tasks that we'd extracted from the Gulani papers we're still heading up so given more tasks we may well have reached that bound that both flash fill and the humans seem to have so we are progressively learning the quotes human-like bias but not quite to human levels in this experiment if we look at the times involved according to log scale so you can see though that we're making exponential savings in terms of the time when we look at the difference between independent and dependent learning which is as you would expect according to the to the complexity bounds that we had introduced to begin with because we're systematically reducing the numbers of clauses needed in order to build the definitions and that will lead to an exponential speed-up that also reduces the error throughout so as a summary what we were looking at here is multitask learning we restricted the learning to always being one example okay all of the learning I've shown you the programs will learn from single examples in the same way as flash film had been demonstrated to do as you learn from those single examples your competence on other single example given learning tasks increases so it shows you a rather interesting strategy rather than having lots of examples in just one task why not have one example in lots of tasks okay it's a completely different way of learning but something that's much it's much more familiar from a kind of cognitive perspective people don't learn from millions of examples but they do learn lots of things simultaneously the use of predicate invention and learning recursion both paid off here predicate invention particularly because it populates the space of string transformations with useful new predicates that are reused within this context then you have here you also have a kind of complete way of doing or an approach to inductive programming that is simple and compact and not demanding on the programmer right so you're only asking for a single example in each case which is again one of the aspects that's built into flash fill because it's designed for naive programmers people are actually writing programs they just don't know that that's what they're doing they're typing in you know Brent Herold or whatever but what's happening underneath the boot is they're generating a more and more complex program so there's a there's a their ideas there that could be used in intelligent interfaces for programming and but we're short of a number of different strengths that we could have had some of the problems that we looked at would could really help we could it could really help having more general transformation so if you think about string transformation as a general class of programming in fact it's cheering complete right if I say I can give you any string as the input and any string as the output clearly I could introduce a sorting program in there right so I my input string is three to eight nine five and the the output is the sorted version of the same and again we didn't at this point have a way of handling noisy examples and it was unclear what noise meant in the case of learning from one example so these were these are all kind of interesting things that came out of it and I would claim that going not just looking at the theory of this but actually working through real examples that people are trying at Microsoft and so on actually had an advantage in making us think harder about the theory of what's going on and what we could do to make that to improve that so any questions something like this that's a very good point and it's something which we followed up in the following year each guy in which we had a paper andrew cropper and i wrote a paper on how to learn program how to learn efficient programs okay and we had two different models of that one of which we had primitives like this but each one of them had a resource being consumed so within the definition it says every time you do this you push a button and it a counter goes up okay and so when evaluating the solutions you evaluated the cost of of them and the search that we did first of all started by finding this the smallest program that you could and then allowed you to search larger spaces where you have non minimal size programs but with lower overall resource cost okay so so that's a that's a really we did we weren't looking at that at this stage but it has been looked at since then and we've got some quite nice results about the degree of optimality that you can achieve in learning efficient programs time efficient programs including learning quicksort where you have to do all the predicate invention itself right with the cilantro well we went up to five clauses and the reason we didn't go beyond there is that it was taking silly amount of time we couldn't complete the experiment okay but the interesting thing is so we could have set it at four and we would have still found that the dependent learner was doing better five was the outer limit of what we could bear to to take time in doing so here even at four it's leaving out just three and nine right whereas over here it's leaving out three thirteen eleven seventeen for nine and five yeah so well one could give it more running time but it eventually it a good question I'd have to look back at the data set I can't remember some of them some of the problems that have been introduced weren't actually so they were ones that the MIT group had just decided would be interesting like reversing a list or something like that and it may have been one of those I can't remember but I could I could check that okay so [Music] so this is the second part of lecture - right so you may have noticed that as I've gone along I've always said what we couldn't do that of various various different stages which has been you know it's a way to to figure out where to go next and many of those lists include can't deal with noisy data right which is a bit shameful because all of the all IRP systems back to the 1990s have been able to deal with noisy data so in this task which the the paper came out this year and the machine learning journal we decided to make it really hard on ourselves because I had read I don't know if anybody here has read something called the master algorithm by Pedro Dominguez anybody who read that which is basically is written a couple of years ago and it gives a survey of machine learning and the things that some systems some approaches are good and mad at and one of the sentences in there that struck me as irritating was he said in the next hundred years inductive logic programming is never going to be able to deal with with images right because you know this is something that neural nets a grater and that logic is hopeless at right so I thought okay let's try that right so this is all about how to learn from noisy images right using inductive logic programming or losing with matter interpretive learning in particular so this is paper oh sorry this is the wrong paper apologies I don't need to fix that so the paper is called matter interpretive learning from noisy images and I better check that I've given you the right paper in lieu I'll fix it it's on the website that I told you know I'll put the right one though the motivation is first of all we think vision is an interesting problem secondly it's particularly interesting both in scientific settings that we wanted to look at and also in problems involving for instance robots and so on but when we look at scientific data let's say for instance here this picture of the moon the right the thing that in the thing that I had been wondering about that I've read recently at that time was something written by Galileo back in 1610 when he made the first observations ever of the surface of the Moon right through a telescope that he had designed to his own specifications he saw the moon and he started describing things which went way beyond what anybody else previously had thought of because they didn't see it as clearly as he did for it to begin with he pointed out for instance of the moon had jagged edges which meant it wasn't a perfect sphere and it's must have mountains therefore because it's a real object and not only that but the mountains were weird on the moon because they didn't just they weren't concave they were actually had convexities because some of those mountains were actually pitted they were what he didn't have the term for craters right but he describes exactly their geometry now what amazed me is how did he figure all of that out just by looking through this thing nobody had actually seen such a thing before but he was pretty he was well-educated he also was very scientifically oriented and he figured it out from shadows if you look at the description carefully he understood what was happening in large part by inference of the shadows so when we see this crescent moon here it was obvious Galileo that there's a light source somewhere right and that light source was also obvious to him it was the Sun right and part of his thesis that he was trying to show is that the Sun is at the center of everything so even if you can't see it it's still there shining on the moon when you look right and we don't see a Sun in this picture but we we don't need to because if we have just a little bit of background knowledge about how light works we know that something must be lighting that thing up and not lighting the other part of it up so even if it's spherical there's something that we can infer about the shape of the object and it's convexity and so on and the theory behind that which is should be learn about it be fairly straightforward when you have a light source and it bounces off a convex object like that then the side that it's closest to the light source is the sight this is the bright part if this is concave then the light source lights up the other side of the object okay so and this produces certain strange effects I don't know so these four things done these four shapes here I think most people will interpret those visually as being either concave or convex I don't know does anybody want to hazard a guess here as to whether this one in the lower left-hand corner is concave or convex it's convex now that's interesting because if you looked closely at what I've just told you you'd realize youth there's another interpretation of that it is convex in the situation that the light source is above it it is concave if the light source is below so you're assuming that the light source is about and that's not surprising because the whole of humanity assumes light comes from above we just have it built into us right so so what about this one here right on the same basis what is this we know it's the moon right so it's concave that's good it's convex but we also know as Galileo did that light could come from below right in which case it could be it could be concave rather than come it could be the natural interpretation would be to say it's concave if you were using our built-in assumptions as we did there but it would be it wouldn't correspond with with what we know about the moon these kind of minimal ideas about light and the movement of a single particle that I'm showing here can actually inform you what's going on in other cases envision so for instance here we have a ball not the moon right we have some shadow pattern actually the light must be coming from above there in which case it must be concave we also have a ball there but we can't see all of it why can't we see all of it because the light is not reaching us so there's something obstructing us but okay and all of those things can be logically inferred if we had the axioms to do it and we do this every day but machine vision doesn't write machine vision does all of these recognition tasks without a theory of light and if you think about how absurd that is that every picture involves reflection and light and photons and yet we're just assuming for some reason in the whole area of computer vision that this is just an array which has some properties it makes a nonsense of what's actually happening we all the time figure out that when you have glass award a glass of water and we can see through it and you put your finger behind it and your finger looks Bend that there's something going on with light that tells you that right so at least having a theory of light is essential I would claim to doing machine vision better and it can be done within an ILP setting right so ILP actually has an advantage as a machine vision approach we're going to try and do this using the matter interpretive learning approach which I've already shown you but the first problem that we hit is that we're going to have noise in our data Matic all we reason we didn't deal with noises it was it's quite it's harder than other machine learning algorithms to get to handle noise because it carries out consistent construction from the examples so it has to assume the examples are correct so the way that we found around this that we've applied is that you you sample the examples that you're given you'll find small samples of examples and you do it repeatedly and you build models from those under the assumption that they are noise free now if you take enough samples and you test out-of-sample on those then you can get some some of those samples which will have high out of sample prediction within the training set and then by optimizing over that you can have you can get around noise okay so this is maybe a complicated slide the complicated way but it meant that Matic all can be used in the pure kind of logical sense using sub sampling and still getting getting consistent noise tolerant solutions to the learning there are other problems involved in in understanding of the object in particular we need to have we need to build models of the of some kind of the surface so this is equivalent to the issue in the string transformation problem of having to have primitives of some kind when we started this work we started with one primitive which was the notion of a point okay so this the idea for this came from Euclid right Euclid looked at vision type problems but he looked at them as geometric and he started by saying well there these things in the world called points and then we can make our objects out of these we'll make lines out of points which have got which are connected together and then we'll build other shapes out of lines etc so we've tried to take purely kind of Euclidean approach to this in some case though we we realized that the objects that we were dealing with would be better dealt with my elliptical models and and that's what we're doing in this case so we're saying we taken various different high contrast points that are sampled from the image the red points we we connect those together using lines and then afterwards we take those in order to formulate a ellipse model the ellipse model in this case gives us errors because the point here if this ellipse was correct should it be a high contrast point so we Reece armhole and we find another point which then gets a better a better fit so again the identification of the sub objects here is being done at a lower level below the Metro interpretive learner by hypothesis formation and active learning effectively here okay so is that good enough to deal with moons and micro microbial slides okay so these were the two scientific tasks we had and indeed it was so we could find elliptical models that worked for the moon and for the for some of these microscopic single-cell Euler organisms and we carried out a learning task in which in this case we were trying to train the learning to decide where the Sun is okay he showed pictures of the moon and you say where's the Sun right so it's a very weird question from a machine learning point of view because or even from from a vision point of view because the Sun is not there it's not in the picture right so how are you asking me where the Sun is because it's not there okay so the answer has to be somewhere outside the picture and we then are given a label to say which compass direction it should indicate in the training examples and then test the test examples similarly there are light sources here we can identify those with the micrographs and here is the experiment in these experiments we did very low sample cases in which we were learning from from one or two examples and we got high levels of accuracy in our matter interpretive learning of both the moon's and the protests compared to some other stand moss well state-of-the-art machine vision tasks and then we moved through to this problem which is the one I was talking about right at the beginning so again I'm going to ask you now you should be better informed it which of these two is the crater so hands up for this is the crater crater and hands up for this is the crater you think this is the crater well they're both the same image ok what they ones upside down right there are identical images just telling this little slide around and you'll find that it's the same image again so a lot more people thought this was the crater than that one that's because the lights coming from the top isn't it ok so how how can we treat this as a logic learning problem ok and so what we did is we we produced a model where we were trying to do we were trying to do that we tried to get it to we use met interpretive learning meta rules that were rich enough for it to invent a theory of light so we had at least the base and the recurrent rules and then we gave it basic examples and some bats that seemed small background facts and it came up with hypotheses and in this case it came up with multiple hypotheses okay so it said this could be the solution right the light source is local light the light source angle is from the south right and the and the object is convex if you make a different assumption about where the light source if it comes from the north then the object is concave so for the same image it's not giving a single classification it's saying under these circumstances this is the answer under these circumstances that is the answer now from a logical perspective that's the right answer right that's what machine vision should be telling you right it depends on what you're assuming but sounded forms of machine learning are set up just to give you a classification which is one thing right this is saying no it's lots of things right it's all these possibilities this is the way that that the that the background knowledge will allow you to interpret what's going on we did we did other experiments with robots again trying to mix this with some working machine vision that a vision a computer vision perfect person working with us and she identified this approach of super pixel so you can take a an image and you can break it up into low variance sub areas and you can treat these as objects with relationships between them and then we use that as the set of objects in the learning task identifying various different relationships between them and then get it to learn relational rules that say where the ball is in each case so we're able to get it to do that and again we did a comparison against a cart learner using that same information and learning from one or two negative examples randomly selected and we outperformed cart which is a pretty powerful propositional learning algorithm by using the relational learning and we again we learn from smaller numbers of examples you get to higher accuracy faster oh that's a log scale on the bottom over number of training images okay so so the summary from this is Pedro Dominguez is wrong you can do logic based learning from images deals we can deal with classification noise and that was one of the kind of technological advantage advances here week we use some degree of active learning though would be nice to incorporate that veteran the whole set up and to some degree we can do efficient problem decomposition as part of the problem okay so there was a mixture of predicate invent only most almost all of the predicates were invented I didn't show you all of the definition so concavity was part oh sorry no concave convex was invented so there were two stages to the learning the first stage was learning on this prediction task and then there was an a predicate that was invented that it was equivalent to concavity and in the in the second stage where we evaluated in the paper if you read it you were evaluated this task of telling where the which were concave in which were not then we provided information simply about what was going on and it used those the predicate symbols which we'd renamed here so that they're readable it introduced those a picture of a penguin a painting so so it wouldn't handle that well it's so okay it depends what the painting was about and there's there are limitations to the underlying primitives that we've got here right so if the painting was Mondrian painting with lots of straight lines I would probably tell you something about the structure of that particularly if you were giving positive and negative examples of certain types of Mondrian painting okay but if it's another picture of the girl with a Pearl Earring then it might identify the earring and so on because it's got this eclipsed identifier and it might be able to use that in order to make predictions so it's only as good as the primitives that you give it to begin with which is we we think that there may be a very strong set of primitives that you could use in a lot of different tasks and at least in some of these there's some evidence of that because you know micrographs and pictures of the moon are not the same source that's being provided but there is in common with both of those there are light sources right so if what you're trying to do is identify things about light sources you can learn from micrographs and apply it to the moon or vice versa every every time we do experiments and publish papers we put the version that we that we use onto the website so anybody can use what we've got we provide the data and so on to say is it in development yes it is because it's research is continuously being changed we're producing research vehicles for tech for carrying out these experiments but they tend to build one on top of another so there's some something of an upward compatibility oh ok so right so now this is the last section which is on stochastic logic programs and Bayesian matter interpretive learning okay so stochastic logic programs are have been around for some time this mid 1990s 1995 bayesian matter interpretive learning it's from about 2014 okay so the key thing that's in common here because they've both dependent on each other is the use of probabilities in within the learning okay so here are the papers for the lecture the key idea here is the notion of providing a Bayesian prior over the hypothesis space and if you imagine such a Bayesian price suppose you've got an infinite hypothesis space and you decide to order all of the hypotheses in descending order of probability over the hypothesis space then you're going to get some kind of curve over those and the the bounding curves that you consider have to be probability distributions in this case okay so they should have I mean if they're continuous then they should be count right so how do you go about setting up a distribution function over a structured object of this kind well let's start by thinking about probability distributions over strings okay one way to define a probability distribution over a set of strings is to define a stochastic otamatone right so stochastic automatons simply an automaton has a certain state which let's say these are accepted States and so and there are transitions which have associated probability so when you're in state 0 you can take a chart with probability 0.4 you can either accept or omit an A and then in state 0 you with probability 0.6 you can omit a B so the assumption with the stochastic automaton is that for the outgoing arcs the probabilities must sum to 1 so so this is a stochastic otamatone and we can take a particular string like a BBC and figure out that the the probability of that since we can treat the derivation of the string as a Markov chain we just take the product of these point four point six point seven points so it gives about 0.05 and it's easy to prove that with a stochastic otamatone when we sum over the set of strings then the overall probability is 1 because draw draw draw out the the graph then all of all of the sub graphs sum to 1 so so this is this is this is a way of thinking of how to do such a definition over the set of strings but strings don't necessarily the way that you would like to describe say a logical hypothesis space so but we can represent the stochastic the equivalent of stochastic atomic stand as a as a grammar and so here a set of production rules that have exactly the same except exactly the same distribution over strings each one of these production rules simply represents one of the arcs in the automata that I just showed you and similarly you can go from there to reexpress in that same grammar as a logic program as a DNF logic program in much the way that i mentioned in previous in the previous talk in part one but here we're saying we are associating probabilities like 0.4 0.6 with individual clauses right so how how do we read this logic program well the easiest way to read it is as an automaton that it there's a certain if you look at the derivations of the logic program then they are Markovian so when I try to prove the string AABB or whatever I have to take a set of of of choices through that the selection of an SLT resolution in order to in order to derive that string so since we can do this with we can we can build a logic program of this kind we can then ask the question well this doesn't just have to be strings this could be we could have here arbitrary logic programs okay so instead of something that looks like a DNF formula like these ones we could just put anything that you like you did what quicksort there you can put anything you that you like or the the definitions of family to families that I showed you earlier or the description of a chess board and allowing yourself to randomly choose chess boards okay with associated pieces in places so this is in a way a nice way of setting up the structural prior and this is a point that was made by James cousins first I think which was you can define a way a way of look of you can define a prior over say bayesian networks by defining a Bayesian network as a logic program so define the class of of Bayesian networks as logic programs and then associate probabilities with those you've now got an SLP generator over the structure of Bayesian networks okay so it's not the probabilities associated with the conditional probability tables of the network it's rather the structure of the object that can be generate you can define as a prior over over graphs rather than a prior over strings as we're doing here okay so there's that's a nice idea but when James described that to me I immediately thought well why not describe a prior overall logic programs than that in that way and this is now the entry point of them the next idea in this line which is I've told you what a matter interpretive learner is there's also a notion of refinement remember I told you about refinement graphs and this idea of refining programs there's a notion of stochastic refinement where instead of systematically following the various different ways of refining a logic program you do that by random sampling okay see if you random sample from the refinement graph overall logic programs what you will pull out is a randomly chosen logic program so we now got an interesting idea refinement generator are generated by Metron to biometry interpretive learner what if we took the matter interpretive learner and treated it as a stochastic logic program right so when we give a set of examples to the learner it will generate a consistent hypothesis when it has to make a choice it'll do so randomly okay so you're randomly sample from the set of consistent hypotheses in the space only consistent hypotheses can be generated by the Metra interpretive learner and in Bayesian terms that called a Gibbs learner right so we can use this idea in order to generate a Gibbs learner that will be able to identify consistent hypotheses and randomly sample from that space in the example that I've shown here we're randomly choosing solutions for instance to parity so you're learning of a finite state acceptor as a logic program and you're doing so using the by identifying the Delta rules so you might say so what why would we want to Psalm or consistent hypotheses from a space and I wouldn't tell you that immediately I'll just set up the the framework what does it actually say well we've got meta rules we've got stochastic refinement refinement is Shapiro's idea but here instead of coming up with individual clauses we associate clauses with probabilities we use that to define a consistent prior using that prior we can define a likelihood function one zero function for a given X set of examples with respect to background knowledge and we can then normalize a posterior so with a normalizing constant here using Bayes formula we can we can take the likelihood versus the prior with respect to the background knowledge to give us a posterior probabilities so we cannot a Bayesian framework that we can apply to this setting that I've just described where we do stochastic refinements using a metro interpretive learner identified as a stochastic logic program okay so where does that take us okay which stochastic logic program we're using as I said the math interpretive learner it's exactly the same one as I showed you before with the abduction and so on it's just that when you call when you have look at this top level these two clauses here have probabilities associated with them and there are also probabilities that lower down in the selection rule for metal rules which metro rules going to choose so everything is a stochastic logic program as part of this definition right which metal rules will suppose we're doing finite state acceptors these these are a set of matter rules that work we're using dyadic learning here are the metal rules that work so everything is as it was in in the matter interpretive learning setting so what do we want to do with this we can now Psalm hole consistent hypotheses given a set of examples and we can calculate the Associated posterior probability by by doing by doing products over Markov chains what can we do that's interesting out of that well actually there's a number of different things so one of them is to is to look at Bayes Bayesian prediction okay so there's a number of there's a number of proofs in the literature going back to the 90s showing that the best machine learner is actually the one which does model averaging by model averaging I mean it considers the entire hypothesis race and as when making a prediction it weights that prediction according to the posterior probability predictions of each of the consistent hypotheses we can generate the consistent hypotheses we can generate their associate posteriors so we could approximate this this Bayes Bayes prediction algorithm which is supposed to give the highest predictive accuracy of any machine learning all right so so we can do that and then the question is how do we do the sampling over over the space of derivations so we found two different ways of doing that sampling one is sample with replacement the standard approach you know statistical approach to to sampling from a set or without replacement now you might wonder well why do it without replacement this most of statistics is based on with replacement learning well we implemented the with replacement one first then we found that it was over sampling it was finding hundreds or thousands of identical solutions right and those were especially associated with the hypotheses that had highest probability that's a lot of wasted time right so if we could simply calculate the probability associated we don't need to re estimate again and again so what about simply doing it without replacement now we've got the issue how do you do sampling from a derivation space without replacement well there's got to be some kind of bookkeeping involved in order to do it without replacement it's not straightforward but we do have a way of doing that and the comparator that we wanted to do for our experiment was a map learner which simply gives the maximum posterior probability so this is the one that has highest probability so you can think of you can think of the medical solutions as typically being of this kind if you assume an oculus prior meta gold provides the smallest solution with the fewest clauses if you just do it in the standard way and that would actually be most strongly associated that would be identified as the maximum posterior solution so how do we do the without replacement approach okay so now you have to consider that we've set up the stochastic logic program in such a way that each of the derivations from a particular point the subtrees have equal probabilities okay so let's show you this Sigma function with equal probabilities and now consider the entire hypothesis space again maybe even an infinite hypothesis space but imagine the cumulative probability across that hypothesis space from left to right as you move through the hypothesis space so this interval here is where you've got the ones that have the cumulative probability up to 0.1 then after 0.2 etc when you get to here you've covered half of the probability volume of the entire space and at the end you've got the cumulative probability of one which is the entire space now we say I'm going to what I'm wanting to do is I'm wanting to sample without replacement and I'm wanting to do that in an even way right so so in this case what what one way of doing that is something called regular sampling right reg what regular sampling does is that instead of being completely random you choose a target probability like point one point two point three etc and you sample the point at that space right so this is a particular hypothesis which is the one that's the last one to get the cumulative probability point one odds or closest to that point two etc the effect of that is that you spread your search of the hypothesis space out over the entire hypothesis base and evenly spread out in such a way that you could simply move this a little bit and you would get a new spread of equally separated hypotheses in terms of their probability but how do you do that how do you identify for instance where the point four guy is going to be well then the question is if we're at that point in the space and it's one-third probability under each then is the hypothesis is it going to be under this left hand branch yes no no right because the left hand branch ends with probability 0.333 recurring right is it going to be under the middle branch yes because that goes from from one third through to two-thirds and it's not going to be there right so if I've got a target probability I've got a determinate search to find exactly where in the space that hypothesis is so we can walk through the refinement graph walk through the derivation of the logic program to find exactly the hypothesis which has probability associated with the target and we can change we can simply go through a selection of target hypotheses to get any number of equally spaced hypotheses out of that and we will never repeat right so we're getting sampling without replacement in by using regular sampling and it's very efficient because we can we simply drop down through the hypothesis base in a guided way to find each hypothesis H consistent hypothesis okay so we tried this out with learning on training examples for for both family examples and for finite state automata we tried metaphase RS meta basemap and meta bays sr you notice that the predictive accuracy in this case this is for this is for the regular automata meta bays RS does pretty much the same as meta base base base meta bathes map here may be slightly better in the second case and family it it outperforms so this is our Bay's predictor so the base predictor is supposed to be the best around and it it is so but it's noticeable that it's performance increase over other approaches is most accentuated when you have small numbers of examples right using the prior if it's a good prior is very valuable and that's where the Bayesian hit is getting you when you have small numbers of examples when you've got large numbers of examples it's a pretty much doesn't matter what you use right so you're just wasting a lot of time doing all of the sampling and resampling when you have large enough examples you might as well do something extremely simple and it's going to converge on consistent hypotheses anyway but when you only have a small number of examples it outpaces everything right so that's interesting so metaphase is actually meta Bay's RS with the with our resampling approach is actually performing very well but if you look at the time that it takes it takes 10 times longer than than meta bays map and that's really only worth it when you have you're only trying to learn from small numbers of examples so there's something quite interesting that was that we found out from doing this that we couldn't see simply by looking at the the 90s converted the 90s bounds that hounds parsley cones and Shapira had come up were four from four for Bayesian prediction okay so we carried on that work in that in the published paper you'll find by looking at something else superimposed logic programs where instead of just taking the sample and and predicting in a base prediction way from it you build what are called problem programs these are probabilistic logic programs by superimposing the the solutions that are identical and Counting the frequency of occurrences of the individual clauses and we found that our again in tests our convergence relative to the standard problem who is better because we could depend on [Music] well we could on the conversion the kind of large number convergence theorems in that case okay so here's related work but back in 1994 Bayes prediction is shown to be optimal by Bernardo and Smith and in Banton's thesis he doesn't he also suggested that that should be this case hustla cones in superior and 94 showed [Music] optimum upper bounds they show that in fact there's only a factor of 2 between the map prediction bound and the Bayes prediction bound ensemble methods are based on essentially kind of sample based approach to Bayes prediction Freund and Shapira ensue and Zhu s RL and P IRP this is these are statistical relational learning uses a different approach and apart from big because our logic programs are still pure logic programs we're just sampling from them apart from in the case of superimposed logic programs and then we're actually getting slightly higher accuracies than either of these in superimposition approach so the summary so slps allow you to generalize over the stochastic grammars the Bayesian prior can be a bayesian problem being implemented using an SLP over a metro interpreter if you use that mechanism then to psalm paul you can get as good an approximation as you like depending on the sample sizes to Bayes prediction we see in practice that this R performs map more so when you've got small numbers of examples than large but there's a cost okay so you you have a large slowdown based on the fact that you're needing to repeatedly sample from the space you can improve that by doing sampling without replacement but it's still expensive and where we had thought that we would be able to build a noise model out of this but we've done that in other ways since then but we are looking now at a PhD in the group active learning using this general approach so the stochastic logic programs are used in order to produce the sampling mechanism for the metro interpreter what we do learn are called superimposed logic programs which have many of the properties of Socastee logic programs except they've got a better declarative reading right a superimposed logic program you can take the statement of the clause and treat it as probability as the degree as that as a as representing the the proportion of models in the world that are for which that's true okay so it's it's a degree of uncertainty that's being represented in a stochastic logic program the probability is harder to associate with the meaning of the clause it's it's to do with the sample frequency okay so si LPS and SOPs look similar in their structure but actually one represents something quite declarative and the other quite procedural in terms of its its implicit so so what happens is when the choices are being made right it makes them with equal probability right so representing all of the probabilities is and isn't really necessary and in fact it's better to simply maintain so at certain points when you've done as you've learned you've got you've gone so far in the learning the the number of of choices will either increase or decrease okay so as you provide more and more examples you you get changing prior depending on the choice set so having a fixed probability for all of those doesn't make sense but having them as all equal does so it's a stochastic logic program in essence but it's it's a bit virtual in reality because it's really binding all of those choices to be equal at all points that there's an interesting question but I don't I don't really have a feel I have a lot to say about I mean I I can imagine that you could for instance try and learn a matter interpretive learner using a matter interpretive learner but that's a kind of weird thing to do I haven't talked in this about work that we have done in using higher order background knowledge there is a paper on that and that that's actually very interesting and it allows a kind of acceleration over the learning which you don't get otherwise we're still we're still not at the point in which we can learn higher order background knowledge but it would be very valuable to do so because we found for instance that by introducing background knowledge that defines a while loop or a for loop or whatever you you can search you get much more compact definitions than you would do otherwise because you don't have to define these recursive and base cases and other higher order functions of that kind produce much better learning much more effective learning so we're we're kind of we're interested in that but we don't have a lot to say about it at the moment
Info
Channel: Federated Logic Conference FLoC 2018
Views: 745
Rating: 5 out of 5
Keywords:
Id: nmjCXFJsMfc
Channel Id: undefined
Length: 78min 57sec (4737 seconds)
Published: Mon Jul 30 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.