Pedro Domingos - Unifying Logical and Statistical AI

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
this production is brought to you by the University of Edinburgh thank you all for being here so I'm going to talk about unifying logical and statistically I this is joint work with other people here I will start with a little bit of motivation oopsie wrong but I will start with it oops this is a new pointer that I just got because the other one died so forgive the glitches so motivation and background and then I will describe Markov logic which is sort of like the heart of this talk I will say a little bit about how to do efficient inference in Markov logic and how to do learning a little bit about some software implementations of this that are available and talk about applications some of the applications that people have done so far and conclude with a little bit of discussion so here's one possible way to motivate this talk back you know at the Dartmouth conference when people started the IEEE they were you know sort of like shooting to have you know reached human-level AI by now but if you compare the IQ of people right with the IQ of machines and you extrapolate the trend right even the century after the original Dartmouth conference we're not going to be anywhere near human AI okay and the question is what can we do about it right we could just give up on it but that seems like a cowardly thing to do well I would argue that what we want to try to do is increase our rate of progress not just make more incremental improvements of the kind that we do every day but try to come up with things that will make all those increments bigger such that we can progress towards our goal at a higher speed and if we can come up with enough of these I think it maybe becomes possible that AI will be reached within at least the lifetime of some of the people here ok so what I'm going to talk about today is one possible candidate for producing one of these you know increases in the slope of progress and that has to do with what you might call the great AI schism which has riven the field since its very beginnings which is the the chasm that that exists between the logical approaches on the one hand and the statistical approaches on the other ok pretty much in every subfield of AI there's a logical approach or set of approaches on the statistical approach like for example in order presentation on the logical side you have things like first-order logic and all its variants and on the statistical side you have things like Bayesian networks Markov networks and so forth in automated reasoning you have things like satisfiability testing and you're improving here you have things like MC MC belief propagation the machine learning on the logical side you have inductive logic programming rule induction things like that on the statistical side you have things like neural networks and e/m and so forth and the same thing extends to planning for example classical planning versus MVPs natural language processing definite clause grammars versus policy context-free grammars and so forth alright and if you go to react conferences you will often hear the people who work on one side of this divide complaining about the limitations of the other right and the interesting thing is that they're actually both right they both right because neither logical approaches by themselves nor statistical approaches by themselves are enough for the kinds of problems that we need to solve in AI and what we really need to do is to unify the two right the real world that our intelligent agents are going to have to work in is both very complex and full of uncertainty logic is good for handling the complexity and probability is good for handing the uncertainty but what we need is both in one language one system and that's the goal we're of course not the first ones to work on this there's there's been you know things going back at least to Nielsen in in AI and his prolific logic the theoretical stuff that Joe helped and another's did a knowledge base model construction in the early 90s more recently this field has grown up of statistical relational learning where the goal is to combine statistical learning and relational learning and the whole bunch of representations have been proposed there including things like stochastic logic programs probabilistic relational models relational Markov networks and many others the ones I mentioning here are just the most direct predecessors to what I'm going to talk about here which is Markov logic and Markov logic compared to previous approaches has a number of advantages one is that it's more general in that it's really fully first-order logic and full you know probabilistic modeling it's also simpler interestingly in many ways it's a much simpler language than many of the things that have come before and it comes with a complete set of learning and inference algorithms that are available in open source software you can use them tomorrow for previous approaches in many cases there is no there are no algorithms for example for learning and when there are algorithms often they're not really available and so forth so if you want to do this in practice today Markov logic is probably one of your one of your best options so here's here's the one slide summary of what I'm going to talk about so the syntax of Markov logic is very simple it's just first-order logic with weights on the formulas the semantics is that I'm going to treat those formulas as templates for features of Markov networks so a set of formals with which is basically going to tell you how to build a probabilistic model and then we're going to see various same prints algorithms in particularly the one I'll focus on here is the most recent one called lifted belief propagation which combines belief propagation with ideas from theorem proving namely you know the whole idea of lifted infants and then the learning algorithm is not surprising we are going to combine ideas from statistical learning like voted perceptrons and so the likelihood with the inductive logic programming and the software that we have the implementing this is called alchemy this has been applied Markov logic is actually only a few years old but it's it's really taken off and people have applied it to a lot of things by now the application that I will use here to illustrate what you can do with Markov logic is information extraction but it's been applied more broadly in a lot of natural language problems social networks computational biology robot mapping etc etc and you know the list continues to grow so let me just start with a little bit of background to make sure we're all on the same page so first of all what are Markov networks so we're going to combine Markov networks with first order logic so let's start by seeing what Markov networks are so Markov network like a Bayesian network is a model of the Joint Distribution of a set of variables and it's a graphical model so we have a node for each variable and now we have an indirect with edge between two nodes if there is a direct dependency between them and the independent semantics of this is that a node for example here if I take out cancer and asthma cough becomes independent of smoking okay so graph separation in this graph tells you what variables are the independent of what variables given others okay so the graph gives you the independence structure now the parameters are given by what are called potential functions and there's a potential function defined over each click of the graph right each completely connected subgraph so for example and and the potential functions can take on any non-negative real values so for example here's the clique smoking and cancer assuming the variables are boolean there are four states and here's the value of the potential function for each one of them okay and the main thing that this potential function is capturing is the fact that the state smoking and not cancer is less likely than the others and the reason for that of course is that smoking causes cancer okay and now the probability of a state of the world is just a product of all the potential functions normalized okay now there's a problem with this is that if the cliques are large then this becomes intractable right because the size of a potential function is exponential in the number of variables in the clique all right so what can we do about that well we can use what's called a log linear form for the model something that's very popular with statisticians and the idea is that instead of a product of factors what we're going to have is an exponentiate sum of weighted features okay and you can always convert from one to the other by defining a feature for every state of the click and letting the weight be the log of the potential okay the thing that you gang though is that I could have a very large clique with you know a gazillion states you find out what the important few features of that clique are I can just write what those features are and their weights and I still have a compact model so for example if I define the following feature that's 1 if you don't smoke or have cancer and 0 otherwise and give it a weight of point 51 this implements the same Markov network as we had here okay and of course in this case the savings is not very impressive but if you have the logic liquor savings with you know be huge ok so let me just say a little bit about first-order logic you know just to get some terminology down so fun was in first our logic are made up of logical connectives like injunction disjunction quantifiers etc and four types of symbols constants representing objects in the domain say enna variables like X functions like mother of X and predicates that represent properties of objects and relations between objects like for example friends XY and I'm going to call a grounding of a predicate or a formula the result of replacing all of its variables by constants so for example if friends is one of the predicates in my domain and N and Bob are two of the constants one possible grunt Adam is friends and above and this is just a boolean variable that's true if an and Bob are friends and false otherwise okay and I need to call a world also known as possible world model interpretation or interpretation state and a bunch of other things an assignment of truth values to all the ground predicates okay so I have my language of predicates and constants I combine them in all possible ways I get a very large number of boolean variables and by assigning truth relish to them I have completely described the state of the world okay and the main thing that we're going to be dealing with in this talk is probability distributions over possible worlds okay all right so long now let's get to the heart of things Markov logic here's an intuitive way to introduce Markov logic a logical knowledge base can be seen as a set of hard constraints on possible worlds meaning if you violate even one constraint if you violate a single grounding of a single formula the world becomes impossible okay and this is why logic is so brittle logic does not make a distinction between a world that violates one formula and the world that violates all of them so as soon as they violate even one little thing you fall off a cliff right what we would like to have is a more smooth degradation as the world violates more of your formulas so how about we make the formulas soft constraints instead of making them hard constraints okay so now what happens is that when the world violates a formula it doesn't become impossible it just becomes less probable and we're going to give it formula a weight that represents how strong of a constraint it is so if I really believe in a formula I give it a highlight and you pay it a big penalty for violating it okay and now the probability of a world is going to be a log in your model of the form that we just saw where the formulas the groanings are the first-order formulas are the features okay and so we get this behavior that the more firm those is satisfying the higher their weight the more probable the world is so this is the intuition here's a more precise definition I'm going to call a Markov logic Network or MLM for short a set of pairs FW where f is a formula in standard first-order logic and W is a real number positive or negative that's the syntax the semantics is as follows together with a set of constants that could just be the constants that appear in the ml n are some others that you define the ml n is going to define a Markov network as follows the Markov networks going to have one node for each grounding of each predicate in the ml n and it's going to have one feature for each running of each formula in the melon with the corresponding weight okay so all the features that are runnings of the Flint formula are going to have the same weight here's an example that will hopefully make things clearer this is a picture from the New York Times a few months ago it's about a social network and who smokes and who doesn't so if you want to combat smoking what do you do turns out that a very important factor in who smokes and who's not is whether your acquaintances and your friends smoke so watch this chose this smokers you know this was a study that was done this is a thousand smokers in 1971 and in 2000 and the nice thing is that many fewer people smoked now right there's they're smaller circles are smaller and there are fewer but the other thing that's noticeable here is that people don't stop smoking at random they stop smoking in clumps if your friends stop smoking that makes you more likely to stop smoking and if they don't smart stop you'll have a hard time stopping as well okay so let's try to model this dumb let's start by doing it in English so we have two statements the first one is smoking causes cancer and the second one is friends have similar smoking habits okay and you know anybody who knows oh yeah whoa no one knows how to turn this into logic you write for every X smokes X implies cancer X and for every X Y friends XY implies that smokes X and smokes Y or equivalent okay very easy there's only one problem which is that these two statements were true then these two are false right they're false because not everyone who smokes gets cancer and certainly not every pair of friends has the same smoking habits okay but if we turn this into an ml n by giving which to the formulas now we're back to what we want we have statements about statistical regularities about the world most smokers get cancer etc and this formula here has a higher weight than this one because it's a stronger regularity right intuitively stronger weights correspond to strong irregularities okay so now we have this ml n what what is this actually saying about the world well let's suppose that we have a very simple world with only two people in it you know on a desert island or something because that's what fits on a slide and let's call them Ann and Bob you know they're not very original but usually it's Alice and Bob so we're using N and Bucky because n is even shorter than Alice but you know to be even shorter we're just going to be no represent them by a and B so the rule now is that we're going to have a node in the network for every grounding of every predicate okay so we're going to have cancer anna smokes and the same thing for Bob and then we're going to have friends and a Bob right friends Bob and I noticed friendship is not symmetric sociologists will tell you that right Bob could be a much better friend of an event she is of Bob right and this is this happens in practice right and and then there's like friends and and and friends Bob which maybe have to do with their self-esteem or something right so it's like a degenerate case of friendship okay so these are the nodes right and at this point what I have is a bunch of boolean variables I can now forget where they came from in some sense another question is like what is my distribution over these boolean variable as well the rule was that I'm going to have a ground a are going to have a feature but for every grounding of every formula and wherever I have feature I'm going to have to connect the corresponding nodes in a clique right so for example smoke sex implies cancer X is going to create the feature smokes Ana implies cancer Ana and therefore there's going to be an edge here and same thing for Bob right so this was a to a click so I have you know only one line here this here has three literals in it so it's a three-way click so I'm going to create triangles right so I'm gonna have a triangle now between smokes and friends and above smokes bobbin etc ok so now this is my complete graph of the mark of network right and now the parameter that the full distribution that this represents is here right so it's going to be the normalization maybe it's some of her all the formulas of the weight of the formula times the number of true groundings okay so now the more people that smoke the more smokers that have cancer the more probable the world becomes and if there's only one smoker that doesn't have cancer that makes the world slightly less likely but not impossible okay another important thing to notice is that an ml n does not actually represent the single distribution this is one of the things that makes it powerful is it actually represents a whole family of ground distributions depending on the domain that you apply it to in particularly if you apply to a large domain to the men with a lot of constants you have a very large a Markov network otherwise you'll have a small one but what they have in common is that they're all repetitions are the same templates okay another important point is that this looks very nice at this point but it doesn't look remotely practical right I'm gonna have an exponential number of nodes and I'm not even going to get off the ground right so what do we do well this is where actually a lot of the progress has been and we're going to see some of it shortly but the first thing that you can do that's very simple and already very useful is to have typed variables and constants okay and you only replace things by the corresponding types and this already really cuts down on the size of your network okay now in Markov logic we can handle the full range of constructs of first order logic including functions as essential quantifiers infinite domains continuous domains I'm not going to talk about that here for the sake of brevity but it but if you're interested I can certainly point you to the papers and you know you can read more about it so how does this relate to the kinds of statistical models that people use in AI and machine learning today well the nice thing is that pretty much all of them fall out as very simple special cases of Markov logic and again we'll see examples later you know pick your own favorite method you know conditional random fields or Bayesian networks or maximum entropy models get set etcetera all of these can be straightforwardly expressed in in in Markov logic in addition mark of logic gives you something very important which is the ability to deal easily with non iid data so iid means that all your objects are assumed to be independent and to come from the same distribution and this assumption is almost always made in statistical learning because it makes life easy but it's false like for example my properties don't just depend on me they depend on the properties of people that I'm related to like you know you might influence me to start smoking and and so forth okay the Markov logic this is easy to handle in the way that we just saw you introduced you know predicates that represent relations and then you know you have dependencies that are mediated by those and Markov logic also makes it very easy to compose these models simply by having them share predicates right so I write the formula that represents the CRF and the formula that represents our set of formers that represents a logistic regression and I can compose the two simply by virtue of sharing predicates and do joint difference over all of them okay what about the relation to first-order logic well it should be fairly intuitive that first-order logic is the limit of Markov logic that you get when you let your weights go to infinity right because then your soft constraints become hard constraints and indeed there's a theorem that says that every entailment query in logic can be answered by computing conditional probabilities on an MLM that you get by taking the logical knowledge base and giving if an it way to all the foreigners okay of course we're really interested in the case where the weights are not infinite and then we can still say something interesting which is the following if your knowledge base is satisfiable meaning it's possible to make all the formulas true at the same time and all the weights are positive which you can always ensure by flipping you know by negating formulas with negative weights and flipping the weights then the satisfying assignments are the modes of the distribution which means that even in the finite case the words that first-order logic are still there there the modes there the peaks of the distribution the difference is that now when you move away from the feet the peak you don't fall off a cliff but you know your probability just gradually goes down most importantly in first order logic if you have a contradiction either direct or indirect between your formulas all hell breaks loose right at that point everything follows and things break down which means that you can't really build large knowledge bases because if he comes impossible to build a large knowledge base without making sure there's no contradiction and you can't combine knowledge bases from multiple people as on a Semantic Web because there'll be full of contradictions with Markov logic however contradictions between formulas are not a problem because what Markov logic does is it's going to weigh the evidence on both sides and it will output a probability for your proposition okay so we can potentially do a lot of things with Markov logic of course this will all be only a theoretical exercise unless we can come up with efficient inference algorithms yeah the way we handle this right we're gonna see shortly how we handle this but thankfully the way we do inference with the satisfiability solver is that we do weighted satisfiability and if you have heart and strengths you're still in the logical situation and you know basically contradiction will make things break down but if you have a contradiction between formulas with a finite weight right what happens is that we're just going to look for the possible world that maximizes the weight because that's the most likely world so if those two contradictions if one of them has higher weight we will prefer that one we can also just compute the probability by you know basically computing the conditional probability of that statement given even the contradictory ones just trying to related you're really saying about well so if right so I see a question so this this statement was supposing the knowledge base is satisfiable okay if the knowledge base is not satisfiable you may still have modes right but those modes will not be the words that first order logic likes anymore okay so at that point that correspondence breaks down okay okay so inference how can you be efficient in this language things don't look very promising right because infant's interest our logic is not exactly efficient in fact it's undecidable and inference in in probabilistic models is also not efficient it's sharpy complete right so you know we're combining sharpy with undecidable and you know that things aren't looking good right so surprisingly though we've actually been able to get very far we and others and what I'll show you here is one example of how and why so there's there's two main kinds of inference that people usually want to do one what it's one is what it's called mep or MP inference which is finding the most likely state of the world given your evidence okay this for example what we Iturbi does in hidden Markov models for those of you familiar with it and we develop a number of elements for this there will not go into here that based on weighted satisfiability testing as I just mentioned the other type of inference that you often want to do is to just compute marginal and conditional probabilities right what is the probability that you know and I gets cancer given that Bob smokes and you know he's a friend right free for this we can use things like Markov chain Monte Carlo and again we've developed some efficient algorithms including something called MC set which combines MC MC with accessibility testing there's also this old idea which we've generalised called knowledge based model construction where you don't actually ground out the entire network you only ground out the minimum network that is needed to answer your query right and this you know connect can itself save you a lot the most recent method and the one that I will cover here though is something called lifted belief propagation and the idea here is that we want to do lift the difference in the same way that we can do lift the difference in logic except that we want to do this in this new more general probabilistic setting so what's what's the idea here the the thing that's so good about infants in first our logic is that I could actually do inference for a whole set of objects at once irrespective of the size of the set in fact I could even do it for an infinite number of objects in finite time right and in the past this has not been probable possible in probabilistic models but what people do in these types of models is they create the ground network very large and then do inference on it which you know might not be possible or even if it's possible might be very slow what we like to be able to do is lift it probabilistic inference and the way we're going to do that here is we're going to group the atoms and the clauses in the network into indistinguishable sets and by indistinguishable I mean think of belief propagation which I will review shortly belief propagation works by atoms sending messages to each other and two features or to each other via features right if two atoms send and receive exactly the same messages throughout belief propagation it's actually a waste of time to do this computation twice I can just I can just do this once for both of them and if there's a million of them then my computation you know is a million times smaller so this is exactly what we're going to do here let me just mention that prior to our work the only thing that people had developed was a lifted version of variable elimination which you know was was a was a certainly an important advance the problem of course with your elimination is that it's an exact inference method and it blows up and it doesn't scale to real-world problems all the things that we try to apply lifts the thurible emission to you know we didn't even weren't even able to you know finish because it ran out of memory so how can we do with the belief propagation well here's the basic idea so for purposes of new propagation is useful to think of your network as what's called a factor graph the factor graph is a bipartite graph with two types of notes some nodes are your you know variables in our case our ground atoms and the other nodes are your features okay and each node is connected to the features that it appears in and and vice versa okay and now the way belief propagation works is that in essence we start out with each node having a crude approximation of its marginal let's say a uniform distribution and the notes and messages to the features and the features to send messages back to the notes and basically the message that a feature sends to a note is what it thinks the notes marginal distribution should be right so the nodes are pleated features with a new marginal and then the feature computes where the things the marginal should be and and basically the way the node computes its new marginal is by multiplying the marginals of all the features it's marginal is according to all the features that it's in and the way a feature computes on your marginal for the node is by summing out the other notes from the future right so I have here you know e to the WF of the future and what I'm doing now this is why this is an approximation is I'm pretending that the other nodes are independent so I take their marginals I multiply them I multiplied by the factor I sum it out and what I have is a marginal on this guy so the node receives the marginals from the features multiplies them that's its new hypothesized marginal and this keeps going until things converge okay the propagation you know been this is received a lot of attention in recent years it's still a very active risk every search how to do this better the problem that we're going to deal with here now is that typically people do this in networks with tens or maybe hundreds of nodes and we're looking at networks with millions or billions of nodes and features right so we need to do something different and the way we do what we're gonna do is we're going to notice that often you know some sets of nodes send exactly the same messages to their features and vice-versa so if I can somehow identify these you know let's call them super nodes and super features then instead of sending all these messages back and forth I can just send these right I send a message from this super node to the super feature which replaces all the different messages that each node will have to send to each feature okay and as it turns out now I can do belief propagation on this lifted Network in pretty much the same way as before except that now the messages are being lifted to certain powers because they represent the product of a lot of messages okay so this is the basic idea so here's how the algorithm works first of all I'm going to form my lifted Network and then I'm going to run blue propagation on that lifted Network okay and the lifted network is composed of these super nodes and super features and soprano there's all the nodes that send and receive exactly the same messages and same thing for super features okay and now we can guarantee that when you run belief propagation on this network that hopefully will be much much much smaller you get the exact same results as if you're running this on on the the full ground network okay and of course the time and memory savings can be you know orders of magnitude so the crucial question is how do you form the lifted Network right that's the hard problem so here's how let me just like give you a rough idea of how you do that initially we have some evidence within some known truth values for some predicates so we can divide each predicate into three super nodes one contains all of its true groundings one contains all the false grounds and one contains all the unknown groundings and of course you don't necessarily have all three of them but in the most general case you do okay and now the first thing that we're going to do is form super features from the super nodes and the super features in essence the join of the super nodes that appear in it all right if I have our X Y and s Y Z right then I have a feature that's the conjunction of the two the feature is over x y&z joined on Y okay so the super features are going to be joins of the super nodes all right and so I'm gonna form them by you know if if each the pointer could be true false or unknown then for the super feature there's going to be suppose there's two right there's going to be true true true false true and none right so I have these combinations all right so in this way I can create a generation of super features from my initial super nodes okay now the next step is that I think think of this as having started at the evidence now the evidence in this is a finite subdivision of the features into super features but now these in turn can cause other nodes that they were connected into become subdivided right so how do we do that well it's by projecting the super features down to the predicates alright so say I have a feature over X Y Z including a predicate R X Y right I project this onto our X Y by getting rid of Z and adding up the results right this is the key so what happens is that I'm the new my new super node is going to be all the groundings of the predicate that receive the same number of projections from the features but because if two if two atoms receive exactly the same number of projections from the same features then they're going to receive the same messages right and therefore they can be treated there's a super node so now I have a new generation of super nodes and from that I can use the new generation of super features and I keep on doing this until it converges so okay and we have a theorem that says there exists a unique minimal lifted Network minimal in having the minimum number of of nodes and features and this algorithm finds it and BP on the lifted network is the same results as on the ground network okay there's there's still an important remaining issue which is how do we represent the super nodes and super features during the network construction right once I've constructed a network this just looks like a regular Network and I can run belief propagation in it and it's guaranteed to never be slower than running on the full network and in fact it could be way faster right but during the network construction I need to somehow represent my nodes and features now then the easiest way to do that is to jettison them by you know database tables write lists of tuples the list of the true ground atoms and so forth the problem is that if you do that that way then that can become the bottleneck right because those tables could become very large so the next option is to use a more resolution like approach when you use equality we say well here these are all the atoms except the ones where X is equal to a okay or only the atoms where X is equal to a and y&z are free to vary okay and this is already going to be more efficient what we can do that's even better and that we've been working on recently it's not published yet is we can actually form clusters of nodes and clusters of features right and now this can be much more compact that even the resolution like representation and you can also imagine doing that approximately in getting even more gains which which we have so let me actually skip over this part these are some interesting open questions that still remain here but in the interest of time let's keep going we can come back to this if people are interested and let's talk a little bit about the learning okay so at this point we actually have inference albums that are good enough to do real applications in in reasonable time and you know I will mention some later but what about learning right where these weights and these formulas come from we would like to learn them from data so the data now is a relational database not just one table as in traditional machine learning but it's going to be a database with multiple relations here I need to make the closed world assumption which is that every atom that is not in that in the database is assumed false and that makes things easier if you don't want to make that assumption then those atoms are unknown and you will need eeehm versions of these algorithms and they are available in this alchemy software that I mentioned but I'm not gonna go into them here so now there's two tasks learning parameters in our case the weight some learning structure in our case the formulas and they can be done generatively and discriminative ly so let's briefly look at each of these so first of all let's look at generative weight learning right the most the first thing that you want to do so what we do here is we want to you know the standard approach is maximize likelihood so we want to find the weights that make the database that you're seeing maximally probable all right and unfortunately you can't find them in closed form but fortunately the likelihood as a function of the weights is a concave function which means there has a single global optimum and we don't have to worry about initialization and getting trapped in local optima and all of that and in fact the gradient so now we do this we can do this by gradient descent right using you know fast second-order techniques like l-bfgs or conjugate gradient which again are available in in alchemy and the gradient actually has a very intuitive form and it's as follows the derivative of the likelihood with respect to a weight is just the difference between the number of true Groundlings of the corresponding clause in the data and the number that's predicted by the MLM okay so if the MLM predicts that a formula is true less often than it really is its weight needs to go up and if it predicts more often than it needs to go down and once all the predictions and the empirical counts line up we've reached the maximum like with a point and we're done okay very nice there's only one problem and a big one which is that this requires inference and we have to do this inference at each step so this is going to be very expensive okay so can we do something that's quicker yes we can so we can borrow this idea from the world of micro random fields of pseudo likelihood and the ideal pseudo likelihood is to be a measure that's like likelihood but doesn't require inference and the pseudo likelihood is just the product over all variables of the probability of the variable condition on the state of its neighbors in the data all right so all that we need to do is compute a conditional probability and this is very easy right and so this does not require inference it's a consistent estimator so that's good it's you know this has been used a lot and feels like vision special statistics and opie etc it does have the the problem that this is not quite the right thing to optimize so if you have wrong inference chains and inference time you know so like what it didn't think about lying inference chains so the results might not be very good right so when this works it's nice because it's very fast and reliable but when it doesn't you need something else and so here's another thing that you can use and that's discriminative learning right and these days people tend to use discriminative learning of a generative learning anyway because it's more accurate so the idea here is that if you know in advance which variables are going to be query let's call them Y and which variables are going to be evidence let's call them X then you can just maximize the conditional likelihood of Y given X and not waste any effort modeling the Joint Distribution of the X is because they're going to know them at query time okay so this is both more efficient and more likely to give you an accurate model because you will not be compromising prediction of X of Y for the sake of better prediction of X okay and now your gradient is still the same except that now this is of course in terms of x and y and we have you know fewer clicks to model and whatnot okay and now there's another thing that we can do which is what makes inference hard is that usually have many isolated modes right but when you condition on things those modes start to disappear in fact in the limit they have to all disappear and leave only one ok so what we can do is approximate this which is essentially a sum over all possible states with just the values in the single most likely State ok the idea that hopefully there's a dominant peak and most of the probability mass is in that pecan nearby states right so instead of doing an exponential sum I just need to find that peak and then I use the counts on that peak as an approximation to my you know overall counts okay in fact this idea was originally proposed by Mike Collins for training hidden Markov models and he called it the verdict perceptron because it's a lot like the perceptron algorithm and of course they're the the network is a linear chain right so these these are my observations and these are my hidden variables right and the wavelet perception works is you initialize the weights to zero and then you just go into the cycle where you estimate the most likely Y given X in this case using the Viterbi algorithm and then you do the difference between the counts in the data and the counts in the MIP solution and that's your training signal multiply that by a learning rate update your weights and then you return the average of the weights okay that's why it's called voted you know this tends to give better results so how do we how can we generalize this to Markov logic well the yes the answer is actually very simple although we need to do is replace Viterbi by max bye max walk set or by some other you know procedure for finding the MEP solution right by plugging this into here right this way the satisfiability solver could be you know max product belief propagation or whatever we now have something that applies to networks with arbitrary topology and you know with arbitrary parameter time okay and the algorithm is still pretty much the same okay and in addition to this we have more recently developed you know faster algorithms that use second-order techniques like scale conjugate gradient and Newton methods that you know are often a lot faster than very perceptive okay but the basic idea is still similar finally structure learning where structural learning is the problem of discovering what formulas holding your data all right and this is both a generalization of the problem of feature induction Markov networks and the form of inductive logic programming right well you're not to what you're programming does is discover clauses that are true there's a couple of important differences the one is that here we we want to use arbitrary clauses not just horn clauses and our evaluation function now should be likelihood because we're building a probably stick model and the nine principal any kind of ILP search algorithm could be used here except that there's a big problem is that in order to evaluate a new candidate clause you can at mln I need to learn weights for it so I know I need to learn weights for every single candidate that I'm gonna try doing this search and that could be a really huge number all right and you know learning weights is not exactly instantaneous surprisingly though if you do a couple of very simple things it actually turns out not to be the bottleneck and the simple things that you can do are first of all when I modify a clause and relearn the weights I don't start from scratch I start from the current weights right because chances are most of weights aren't going to change and even once that change aren't going to change much okay and the other thing I can do is have looser convergence thresholds because I just want to figure out who's the winner not exactly what the weights should be at the end of each selections that maybe I can do the exact weight learning okay and things like l-bfgs are extremely fast when you start close to the optimum so what happens is that we often converge in just two or three iterations interesting with the bottom leg turns out to be counting clause groundings which is actually an np-complete problem in itself unfortunately now when when we were doing this for with learning we just had to do it once when that wasn't too bad now again we have to do it every time we introduce a new clause but again there's a bunch of things that we can do here that speed things up including caching and various kinds of subsampling that make things much much faster right if a class has a billion groundings I don't need to go check whether each one of those is true I can just subsoil you know see what fraction is true and extrapolate that to the whole and with that structure learning in a minute Markov logic becomes about as efficient as other kinds of ILP which means it's not blindingly fast that you know it's something that you can you can do in practice and you know now you have the usual choices for initial state and operators and evaluation functions and very search algorithms and whatnot okay you can do top-down search you can do bottom-up search you can do various combinations for example it will learn that smoking is useful in predicting whether you're gonna get cancer or not yeah that's exactly what I might for example it might learn that whether your friend smoke is predictive of whether you smoke or not that's sort of yes/no so there's another thing that you might want to do which is predicate invention where you discover the predicates and again this is this in some ways the most interesting problem something that we've worked on another set worked on I'm not covering it here but I'd be more than happy to talk this is actually where most of our recent work has been because that really is the bigger challenge so yeah and you know it's a hard problem but we've made some good progress in it it's also another the the promised equivalent to this is discovering hidden variables or latent variables but it's a predicate invention and latent variables are really two sides of the same coin and so what we're doing right now is basically you know it combines those aspects so let me briefly mention the software that this is implemented and it's called alchemy and it's available at this URL that I will put up again at the end and it has all the algorithms that I've talked about and you can you know encode your knowledge your assumptions in first order logic this is usually something that you can do very quickly and it's very easy to learn and because hey if you know logic you already know Markov logic at some level and then it has a bunch of you know nice programming language features like you know you can link in external functions and you can you know define types and you have you know you can commenting you know the usual things and very much once we started using Markov logic we realized that it really is best thought of as a kind of programming language in the same way that Prolog is a declarative programming language except that it has some interesting new features so I think it's interesting to sort like compare alchemy with what's available on both just the symbolic side and on the statistical side okay so if you compare alchemy with Prolog in Prolog your representation is horn clauses they're not to me it's arbitrary first are the formulas in Prolog your inference engine is theorem proving in alchemy if we have a bunch of them one of which is lifted BP also satisfiability testing weighted Mac satin and things like that but of course the big difference is that in alchemy you have learning and uncertainty handling right out of the box whereas in Prolog you have to somehow add them after and that's where your hard work begins you can also contrast out to me with something like bugs right which is probably the most popular package for graphical models out there this is short for Bayesian inference using Gibbs sampling so bugs uses Bayes Nets we use Markov nets and in fact our representation is more general in the sense that basically Markov logic allows it to encode arbitrary log in your models and bayesian networks are only a subset of all the possible log in your models in bugs your inference is Gibbs sampling which is excruciating with slow for a lot of models there was an alchemy we have some much faster things available in bugs economy when parameters in outcome you can also in structure but of course the most important thing is that alchemy is relational the now come you can talk about relations between objects non iid data etcetera very easily in bugs this is you know either impossible to do or very hard depending on exactly what you're trying to model okay so alchemy you know in some sense it combines a lot of the best features of both of these but the most important thing of course is that in outcome you can basically have the logic and the probability and not even think about the distinction between the two anymore okay you know and just have both capabilities at your disposal now of course both Prolog and bugs are much more mature systems than alchemy is right so they have more a lot more bells and whistles and so forth and then alchemy does at this point but you know for many things that you might want to do how come is already a better starting point than the neither of these so as an illustration of that let me you know just to finalize talk a little bit about applications so as I mentioned Markov logic is only a few years old but it's actually been applied to a remarkable number of things by us and various other people here's a partial list of them information extraction is one into resolution is another link prediction other natural language processing tests like Sebastian for example has done a lot of applications of Markov logic to things like semantic role labeling and Inter resolution and and so forth and you know won some competitions while while edit web mining computational biology shows the network's robot mapping active recognition some of you might be familiar with psych the world's large base the folks at psycorps are turning parts of psyche into ml ants so they can do probably stick reasoning with them also payload the largest project in DARPA history it used Markov logic for its core inference engine and again this was very successful there and so on so let me just pick one particular application because I think there's nothing like seeing a concrete application to see you know what what the power of something like Markov logic is so information extraction a very important problem and you know very widely study these days is the problem of going to text and extracting a database from the text that you can then issue queries to and one popular example of that is citation databases like sites here and Google Scholar so you have the goal is go to the citation lists at the end of papers and extract out you know the the papers right to have a citation database and there's basically two parts to this the first one is segmentation where I want to detect that Triple A oh six is a venue field and so is Proceedings of the 21st National Conference artificial intelligence and so forth okay the second step is in Theresa I need to figure out that Triple E is six and the Proceedings of the 21st National Council nitrogen tons exact actually represent exactly the same thing alright and this can be very hard because you know these two strings look completely different right and once I've matched the fields I also want to match the citations as a whole like I want to identify that these two are actually the same paper and these two are the same paper right otherwise I will get duplicate records when you know my queries will return garbage and any modeling that I try to do on top of this will also be broken okay so the state of the art in its resolution today is to use something like an hmm or a CRF to do the segmentation by basically assigning each token to a field you know author author author title title title venue and then the entry solution is done by taking a classifier typically logistic regression or knife base applying it to every pair of fields or citations and predicting whether they're the same or not and then you need to apply transitive closure which can be done in a number of ways usually a very heuristic ones right if I've decided that a and B are the same and B and C are the same well now I need to decide that a and C are the same as well okay so if you were to go and you know implement your own information extraction system using Java or C++ or something today it would probably take you you know on the order of weeks to do that and it would probably run into tens of thousands of lines of code in alchemy you can do this with just seven formulas so you can write seven formulas in alchemy that fit on a slide that implement a complete information extraction system for you okay in fact if you don't believe me you can go and download this from the alchemy their site and you know do it yourself and try it out in fact I've given talks where people do that while I'm giving the talk and download and come to me at the end with questions about what there's you know system is doing so so how do we do this briefly so an MLM file in in alchemy has three parts the first one is type declarations the second one is predicate declarations and the third one is the formula themselves and the type declarations in fact are optional because the types can just be inferred from the data okay so we almost never the type declarations but for clarity I'm gonna do them here so we're going to have four types here the first one is token right this is all the words that appear in the data the second one is field which in this case is author title and venue said but it could be other things that could be cities it could be you know companies etc and now I have you know all my citations and also have an index to run over positions in a citation okay so those those are the types now the predicates this predicate is going to be my evidence it says that this token appears in this position in this citation okay so the token singular appears in position 1 of citation 1 okay and now my query predicates are going to be these 3 in field does the segmentation it says that this position in this citation belongs to this field so for example the first position of citation one belongs to field author ok and then these to do the ante resolution right same field means that this for example the author field in citations one and two are the same person ok and same citation just means that the two citations are the same ok so with these pretty kids here's the ml n it has 7 pointed as advertised with a pretty short the first three four years are basically implementing an hmm or actually CRF because we train this discriminative Lee but either way and the last four as we shall see are doing the inter resolution so how are these formulas implementing an hmm well it's quite simple so this formula is saying that having a token in some position applies that implies that that field is in that position and all the open variables here are implicitly universally quantified as in Prolog and we have this notation Plus that basically allows this formula to be a summary for all the instances of this form of that you get by replacing this with all the tokens with all the fields okay so when we learned the weights for this formula we're basically learning a matrix of token by field and this is exactly the observation matrix of the hmm okay and now the next one not surprising is going to be the transition model of hmm and what the transition model is capturing is the correlation between the fields of successive tokens okay and this rule actually here is actually not even necessary and let me jump over it it's just to make things you know better in some ways so now being through resolution the the first rule that has a lot of the work is this rule that says if two fields have the same tokens then they're the same field right so the more tokens two fields have in common the more likely they are to be the same but of course some tokens are more informative than others and again what we're going to have here is a wait for this formula for each token field pair right so some tokens might be very indicative of some fields in which case that instance of the rule will have a high weight okay so you can look at this as a logistic regression where this is the predicted variable and and we just constructed these features for the regression right so we're doing with just a congressionally the features are defined by Markov logic or by logic right you can also look at this as just it's a similarity comparison right I'm comparing fields and the more similar ones you know are more likely to be the same okay so now I have a little rule that there's a lot of work which is a rule that says that if the heels are the same then the citations are the same okay now and and vice versa notice this is an equivalence certainly if the citations are the same right necessarily the fields have to be the same right but if the titles are the same this is actually evidence that the citations are the same even if the authors are the same this is weak evidence right so again as logical statements saying that the same authors means that this the the authors are the same implies that the citations are the same is broken but in Markov logic this is just capturing a correlation so same title would probably have a higher weight in the same venue and you know and same like then same author and same author a higher weight and same venue okay and notice that there's also just something important which is if I figure out that for example those two citations that I had from Tripoli I were the same because the authors and the tiles were the same right now since the citations are the same the venue to be the same so at this point I've inferred that Tripoli and the National Conference on artificial intelligence are the same thing so I've actually done something that you would think would require domain knowledge but I've been able to indirectly figure that out just for the data and even better now this becomes available to match for the citations now that I figured that Tripoli and the National Conference bob'll are the same I can use this to match other things right so this rule is very short but actually there's a lot for us and then these two rules just implement transitive closure and that the beauty of it is that these rules this is just the statement of the logical axiom of transitivity but the only thing that we did here was add the axiom you know if C is the same as C prime and C prime is the same as C prime prime this C is the same as C prime prime and you throw that in and the system does transitive closure if years ago there was a widely cited paper by McCallum and Wellner at nips where they basically what they did was to air transitively to a CRF and you know the paper was it to the force because they had to design new inference algorithms new learning algorithms they had to basically do everything from scratch to make it work two-hour transitive closure in our case adding transitive closure is just one more formula that you write okay of course it increases the complexity of the inference but fundamentally as a user you don't have to do anything different now this has an interest Lucien model is actually a pretty good one this as a segmentation model is not so great then there isn't it's not so great is that it's not very good at deciding exactly where the boundaries of the fields are and people information extraction notice and what they usually do is that they write rules to basically predict where the boundaries are okay but those are so like the real based approaches to information extraction where CRFs and my channels are more the statistical based approach right another questions can we combine the two actually we can combine them very easily just by taking here's a simple example of how you can do that in citation matching there's a very good predictor of where he'll ends that's the period right most fields are separated by periods so all I need to do is modify this rule by saying well actually consecutive tokens only want to be in the same field as long as one of them is not a period all right so the period stops the propagation of fields being the same okay and here's some experimental results from this on the core database this is a standard thing that's used for these purposes here's here in green is the model without this change that I just made all right so this is precision recall you know you want to be up there all right so this is the green mark so this is just the logistic regression model no sequential model this is when you have the sequence the sequence correlation right and now when you add that when you do that one modification that I just saw it suddenly go you get up there and now if you do the same thing for the comma which often is a field separated but a weaker one you get even better results okay so refining your ml and to take more information into account is usually quite easy you just change your rules or or add new rules and you know here are some more results so okay let me just briefly conclude now if you look at other areas of computer science there's a remarkably consistent pattern as to when the speed of progress picks up and this is when people define what you might call an interface layer that separates the infrastructure from the applications for example in the case of the Internet of networking the interface layer is the Internet and the infrastructure things like protocols and routers and the applications are things like the web and email and whatnot and the power of this is that once you have this layer the people doing applications above it don't need to worry about what's below and likewise the people who work on what's below don't need to worry about what's above they're just implementing this language so the N squared problem of every application communicating with every piece of infrastructure becomes an event problem right and so and again in pretty much every field you see this case in databases you know the interface is the relational model but in programming languages its high-level languages in you know VLSI it's VLSI modules and you know architecture in I mean operating systems for example it's virtual machines right in in in microprocessor Michael process design interface layer right in HCI it's the the graphical user interface right so now we might ask ourselves well what is the interface layer for AI right and you know in the early days of the eyes some people would have said well it's first-order logic right and then they would build like this expert system shells and whatnot the problem though is that first-order logic by itself doesn't have everything that you need to have a good interface layer for UI so it hasn't really taken off in that way but particularly it can handle uncertainty all right on the other hand some people would say that graphical models in some sense are like this you know layer where you can describe all promised ik models using graphical models and then you know develop inference alguns for them and so forth but again graphical models per se don't have everything that you need in an interface layer for AI in particular they work with a very impoverished representation of the world just the vector of variables they can't talk about relations and objects and things like that however if we combine the two in something like Markov logic or a similar language now potentially you do have a viable interface layer for AI okay because now you have combined in one language everything that you need and now you know if you work for example in learning like you can you can develop learning algorithms as long as they use these ApS they become available for all the applications and they will work with the other pieces without you having to worry about it right so if you like you know inference right maybe you have a better set solver well that can then work with the learning even though you yourself don't have to know anything about the learning and if you're on the application side right you don't have it like for example an LP today you know people build these NOP systems that basically have everything implemented in them and they spend their entire time maintaining them instead of doing an LP research alright with with something like this then you're at least potentially are able to focus on what you care about which is the application and you know our goal with outcome is of course to support such an interface layer so as a someone on the application side you can use alchemy and produce ml ends there's somebody on the infrastructure side you can produce algorithms improve as long as you use the alchemy API s they then become available to the people doing applications and ideally and you know we're starting to see this if when you do an application you take your Markov logic network and you make it available then others right can start from your Markov logic network and you know pick and choose the ones that they need and then you know build on top of that and hopefully in this way we can actually make faster progress towards human level AI R or something approaching it thank you this production is copyright the University of Edinburgh
Info
Channel: The University of Edinburgh
Views: 11,210
Rating: 4.9444447 out of 5
Keywords: logical, statistical, AI, artificial, intelligence, learing, algorithms, logic, lecture, pedro, domingos, edinburgh, university, informatics, ventures, computerscience
Id: bW5DzNZgGxY
Channel Id: undefined
Length: 60min 20sec (3620 seconds)
Published: Thu Sep 17 2009
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.