AGI 2011 - Probabilistic Programs: A New Language for AI

Video Statistics and Information

Video

Captions Word Cloud

Captions

it's my pleasure to introduce our next tutorial speaker know of Goodman from Stanford University thanks Moshe for the introduction and the invitation so let's see first of all if I get too far away from this thing wave at me at all request sir so the big question that I'm attempting to address in my work is pretty similar to what everybody else here I think is addressing it's the question about what is thought what is intelligence and it's based on the you know core principle from the computational theory of mind which is that the mind is actually an information processing system but this of course leaves open a really important free variable in that metaphor which is what kind of information processing system and this has changed a lot through the decades so you know there was a time when the only information processing systems that we really knew about were you know pneumatic tubes and valves and for a while that's what we thought the mind was like pneumatic tubes and valves and then at some point we invented these neat stored-program digital computers and then all of a sudden that's what the mind was like and there have been some more levels in the middle there so what I'm gonna try to describe to you today is sort of a set of ideas that we've been working on in really in cognitive psychology and machine learning for how to think about the kind of information processor that's necessary in order to give rise to intelligence sort of what metaphor what computational metaphor should we be using for intelligence I'm gonna do this from a slightly different point of view than for instance Ben just did because I'm really a cognitive psychologist and an AI researcher and so what I'm looking for is not an ability to just start off making whole systems whole cognitive architectures but I'm looking for principles mathematical principles formal principles that underlie intelligence that I can apply in specific case studies and compare with data from human experiments so that's where we're heading and to to motivate the approach let me just sort of talk through a slightly revisionist history of the ideas in cognitive science over the last five decades or so it's based on a couple of observations about thought the first observation is that Wow thought is actually really useful and it's useful in an incredibly uncertain world and that uncertainty can come in the form of noise perceptual noise that gets in the way like you're driving down the 280 and there's rain all over the windshield and you don't really know what the what the sign says but you don't just explode and drive into the ditch you you make do you deal with that noise noise also comes in a in a actually more common and more interesting form which is lack of information so uncertainty can come in situations like this imagine you call your best friend on the phone and before he says anything or you say anything he just yells at you and hangs up the phone and you're asking yourself well what happened maybe he wanted to hurt you or maybe he thought you were a telemarketer so this is a case where nature gives you two different nature gives you one piece of information the action of your friend and you're trying to recover two different things his beliefs and his desires of course you don't have enough information to do that this is an under constrained inference problem and this is exactly the kind of situation where we cope with uncertainty in the form of sparse information okay well how do we explain the ability to cope with uncertainty well there's a principle that people have pointed to many times over the decades in cognitive science which is probabilistic inference probabilistic inference is a calculus for reasoning under uncertainty so it lets you successfully reason under uncertainty this of course has been behind a lot of research traditions I'm just showing here some icons depicting connectionism various kinds of exemplar and prototype models various Bayesian approaches and and there are many others okay so that's one corner in the other corner of the ring we have the observation that actually thought is this incredibly productive system so if I'm Humboldt said thought well language is that which makes infinite use of finite means and this productivity again has a couple of a couple of different tips of the iceberg that's a bad you see it in a couple of different places one is a very short-term thing like imagine that you open up the newspaper one day and read this sentence a big green bear who loves chocolate you don't have any trouble imagining such a thing even though you've probably never thought of it before and probably never will again unless you hear me give this talk again alright but somehow we're able to put together the the existing ideas in order to productively form new ideas and new concepts on the fly this also this kind of productivity also shows up at a much longer time scale a time scale that's both developmental and and in terms of development of whole societies for instance as a society we've created the ideas of mass and momentum or the idea of a nation and every child eventually has to learn these concepts and their enormous number of concepts which are not plausibly innate so they have to be learned somehow so this this observation about thought that is incredibly productive also has a response a mathematical principle that people have used to explain it which is compositional representation if you think that thoughts are built up combinatorially out of a small set of simple pieces then you can explain the apparently unbounded variety of different thoughts that we can construct and manipulate and this of course also has a long history here I'm just throwing up a few Alonzo Church and Alan Turing generative grammar lots of different logical approaches in AI where this was really important and of course Jerry Fodor sort of philosophically summed this up in his book the language of thought now these are usually seen as competing principles in psychology there's always these debates about whether it's you know rules or statistics for instance and I'm gonna argue that actually these shouldn't be seen as competing but there's one more ingredient that I have to add to this discussion which is a question about in some sense what kind of knowledge should we be representing when we're representing knowledge so imagine that you get a bunch of examples of red circles and blue circles and they only have one one property which is I'm drawing as the position along this line and now there's two different ways that you can try to capture knowledge about this one is you can capture knowledge about the specific task if I see a position on the line and have to label it as a red circle or a ballistic circle how should I do that and maybe they are what you want is just a decision boundary right to the left is the red ones to the right is the blue ones the other way to represent this is generative knowledge about how the circles got there so something like the distribution of circles that are the red circles and the distribution of the blue circles you can think of this in some sense as two different causal directions either the arrows point from the input towards the output from perception towards motor or you flip the arrows around and represent them as pointing towards perception so you say this is knowledge that describes how I think things sort of how I think the world gives rise to the observations that I see okay and then I can use it to answer questions that basic idea is the idea of generative models the latter one and the the really key thing about generative models is that it's a kind of knowledge that lets you as answer many many different questions so maybe I should have a third little box up here for the flexibility of thought the fact that we're not tied to a specific task and this is the thing that I think really differs between kind of standard run-of-the-mill AI and machine learning and the artificial general intelligence or just actual intelligence human intelligence that we're trying to explain this kind of task agnostic flexibility okay so my research is basically about putting together these three principles probabilistic inference compositional representation and generative models and I give it the tagline the probabilistic language of thought hypothesis partly because this makes psychologists kind of irate which is fun so what is that the probablistic language of thought hypothesis in informal terms is basically this set of assertions that yeah mental representations are in fact compositional but their meaning is probabilistic not truth functional or logical and that they are used to encode generative knowledge as opposed to discriminative knowledge and because of all of this they support thinking by probabilistic inference and can be learned by probabilistic inference now you should be holding on to your wallets and asking me can this actually be formulas is this better than just kind of a set of you know big big philosophical principles well that's also the question that I've asked myself a lot and so what I want to talk about in the rest of the tutorial today is kind of the first two sections are really development of the formal ideas that go into this probablistic language of thought idea and the last two sections are a series of applications to human cognition and modeling psychology experiments and the first bit you just heard that was the introduction okay so rather than jump into the the full-on formal system I actually want to kind of motivate this from the ground up both because I think it's incredibly illustrative of why we want to go where we're going and also to make sure we're all on the same page so let's start with logic okay logic we all know is a great way to represent things that we believe about the world and it's a very simple way in logic we basically say well we can start with a truth table where we basically say look there are a bunch of different propositions a bunch of different basic things about the world like whether somebody is coughing or sneezing or has the flu or has TB and for every one of those combinations we assign a possibility judgment is this a possible world or not and that's great we have we've captured sort of our beliefs in which worlds are possible which combinations of these things are possible we can do inference we can reason and the basic way to think about inference in logic in this kind of truth table version of logic is to just imagine crossing out the rows so what do we do if we want to know if given that somebody is coughing and has flu has the flu did they have TB well we would go we would cross out all the row rows where they're not coughing we would cross out all the rows where they do have the flu and we would look at what's left and we would say hey in every possible world that's left the person has TB so we concluded okay the person has TV and this is it's a really simple way to describe how to do reasoning how to do inference hmm we could you know we sit could say okay we're done let's go home there are a couple of problems of course just with using this kind of representation of knowledge the the the kind of the basic problem is that the size of the truth table is exponential in the sort of basic description of the world in the set of different propositions and a corollary of that is that this inference by crossing out rows is going to become ridiculously impractical okay but don't forget about it I think this idea of basically crossing out the worlds that you don't think are possible is a really key idea that you know we want to find ways to make this more efficient and more more useful but that's going to be the key idea behind reasoning in general and almost all of the logical systems that we're going to talk about and probably that you're familiar with okay yeah so we conclude TV so the solution that people found in propositional logic is to use a formal language to more compact compactly represent these tables of truth values and the basic idea is well hey let's define operators which are represented by conditional truth tables right so it's no longer just the truth table but it's the truth table of for instance a or B conditioned on a and separately B so this is the kind of truth table you usually see and then we're going to try to describe these possible worlds by composing the operators right so for example I could take this big truth table and I could just write it as coffee flew or TV and you can probably see that the thing on the right is a lot more compact than the thing on the left so in many cases it's no longer exponential so the key idea here was to make a language by compositionality and it's compositionality of these conditional truth tables the representation is often they're not always much more compact there are new inference sort of techniques that we can develop based on this idea such as the the the proof rules that you probably all know and love we much more intuitive specification in some sense those those those conditional operators match the way that we think about logic much more than the giant truth tables although maybe that's because of our education you might wonder does this really have anything to do with mental representation in humans it's a good question interestingly the place that this whole set of ideas was actually introduced this 1854 book by Bible was actually called the laws of thought and then it had a much longer name so he was really thinking about this stuff as representing how people think as mental representations and reasoning interestingly he also had some chapters at the end about probability so he was you know way ahead of me in many ways not just too poorly okay so you might ask whether this is enough well we all know that this is not enough we don't stop there when we're doing logic because propositional logic isn't compositional enough right it's still not as compact as we intuitively think it should be why well because we have to fix the set of variables in advance and we miss a lot of opportunities for sort of generalization and compactness because of that so the key idea for going beyond propositional logic was well actually compositionality can still work even if the things we're composing our truth functions aren't truth values themselves right so the conditional truth table is what we're composing is really just truth right true and true compose via or to give you true and so on and so the key idea was well let's do this at a finer grained level instead of just having boolean operators from truth to truth let's introduce things like predicates which are functions from object to truth that we can compose with objects or in higher-order logic let's introduce higher-order functions you all know this I'm sure but just a quick recap first-order logic we introduced predicates they're functions from objects to truth values we introduced things like quantifiers which are higher-order functions that assign objects to object variables and then all of a sudden something amazing happens we get first-order logic which is vastly vastly more compressive than propositional logic was so Stewart Russell did this beautiful calculation where he wrote down the rules of chess or maybe estimated them and they come to about one page of first-order logic and over a thousand pages of propositional logic okay so by this finer-grained compositionality you get a big win in expressiveness we can go further first-order logic is not as as compositional as it could be it doesn't have as fine-grained compositionality as we might have we can allow so instead we can allow functions that have higher type so functions on functions and so on and the the the banner carrier logic of this is lambda calculus and I'm going to come back to lambda calculus in a few minutes so I won't bother to describe it just yet okay so that was kind of a whirlwind description of logic and kind of motivation for why we want the kind of logic that we want ultimately now what I want to do is sort of say to you guys well that's great but we know that logic alone deterministic logic has failings it's very brittle it doesn't cope with uncertainty and so on so what happens if we sort of run the same story on the probability side well it starts it starts out exactly the same there's a very elegant correspondence we say okay instead of the possibility of worlds let's represent our degree of belief in each world in other words a distribution we can represent that by a big table again but the big table instead of having a yes or no for each world has a real number between 0 & 1 & such that the sum over the the whole thing adds up to 1 right it's a probability distribution ok that's great and then reasoning actually is described exactly the same way that it was for propositional r44 deterministic logic if we want to compute a conditional so something like given that somebody's coughing do they have the flu or more abstractly what's the probability of a given B we can just think about that as crossing out rows again now we can't just cross out rows anymore because we have this restriction that the columns sum to 1 right so we have to first cross out rows and then renormalize right well it turns out that's exactly equivalent to the usual mathematical definition of conditional probability so you can go and run the the arithmetic to show that the usual definition follows from just crossing out rows given the and in general that the usual kind of axiomatic laws of probability really are just you have a table that sums up to one and you cross out rows so that's great and it's great because probability is with some argument the right way to reason under uncertainty there are arguments like the Dutch book arguments or Cox's theorem that say that you know given the right assumptions if you want a reason under uncertainty you should use that probability algebra all of the cases where there's some discussion about this it's really a question about whether you want to refinement of the probability algebra not whether you should just toss it out and you know use something completely different there are also a whole lot of really appealing effects that come from using the probability algebra and I'll touch on these a little bit as we go along the biggest one from from an AI point of view might be that non monotonic reasoning is for free using the probability algebra so we'll talk in a minute about explaining away that's a main example of that you also get things falling out like Occam's razor there's this thing called the the Bayesian Occam's razor and kind of hierarchical learning to learn effects and a lot of a lot of good things seems great the problem of course is exactly the same thing as we saw for deterministic logic the table of probabilities grows exponentially in the number of variables in the world variable is those little propositional things and that leads us to ask whether there are compositional languages for probability well let's play the same game okay for logic in order to get from big tables to propositional logic we started with our set of basic propositions in this case variables and then we sort of made this move to these conditional tables and through that to these comfort this compositional language for propositional logic so we're going to do the same thing okay we start by switching to our sorry so the the basic goal is to take a big joint probability distribution table like this one and find a more compact way to represent it and the basic move is the same one we're going to start by switching from big joint distribution to conditional distributions okay and then we're going to see how we can use those to compose the basic observation is that because of the definition or lemma that gives us an algebraic form for the conditional probability the thing I'm writing at the top there we can always factor a joint distribution into a big product of conditional distributions this just if you expand that out you cancel everything and it's just a very simple identity although it's usually called the product rule for probabilities and so far this is different way to represent the same probability but no more compact if you count up the number of numbers that we need it's exactly the same except that sometimes it turns out that it can be more compact so we say that one variable is independent of another variable basically if the 1 doesn't depend sorry if yeah if it doesn't depend obviously x2 is independent of x1 if the conditional probability of x2 given x1 just turns into an unconditional probability so if wiggling x1 doesn't tell you anything about x2 so if the probability of x2 is the same for every value of x1 so given that we can simplify these tables quite a bit so for instance if we realize that the probability of sneezing depends on flu but not TV then all of a sudden we don't need 4 numbers or 5 numbers whatever eight numbers whatever it was before we just need two numbers so we can exploit this conditional independence and this idea of conditional factoring to get a much more compact representation ok so this is the basic idea behind Bayes nets and I'm sure a lot of you are familiar with this this is basically pearls 1988 book and the things before and after that by other people the basic idea is that to specify a distribution and in particular to specify the independence structure we're going to write down a directed graph the variables a directed graph and a conditional probability table for each variable so an example here's a set of variables for a little medical diagnosis case tuberculosis flu coughing sneezing we might write down we write down a graph that's supposed to specify which variables actually depend on which other ones and then to fill it in we add a conditional probability table for each variable interesting kind of side note here is that the names and the graph are redundant if you have the names in the conditional probability table you can recover the graph and if you have the the arrows then you don't actually need to give the name so it's really the arrows are really a binding mechanism here in Bayes nets here's an aside for those of you who have heard the term Bayesian floating around a lot and are now thoroughly confused probably most of you here are on board with this but a Bayes net is not the same thing as Bayesian probability a Bayes net is just a particular representation for a distribution which is particularly compact Bayesian statistics or Bayesian inference is a set of rules for doing inference on a given distribution on a given model and in particular Bayes rule which is the kind of you know the sloganeering kind of thing that you see written down for Bayesian statistics is just one particularly useful kind of common pattern when you're doing Bayesian inference so those are two separate things Bayes Nets and Bayesian statistics okay so I've sort of argued so far that these Bayes Nets are a more compact way of writing you know a distribution on possible worlds it turns out that that compactness also gives us gains in terms of sort of the algorithms that are available for inference so instead of doing basically crossing out the rows and summing up we can play games pushing the summations through the products which I'm sort of showing you a quick version here and then if we sort of count up however many values got to consider operations we find that actually by exploiting the independence relations that we have we go from having to sum over say eight values in this particular case to summing over two and then two values which is it turns out so somewhat less work and of course the the point here is that if this Bayes net got really big if the world got really big you would start to get really big winds you would reduce exponential work down to optimistically not exponential work although sometimes still exponential so let me just give you a quick example of this in practice again this is this medical Bayes net and now in this Bayes net TV and flu are independent of each other right there's no arrow from one to the other which means that a priori they're independent what's interesting is that they become dependent if we make the right kinds of observation so we can just compute the the probability of tuberculosis a priori it's ten percent and then we can say well what if we observe that a patient has the flu and compute the probability of tuberculosis well still 10 percent those are independent but what if we observe that the patient is coughing well if the patient is coughing it's now much more likely that they have tuberculosis right that's a symptom of tuberculosis so we're using the probability algebra to reason backwards again this is one of these kind of generative sort of models right we have a model that's pointing towards the observable things instead of from the observable things to the answer and then we're using the probability algebra to turn it around and get the answer we need now the interesting thing is what happens if we observe both that the patient is coughing and that they have the flu well what we find is that the probability of tuberculosis now drops back down okay so this is called explaining away and it was really the the kind of the thread that purl pulled on to to convince the world that low probabilities are actually useful because what we're seeing is non monotonic reasoning okay we're adding additional premises and first our builder degree of belief goes up and then our degree of belief goes back down and that's something that can't happen in in classical logic and to convince yourself that it can't happen in classical logic all you actually have to think about is what happens when you cross out rows once you've crossed out the rows you can't get them back and so you can just narrow and narrow and narrow the set of possible conclusions you can never increase it again what happens in the probability algebra is that because you have to normalize the rows something nonlinear starts to happen which can make the degree of belief go back up it's kind of very simple very elegant nice nice thing to have fall out okay there are a lot of other algorithms that you can use for computing these conditional probabilities I showed you a kind of a simple case of the sum-product algorithm but there's a huge literature and a huge amount of effort looking at dynamic programming and message passing and variational techniques and especially Montecarlo techniques recently you can also extend this formalism to non boolean variables you might be a little bit worried that oh no all we can deal with here is boolean variables does he have flu or does he not nothing really changes except that it becomes much harder to write down the conditional probability tables and sometimes impossible for instance if the variables that we want are continuous right you can't write an infinitely large uncountably infinite ly large table there's a solution to this statisticians notation is something like this where we say that the variable X is a sample from the distribution D depending on the parents of X so for example here's a little case that I made up which I think has no scientific content gives you the idea which says maybe somebody's temperature depends not only on whether they have the flu but on their body weight which in turn depends on their diet and these are continuous variables that have continuous dependency and we capture all of that using the statisticians notation and the key point here is not the details but that we again have a compositional language that we can use to describe complicated distributions in particular what this really exposes for us is that we're we're taking these stochastic functions D and then the graph is describing how to compose them together based on the parents of different variables so a graphical model or a Bayes net really defines a composition of stochastic functions now here's a foreshadowing of what I'm gonna say in a few slides you might be asking yourself what language is it that the stochastic functions are written in right it's not a Bayes net right you didn't see any little Bayes and that's in that Bayes net that I showed you so maybe that's something to worry about a little bit okay but overall these directed graphical models are great directed graphical models is the other term for Bayes net they give you a very compact representation for distributions they capture in some sense a causal process intuition about how our observations get there in the world which we can then turn around using inference techniques some of them reasonably efficient to make inferences about what must be true of the world given our observations as a modeler as a you know cognitive model or AI modeler this representation also makes it easier to just think about what you're modeling which is a sort of alternative goal or well it's another payoff okay well aren't we done what we have so far is basically a probabilistic propositional logic right notice that the entire story was parallel to what happened in propositional logic we started with these big tables we factored them with conditional probabilities and then we had these networks that describe composition of conditional probabilities so we're if this is really the probabilistic version of propositional logic so of course we're not done and we're not done for exactly the same reasons we had to start with a fixed set of variables we had to kind of hide or ignore the structure within the variables and the structure within the factors within the CPT's and finally something that that is particularly important to me there was no way there was no abstraction mechanism no way within the language to make a new stochastic function and let me just on you know show you on on three quick slides examples of these these failings for Bayes nets for graphical models the first is I was sort of going fast and loose of talking about the patient but of course Bob having TB doesn't mean that Alice has TV right I need a different TB variable for each of these people and so one thing that I could do is just make you know infinitely many copies of that basic TB causes coffee they isn't it that's not so compact so statisticians and AI people have invented the plate notation which is basically kind of a poor man's universal quantifier it says duplicate what's inside the plate as many times as you need for the set that you're quantifying over so in this case for every person make a copy of that a little bit that of course there's a question what if you don't know this set ahead of time okay then this is going to kind of fall apart here's another example TB whether I have TB changes over time right so whether I have TB now is going to affect whether I'm coughing later and whether I have TB later and so on and so forth that's not represented with that simple static Bayes net on the left so I can make up for that I can say okay well let's make what's called the dynamic Bayes net where I make lots of copies of that Bayes net one for each time and then I sort of have arrows pointing forward in time and that's fine it's a lot like a plate but it's not a plate so it's a sort of another thing that we have to add to the language in order to capture this this basic idea and it seems kind of frustrating why can't we have a common notation a common language that is both plates and these DBMS they're so similar and even more troubling case if you sat down to try to think about what the Bayes net would look like for a probabilistic context-free grammar you know kind of a very standard probabilistic object in natural language you would discover that you end up writing down a Bayes net that is infinite and has loops in it it's not actually a Bayes net and basically that's a consequence of the recursion of the fact that grammars can generate structures of unbounded sighs so grammars sorry Bayes Nets graphical models really just are not useful at all for representing models like a pcfg yet another kind of thing to worry about is context-specific independence so if you look at the distribution on worlds that I wrote down on the right if you look at it carefully you might discover that there's actually a lot of structure in that distribution okay namely if a is true then D doesn't depend on B or C but you can't actually exploit that structure in a graphical model because it's not true unconditionally it's only true if a has a particular value you can't sort of do that factorization and so as a result that additional structure that we might like to represent gets hidden inside the conditional probability table when we write down thus this sort of if we were to write down a Bayes net for this model another way of saying this is that the the stochastic functions the conditional probability is that we're composing together there black boxes right there's some stuff hidden in there and we'd like to open them up and kind of peer inside final example and then I'll go on to you know resolve this tension and tell you what we can do about this what happens when the domain of the variable is actually infinite and so conditional probability tables aren't an option so a really good example of this is if you go back to the example I used at the beginning of your friend hanging up the phone on you I had this little thing that looked like a Bayes net that related beliefs and desires to actions and in some sense that's totally valid right actions depend on beliefs and desires the problem is there are infinitely many different beliefs and different desires and different actions that you could take so to write this as a conditional probability table we need this infinite by infinite by infinite table that doesn't actually seem like at all a good way to write down what we know about how people behave in the world so the question is how we can write down this kind of dependence and also the dependencies for continuous variables in a way that doesn't have to be written in some other language something you know that's not graphical model E but can be written in the same language how can we open up these black boxes okay so so far what I've been telling you about is really a kind of systematic set of ideas that motivate the move from very simple representations for either logic or probability to increasingly compositional ones and we got on the probability side all the way up to a propositional probable logic BAE's nuts and then I tried to convince you that this really isn't good enough and the answer shouldn't surprise you is that we need a more fine-grained compositionality to our probabilistic models so how are we going to do that how are we going to create a language for probabilities that's more fine-grained than Bayes Nets hmm well the set of ideas really is inspired by the different ideas for how to do logic we can certainly do something that's built on first-order logic and this it turns out is going to lead to undirected methods and and one of the really good examples of this is Markov logic that Pedro Domingo says group at Washington has been working on for a number of years the problem for me with things like Markov logic is that they're fundamentally undirected and so they don't give you they don't capture this idea of a causal relation sort of capturing causal structure in the world and what we've been discovering in cognitive psychology in the kind of Bayesian cognitive psychology recently is that you kind of need this that people seem to have in doubt of the world with a lot of causal structure and we need something like a directed model to capture that I won't justify that too carefully except that you're going to see examples towards the end where we're really using the directedness of the model so the other option is directed methods and the basic idea that we're going to use is to actually exploit the control flow from higher-order logic in other words a programming language in order to structure our distributions the basic the basic idea here is that programs first of all their logic right programs our logic and second of all there are kind of logic that's telling you a sequence of kind of causal instructions they're telling you do this thing and then do this thing right in other words doing the second thing causally depends on having done the first thing I mean we're going to exploit that to build probabilistic models now you can do this for imperative languages or functional languages the the trade-off is that imperative languages the syntax is much more familiar to a lot of people but the the kind of logical basis and the the formal basis gets obscured in that kind of representation and in particular the causality gets all messed up it gets kind of smushed together with the flow of time we don't want to do that so the other option is to build on a functional language which is what we're going to do and in addition to giving you a really clear foundation this also brings a lot of kind of powerful classic AI ideas to bear since classic AI was largely written in Lisp and kind of built on those ideas okay so that's where we're going any questions while I yeah so the question I was claiming here that that first order logic gives rise to undirected probabilistic models and the question was well that that's the case for alchemy for for Markov logic but is that necessarily the case it's a good point so the story is more complicated than this if you build on sort of horn clauses like in like in Prolog then you can get something that's more directed but you have to be very careful with the semantics of how you had the probabilities the the sort of obvious ways to add probability still end up giving you undirected models so there are some examples of things that are essentially first-order and based on her bern semantics I think look their odds current favorite version of this is called problem it does something a lot like that it's a it's a hazy boundary between the two and I'll try to show you a few examples of why it's hazy as we go along but that's my basic motivation is to get that kind of directedness as being a sort of immediate thing there's also some some issues that have to do with being able to easily express hierarchical models that I'll I'll point out as we go along okay so I'm going to build on functional programming and in particular I'm going to build on lambda calculus now I'm just going to give you a very brief intuitive kind of reminder of what lambda calculus is first of all I'm gonna use Lisp notation so the parentheses go on the wrong side parentheses sine X instead of sine of X and the operators always go at the beginning plus X Y so don't let that throw you now lambda calculus is the simplest idea in the world it's beautifully simple it basically just says let's have a language that can define new functions and then apply functions to things so you define new functions by using the lambda operator so this expression says you know the lambda X says I'm making a new function what goes in is called X and then plus X X as what I do with the X once I get it is add it to itself ok new function that adds a adds the input to itself I'm going to go ahead and myself have defined statements even though you don't need to have those as a special edition because they make things clearer so here I'm saying define double to be the function that adds X to itself and then I can use that I can say and now apply double to three if I were to do that I get six I can also do this for higher-order functions which is where the power comes in lambda calculus so for instance I can define repeat which takes in a function f and returns a function that takes in X and does F twice repeats f I can then do things like repeat double I get a new function I apply that to three I end up with twelve or if I already had a derivative operator I could repeat the derivative operator and get a second derivative operator and it's this ability to really easily make and manipulate higher-order functions that makes lambda calculus so powerful indeed it's it's universal so you know as far as we know there's no other notion of computation which is more powerful than the lambda calculus so this is all we need which is if you ask me shocking right this is so simple how could this be everything we need but there you go a little bit more formally just to have it up there lambda calculus is defined it has a syntax which has three kinds of constructions there's a set of variables there are lambda statements and their application statements and it has a semantics which parallels that syntax so there are three kinds of reduction first of all we can change the names of variables second of all there's beta reduction or substitution where when we apply a function we get the body with the variable substituted for the argument I'm not giving you all the definitions of the notation I'm just putting it up there for completeness sake and a token version in what's going to happen next I'm going to assume that this is applicative order so the arguments are always reduced before the function and that becomes that's not important for the classical lambda calculus the church-rosser theorem says it's not important it becomes very important when we make this into a stochastic calculus okay which we do next again the basic idea here is is really simplicity itself we just say okay how can we use these these ideas this incredibly compositional system to describe probabilities well we make a new calculus which is just a slight tweak this is the stochastic lambda calculus or sometimes I abbreviate it SCI lambda calculus because it's shorter right and basically what we do is we introduce a new operator which takes two expressions and it's the random choice operator it goes along with the reduction rule that says when you get to a random choice between m and n you randomly with 50% probability reduced to M and otherwise you reduced to n okay and that's it now think about what that does that means that as we're doing lambda calculus and we're doing these function applications and substitutions occasionally we're going to get to a random choice and we're going to make that random choice one way or the other and as a result there's going to be a random execution a random reduction that happens when we're evaluating these these expressions in other words what we get is a way of specifying how to sample an ultimate return value from an expression it's not terribly hard to see that this induces a distribution and I'll give you an example on the next slide basically it's just if it's just the distribution on return values that you get by running the sampling process a lot of times in order to take this so this is the formal basis that I'm going to work with this stochastic lambda calculus I'm to make it a kind of useful modeling language we do the same move that happened to make Lisp out of lambda calculus namely we're going to add some additional primitive operators and data types so that we don't have to encode everything in church encodings which is hard and we add additional primitive operations not just fair choice but things like a Gaussian distribution although it turns out you don't actually need to add those mathematically all that you need to get a universal system is their choice um the language that we end up with we call church after Alonzo Church to distinguish it from the formal basis stochastic lambda calculus I guess this is the same move that been made seems to be pretty useful to distinguish the set of ideas from the language from the implementation okay so let me give you this example here basically what we've done is we've taken a simple simple expression and really list for lambda calculus and we've added an operator for flipping a coin ice flips a coin with probability 0.3 and if we run it once we'll get something like heads tails heads we add those up plus ABC we get 2 we can do this again we can do this a lot and then we can make a histogram and we're going to get something like this so on the Left what you're seeing is instructions for sampling a value and on the right what you're seeing is a distribution over the return values now the really important and deep thing is it turns out that that's complete that relationship is complete so there's a theorem I've only written down the interesting half here that says that any computable distribution can be represented by a church expression and the other way any church expression that halts almost always represents a probability distribution so this is really really cool because what it says to us is that we have two different notions of the semantics for probabilistic representations on the Left we have a sampling based semantics and on the right we have a distribution semantics and these are equivalent we can work in whichever one we choose to work in yeah question yeah yeah absolutely the question was whether there were multiple ways to make the universal distributions actually the question could have been two different things the answer would have been yes to both of them one is that this is not a one-to-one relationship in this theorem there are many ways to represent the same distribution in the language that's true and turns out right and so the other question was whether we could play the same game with other random operators and rent and process calculators our basis and absolutely we could the important thing is to have a very fine-grained compositional system that we augment with randomness and then essentially the same proofs go through of showing that it's Universal few caveats yeah so for me the compositionality that we're after is much more straightforward viewing it in the sampling semantics that in the distribution semantics why well because in the sampling semantics compositionality is really just function composition we sample a value from one function and we feed it in by function composition to the next function and away we go you can sort of write down the composition laws in terms of the distributions and you'll get them the problem is that they involve integrals over those those sort of in between values and so you actually have basically an infinite set of different composition laws that you're playing with so that's a big part of the motivation for working in this sampling representation when we're actually trying to build models okay we're not quite done as far as expressing models because we don't want to just write down distributions we also want to be able to ask questions of those distributions we want to be able to do inference so in church we add an interface for inference we call it query and basically a query expression looks like a set of definitions that give you a model an expression that you want the value for called the query expression and then a condition a conditioning expression which must be true so this basically is the conditional distribution on a plus B plus C conditioned on a plus B being equal to one of course we could do the same thing take samples look at the distribution and we get something that never gives you 0 or 3 because that can't satisfy the condition it turns out that we don't actually have to add query formally as an additional primitive operation so that's why it didn't show up in the stochastic lambda calculus definition I gave you we can actually define it within the language and the kind of definitional way to see it is to think of the query notation that I just gave you and sugar and we basically do sugar into something where we have a distribution to sample from and a predicate to say whether or not that's okay or the way I've written it here it C sugars into a pair which is the query value and the condition value and then we can think of actually defining query defining the inference operator by this function within the language which is a rejection sampler function so this basically says you sample from the the D sugared procedure you check whether the condition is true if so you return the query value and if not you try again okay if you work through all the math you would discover that this is actually equivalent to specifying the conditional distribution that you think it should be but it's really I think important that you can define it within the language you don't actually have to add anything extra in order to get this in stochastic lambda calculus right which is which is a stark contrast to Bayes net Bayes Nets the idea of actually defining a conditional probability as a Bayes net is kind of you know sort of unthinkable or at least weird okay however you're saying to yourself trying to say to me except you're very polite that's ridiculously inefficient you don't actually want to do a rejection sampling ever almost ever and and you're absolutely right so I'll make two points about this the the first is that we want universal inference algorithms for this sort of representation in other words ways to implement query which are always correct in the long run but make different approximations in the short run and are hopefully reasonably efficient for a wide class of different programs of different expressions why do we want that well as a modeler this would be fantastically useful right now when we're doing our little at least narrow AI kinds of projects we can just write down the model in this language wrap it in a query that says what we've observed press GO and we don't have to spend the three or four months implementing an inference algorithm that we usually do so this would be great as a cognitive scientist I think this is also really important and as a sort of big view AI person and a GI sort of person because it means that the mind can be a system that does but let me say this a different way it would be a very bad thing if for every different mental representation that we ascribe to the mind we also had to pasta posit a different inference algorithm that worked over it because every different task would give you a different little representation the set of different algorithms that we would be positing just explodes so I'm not at all against having a collection of different algorithms but altogether they have to have coverage which is wide enough to do all the things that people do and as far as we know what people do is unbounded not particularly well in most cases but we can do pretty much well I'll challenge you to tell me the thing that I can't think about okay so rejection sampling as a definition is great but as an actual implementation it's terrible it's incredibly slow it turns out that we have other options though one of the big kind of technical efforts in developing probablistic programming languages is really figuring out how to create these universal inference algorithms usually by adopting things and adapting things that have already existed for specific probabilistic models in this UI nips sort of community for a church we have at the moment both in MCMC and metropolis Hastings based algorithm and a dynamic programming based algorithm and each of these is universal in principle and particularly good for different classes of problems I'm not going to say much today about how these things are implemented for kind of time reasons and because the details get quite quite nitty-gritty but we have papers on this and I can point you to them okay so just to tie this back to something something concrete let's look at our little a little medical diagnosis based on that example again so in church we're gonna represent this as something like this there are many ways to write it but flu is a random choice TB is a random choice and coughing is a conditional that depends on whether the patient has flu or TV as a high probability of so low probability otherwise the thing to note about this is it's really equivalent to that little Bayes net but it has all of the conditional probability table information already inside of it we don't need in addition to add sort of these extra tables if we want to do something like infer the chance of flu given coughing we write it with a query we say okay there's the model tell me about flu and coughing we get some reasonably high probability if now we know the patient is coughing but also has TV probability of flu goes down so that's explaining away again going a little bit beyond Bayes Nets we can do something like what I showed you in plate notation by extending those functions to be functions that take an argument the person and then we have something like flu as a function from the person to whether they have the flu and TB is a function from the person to whether they have TB and so on and so forth and now all of a sudden what we have is a parameterised Bayes net exactly that kind of thing that we had from a plate that's parameterised by the particular person that we're asking about now if you're on the ball and looking at this you might notice that there's something a little bit funny going on in that representation which is that if I ask whether bob has flu 12 times I might get 12 different answers well I guess I can only get two different answers in this case true and false but I'll get different answers so we add we add a construct to church that makes it much much easier to write the kind of models that capture the kind of notion of objects in the world with persistent properties it's based on the idea of memoization so here's what I just said before if I ask whether bob is coughing twice I'm not going to get true a hundred percent of the time okay or 50 percent of the time if I ask whether the conjunction is true so what I'm going to do is I'm going to add an operator that takes a function and memorizes it now classically enlists for a classical language memoization is fantastically useful it makes things go a lot faster but it has no semantic content right it basically says cache the values in case you need them but that doesn't change what you're doing interestingly in stochastic lambda calculus memoization does have semantic content it changes what the function means because basically what it says is take a function and return a new function that does something random the first time that you call it and then reuses that random value every time there when you call it yeah question yeah yeah so so there's no there's no reason that I'm using this particular domain for these these random functions if you wanted to do the thing that you were just describing you would still have to have something in the semantics that caused this persistence so that you know once it becomes defined its defined so you end up having basically the same kind of thing yeah okay so for instance the transformation that we do to this sort of simple coughing thing is we say okay flu is the function you get by first creating a function which takes a person and randomly decides whether they have the flu and then memorizing it so that now flu is a persistent property of that person and every time we ask we'll get the same answer and for instance now if we ask whether Bob is coughing and Bob is coughing we're going to get that that's true half the time right either it's true it stays true or its false and it stays false so the key thing here is that between different executions of the whole program the randomness can be different so we have different random you know possible random possible worlds but within one execution it's going to be persistent okay this is a good time to come back armed with all of this formal machinery and try to remind you what I was why I started doing this about 50 minutes ago I said that we motivated by compositionality probability and generativity want to think of the mind in terms of this probabilistic language of thought hypothesis and then I listed out these somewhat vague desiderata so now I can formalize that I can come back and say actually what we're saying in the probabilistic language of thought hypothesis is that mental representations or concepts if you prefer are something like functions in a stochastic lambda calculus like this language Church that I just showed you now what I think is important about this is not that it's necessarily Church but that mental representations that at least lots of them have this role that looks like pieces of a probabilistic program could be imperative could be functional could be Church could be your favorite probabilistic programming language but it's this key idea of probabilistic process calculus that's really driving the the usefulness let me take this moment to emphasize something else that's really important as part of the discourse in cognitive science and I think should be part of the discourse and all the rest of the fields where we care about intelligence which is that we're separating the process of inference here from the representations and the inferences that they license so church is a language that lets us specify representations that describe how you think the world is and because they do that probabilistically they then therefore specify the inferential role they specify what inferences you should make given different observations in virtue of those representations they don't tell you Church doesn't tell you anything directly about how to do those inferences okay so it separates out the representation from the process this is basically the idea that David Maher and Tommy Poggio had and it's usually called Mars levels the idea is that we can ask three different levels of question about intelligence we can ask the the computational level or what question where we're trying to describe what it is that an intelligent system or a person does we can ask the algorithmic level or how questions about how you actually do that what's the process and then you can ask the implementation level question of how is that actually implemented in physical in a physical substrate and the argument which you can take a look at more or many other people to get is that these are not sort of alternative questions to ask you have to give an answer to each of these three questions they support each other information we can get ideas from one and apply it to the other levels but we need to add to sort of understand and and and describe the mind at all three levels so what I'm talking about here is purely at the computational level We certainly have ideas about how this connects through to the algorithmic and the implementation level and I'm happy to talk about those in the in the question period so the actual language that I'm using just to sort of flesh that up again which is Church also has a bunch of implementations there's MIT Church which is an MCMC based engine that's based on an interpreter and is mostly defunct now because it's too complicated for me to want to look at there's bear which is again MCMC but it's a compiler it's faster more fun to play with and there's cache which is based on dynamic programming a very fancy dynamic programming algorithm and bear and cache each have sort of their their uses Kosh is what we use for the sort of cases where you're going to embed a query inside a query that I'm going to show you in a minute which turns out to be very interactive all for standard algorithms and there is good for more standard kinds of models oh yeah and we have a website which is occasionally maintained and updated projects that csail that mit.edu slash search will you'll get various kinds of information about church implementations and papers and also maybe most critically the very top link on that page is a link to a tutorial and probabilistic models of cognition using Church where you can kind of get a little bit more hands-on sort of idea of how to how church works and what it's for I'll just quickly show you an example we're going to come back to this in a second but this is a page from the church tutorial on the wiki and the really nice thing which makes it kind of fun except that my internet connection the really nice thing is that there are these little runnable boxes so you can type your code into oh you can't see this okay you can type your code into these boxes and the press go on the thing that you can't see right you can go over here and press go and it sends it to our server at MIT and then it'll send back the results so without actually installing the software which is a little bit of a pain you can play around and get a feeling for what it does and how it works we limit the jobs to two or three minutes because our server is not that beefy okay so I'm gonna go on now and give you a bunch of examples of using Church to model aspects of human cognition first I'm going to show you what's mostly a thought experiment about kind of representing complex reasoning and then I'm going to give you a couple of examples that connect up to psychological data about multi-agent reasoning and concept acquisition any questions before I go on yeah yeah yes so it turns out to be pretty straightforward I can't remember if I have an example of this but basically because the conditioning statement in church can be arbitrary statement essentially undirected knowledge goes in the condition so you can say something like you know say you have two variables a and B and you want to say that there's some you know some tendency for them to be the same then you condition on flipping a coin whose probability depends on whether they're the same coming up heads and that means the same thing as an undirected statement so I can show you how to write Ising models and things like this certainly church is prettier for directed models it's kind of intended to express that easily the the kind of correspondence that I was mentioning earlier is that once you get a sufficiently rich language there there's a kind of blurry line between the two and in particular you could make arbitrary Markov logic statements and condition on them in church and get something that's equivalent to Markov logic either questions okay yeah so the question was is there any way to learn a church program from data in the way that people try to learn Bayes Bayes Nets from data great question I'm going to talk about that at the very end hopefully if I have time the the short version is as you might expect it's hard is it's basically a program induction problem I think that there are specific cases where we can make real progress and actually I'll show you in the third section using very simplified versions of that to predict very high accuracy human concept learning performance but yeah we'll come back to that okay let's talk about complex reasoning a little bit so imagine a scenario like this the game of tug of war okay you all know about the game of tug of war and if you start thinking about what tug-of-war entails there's a lot of interesting concepts that are part of your knowledge about tug-of-war there's a concept like strength concepts like well the magnitude of strength strength isn't boolean right it's a magnitude laziness may be on a particular match somebody is just lazy on that match and of course pulling laziness doesn't make sense without pulling there's the concept of a team people on one side of the rope and of course the concept of the winner of a match so a winner yes a winner of a match so how would we capture something like this so I'm going to show you a very simple kind of church model that captures those concepts it's not meant to be the you know the end all and be all of the knowledge we have about tug-of-war but it's meant to illustrate how you can capture fairly rich causal knowledge using a language like this so first of all we say that strength is a persistent property of a person so we use them to create a persistent property it's a function that depending on the person associate a draw from a Goshen you know the Goshen over the strengths of individuals similarly we talk laziness as a random a random a random quantity but laziness is not persistent you can be lazy on one match and not lazy on another match so we don't memorize that and we have this kind of simple definition of pulling where we say the amount that somebody is pulling depends on whether they're being lazy if they're being lazy they pull with half their strength otherwise their full strength kind of you know simple representation of some knowledge about the relationship between strength and pulling and laziness we talked about how hard a team is pulling all together and this is just a little functional magic that says there's a pulling function you apply it to every person in on the team and then you add up however much they're pulling so this is a little since we're at Google I should call this Map Reduce which is a little MapReduce over the team and finally there's the concept of who wins a match between two teams which is just the simple-minded thing of saying well if team one is pulling harder than team two then they win and otherwise team two wins okay so all already you can ask some interesting questions from just this simple model you can ask questions like given the outcomes to a bunch of matches how strong do you think Bob is so if there's a match where Bob and Bob and Mary play against Tom and Sue and Bob and Mary win and another match where Bob and Bev play against Jane and Jim and Bob and Bev win do you think Bob is strong or weak well he keeps being on these winning teams so probably all other things being equal he's relatively strong we did a little experiment like this in the lab well using Mechanical Turk so we basically we being Toby Gersten Berg and I shown subjects a little human subjects that is I should say that since we're not at a psychology conference we showed people little scenarios like this where there were a bunch of matches and an indication of who won and then we asked them to rate the strength of one of the players and we varied both the size of the matches and the compositions of the different teams fairly systematically so there were 20 conditions eight of them single player 12 of them double and the result is really encouraging so what I'm showing you here is the prediction on the on the y-axis I'm showing you the prediction of this very very simple church model that I wrote down for you and on the x-axis I'm showing you the ratings the mean ratings of the human subjects and so what you see is a scatterplot that is very linear indicating that we're predicting extremely well the judgments that people make in this scenario the correlation is something entirely ridiculous 0.98 8 so there's really not much variance left the claim here is not that we're capturing all the knowledge that people have of tug-of-war but that we're capturing the kind of causal dependencies that are going into making these particular simple judgments what's interesting about this kind of language is that we're not limited to just the kind of evidence and reasoning that we did in that experiment we can easily extend this and ask complicated questions like that was the outcome you know assume that Bob and Mary won one match but assume that you also happen to already know that Tom is stronger than Bob then what's your conclusion right so this is a complicated condition a complicated assumption and we can ask complicated questions too we can say how likely is it that either Bob was being lazy or merry is stronger than sue given that etc and so you can start to see how having this fine-grained compositionality in a probabilistic language starts to give you the kind of productivity that I was asking for at the very beginning of this talk you know with a very small set of functions of concepts we can ask a lot of different questions and make judgments make inferences about them so the points of this example are to illustrate what I just said that a small set of concepts leads to a huge set of potential inferences but also kind of critically that by composing stochastic functions instead of deterministic ones we get productivity but we also get graded probabilistic inference which turns out to be what we need to capture the kinds of judgments that people actually make now you might say ok that's fine but let's let's make this more interesting how can we add you know really interesting concepts to this kind of game so we could we could concepts like you know a reasoning task say Alice believes that jumping causes luck and Alice wants to be lucky at the game how likely is it that Alice will jump before the game people have no trouble reasoning about that kind of thing or more complicated if John said some of the plants sprouted and these plants almost always sprout and then actually John couldn't see all of the plants how likely is it that all of the plants sprouted so in the next section what I want to do is show you some some work we've done to capture this sort of reasoning using Church and some experiments that go along with it even for those complex conditions yeah we haven't finished running those conditions yet check back though you know I agree I mean and more generally and I'll sort of you know maybe maybe the following sections will sort of make you feel better that we're exploring that generativity but I agree that the you know the real promise of this kind of thing is to capture the productivity of people's reasoning from these more complicated more heterogeneous kinds of information no no not at all and that's and that's critically not the claim right so the claim is about the sort of the representations and the inferences that they license not about the process of the inference right MCMC happens to be useful I think in the simulations I showed you I don't know if we actually used MCMC or used a more explicit strategy but the point is that the the competence predicted by having that knowledge captures pretty well the confidence that people exhibit and I'm happy to talk more about the sort of you know brain algorithmic and brain connections towards the end ok right so we're on to application and the first set of applications I want to tell you about our applications to capture human social reasoning also called theory of mind so let me start with the thought experiment well actually this is a real experiment but let's do it as a thought experiment first imagine that you've got this friend Bob and Bob has a favorite toy which is this box which he one day brings into the office and he puts it on the table and he presses the two buttons and then the light comes on and the question for you is how does the box work so it could be that the green button alone makes the light go on or the red button alone makes the light go on or either one of those buttons would have made the light go on or maybe you need both of the buttons it's a and B together that make a light go on or maybe there was no relationship okay so how many people will do a little informal experiment how many people think that a alone causes the light to go on the green button alone how many people think that the red button alone causes the light to go on how many people think that either button could have made the light go on okay how many people think that you need buttons a and B to make the light go on okay so by my estimate I would say about 30 ish percent said either button could have done it and maybe the rest ish said that you need both of the buttons which is great I'll show you data from experimental subjects momentarily that confirms that the interesting thing about this scenario is that if you treat it as a purely causal learning scenario you can't actually draw any strong conclusions at all so we can model this here's a very simple sketch of a church model to model this where basically what we're saying is the world causal structure comes from some prior over causal structures the action that Bob took is sort of random we didn't know what he was going to do the outcome comes from applying the world causal structure to the state of the world and the action that Bob took we ask what the world causal structure is conditioned on the observations he pressed the two buttons and the light came on this by the way is a model that's been published and that actually has done very well at capturing a lot of human judgment sites the model basically the model of Griffiths and Tenenbaum 2005 question we're gonna get there give me two seconds so the prediction of this simple causal learning model is actually the right prediction from the point of view of you know a good scientist this was confounded evidence there were two events followed by a third event you shouldn't draw any conclusions until you can unconvince in this model does it says I really don't know okay on the other hand you guys were drawing pretty strong conclusions so how can we account for for that well the basic idea is that when people are reasoning about this situation their reasoning about why Bob took the actions that he took exactly the suggestion from a second ago and so we need to add some more information to the model that has to do with the relationship between beliefs desires and actions and of course I already argued to you that this isn't something you can usefully capture in a simple Bayes net fortunately we can capture the pretty directly in this richer language in church so what I've written down here is a one possible way to do that it's a basically representation of a rational agent assumption that says that people take actions which approximately optimize the probability that their goal is achieved so they say decision or decide is a function that depends on the state of the world according to the agent the causal model according to the agent so those are beliefs and the goal of the agent and what the agent does is they do an inference they say let's infer the action I should take conditioned on my goal being satisfied if I take that action in in the world that I believe to be true okay so this gives you something which is basically equivalent to a standard Bayesian decision theory it says softmax rather than a hard max policy but that is but basically this is a kind of standard representation for rational agents now what we can do is we can take that and we can compose it with the rest of our model the sort of previous causal learning model so we can compose this knowledge about agents with knowledge that we have about causal structure in the world so now I'm showing you a model where we make two additions first of all we make this rational agent assumption that the source of actions is not uniform but it actually is a rational decision by the agent that we're looking at and second of all in this case I'm going to add a knowledgeable agent assumption that actually the agent knows what how the box works in all of this in these scenarios we say you know this is Bob's favorite toy or something like that so it's pretty plausible that he knows what's going on now we do inference we do a big joint inference over this composed model and the prediction we get is that it's much more likely that you have to press both both buttons a and B to make the light go on than anything else and to interrupt amorphize a little bit basically what's going on in this inference is the well why else would he have pressed both of the buttons unless he thought he needed to so since I'm a psychologist most of the time part of the time I did an experiment to to verify the intuitions here so we set up a bunch of scenarios that were all parallel to this Bob's box scenario things like you work at a genetically engineered plants nursery and one of your coworkers is tending to some almost dead flowers that you haven't seen before your coworker reported a yellow liquid and a blue liquid on the flowers by the end of the day the flowers are growing again and then we asked the subjects what causes the flowers to grow we elicited the answers in the form of bets on the different causal structures to get something that kind of is comparable to a probability distribution and critically we had a control condition so we had the same scenario but where the two events were the outcome of a physical event there was no agent present so something like a small earthquake knocks over a yellow liquid in a blue liquid which pour on the flowers everything else the same and we did this with several different cover stories across several different domains to make sure it wasn't sort of specific to machines or something like that and the result is that people as you guys did quite strongly draw the inference that you need both of the buttons to make the light go on just like the social the combined social causal model and not at all like the causal alone model and that effect basically goes away and even reverses a little bit in the physical condition where there's no agent we've replicated this a bunch of times done it with a bunch of different control conditions that are tighter controls and so on and so forth so it's a pretty robust effect and then the conclusion that I draw from it is that people are definitely doing causal reasoning in the context of social reasoning and in order to capture that formally we need these models that combined together really compositionally these different kind of pieces of domain knowledge from different domains Serah question yeah yeah yeah you definitely could so so the the model I showed you is like the you know it's the the kind of base model where you say let's assume that when I am thinking about another agent I attribute you know I ascribe rationality to them right and so the function that I wrote down to to capture my knowledge my generative knowledge about that person was essentially rational softmax optimal there's nothing to stop me from giving a more detailed representation of how I think people make decisions that can include potential irrationalities and a really interesting research program that really hasn't been hasn't been touched yet is to take these theory of mind experiments and look for evidence of which of the irrationality is that people have do other people know about right because there's a distinction between the actual irrationalities and decision-making that people exhibit and what they ascribe to other people when their reasoning about other people yeah it's a totally interesting set of questions question so there's two kinds of things that go wrong at one level you could say nothing goes wrong I could I could take those programs and I could write a Bayes net that approximates it the problem is or the the thing that's unsatisfying is twofold the first is that what you're going to get is kind of you know you're gonna represent decision making by these two little arrows and all of the work is going to be done by this very complicated function which isn't represented in the Bayes net Josh Tenenbaum and I like to call these the thick arrows that you know real knowledge about the world has thick arrows and it's the the Bayes net is not false right it's true that there's dependence it's just that it's a black box that doesn't expose the actual knowledge about what's inside those thick arrows um the other thing is that as you go beyond a very simple situation you want to be able to apply this to a generative kind of notion of what are people's goals in different situations in the world and this model sort of ax me extends you give it a more complicated you know you ascribe to the person a more complicated set of beliefs about the world and the set of possible goals and possible action sort of immediately grows because it's all built compositionally and what you would have to do in a Bayes net is basically you the model or sit down and make a new Bayes net in each of those additional situations it's very much like the relationship between first or higher-order logic and the propositional grounding of the same thing right with respect to any given problem you can write down a giant propositional grounding right like the relationship like Stewart Russell's calculation about chess right you can certainly take the one page of first-order logic and write out the you know thousand pages of propositional logic because you know what all the squares of the chessboard are but it's both less productive because you couldn't transport that to a different chessboard right and much less clear representation of what people know and I should say that you can get the results from this kind of model by you know what we usually do in a kind of you know probabilistic AI land is you know we use this sort of informal mixture of English and math and dependence diagrams and things to write the model and you can do that and then go off and implement it and get the results it's just you won't have a sort of formal representation of the model that you can then compose with other things and use productively generatively which is kind of the goal yeah okay so I want to move on to more slightly more sophisticated situation of social reasoning these situations or situations where it's not just me reasoning about some other agent that I'm seeing in the world but it's me reasoning about another agent I'm seeing in the world who I think is reasoning about me so these are multi agent reasoning or recursive social inference situations and the classic one of those are these set of those our coordination games Schelling was one of the big one of the main people to introduce these in the 60s they've been followed up by a lot of people including my colleague herb clark and their situations like this one and that Alice and Bob have arranged to meet at the bar and it's only after they get off the phone that each of them realizes they didn't agree on which bar they were gonna meet at and let's say their phones died they can't call back okay so how do they guess how do they decide which bar they're going to go to and now let's imagine a little bit further that there's a slightly more popular bar in the town not a whole lot but a little bit so the observation that shelling made is that in situations like these people will choose the the sort of the same outcome much more sort of easily and rapidly and probably then it seems like they should just based on this sort of the base situation or the example he uses is let's say we where we decide to meet in Manhattan where will we both go in Manhattan I don't actually know Manhattan so I'm not sure what the right answer is but I think it's Times Square so that sound like a plausible place to meet in Manhattan okay Grand Central all right obviously there's some diversity of opinions there so we're going to capture this in an incredibly simple way so we start out by just modeling Alice and Bob separately as deciding where to go each of them knows which is the good bar and which is the bad bar and has a slight prior to go to the good bar 55% of the time they'll go to the good bar and so each of them is a function and when you call the Bob function it samples a location from that prior 55% of the time he'll end up at the good bar they're not going to coordinate it very often here so we're going to make this a little bit fancier we're going to say okay what if Bob is actually doing an inference Bob is saying I'm sampling the location I should go to from my prior but conditioned on that I'm going to end up at the same place that Ellis will so conditioned on where I go to is where Alice goes to right so that sub call to the Alice function is Bob reasoning about Alice now Alice is being very self-centered and she just goes where she wants to still okay but that wasn't how I set up the scenario I at least implied that Alice likes Bob as well and so Alice is going to try to go where Bob goes as we write down a symmetric function where Alice is deciding where to go conditioning on that that's where Bob is gonna go so this is really elegant it has a slight drawback which is that it's a mutually recursive pair of functions that will never stop recursing okay so there are two things that we can do about this the first thing that we can do about this is just say let's just add an extra argument everywhere which is the depth of reasoning and they'll let's just make them stop reasoning about each other after some point so here we've added this depth argument and the critical piece is that once we get to depth zero Alice decides to just be selfish and just go to the bar that she wants to go to so we condition on an OL condition on true so the prediction of this model is the following that if they don't coordinate at all so if there's no depth of recursion then they're going to go to the same bar fairly infrequently sort of half the time right it's basically slightly more than half the size it's basically at random as they reason about each other to a greater degree they begin to coordinate and it very quickly approaches essentially a deterministic choice to go to the popular bar so basically what we've done is we've made a model which is essentially the same as the the social inference model I gave you before so we're just saying hey what if they both have theory of mind and they each have the goal to go where the other one is going and what falls out is coordination there's a slightly prettier way that we could do this which is to say what if instead of adding this this depth argument which is a little bit funny we just say that there's actually only some probability that alice is going to bother to reason about Bob and so with probability 20% she'll just stop and go where she wants to go and otherwise she'll reason some more about Bob so you can think of this as a geometric mixture of the different reasoning depths that you could reason that and then the prediction here is that you know they'll coordinate pretty well but not perfectly and it depends of course on that probably of recursion this by the way is something that is very natural to write down in a full probabilistic programming language but is totally crazy and you know most other representations because this is a stochastic recursion that builds unbounded ly unbounded computations right it will always stop recursing at some point but you can't drawn an upper limit on the size of the recursion mm-hmm so let's start talking about that kind of thing I always like it when I get leading questions like that it makes me feel kind of superhuman okay so I want to move on now from simple these sort of simple coordination to more explicit communication and really communication games and in particular the the kind of the beautiful thing about coordination is you know those two people without without really communicating anything explicitly managed to coordinate and do something useful they'll end up at the same bar but now imagine that there are a thousand different bars or a complexly structured space of bars or they didn't ever actually talk on the phone and decide they were going to a bar so they don't actually know what Alice's wants to do in etc pretty soon just coordinating as just a simple coordination game blows up it sort of goes back to chance it's not very useful so what we need to do is sort of introduce ways of communicating with each other that narrow the space of possibilities and get the coordination game sort of off the ground where I say what we need to do that's sort of like I guess a design claim about language what would be great if language did is that so the first step is I want to I want to ask what happens if we actually have an opportunity to pass signs to each other in order to help us coordinate and let's say that we have signs which themselves actually have a meaning in the natural world so let me give you a little a little thought experiment of this this one is less robust but I'll I'll try it anyway so imagine that there are two dice and be and each of them is a three-sided die which has a red side of blue side and a green side and we sort of sit together and roll them a bunch of times and see oh they're actually weighted yeah I know a three-sided die they're magical but just imagine and what we see is that they come up in these proportions so they're definitely not fair dive but you know we get some feeling for how they actually behave for the frequencies of the different sides now imagine two different situations in the first situation I take the two dice I put them into a bag I shake it up and without looking in I draw one out I roll it and one of the sides comes up and I show you guys oh it's the green side that came up okay now how many of you think that this is die your choices or die a or die B diet how many of you think it's dying how many of you think it's a couple how many of you think it's die be most of you okay now let's move on to the let me show you sort of you know the modeling of this situation is pretty straightforward this is just a simple Bayesian inference right there was a random thing in the world that we want to infer the hypothesis and so this is just going to look like a function where we say the observer is a function of the side of the die that the observer sees and that's a query and an inference conditioned on the outcome that conditioned on if I had rolled that die I would have seen that side and the conclusion is okay it's probably div' a little bit less strong than you guys were but still pretty clear now here's the interesting case imagine we play a different game we say hey let's play the game where I'm gonna try to teach you or communicate to you which die I'm holding but I can only do it by showing you sides of these dice no no talking so now I put the die into the bag and I look into the bag and choose one okay so I know which die it is and then I show you one of the faces of the die and I again show you the green face okay so now how many of you think that this is die eh and how many of you think it's die be okay so it shifted and now there's slightly more who think it's died a and then died B so it's definitely a big difference right so how are we going to model this well we're just going to do the same thing we're going to model this as you the Speaker I sorry you the observer reasoning about me the teacher who was reasoning about you the observer etc so okay we start out with a simple observer which is not doing any social reasoning and then we say that the Speaker I mean call it speaker instead of teacher cuz that goes over to what happens next the speaker basically just does an inference and says okay which side should I show conditioned on the observer inferring the right die if they see that side okay so that's the goal for the observer to get the right die so what should I show so that the observer will infer the correct die now the observer themselves that was you guys you're not actually just a simple-minded Bayesian observer because you know you're referring which diet is given that the speaker chose to show you that side of the tie so and now we turn the crank we wrap a query around all of this and say okay so what should the listener in forgive in the green face and what we get is a switch for the simple observer model with no social inference dighby was the most likely conclusion and for the social inference model in this social game the listener model dye a is now the more likely conclusion so it was the same switch that you guys exemplified for us although in slightly different proportions question we did so I'll show you the next version where I have a better answer to that the next version is one where we do recursive reasoning instead of just going one level deep and do the same kind of cutting it off thing so this is one where that speaker the the listener decides whether to just reason about the physical sort of process or recent about the speaker so here the computational we're using dynamic programming for this so the computational cost is linear in the depth of recursion that you approximate to and bad in the size of the hypothesis base of the state space so it's well it's a correct approximation so it's the this is exact inference except that we have to bound the recursion depth at some depth so we take it out you know 100 iterations or something what's that it's not bad in this particular case I mean so this is probably a good time to make a comment about these sort of models so you'll notice that the structure of these models is queries nested within queries recursively right now each query is a conditional probability in inference and so in sort of more traditional probabilistic terms each of those recursions is accumulating an extra unknown normal constant every time you have an unknown normalizing constant that's called intractable if you do it twice that's a class of models that people worry about called doubly intractable so this is like infinitely intractable or something horrible like that MCMC is completely hopeless for this but it turns out that there is appropriate structure to do dynamic programming in these models by essentially sharing on the recursion and so we have an algorithm that can sort of that analyzes the program and detects that and then can do the dynamic program fairly attractively well I would say yes we can by that I assume you mean very simple problems yeah so I would say mm-hmm well you know it's really interesting in this domain so they're very simple in the sense that the state space the world that the agents are in it has to be very simple but actually the reasoning gets very complicated so we're doing this series of experiments mm-hmm where we're actually you know giving people these we call them sort of sitcom situations where we in it we manipulate sort of you know whether Alice knows that Bob doesn't like her and whether they agreed to go to the same place and sort of you know set of things which in terms of the worlds are simple but in terms of the kind of reasoning and the pasta the potential outcomes gets very complicated it's less tractable than the simplest Markov logic but it's also more powerful in terms of representation representation and I should say that you know I'll get to this at the end of the talk but a big part of this talk is to say look we have this language this representational system and we have these implementations that are that do well enough to show that this has real promise for modeling interesting cognition but it's really a call to arms right people need to start working on this kind of thing and really understanding the the possibilities for inference and the difficulties and those kind of things yeah not a case closed at all sort of the opposite of a case closed a case wide open those other yes exactly yeah see you can write this model with helpfulness as the explicit goal it's much bigger and but that's basically what's going on is it kind of collapses down to that sort of particular goal yeah okay the recursive model does about the same thing although the inferences are stronger I'll just briefly say that we my colleague Pat Shafto and I have proposed this as actually a model of how how humans risen in these natural pedagogy contexts where they have to teach by example we've done a bunch of experiments on these sort of teaching games that are a lot like that game although slightly richer so like you know you have this rectangle and you have to transmit to each other a rectangle and maybe I is the teacher give you those examples and you have to decide whether it's a little rectangle or a big rectangle and it should be pretty clear to all of you that it's the big rectangle because why else would I have drawn that negative example way out there that's not the prediction of standard inference models that's the prediction of this sort of communication based model and it also happens to be what what people infer and we have a control condition if people don't have reason to believe that they're in helpful teaching context those inferences go away those inferences go away yes we've also done something very similar to this in kids sorry I'm skimming over this because I'm realizing time is getting short we've done something similar to this in kids where we make the prediction that actually explicit pedagogy is going to limit the amount of exploration that kids do because the the model basically reasons that if there was anything else to see in this toy the adult would have showed it to them the adult didn't so there must not be anything else and what we see kind of kind of amazingly is exactly this what liz bono ETSU's one first author of this calls the double-edged sword of pedagogy that children explore much less when they're given direct pedagogical instruction than when they see the equivalent thing accidentally or they're just handed the toy to play with okay I'm going to skip the next part which would be sort of moving from these natural signs that have sort of a natural grounding in the world to more arbitrary signs in other words real language and just tell you that we've used this to model scaler implicature and it turns out that basically scaler implicature like the one that some implies not all fall out from exactly the same kind of thing the same kind of recursive social reasoning let me go ahead and skip on skip onto the next section which is a little bit about concept learning so before we actually get to concept learning per se I just want to emphasize that the the ability to make a new abstraction within the language is really key and is really kind of an unusual thing for probabilistic modeling languages and it lets you do some kind of wild things like we could imagine that we have an abstraction which takes a triangulation and renders it to an image that's just graphics right graphics is a solved problem and because that's a program we can package it up into an abstraction and have it within our probabilistic model sort of explicitly there and then we can try to do things like say okay let's assume that we observe a noisy image let's infer the triangulation that gives rise to that image so this is something that we've been playing with and this is an example of running that church query given the target image on the right and I'm sort of showing you the trajectory of the inference over time as it figures out the triangulation that would give rise to something that looks like that face now inference clearly is incredibly hard here which is sort of the point this is an example from a paper that David Wingate and I and several others have written trying to use non-standard interpretation techniques to make probabilistic inference work better so you know that the kind of the inference issue certainly get bad but the fact that we can just write this down at all within the language it's kind is really because we have such an expressive language that can kind of have very complicated computations packaged up into one piece going to something a little bit simpler but in my opinion deeper mmm-hmm we can make new abstractions that are stochastic abstractions within the language so here's a teeny little fragment that just says my logic function is going to be a random choice between an or a noisy or in a noisy and and then the variable C is related to the variables a and B through my logic function okay now what's interesting is now I can do an inference where I can make a query that says assuming some observations I've made like I observe an outcome of a or a and not C I can ask what's the logic function likely to be in other words I'm actually inferring the the logical relation between these variables that thick arrow that I had mentioned earlier so we've applied this as kind of a simple naive but it turns out successful model of human concept learning traditional human concept learning or boolean categorization tasks look like the following imagine that I tell you those are effects and I tell you those are not fabs and now I ask you you the participant in this experiment do you think that this guy is a femme how many of you think that this isn't that and how many of you think it's not a femme okay mostly you guys say it's a FAP which is great you just replicated another classic psychology experiment so the standard results from this experiment are first of all that this guy which was not any one of the examples shown is rated as a better effect than any of the examples shown that's called prototype enhancement second of all that there's a gradation in these judgments and some of the effects that you've seen are judged as better FEPs than the others so this was a really big problem for early theories of concept learning that were based on kind of a classical logic deterministic logic because there was no real way to understand this graded 'no sand these typicality effects from a view where you were just sort of deterministically learning a rule that explained the examples so what we had we set out to do was ask whether that was because of the ruinous or the lack of probabilistic inference in those those algorithms so we just did a really simple thing where we said okay you can easily represent a space of simple feature based rules in lambda calculus so things like it's a fact if it has a flat head and round wings and use this to classify different bugs and then we can play a game where we say let's define a stochastic function which is a rural generator which makes a random choice of whether to just stop and sample a feature like flat head and round wings or to recursively go on sample the rest of a and then combine it together with a specific feature and so in this case I've just shown you a simple combination of simple Combinator that just as a conjunction so at the end you get a conjunction of features key property here is that because you have to flip that coin a lot of times in order to get a longer role the probability of longer rules is going to be lower so we get this kind of Occam's razor coming for free just from the generative process that we that we go through to generate a big rule we're going to do two other things for reasons that are technical and I don't think I'll talk about we put some uncertainty over the rule probabilities and in addition we use a slightly richer language we use full propositional logic but in disjunctive normal form representation so ORS as well as ands and then we do the kind of obvious thing we say let's assume that there is a rule in the world that came from the rural generator and I want to know say what it's going to say about a new object and I'm going to condition on all the objects that I've seen all the things that I've seen labels that one's a feb that one's not effect only if I just did this I would be back in deterministic role learning land so what I'm going to do is I'm going to do this probabilistic inference with a little bit of noise I'm going to have this function that says some of those observations I made might have been outliers with some probability okay and now we use this model and we compare it to human data so the horizontal axis is the predictions of this rule induction model and the vertical axis is the human data what we see you know first and foremost is that we're explaining human inferences ridiculously well so the correlation is 0.99 here and we're only fitting one free parameter which is this this outlier probability in addition we capture the qualitative features that people worried about for well several decades in category learning tasks graded judgments typicality prototype prototype enhancement and so on you can't see selective intention here so we've applied this kind of model to a bunch of different concept learning tasks now some that are relational we've looked at an interesting developmental case of learning the natural number concepts we've also done in a direction that I'm particularly excited about so the disconnect between this model and sort of the overall you know motivation for this talk this tutorial we're doing probabilistic inference but we're inducing deterministic logic and I've been arguing for you that concepts are something like a probabilistic logic so the direction we've been moving in is the question you asked about earlier of when is it going to be the case that we need a probabilistic logic something like church as the representation for the concepts and what we've been focusing on is these highly structured examples where each example provides a lot more structure to learn from them it was a question back there somewhere to do inference I can show you it takes about 30 seconds because it turns out that this is not a ridiculously complicated model so here's what I the model that I just described for you written out in church and now I get it on screen which seems to be the problem I can press that button all right I get it on screen and it stops just working come on alright something decided to die anyhow it takes about 30 40 50 seconds depending on the server not terribly long this is not a terribly complicated rule inference task so we've moved on to these structured objects and the idea is that structured objects provide a lot more evidence that you need a structured concept and so we're going to have to sort of move from simple boolean features to something much more interesting for our representation mm-hmm we did a set of experiments with these little colored trees that I'm they're not perfect experiments but I'm really fond of them we did things where we looked at sort of trees that were generated from concepts with varying complexity of structure from kind of simple prototype concepts all the way up to recursive concepts single and multiple recursion and things in the middle like sort of parts with parameters parameterize sub parts and by and large a sort of Bayesian inference algorithm Bayesian inference model over these more structured representations which are now probabilistic they're little little teeny scraps of church code are the concepts is able to explain human inferences pretty well across a bunch of different kinds of concepts now I should have a copy out on this which is when we wrote this paper about a year and a half ago we didn't actually know how to do inference properly so we did some kind of ugly approximations and I suspect that the story will be clearer now as we're figuring out how to actually do this and so what we've been doing here to algorithmically think about how can we possibly do this program induction task is really drawing on a bunch of ideas from existing literature's a bunch of existing literature's the basic idea is well hey program induction is really hard let's not try for a universal algorithm for the moment let's actually just see what we can do in more specific cases and in particular what we've been trying to do is use what I like to call syntactic analogy but it's the same thing as anti unification together with an argument compression move where you remove stochastically remove arguments from functions so there's the the inverse inlining or anti unification move that finds two things that look about the same with possible mismatches and it abstracts them out into a new function definition and therefore compresses potentially compresses the observations there's D argumentation where you look at all of the arguments to one function you say and I don't really care about the difference between those so I'll get rid of that argument and replace it with a random draw a couple of moves like that and there's a different D argument move that says hey if I look at all of the arguments to that F function I see sub calls to F in several of them so let's actually introduce a stochastic recursion that will randomly either do more f of whatever f does or stop with the base case so I introduced a stochastic recursion and get rid of that argument and the basic idea here is really kind of very parallel to the the Bayesian model merging idea okay mohandro also building pretty pretty heavily on things from genetic programming and ILP and those literature's where they care about things like anti unification it's well we can start out with an initial state which is just a program that's an explicit mixture of the examples we see we translate those trees directly into our language because we have data data constructors for those trees mm-hmm and then we do all of these transformations using these transformation moves mm-hmm and we do search by whatever your favorite search method is at the moment we're using BM search because it's simple but obviously something like a stochastic search would probably be better in the long run and we compare the the goodness of two potential programs by looking at the posterior probability and neat things happen basically so for example if you if I gave you that little tree on the left as the input you would first represent that as a program you would then notice hey there's a lot of repeated structures so I can extract out this function that says read node with the blue node and then something then you would notice that if I look at all the arguments to F a lot of them are sub calls too and so I can you know consider changing that into a stochastic recursion turns out that yeah maybe that's actually higher probability here and then what I did is I I just said here are four word samples from this concept posterior predictive samples and what you see is basically from this one example that has the recursive structure the concept you've learned is a tree that looks like that with some number of units and you're not too bothered if there are more units instead of less units here's a sort of fancier case where there's a lot more structure there that we actually ran and this is some work that we're writing a tech report on right now myself and urban Huang who's sitting in the back there and Adil Muller and it'll be out shortly and also Irvin is looking for a job so you should all hire him he's good okay I am nearly exactly out of time so let me just say the basic overall flow of what I was trying to argue for you is that

Info

Channel: Google TechTalks

Views: 12,540

Rating: 4.9389315 out of 5

Keywords: google tech talk, agi-11, artificial intelligence, ai

Id: fclvsoaUI-U

Channel Id: undefined

Length: 118min 26sec (7106 seconds)

Published: Tue Aug 30 2011