21. Probabilistic Inference I

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
PATRICK WINSTON: Here we are, down to the final sprint. Three to go. And we're going to take some of the last three, maybe two of the last three, to talk a little bit about stuff having to do with probabilistic approaches-- use of probability in artificial intelligence. Now, for many of you, this will be kind of a review, because I know many of you learned about probability over the [? sand ?] table and every year since then. But maybe we'll put another little twist into it, especially toward the end of the hour when we get into a discussion of that which has come to be called belief nets. But first, I was driving in this morning, and I was quite astonished to see, as I drove in, this thing here. And my first reaction was, oh my god, it's the world's greatest hack. And then I decided, well, maybe it's a piece of art. So I'd like to address the question of how I could come to grips with that issue. There's a distinct possibility that this thing is a consequence of a hat, possibly the result of some kind of art show. And in any event, some sort of statue appeared, and statues don't usually appear like that. So I got the possibility of thinking about how all these things might occur together or not occur together. So the natural thing is to build myself some sort of table to keep track of my observations. So I have three columns in my table. I've got the possibility of a statue appearing, a hack having occurred, and some sort of art show. And so I can make a table of all the combinations of those things that might appear. And I happen to have already guessed that there are going to be eight rows in my table. So it's going to look like this. And this is the set of combinations in this row where none of that occurs at all. And down here is the situation where all of those things occur. After all, it's possible that we can have an art show and have a hack be a legitimate participant in the art show. That's why we have that final row. So we have all manner of combinations in between. So those are those combinations. Then we have F, F, T, T, F, F, T, T, F, T, F, T, F, T, F, T. So it's plain that the number of rows in the table, or these binary possibilities, is 2 to the number of variables. And that could be a big number. In fact, I'd love to do a bigger example, but I don't have the patience to do it. But anyhow, what we might do is in order to figure out how likely any of these combinations are, is we might have observed the area outside the student center and rest of campus over a long period of time and keep track of what happens on 1,000 days. Or maybe 1,000 months or 1,000 years. I don't know. The trouble is, these events don't happen very often. So the period of time that I use for measurement needs to be fairly long. Probably a day is not short enough. But in any case, I can keep a tally of how often I see these various combinations. So this one might be, for example, 405, this one might be 45, this one might be 225, this one might be 40, and so on. And so having done all those measurements, kept track of all that data, then I could say, well, the probability that at any given time period one of these things occurs will just be the frequency-- the number of tallies divided by the total number of tallies. So that would be a number between 0 and 1. So that's the probability for each of these events. And it's readily calculated from my data. And once I do that, then I can say that I got myself a joint probability table, and I could perform all manner of miracles using that joint probability table. So let me perform a few of those miracles, while we're at it. There's the table. And now, what I want to do is I want to count up the probability in all the rows where the statue appears. So that's going to be the probability of the statue appearing. So I'll just check off those four boxes there. And it looks like the probability of the statue appearing is about 0.355 in my model. I don't think it's quite that frequent, but this is a classroom exercise, right? So I can make up whatever numbers I want. Now, I could say, well, what's the probability of a statue occurring given that there's an art show? Well, I can limit my tallies to those in which art show is true, like so. And in that case, the probability has just zoomed up. So if I know there's an art show, there's a much higher probability that a statue will appear. And if I know there's a hack as well as an art show going on, it goes up higher still to 0.9. We can also do other kinds of things. For example, we can go back to the original table. And instead of counting up the probability we've got a statue, as we just did, we're going to calculate the probability that there is an art show. I guess that would be that one and that one, not that one, but that one. So the probability there's an art show is one chance in 10. Or we can do the same thing with a hack. In that case, we get that one off, that one on, that one off, that one on, that one off, that one on, that one off. So the probability of a hack on any given time period is about 50-50. So I've cooked up this little demo so it does the "ands" of all these things. It could do "ors," too, with a little more work. But these are just the "ands" of these various combinations. Then you can ask more complicated questions, like for example, you could say, what is the probability of a hack given that there's a statue? And that would be limiting the calculations to those rows in which the statue thing is true. And then what I get is 0.781. Now, what would happen to the probability that it's a hack if I know that there's an art show? Will that number go up or down? Well, let's try it. Ah, it went down. So that's sort of because the existence of the art show sort of explains why the statue might be there. Now, just for fun, I'm going to switch to another situation, very similar. And the situation here is that a neighbor's dog often barks. It might be because of a burglar. It might be because of a raccoon. Sometimes, there's a burglar and a raccoon. Sometimes, the damn dog just barks. So let's do some calculations there and calculate the probability that a raccoon is true, similar to what we did last time. Looks like on any given night-- it's kind of a wooded are-- there's a high probability of a raccoon showing up. And then we can ask, well, what is the probability of the dog barking given that a raccoon shows up? Well, in that case, we want to just limit the number of rows to those where a raccoon-- or where the dog is barking. Looks like the probability of the dog barking, knowing nothing else, is about [? 3/7. ?] But now we want to know the probability of the raccoon-- that's these guys here need to get checked. These are off. So that's the probability of a raccoon. Did I get that right? Oh, that's probability of a burglar. Sorry, that was too hard. So let me go back and calculate-- I want to get the probability of a raccoon. That's true, false, true, false, true, false, true. So the probability of a raccoon, as I said before is 0.5. Now, what happens to that probability if I know the dog is barking? Well, all I need to do is limit my rows to those where the dog is barking, those bottom four. And I'll click that there, and you'll notice all these tallies up above the midpoint have gone to zero, because we're only considering those cases where the dog is barking. In that case, the probability that there's a raccoon-- just the number of tallies over the total number of tallies-- gee, I guess it's 225 plus 50 divided by 370. That turns out to be 0.743. So about 75% of the time, the dog barking is accounted for-- well, the probability of a raccoon under those conditions is pretty high. And now, once again, I'm going to ask, well, what is the probability of a raccoon, given that the dog is barking and there's a burglar? Any guess what will happen there? We did this once before with the statue. Probability first went up when we saw the statue and then went down when we saw another explanation. Here's this one here. Wow, look at that. It went back to its original condition, its a priori probability. So somehow, the existence of the burglar and the dog barking means that the probability of a raccoon is just what it was before we started this game. So those are kind of interesting questions, and there's a lot we can do when we have this table by way of those kinds of calculations. And in fact, the whole miracle of probabilistic inference is right in front of us. It's the table. So why don't we go home? Well, because there's a little problem with this table-- with these two tables that I've shown you by way of illustration. And the problem is that there are a lot of rows. And I had a hard time making up those numbers. I didn't have the patience to wait and make observations. That would take too long. So I had to kind of make some guesses. And I could kind of manage it with eight rows-- those up there. I could put in some tallies. It wasn't that big of a deal. So I got myself all those eight numbers up there like that. And similarly, for the art show calculations, produced eight numbers. But what if I added something else to the mix? What if I added the day of the week or what I had for breakfast? Each of those things would double the number of rows of their binary variables. So if I have to consider 10 influences all working together, then I'd have 2 to the 10th. I'd have 1,000 numbers to deal with. And that would be hard. But if I had a joint probability table, then I can do these kinds of miracles. But Dave, if I could have this little projector now, please. I just want to emphasize that although we're talking about probabilistic inference, and it's a very powerful tool, it's not the only tool we need in our bag. Trouble with most ideas in artificial intelligence is that their hardcore proponents think that they're the only thing to do. And probabilistic inference has a role to play in developing a theory of human intelligence. And it certainly has a practical value, but it's not the only thing. And to illustrate that point, I'd like to imagine for a few moments that MIT were founded in 1861 BC instead of 1861 AD. And if that were so, then it might be the case that there would be a research program on what floats. And this, of course, would be a problem in experimental physics, and we could imagine that those people back there in that early MIT would, being experimentally minded, try some things. Oh, I didn't know that's what happened. It looks like chalk floats. Here's a rock. No, it didn't float. Here's some money. Doesn't float. Here's a pencil. No, it doesn't float. Here's a pen. Here's a piece of tin foil I got from Kendra. That floats. That's a metal. The other stuff's metal, too. Now I'm really getting confused. Here's a little wad of paper. Here's a cell ph-- no, actually, I've tried that before. They don't float. And they also don't work afterward, either. I don't need to do any of that in the MIT of 1861 AD and beyond, because I know that Archimedes worked this all out. And all I have to do is measure the volume of the stuff, divide that by the weight, and if that ratio is big enough, then the thing will float. But back in the old days, I would have to try a lot of stuff and make a big table, taking into account such factors as how hard it is, how big it is, how heavy it is, whether it's alive or not. Most things that are alive float. Some don't. Fish don't, for instance. So it would be foolhardy to do that. That's sort of a probabilistic inference. On the other hand, there are lots of things where I don't know all the stuff I need to know in order to make the calculation. I know all the stuff I need to know in order to decide if something floats, but not all the stuff I need to know in order, for example, to decide if the child of a Republican is likely to be a Republican. There are a lot of subtle influences there, and it is the case that the children of Republicans and the children of Democrats are more likely to share the political party of their parents. But I don't have any direct way of calculating whether that will be true or not. All I can do in that case is what I've done over here, is do some measurements, get some frequencies, take some snapshots of the way the world is and incorporate that into a set of probabilities that can help me determine if any given parent is a Republican, given that I've observed the voting behavior their children. So probability has a place, but it's not the only tool we need. And that is an important preamble to all the stuff we're going to do today. Now, we're really through, because this joint probability table is all that there is to it, except for the fact we can't either record all those numbers, and it becomes quickly a pain to guess at them. There are two ways to think about all this. We can think about these probabilities as probabilities that come out of looking at some data. That's a frequentist view of the probabilities. Or we could say, well, we can't do those measurements. So I can just make them up. That's sort of the subjective view of where these probabilities come from. And in some cases, some people like to talk about natural propensities, like in quantum mechanics. But for our purposes, we either make them up, or we do some tallying. Trouble is, we can't deal with this kind of table. So as a consequence of not being able to deal with this kind of table, a gigantic industry has emerged for dealing with probabilities without the need to work up this full table. And that's where we're going to go for the rest of the hour. And here's the path we're going to take. We're going to talk about some basic overview of basic probability. Then we're going to move ourselves step by step toward the so-called belief networks, which make it possible to make this a practical tool. So let us begin. The first thing is basic probability. Let us say basic. And basic probability-- all probability flows from a small number of axioms. We have the probability of some event a has got to be greater than 0 and less than 1. That's axiom number one. In a binary world, things have a probability of being true. Some have a probability of being false. But the true event doesn't have any possibility of being anything other than true, so the probability of true is equal to 1, and the probability of false-- the false event, the false condition-- has no possibility of being true, so that's 0. Then the third of the axioms of probability is that the probability of a plus the probability of b minus the probability of a and b is equal to the probability of a or b. Yeah, that makes sense, right? I guess it would make more sense if I didn't switch my notation in midstream-- a and b. So those are the axioms that mathematicians love to start up that way, and they can derive everything there is to derive from that. But I never can deal with stuff that way. I have to draw a picture and think of this stuff in a more intuitionist type of way. So that's the formal approach to dealing with probability, and it's mirrored by intuitions that have to do with discussions of spaces, like so, in which we have circles, or areas, representing a and b. And to keep my notation consistent, I'll make those lowercase. So you can think of those as spaces of all possible worlds in which these things might occur. Or you can think of them as sample spaces. But in any event, you associate with the probability of a the size of this area here relative to the total area in the rectangle-- the universe. So the probability of a is the size of this circle divided by the size of this rectangle in this picture. So now all these axioms make sense. The probability that a is certain is just when that fills up the whole thing, and there's no other place for a sample to be, that means it has to be a. So that probability goes all the way up to 1. On the other hand, if the size of a is just an infinitesimal dot, then the chances of landing in that world is 0. That's the bound on the other end. So this-- axiom number one-- makes sense in terms of that picture over there. Likewise, axiom number two. What about axiom number three? Does that make sense in terms of all this stuff? And the answer is, sure, because we can just look at those areas with a little bit of colored chalk. And so the probability of a is just this area here. The probability of b is this area here. And if we want to know the probability that we're in either a or b, then we just have to add up those areas. But when we add up those areas, this intersection part is added in twice. So we've got to subtract that off in order to make this thing make a rational equation, so that makes sense. And axiom three makes sense, just as axioms one and two did. So that's all there is to basic probability. And now you could do all sorts of algebra on that, and it's elegant, because it's like circuit theory or electromagnetism, because from a very small number of axioms-- in this case three-- you can build an elegant mathematical system. And that's what probability subjects do. But we're not going to go there, because we're sort of focused on getting down to a point where we can deal with that joint probability table that we currently can't deal with. So we're not going to go into a whole lot of algebra with these things. Just what we need in order to go through that network. So the next thing we need to deal with is conditional probability. And whereas those are axioms, this is a definition. We say that the probability of a given b is equal to, by definition, the probability of a and b. I'm using that common notation to mean [INAUDIBLE] as is conventional in the field. And then we're going to divide that by the probability of B. You can take that as a definition, and then it's just a little bit of mysterious algebra. Or you could do like we did up there and take an intuitionist approach and ask what that stuff means in terms of a circle diagram and some sort of space. And let's see, what does that mean? It means that we're trying to restrict the probability of a to those circumstances where b is known to be so. And we're going to say that-- we've got this part here, and then we've got the intersection of a with b. And so it does make sense as a definition, because it says that if you've got b, then the probability that you're going to get a is the size of that intersection-- the pink and orange stuff-- divided by the whole of b. So it's as if we restricted the universe of consideration to just that part of the original universe as covered by b. So that makes sense as a definition. And we can rewrite that, of course, as P of a and b is equal to the probability of a given b times the probability of b. That's all basic stuff. Now, we do want to do a little bit of algebra here, because I want to consider not just two cases, but what if we divide this space up into three parts? Then we'll say that the probability of a, b, and c is equal to what? Well, there are lots of ways to think about that. But one way to think about it is that we are restricting the universe to that part of the world where b and c are both true. So let's say that y is equal to b and c-- the intersection of b and c, where a and b are both true. Then we can use this formula over here to say that probability of a, b, and c is equal to the probability of a and y, which is equal to the probability of a given y times the probability of y. And then we can expand that back out and say that P of a given b and c is equal to the probability-- sorry, times the probability of y, but y is equal to the probability of b and c, like so. Ah, but wait-- we can run this idea over that one, too, and we can say that this whole works is equal to the probability of a given b and c times the probability of b given c times the probability of c. And now, when we stand back and let that sing to us, we can see that some magic is beginning to happen here, because we've taken this probability of all things being so, and we've broken up into a product of three probabilities. The first two are conditional probabilities, so they're really all conditional probabilities. The last one's conditional on nothing. But look what happens as we go from left to right. a is dependent on two things. b is only dependent on one thing and nothing to the left. c is dependent on nothing and nothing to the left. So you can sense a generalization coming. So let's write it down. So let's go from here over to here and say that the probability of a whole bunch of things-- x1 through x10-- is equal to some product of probabilities. We'll let the index i run from n to 1. Probability of x to the last one in the series, conditioned on all the other ones-- sorry, that's probability of i, i minus 1 down to x1 like so. And for the first one in this product, i will be equal to n. For the second one, i will be equal to n minus 1. But you'll notice that as I go from n toward 1, these conditionals get smaller-- the number of things on condition get smaller, and none of these things are on the left. They're only stuff that I have on the right. So what I mean to say is all of these things have an index that's smaller than this index. None of the ones that have a higher index are appearing in that conditional. So it's a way of taking a probability of the end of a whole bunch of things and writing it as a product of conditional probabilities. So we're making good progress. We've done one. We've done two. And now we've done three, because this is the chain rule. And we're about halfway through our diagram, halfway to the point where we can do something fun. But we still have a couple more concepts to deal with, and the next concept is the concept of conditional probability. So that's all this stuff up here-- oops. All this stuff here is the definition of conditional probability. And now I want to go to the definition of independence. So that's another definitional deal. But it's another definitional deal that makes some sense with a diagram as well. So the definition goes like this. We say that P of a given b is equal to P of a if a independent of b. So that says that the probability of a doesn't depend on what's going on with b. It's the same either way. So it's independent. b doesn't matter. So what does that look like if we try to do an intuitionist diagram? Well, let's see. Here's a. Here's b. Now, the probability of a given b-- well, let's see. That must be this part here divided by this part here. So the ratio of those areas is the probability of a given b. So that's the probability of this way divided by the probability of both ways. So what's the probability of a in terms of these areas? Well, probability of a in terms of these areas is the probability-- let's see, have I got this right? I've got this upside down. The probability of a given b is the probability of the stuff in the intersection-- so that's both ways-- divided by the probability of the stuff in b, which is going this way. And let's see, the probability of a not conditioned on anything except being in this universe is all these hash marks, like so, divided by the universe. So when we say that something's independent, it means that those two ratios are the same. That's all it means in the intuitionist's point of view. So it says that this little area here divided by this whole area is the same as this whole area for a divided by the size of the universe. So that's what independence means. Now, that's quite a lot of work. But we're not done with independence, because we've got conditional independence to deal with. And that, too, can be viewed as a definition. And what we're going to say is that the probability of a given b and z is equal to the probability of a given z. What's that mean? That means that if you know that we're dealing with z, then the probability of a doesn't depend on b. b doesn't matter anymore once you're restricted to being in z. So you can look at that this way. Here's a, and here's b, and here is z. So what we're saying is that we're restricting the world to being in this part of the universe where z is. So the probability of a given b and z is this piece in here. a given b and z is that part there. And the probability of a given z is this part here divided by all of z. So we're saying that the ratio of this little piece here to this part, which I'll mark that way, ratio of this to this is the same as the ratio of that to that. So that's conditional independence. So you can infer from these things, with a little bit of algebra, that P of a and b given z is equal to P of a given z times P of b in z. Boy, that's been quite a journey, but we got all the way through one, two, three, four, and five. And now the next thing is belief nets, and I'm going to ask you to forget everything I've said for a minute or two. And we'll come back to it. I want to talk about the dog and the burglar and the raccoon again. And now, forgetting about probability, I can say, look, the dog barks if a raccoon shows up. The dog barks if a burglar shows up. A burglar doesn't show up because the dog is barking. A raccoon doesn't show up because the dog is barking. So the causality flows from the burglar and the raccoon to the barking. So we can make a diagram of that. And our diagram will look like this. Here is the burglar, and here is the raccoon. And these have causal relations to the dog barking. So that's an interesting idea, because now I can say that-- well, I can't say anything yet, because I want to add a little more complexity to it. I'm going to add two more variables. You might call the police, depending on how vigorous the dog is barking, I guess. And the raccoon has a propensity to knocking over the trash can. So now, I've got five variables. How big a joint probability table am I going to need to keep my tallies straight? Well, it'll be 2 to the 5th. That's 32. But what I'm going to say is that this diagram is a statement, that every node in it depends on its parents and nothing else that's not a descendant. Now, I need to say that about 50 times, because you've got to say it right. Every node there is independent of every non-descendant other then its parents. No, that's not quite right. Given its parents, every node is independent of all other non-descendants. Well, what does that mean? Here's the deal with calling the police. Here's its one and only parent. So given this parent, the probability that they were going to call the police doesn't depend on anything like B, R, or T. It's because all of the causality is flowing through this dog barking. I'm not going to call the police in a way that's dependent on anything else other than whether the dog is barking or not. Because this guy has this as a parent, and these are not descendants of calling the police, so this is independent of B, R, and T. So let's go walk through the others. Here's the dog. The dog's parents are burger appearing and raccoon appearing. So the probability that the dog appears is independent of that trash can over there, because that's not a descendant. It is dependent on these parents. How about the trash can? It depends only on the raccoon. It doesn't depend on any other non-descendant, so therefore, it doesn't depend on D, B, or P. How about B? It has no parents. So it depends on nothing else, because everything else is either a non-descendant, because B does not dependent on R and T, because they're not descendants. It's interesting that B might depend on D and P, because those are descendants. So it's important to understand that there's this business of independence given the parents of all other non-descendants. And you'll see why that funny, strange language is important in a minute. But now, let's see-- I want to make a model of what's going to happen here. So let me see what kind of probabilities I'm going to have to figure out. This guy doesn't depend on anything upstream. So we could just say that all we need there is the probability that a burglar is going to appear. Let's say it's a fairly high-crime neighborhood-- 1 chance in 10-- 1 day in 10, a burglar appears. The raccoon doesn't depend on anything other than its own propensity, so its probability, we'll say, is 0.5. Raccoons love the place, so it shows up about 1 day in 2. So what about the dog barking? That depends on whether there's a burglar, and the other parent is whether there's a raccoon. So we need to keep track of the probability that the dog will bark for all four combinations. So this will be the burglar, and this will be the raccoon. This will be false, false, true, true-- oops-- false, false, true, false, false, true, true, true. So let's say it's a wonderful dog, and it always barks if there's a burglar. So that would say that the probability here is 1.0, and the probability here is 1.0. And if there's neither a burglar nor a raccoon, the dog still likes to bark just for fun. So we'll say that's a chance of 1 in 10. And then in case there's a burglar, let's say this. There's no burglar, but there is a raccoon-- he's tired of the raccoons, so he only barks half the time. Do these numbers, by the way, have to add up to 1? They clearly don't. These numbers don't add up to one. What adds up to 1 is this is the probability that the dog barks. And then the other phantom probability is out here. And these have to add up to 1. So that would be 0.9, that would be 0.0, that would be 0.5, and this would be 0.0. So because those are just 1 minus the numbers in these columns, I don't bother to write them down. Well, we still have a couple more things to do. The probability that we'll call the police depends only on the dog. So we'll have a column for the dog, and then we'll have a probability of calling the police. There's a probability for that being false and a probability for that being true. So if the dog doesn't bark, there's really hardly any chance we'll call the police. So make that 0, 0, 1. If the dog is barking, if he barks vigorously enough, maybe 1 chance in 10. Here, we have the trash can-- the final thing we have to think about. There's the trash can; rather, the raccoon. And here's the trash can probability. Depends on the raccoon being either present or not present. If the raccoon is not present, the probability the trash can is knocked over by, say, the wind is 1 in 1,000. If the raccoon is there, oh man, that guy always likes to go in there, so that's 0.8. So now I'm done specifying this model. And the question is, how many numbers did I have to specify? Well, let's see. I have to specify that one, that one, that one, that one, that one, that one-- that's 6, 7, 8, 9, 10. So I had to specify 10 numbers. If I just try to build myself a joint probability table straightaway, how many numbers would I have to supply? Well, it's 2 to the n. So it's 2 to the 5th, that's 32. Considerable saving. By the way, how do you suppose I made that table? Not by doing all those numbers. By making this belief network and then using the belief network to calculate those numbers. And that's why this is a miracle, because with these numbers, I can calculate those numbers instead of making them up or making a whole lot of tally-type measurements. So I'd like to make sure that that's true. And I can use this stuff here to calculate the full joint probability table. So here's how this works. I have the probability of some combination-- let's say the police, the dog, the burglar, the trash can, and the raccoon. All the combinations that are possible there will give me an entry in the table-- one row. But let's see-- there's some miracle here. Oh, this chain rule. Let's use the chain rule. We can write that as a probability that we call the police given d, b, t, and r. And then the next one in my chain is probability of d given b, t, and r. Then the next one in the chain is the probability of b given t and r. And the next one in my chain is P of t given r. And the final one in my chain is p of r. Now, we have some conditional independence knowledge, too, don't we? We know that this probability here depends only on d because there are no descendants. So therefore, we don't have to think about that, and all the numbers we need here are produced by this table. How about this one here? Probability that the dog barks depends only on its parents, b and r, so it doesn't depend on t. So b, in turn, depends on-- what does it depend on? It doesn't depend on anything. So we can scratch those. Probability of t given r, yeah, there's a probability there, but we can get that from the table. And finally, P or r. So that's why I went through all that probability junk, because if we arrange things in the expansion of this, from bottom to top, then we arrange things so that none of these guys depends on a descendant in this formula. And we have a limited number of things that it depends on above it. So that's the way we can calculate back the full joint probability table. And that brings us to the end of the discussion today. But the thing we're going to think about is, how much saving do we really get out of this? In this particular case, we only had to devise 10 numbers out of 32. What if we had 10 properties or 100 properties? How much saving would we get then? That's what we'll take up next time, after the quiz on Wednesday.
Info
Channel: MIT OpenCourseWare
Views: 92,061
Rating: undefined out of 5
Keywords: probability, conditional, independence, chain rule, belief net, causal relationship, joint probability table, graphical, intuition
Id: A6Ud6oUCRak
Channel Id: undefined
Length: 48min 29sec (2909 seconds)
Published: Fri Jan 10 2014
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.