Lecture 5: Conditioning Continued, Law of Total Probability | Statistics 110

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
So last time we proved a lot of theorems last time right, Bayes Rule at least n factorial plus 3 theorems or something like that for any n. So that was a very productive day and I wanna continue with conditional probability. Thinking conditionally, we did Bayes Rule. But I want to do some examples of conditional probability and some more stuff on conditional probability. So basically the topic for today is not just probability, but it's thinking. So [COUGH] probability is how to think about uncertainty and randomness, right? That's the topic for this entire course. So, this is not just a statistics course, this is a thinking course and you know the math we were doing last time, the math is extremely easy, right. I like multiply both sides by something and then there's our theorem. So it looked really easy but to actually think about it and how to apply it is not always easy, in fact it's often subtle. So I wanna do some examples and a few more theorems along those lines. So I like to say thinking conditionally that that's one of the biggest themes in this whole course, using conditional probability, conditional thinking. It is a condition for thinking. That is you can't really think clearly except under the condition that you understand how to think conditionally. Okay, so that's kind of a general statement. Now, here's a general way to solve a problem. This is also a course in problem solving, okay? So I wanna just say, in general, how do you solve a problem? And there are different strategies for solving problems, right? Now one strategy that we already talked a little bit about is to try simple and extreme cases that is extremely useful in a wide variety of problems. So I did my undergrad at Cal Tech. And at Cal Tech, everyone's hero is Richard Feynman, who was one of the greatest physicists of the 20th century. And people like to say that Feynman also had an algorithm, a general problem solving algorithm that Feynman used. Does anyone know the Feynman algorithm? Okay the Feynman algorithm is, write down the problem, think hard about it, write down the solution. So, that worked really well for Richard Feynman, but it doesn't work for anyone else that I know, okay? So we need better, so this is one strategy, and this is a strategy that we'll be using over and over again in the course. A second strategy that we'll be using over and over again that's useful in statistics but it's useful in computer science, it's useful in math. It's useful in econ, it's useful all over the place, is to try to break the problem up into simpler pieces. If you have a problem that seems too difficult and complicated try to decompose it into smaller pieces. Try and solve the smaller, it's recursive, right. If the smaller pieces are still too difficult, then break up the smaller pieces into simpler problems. So you have more problems, but each problem is easier. And hopefully eventually you'll reach a point where you can solve each of those problems, put the pieces back together. So that's a very, very general strategy for solving problems. So break up problem into simpler pieces. Okay, so that's just a general strategy. But let me write down, what does that mean in the context of what we're doing right now? Well, let's draw a picture, here's one of our Venn diagrams, here's the sample space S. And supposedly, we want to find the probability of B which is Blob, B for Blob. So suppose that our problem, this is a very general strategy. But suppose that our problem is still a generic problem but less generic than this, we have this complicated looking Blob, B, and we want to find the probability of B. We don't know how to do it because it's this complicated blob. So instead of giving up, what we would do is to break B up into pieces, find the probability of each piece and add them up. So that's a very simple idea, right? Just break it up into pieces, add up the pieces. It's a very, very powerful idea. So we are going to break this up into pieces. Let's say this is A1 is this rectangle, A2, A3 and here's A4. A4 doesn't actually intersect B, that's fine. We're gonna let A1 through An be a partition of S. The word partition just means that these sets a which are rectangles are disjoint and their union is all of S. So we are just chopping up the space into disjoint pieces. They don't have to look like rectangles. Chop it up however you want, as long as those pieces are disjoint and their union is the whole space, that's called a partition of S. So that's a partition of S. Then I don't need to do any kind of calculation, I can just immediately write down what the decomposition says. P(B), just by the second axiom of probability says, if you partition a set, and you want its probability, then you can just add up all the pieces. That's all we're doing. So I can write P(B) = P (BnA1) + blah blah blah + P (BnAn). I don't need to write a proof for this. This is just immediate just from the axioms of probability, it's immediate, right? Because I broke it up into disjoint pieces. So that's immediate, and then remember we have this long list of theorems last time that all just followed immediately from the definition of conditional probability. So another way to write this would be, remember P(B) intersect A1. I can either do P(B) times P(A1) given B or the other way round. So let's do P(B) given P(A1). That's what we did last time so that's a quick review. If I want the probability of this and this I can take one thing, and then the other thing, given the first thing, and I can do it in either order. That's why I said we had n factorial theorems, because you could do it in any order, +, blah, blah, blah, do that for all the pieces, (B/An) P(An). And this equation here is called the law of total probability. That's the name for it, but I prefer to just think of it as breaking up a problem into simpler pieces. So the proof is already just written here. It's immediate, okay? It's not like we have to spend 20 minutes trying to prove this. It's just immediately true. So whether this is useful or not, depends on how well you chose this patrician, okay? So statistics is both a science and an art. It takes a lot of practice. That if I had chosen this partition in certain ways, then I multiplied, here I have only one problem and I've multiplied into n problems And it could be that each of those n problems is just as hard or harder than when I started. And that would be a nightmare. But for a lot of examples that we'll see later on in the course, and you'll see in section on the homework and so on, this P(B) is complicated but each of these is really easy. So we split it up into easy problems. So that's what we are looking for, you just need experience with that. The more problems you do, then the better you'll get at kind of guessing what would be a useful partition and what would be a useless partition, okay? So that's just the general idea of Y is conditional. Basically there's two main reasons conditional probability is very important. One is that it's important in its own right, right? Because like I was talking about last time, conditional probabilities just says we get some evidence, how do we update our probabilities based on the evidence? So, that's just a very general, important problem. The second reason it's extremely important is that even if we wanted an unconditional probability like here P(B) is unconditional. Still a lot of times we need to use conditional probability to break it up into simpler pieces, all right. So let's do some examples. All right, I'll start with one that seems really simple, but actually is kind of surprising I think. Suppose we have two random cards, just from a standard deck. Get random 2-card hand. So random two cards out of a 52 card deck, from standard deck. And then let's compute two different conditional probabilities, okay? So suppose we're given that this hand has an ace, okay? And we want the probability that both cards are aces. So let's do two things, first of all let's find the probability. I'm just gonna write this in words, but we could also define some notation for the different events, but I'll just write this in words. We wanna find the probability that both cards are aces, given that we have an ace, that is, I'm just gonna write have ace. If you write it mathematically though, this would be a union, it'll be the union of the first card. We don't necessarily have an ordering of a first card and a second card, but we could imagine we got dealt one card first and then got dealt a second card. This would be the union of the first card being an ace and the second card being an ace. That is we have at least one ace. All right so that's one problem. Seems like a pretty simple problem okay, but there's actually a lot going on here. And then a second problem is to find the probability that both cards are aces, given that we have the ace of spades. Okay, so that's what we wanna do. So let's do the first thing first and then the second thing second. The probability of both aces, this one, given that we have an ace. This is just practice with the definition of conditional probability. So the probability of a given b, is the probability of a intersect b. Now in this case, I'll write it out but then I'll simplify it. Both aces and we have an, both cards are aces and we have an ace. But if you already told that both cards are aces, then it's redundant to say we have an ace. So the the intersection of this event and this event is just this event. So that's redundant, so I just crossed it out, okay? Divide it by the probability that we have an ace. This is just a quick review with the naive definition, which needs the naive definition here, cuz we're assuming all two card hands are equally likely. The probability that both cards are aces, well you can chose to do this problem using order or with out order, but let's just do this with out order cuz I don't really care about the order of the hand. If the card consists of two aces, there's two possibilities, right, chose two out of the four aces, and the denominator, we know, is 52 chose two, cuz we're just picking two cards out of 52, naive definition. The denominator, the probability we have a miss is two ways to do that either we could break it up into cases so there is two cases, either we have two aces and you can find that we just did that or we have one ace and one non-ace. Those are two disjoint cases, we could add them up. I think it's a little bit easier to do the complement, but you'll get the same thing either way if you do it correctly. If we do the complement, that what we're saying is it's 1 minus the probability that neither card is an ace. The probability that neither card is an ace, well there are 48 non-aces in a deck, right, 52- 4. We can choose any two of them, divided by 52, choose 2. And if you simplify this you get 1 over 33. So about a 3% chance of this happening after simplification. Okay, now let's do this problem. What the probability that both are aces, given that we have the ace of spades? So P of both aces, given that we have the ace of spades. Again, there's more than one way we can do this problem. We could just plug into the definition. The numerator would be, what's the probability that we have both, this and this would say we have the ace of spades and we have some other ace, right? And then you could do the denominator, and you can work that out. I would prefer to think of this in a simpler way but for practice it's good to see that you get the same answer either way. I'd rather think of this more directly. Just what does this mean? We have two cards here, now I didn't say anything about order or unordered, what we're given, that is we learn this information, we learned that we have the ace of spades. So here's my two card hand, and we have the ace of spades, I'll abbreviate it to AS, Ace of Spades. Of course if we want we can put the ace of spades on the left and the other card on right that's why holding these. I have the ace of spades here, and this one here, okay. And so this card is the mystery card, right? Now this second card could be any card other than the ace of spades, right? By symmetry, this card is equally likely to be any of the 51 cards in the deck other than the ace of spades. There's no reason that this is more likely to be the jack of diamonds than the ace of clubs, right? Completely symmetric. Therefore, we can immediately just write down the answer, 3/51 because if this is any of the other three aces, then we have both aces and otherwise, we don't. So, it's immediately 3 over 51 by symmetry. You could write something more analogous to this and if you do it correctly, you'll get the same thing, but that's simpler. 3 over 51 simplifies to 1/17, right? 1/17 is, it's basically double this if this were 1/34 it would be exactly double. It's about twice as likely. So let's think about that for a second. Does that make sense? Here we have an ace and we haven't specified what suit it is. Here, we specify though it's the Ace of Spades, and suddenly our probability almost doubled. Now, if I had said Ace of Hearts here, that's not gonna change, right? It's still gonna be one-seventeenth. If I had said Ace of Clubs here that's not gonna change. If I said Ace of Diamonds here that's not gonna change. The problem is symmetric. It doesn't actually matter that it's the Ace of Spades. We don't care what suit it is, right? It could say hearts, clubs, diamonds, spades here, it's still gonna be one-seventeenth. And yet here, where we didn't specify it, it's 1 over 33. So I'm gonna let you think about this problem for a while. This is a problem that you could spend hours and hours trying to build intuition about. I'm mainly doing it just for practice with the definition. And to show you that even something that sounds, this sounds like a very simple problem, okay? But already something very surprising is going on here. So conditional probability can be very, very subtle. One hint for thinking about it is, here we're saying we have an ace that means at least one, right? Now here, well we could say we have at least one Ace of Spades, but if we say there's at least one Ace of Spades, we're just saying we have the Ace of Spades cuz there's only one Ace of Spades. So here I can specifically say, here I have the Ace of Spades, something else. Here, it's complicated. I can't draw the same picture here as there, because I'm just saying there's at least one ace. So there's a difference here in terms of talking about at least one or a specific one. All right, so that's just a fun problem to think about with conditional probability. Let's do another example. This kind of thing actually is important in gambling, but let's do one that's important in daily life. So, suppose that we're testing for a disease. So this is a problem that comes up every day everywhere in the world in a medical context. So suppose that a patient comes in, and is getting tested for a certain disease. Okay, this is a good problem and just illustrating. You have to think very carefully about what your conditioning on, what's your goal? And that's a hint for the homework as well to try to clearly specify what are you trying to find. Make up some clear notation and then say very, very explicitly, our goal is to find p of what given what, that kind of thing. Okay, so patient gets tested for disease. And suppose that that particular disease afflicts 1% of the population or maybe wanna say, 1% of people who are similar in demographically, same age, and so on as this patient. 1% of similar patients have the disease, okay, that's the assumption. And suppose that the patient tests positive, that's the result. And even though testing positive sounds good, that's actually bad. Tests positive means that the test is asserting that the patient has the disease. Now, the test could be wrong, right? But if the test is correct, then that means the patient has the disease, okay? So that's what a positive test means. Now, so far I haven't said anything about how reliable is the test, right? So some diseases are hard to detect, okay? And so some diseases are easy to detect, some are hard to detect. And some tests work better than other tests, right? Okay, so we need some assumption there. And so suppose that the test is advertised, As being 95% accurate. You might see some marketing that the people who manufacture this particular test and they say it's 95% accurate. And what does that actually mean? I"m putting that in quotes because that's not yet precise enough to actually do anything with. So we're gonna have to interpret what does it mean for it to be 95% accurate? There's more than one ways you can interpret that phrase. That's ambiguous right now. So to be able to solve the problem I'm gonna make a specific assumption. So suppose, that's an assumption, that this means, Now we start needing some notation for the different events. So let's define our events here. Let's let D be the event that the patient has the disease. Patient has disease, D for disease. Okay, when you're defining events, try to write it out as carefully as possible. It will be tempting here to just write D for disease, okay? But that's confusing, right? Disease is not an event. The event is that the patient has a disease. Now if it's very obvious in the context what you mean, then that's fine, but a lot of times, if you're just writing down, B is blood, and D is disease and so on, and they're not really events. It's gonna be very confusing for anyone to understand what you're doing, okay? So, D is the event that this patient, we're assuming we have a specific patient that we're talking about. That patient has the disease, that's event D. And let's let T be the event that the patient tests positive. I would normally wanna use P for positive, but it's very confusing to use P for probability and P for positive. So don't call your events P, okay? So I'll just say patient tests positive. Okay, so those are our two events that we need. I don't think we need any more notation at this point. So let's suppose that 95% accurate means that the probability of T given D = 0.95 = the probability of T complement given D complement. So that's an assumption. What this assumption is? So that's an interpretation of 95% accurate. And what this says, is if the patient has the disease, then 95% of the time that the test will correctly report test positive. If the patient does not have the disease, then 95% of the time the test will correctly report negative. So conditional on whether the patient has the disease or not, 95% of the time the test gives the correct answer. Does that make sense? So that's the assumption, okay. But that's not actually what we want to know. What do you think the patient cares about? The patient doesn't care about that. The patient wants to know whether he or she has the disease, right? So what the patient cares about is not P of T given D, it's P of D given T, okay? So that's our goal. Our goal is to find P of D given T. Now, so one of the most common mistakes in statistics as applied in real life is confusing P of T given D with P of D given T. Although they're completely different concepts. Luckily, we know how they're connected, right? We don't just ever say this is different from this, we actually know how they're related. They're related by Bayes rule, right? So a Bayes rule, what does that say? Bayes' rule says P(D given T) is P(T given D)P(D) over P(T). That's just Bayes' rule, which we proved in one line Wednesday or Monday, that's Bayes' rule. And we already know P(D given T), we know P(D), that's just for the populations, that's 0.01. The ony thing left is P(T), we don't yet know P(T), so here's a little trick. And sometimes if you look in books they won't state Bayes' rule this way. They'll do something more complicated with the sum in the denominator. But I don't consider that Bayes' rule, this is Bayes' rule. Okay, and now often the denominator is the tricky part, and for doing the denominator, that's when we do this law of total probability. Okay, so it's very common to use Bayes' rule and the law of total probability in tandem. So if we expand the denominator using the law of total probability, well, that's just gonna be P(T given D)P(D) + P(T given D complement)P(D complement). So our partition is a very obvious one, the partition is just saying either the patient has the disease or does not have the disease, so breaking up into those two cases. Right, so it's hard to immediately see what P(T) is, but it's easy as soon as we break into two cases. So we have two cases, it's kind of neat also that this thing in the numerator, when we write it this way, it's the same thing here. And then plus the case where D complement, it kind of has a nice structure to it. All right, so at this point, we know all of these numbers, we can just plug them in. I guess I didn't mention, let's see, what's P(T given D complement)? Well, we know P(T complement given D complement) is 0.95, so therefore P(T given D complement) is 0.05, right? Cuz given, it's just the complement, so if you plug all that in, you get approximately 0.16. So even though the test is supposedly 95% accurate in this sense, there's only a 16% chance that the patient has the disease. And that seems surprising both to most patients and to most doctors at first. And in fact, there was a study Done at Harvard where they asked something like 60 Harvard doctors a question very similar to this. And something like 80% of the doctors, and they were basically asked to guess what should this number be. And 80% of the doctors said numbers like 95% or very, very high numbers, and didn't realize that it was so small. So what's the moral of the story, one thing is, you should get a second opinion, right, do another test. And there's some subtleties that come up there too, because maybe the second test is not independent of the first test. In that if there was something that was causing the test to be wrong the first time, maybe the same thing would happen again. So it'd be a good idea to get a different kind of test. Secondly, what's really going on here, most people's intuition on this problem is completely wrong. Maybe they guess 95%, or maybe they would conservatively lower it down to 70% or 80%. But hardly anyone, if you ask this problem, will say 16%. So why is people's intuition so completely wrong about this? Well, I think the reason is that they focus on this part of the problem. But an equally important part of the problem is this 1% here, and that's what gets ignored. So there's a tradeoff here, it's fairly rare that the test is wrong, but it's also fairly rare that someone has the disease. So there's kind of a competition between how rare is the disease versus how rarely is the test wrong. And for some psychological reason, most people focus on the part about the test being wrong. And don't focus on the fact that the disease itself is rare. So most people's intuition is wrong because of that. Here's another, just if you want another quick intuition into this problem. Remember we talked about frequentists' world last time? Where we were repeating the same experiment over and over again, so suppose we didn't just do this with one patient. Suppose we had 1000, I'm not going to write this, I'm just giving you a quick intuition. Suppose we repeated this 1000 times, as we have 1000 patients. And just speaking roughly, if we have 1000 patients, about 10 of them will have the disease, right, that's 1%. Maybe not exactly 10, but just roughly, intuitively speaking, we'd imagine, 1000 patients, 10 have the disease. And let's suppose that for those 10, the test is correct every time, so all 10 of them test positive. Now what about the other 900 and whatever people? 10 people have the disease, so 990 people do not have the disease. They all get tested, but something like 5% of those people are gonna test positive. Just roughly speaking in that example, I'm not gonna compute exact numbers, I'm just giving you some intuition. Roughly speaking, 50, that's 5% of 1000, so about 50 people would test positive who don't have the disease, and about 10 people would test positive who do have the disease. So that's in a ratio of 5 to 1, and 0.16 is about one-sixth, so that's what's going on. You have 50 people who tested positive who don't have the disease, and 10 who did, question? >> [INAUDIBLE] >> Yeah. >> [INAUDIBLE] >> Yeah, so that is an extremely important question. In case anyone couldn't hear, the question is, will they usually be higher? Because if the patient who's getting tested, maybe they came in because they are having symptoms and things like that. So the calculation would completely change if that patient already has information, like certain symptoms that are consistent with the disease. The calculation would change, but the principle is the same. The principle is whatever evidence, you have some initial, so maybe it's initially 1% if don't have any evidence. As you start getting symptoms, then that's evidence, and you update your probabilities. And maybe this 1% would change to something higher, and that would change the numbers around. But the principle is the same, you get evidence and you update your probabilities using Bayes' rule. And here's one really crucial and beautiful fact about Bayes' rule, it has a certain coherency property. And I kinda work out the math of this in one of the strategic practice problems, but let me just tell you the intuition right now. Suppose that you get two pieces of evidence, not just one, suppose you are investigating a crime and there's two possibilities. Either you get two clues and you update using Bayes' rule using the intersection of those two events, okay? But another possibility would be, you get one clue, and then you go off and get lunch and take a break. You come back after lunch, you get another clue, okay? So suppose that you updated your probabilities using the first clue, and then you come back later and get the second clue, update again. You'll get the same thing, so as long as, so you can update it more than one step, you can update all at once. You can update in either order, it's always gonna be completely consistent. So Bayes' rule is coherent in that sense, that whatever evidence you get, one piece at a time or all at once or some combination. The end result is gonna be what's the probability of the thing you're interested in given the evidence you have. So that's really nice. So just a couple quick warnings about common mistakes with conditional probability. And then we can do another example, So I call these biohazards, basically just common mistakes. But they are hazardous to your health if you make them, so you have to be really careful. These are common mistakes with conditional probability. So 1 is the one we just talked about, confusing P(A|B) with P(B|A). And we know that Bayes' rule is how you can connect these two, but they're not the same thing. Sometimes this is called the prosecutor's fallacy. And if you wanna find examples of this, you can just search online for prosecutor's fallacy. That's kind of unfair to prosecutors, because defense attorneys make the same mistake. Doctor's make the same mistake, people make that mistake all the time in everyday life, so it's unfair to just pick on prosecutors. The reason it's called the prosecutor's fallacy is that it's a common situation that. And maybe sometimes the prosecutor does it deliberately, and maybe sometimes it's cuz they don't know this stuff. But it's the mistake of, if you're deciding a criminal case, what you care about is the probability that the defendant is guilty given all the evidence, right? And the mistake would be focusing entirely on the probability of the evidence, given innocence. And you want the probability of innocence given evidence, right, so those two things get confused. Let me mention one legal example, which is a very sad true story called the Sally Clark case. This is an extreme example, but there are many other legal cases that are similar in flavor, in varying degrees, similar in spirit to this one. This is an especially sad, true story. So Sally Clark was a British woman who had, two of her babies both died for mysterious, unexplained reasons. So they were calling it like SIDS, Sudden Infant Death Syndrome. That basically just meant the two babies died and they had no explanation. She was convicted of murdering her two children. And here's basically the total evidence that was put forward against her, okay? So the prosecution got some so-called expert to come up to testify. And that expert said, assuming that she's innocent, the probability of a baby just spontaneously dying for no apparent reason. The so called expert, said that the probability is 1 in 8500, I don't know where he got that number from, but let's just accept that. Said there is only a 1 in 8500 chance that the baby is just going to die mysteriously, assuming it's not murdered. But she had two babies and they both died, which is obviously a terrible tragedy for her. So he said, okay, well, there was another one. So that's 1 over 8500 times 1 over 8500, which is about 1 in 73 million. So the first thing that's wrong with this ties in with what were doing last time, about independence. That assumes independence, which is very questionable here. Because it's not really, maybe there's some genetic factor that caused the first baby to die, and then the second baby had similar genetic characteristics that led to that. So that assumes independence, which there's absolutely no justification for assuming independence here. But even, let's just assume independence for the sake of argument, and let's assume that we agree with these numbers, so it's 1 in 73 million. Even then, since she was convicted based on this. They said, there is only 1, and basically they were saying there's a 1 in 73 million chance that she's innocent. So that's beyond a reasonable doubt, or whatever the British equivalent of that is, so she was convicted and went to jail. But that's the prosecutor's fallacy, right? Because the relevant information is not, the relevant thing we want to compute is P(innocence|evidence), not the other way around. So even if you accept this number, that's evidence given innocence, right? How are those things related, well, they're related by Bayes' rule. So we could do P(evidence|innocence) times P(innocence) over P(evidence), and I'm not gonna try to do a calculation with that. But notice that if you write that down, you're gonna have a term that's P(innocence). That's the prior probability of innocence, that's the probability of innocence before we have any evidence, okay? Now there are billions of mothers in the world, a vast, vast majority of them do not murder their babies, okay? So the prior probability of innocence is extremely close to 1, okay? So there's a trade-off between the prior probability, versus how extreme this number is, that completely got ignored, and she went to prison. And later it sort of got exposed that this was wrong and it sort of got overturned, but by then she'd spent years in prison. And basically, I think she died shortly after being released from prison, just because she was too miserable from that. I mean, it's just an unimaginable trauma. So that was one extreme case, but there are other cases like that. In fact, I brought two books today. The first one is called Statistical Science in the Courtroom, this is a really, really good book. The other one is called Statistics for Lawyers, this is also a really, really good book. I'm not gonna talk much about them, but if any of you are interested, you can come look at them after class. So there are a lot of important connections between statistics and the law. And in fact, we have a student in this class here, who's organizing a group at Harvard to focus on statistics and the law. So if anyone is interested, you can let me know and I'll put you In touch with him. Okay, so that's called the prosecutor's fallacy, although it's not restricted to prosecutors. Now, another one that I wanted to emphasize is one that I was just leading into when I mentioned this idea of priors. So I wanna tell you what the word prior and the word posterior mean. So the second mistake is confusing P(A), that's called the prior. Prior means before we have evidence, posterior means after we have evidence. With P(A|B), which is called the posterior. So one instance of that is, sometimes students, if the problem says that A occurs, and then they would be very tempted to write P(A) = 1. And I ask, why did you write P(A) = 1? And the student says, because it's given that A occurred, okay? But that's completely wrong and leads to completely wrong arguments. P(A|A) =1, that's the key thing to keep in mind. That's always gonna be true, given that A occurred, P(A) = 1. But if we just write the P(A), we're not assuming that. So the problem they say we observe that they occurred and that's saying we're interested in computing stuff given A but doesn't mean just write P(A)=1, P(A) given A is 1. So when you're writing down conditional probability calculations you have to very, very careful about what should you put to the left of this bar and what do you put to the right of this bar, okay? And one more common mistake with probability is confusing independence with conditional independence. And I haven't yet defined the term conditional independence, so I'll do that now. But I'm actually kind of assuming that you already understand what conditional independence means. That is, you should be able to guess what it means. I'll write it down. And you can see if that agrees with your guess. Confusing independence with conditional independence. This is a more subtle mistake than these first two, but also comes up a lot in practice and can lead you to completely wrong results. So I want to talk about this distinction. Independence versus conditional independence. So, first let me write the definition of conditional independence. So we say that, this is the definition, but it should be kind of an intuitive definition. We say that a and b, this if we have events a and b, are conditionally independent. Now we can't just say conditionally independent unless it's clear what we're conditioning on. So to say this precisely, we would say conditionally independent given some other event, we'll just call that C, so C is what we're conditioning on. We say they're conditionally independent given C, and then the definition should be obvious, just write down the definition of independence, it's just assume everything is conditional on C. So it's just completely analogous, everything's conditioned on C. So I can write that down immediately. P(A intersect B given C) = P(A given C)P(B given C). Okay, so that's the definition, just looks like independence, except I put given C everywhere. And so a natural question then, is does independence imply conditional independence, and does conditional independence imply independence? So, let's talk about those two questions. So, first of all, does conditional independence, Let's say given C, but given whatever, imply unconditional independence? And the answer is no, not in general. So if they're conditionally independent they may or may not be independent. And there's an example that I worked out in detail on the strategic practice, so I'll just talk about that example briefly but you can read that example if you haven't already. That was the chess player example. Chess opponent of unknown strength, this is a very common situation in real life, right. You're going to, it doesn't have to be chess obviously, that's just an example. Chess opponent of unknown, so you can make up your own example. And that's one that I like since I like chess. Unknown strength. So you're playing chess with someone you've never met before. You assume you don't know that person's rating or any information at all, so you have no idea about how good that person is, right? Well, Suppose you play a series of chess games with that person, same person, and then you play multiple games. Now it's possible that after you win the first few then that person gets demoralized and collapses and starts playing badly. Well it's possible they get mad and they think harder and they wanna take revenge and things like that. Ignore all that cuz I'm just coming up with an example. So let's assume that conditional on how strong of a player that person is, all the games are independent. That's reasonable. But that does not imply that the, That doesn't imply that the games are unconditionally independent if you don't condition on how strong that person is. Because if you think about it, suppose that you win the first five games or something, okay. At that point you'd be pretty confident that you are a better chess player than that person, right. So therefore the earlier games give you evidence. So even though the games are seemingly independent, the earlier games give you evidence that helps you assess the strength of your opponent, right. So that gives you evidence that's relevant for predicting the later outcomes. Independence would mean that the earlier games that you played give you no information whatsoever that helps you predict the outcomes of the later games, okay? But actually the earlier games give you a good sense of how strong that person is, so that's the distinction. So it may happen that the games are conditionally independent. I'll say game outcomes are conditionally independent, but not independent. And when I say conditionally independent, I mean conditionally independent given the strength of your opponent. So, that's one example, but it would be good for practice for you to try to come up with your own examples. Given strength of opponent, but not independent. So that's an example in this direction. The last thing for today though, would be what about the converse questions. So let's talk about that. So they're conditionally independent given strength of opponent, but not independent unconditionally. This is a fairly common structure for this kind of thing where there's something that's unknown, okay? And if we knew that thing, it may be reasonable to assume independence conditionally, but without that thing, then the earlier trials give us data that we can use, okay? So that's the distinction there. All right, last thing is what about the reverse question? So I wanna know does independence imply conditional independence? So if A and B are independent without any conditions, does independence imply conditional independence given C? Okay, so that's a natural question. Given C in general, is this true? So this is unconditional independence, does that imply conditional independence? Okay, and if you have to guess how of you would guess the answer is yes? Okay, how many of you would guess no? Okay, well, the answer is no. So, let's talk about why. It's not obvious, it sounds like unconditional independence, that's like a stronger condition. If it's unconditional, surely it should be conditional. It's more subtle than this case, try and come up with. But this is also a very, very common phenomenon in real life when you have some phenomenon with multiple causes. Okay, so if you have something that gonna occur and cause by multiple things then you can see this kind of thing. So just to give a counter example, supposed the fire alarm goes off, so let's say F is fire alarm goes off. Hopefully it won't go off right now so we can finish doing this example. Okay, now suppose that there are two, just for simplicity, suppose there only two things that can cause the fire alarm to go off. So, that could be caused by two possibilities. Either, actually, let me call this A for alarm, and let's use F for fire. So, it's caused by one of two things, either F, that is actually is a fire, right. Or, well let's say, I'll suppose the other possible cause is someone is making popcorn. Popcorn, and again, do not call your events P, so I'll call it C for corn. And supposed that making popcorn is completely independent of creating a fire. Okay, but suppose that either of these two things will cause the fire alarm to go off. So suppose, that's an assumption, but I'm just constructing an example so I can assume that. Suppose that F and C are independent, okay? But the key observation, what's the probability that, what's the probability that someone, that there's a fire, given that the alarm goes off, and no one's making popcorn? Well according to what I just said, that's 1. Because if the alarm goes off, then there are two explanations, either this or this, just like Sherlock Holmes said, right. If you eliminate all the other explanations, so you eliminate the popcorn explanation, it must be that there's a fire. So they are independent, they're not conditionally independent given that the alarm goes off. Initially these are independent. As soon as the alarm goes off, then you wanna try to explain that, and they become dependent, so given A. All right, so that's all for today.
Info
Channel: Harvard University
Views: 112,613
Rating: 4.9175825 out of 5
Keywords: harvard, statistics, stat, math, probability, conditional probability, total probability
Id: JzDvVgNDxo8
Channel Id: undefined
Length: 50min 1sec (3001 seconds)
Published: Mon Apr 29 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.