So last time we proved a lot
of theorems last time right, Bayes Rule at least n factorial plus 3
theorems or something like that for any n. So that was a very productive day and I wanna continue with
conditional probability. Thinking conditionally, we did Bayes Rule. But I want to do some examples
of conditional probability and some more stuff on
conditional probability. So basically the topic for today is not
just probability, but it's thinking. So [COUGH] probability is how to think
about uncertainty and randomness, right? That's the topic for this entire course. So, this is not just a statistics course,
this is a thinking course and you know the math we were doing last time,
the math is extremely easy, right. I like multiply both sides by something
and then there's our theorem. So it looked really easy but
to actually think about it and how to apply it is not always easy,
in fact it's often subtle. So I wanna do some examples and
a few more theorems along those lines. So I like to say thinking conditionally
that that's one of the biggest themes in this whole course, using conditional
probability, conditional thinking. It is a condition for thinking. That is you can't really think clearly except under the condition that you
understand how to think conditionally. Okay, so
that's kind of a general statement. Now, here's a general
way to solve a problem. This is also a course in problem solving,
okay? So I wanna just say, in general,
how do you solve a problem? And there are different strategies for
solving problems, right? Now one strategy that we already talked a little bit about
is to try simple and extreme cases that is extremely
useful in a wide variety of problems. So I did my undergrad at Cal Tech. And at Cal Tech,
everyone's hero is Richard Feynman, who was one of the greatest
physicists of the 20th century. And people like to say that
Feynman also had an algorithm, a general problem solving
algorithm that Feynman used. Does anyone know the Feynman algorithm? Okay the Feynman algorithm is,
write down the problem, think hard about it,
write down the solution. So, that worked really well for
Richard Feynman, but it doesn't work for anyone else that I know, okay? So we need better, so
this is one strategy, and this is a strategy that we'll be using
over and over again in the course. A second strategy that we'll be using over
and over again that's useful in statistics but it's useful in computer science,
it's useful in math. It's useful in econ,
it's useful all over the place, is to try to break the problem
up into simpler pieces. If you have a problem that
seems too difficult and complicated try to decompose
it into smaller pieces. Try and solve the smaller,
it's recursive, right. If the smaller pieces
are still too difficult, then break up the smaller
pieces into simpler problems. So you have more problems,
but each problem is easier. And hopefully eventually you'll reach
a point where you can solve each of those problems, put the pieces back together. So that's a very, very general
strategy for solving problems. So break up problem into simpler pieces. Okay, so that's just a general strategy. But let me write down, what does that mean in the context
of what we're doing right now? Well, let's draw a picture,
here's one of our Venn diagrams, here's the sample space S. And supposedly, we want to find the probability
of B which is Blob, B for Blob. So suppose that our problem,
this is a very general strategy. But suppose that our problem
is still a generic problem but less generic than this,
we have this complicated looking Blob, B, and we want to find the probability of B. We don't know how to do it because
it's this complicated blob. So instead of giving up, what we would
do is to break B up into pieces, find the probability of each piece and
add them up. So that's a very simple idea, right? Just break it up into pieces,
add up the pieces. It's a very, very powerful idea. So we are going to break
this up into pieces. Let's say this is A1 is this rectangle,
A2, A3 and here's A4. A4 doesn't actually intersect B,
that's fine. We're gonna let A1 through
An be a partition of S. The word partition just means
that these sets a which are rectangles are disjoint and
their union is all of S. So we are just chopping up
the space into disjoint pieces. They don't have to look like rectangles. Chop it up however you want,
as long as those pieces are disjoint and their union is the whole space,
that's called a partition of S. So that's a partition of S. Then I don't need to do
any kind of calculation, I can just immediately write down
what the decomposition says. P(B), just by the second axiom of
probability says, if you partition a set, and you want its probability,
then you can just add up all the pieces. That's all we're doing. So I can write P(B) = P (BnA1)
+ blah blah blah + P (BnAn). I don't need to write a proof for this. This is just immediate just
from the axioms of probability, it's immediate, right? Because I broke it up
into disjoint pieces. So that's immediate, and then remember we
have this long list of theorems last time that all just followed immediately from
the definition of conditional probability. So another way to write this would be,
remember P(B) intersect A1. I can either do P(B) times P(A1)
given B or the other way round. So let's do P(B) given P(A1). That's what we did last time so
that's a quick review. If I want the probability of this and
this I can take one thing, and then the other thing, given the first
thing, and I can do it in either order. That's why I said we had
n factorial theorems, because you could do it in any order,
+, blah, blah, blah, do that for all the pieces, (B/An) P(An). And this equation here is called
the law of total probability. That's the name for it, but I prefer to just think of it as breaking
up a problem into simpler pieces. So the proof is already just written here. It's immediate, okay? It's not like we have to spend
20 minutes trying to prove this. It's just immediately true. So whether this is useful or not, depends on how well you
chose this patrician, okay? So statistics is both a science and
an art. It takes a lot of practice. That if I had chosen this partition
in certain ways, then I multiplied, here I have only one problem and I've
multiplied into n problems And it could be that each of those n problems is just
as hard or harder than when I started. And that would be a nightmare. But for a lot of examples that
we'll see later on in the course, and you'll see in section
on the homework and so on, this P(B) is complicated but
each of these is really easy. So we split it up into easy problems. So that's what we are looking for,
you just need experience with that. The more problems you do, then the better
you'll get at kind of guessing what would be a useful partition and
what would be a useless partition, okay? So that's just the general
idea of Y is conditional. Basically there's two main reasons
conditional probability is very important. One is that it's important
in its own right, right? Because like I was talking about last
time, conditional probabilities just says we get some evidence, how do we update
our probabilities based on the evidence? So, that's just a very general,
important problem. The second reason it's extremely important
is that even if we wanted an unconditional probability like here
P(B) is unconditional. Still a lot of times we need to use
conditional probability to break it up into simpler pieces, all right. So let's do some examples. All right, I'll start with one
that seems really simple, but actually is kind of surprising I think. Suppose we have two random cards, just from a standard deck. Get random 2-card hand. So random two cards out of a 52 card deck, from standard deck. And then let's compute two different
conditional probabilities, okay? So suppose we're given that
this hand has an ace, okay? And we want the probability
that both cards are aces. So let's do two things,
first of all let's find the probability. I'm just gonna write this in words, but
we could also define some notation for the different events, but
I'll just write this in words. We wanna find the probability
that both cards are aces, given that we have an ace, that is,
I'm just gonna write have ace. If you write it mathematically though,
this would be a union, it'll be the union of the first card. We don't necessarily have an ordering
of a first card and a second card, but we could imagine we got dealt one card
first and then got dealt a second card. This would be the union of
the first card being an ace and the second card being an ace. That is we have at least one ace. All right so that's one problem. Seems like a pretty simple problem okay,
but there's actually a lot going on here. And then a second problem is to find
the probability that both cards are aces, given that we have the ace of spades. Okay, so that's what we wanna do. So let's do the first thing first and
then the second thing second. The probability of both aces,
this one, given that we have an ace. This is just practice with the definition
of conditional probability. So the probability of a given b,
is the probability of a intersect b. Now in this case, I'll write it out but
then I'll simplify it. Both aces and we have an,
both cards are aces and we have an ace. But if you already told
that both cards are aces, then it's redundant to say we have an ace. So the the intersection of this event and
this event is just this event. So that's redundant, so
I just crossed it out, okay? Divide it by the probability
that we have an ace. This is just a quick review with the naive
definition, which needs the naive definition here, cuz we're assuming
all two card hands are equally likely. The probability that both cards are aces, well you can chose to do this problem
using order or with out order, but let's just do this with out order cuz I don't
really care about the order of the hand. If the card consists of two aces,
there's two possibilities, right, chose two out of the four aces, and
the denominator, we know, is 52 chose two, cuz we're just picking two cards
out of 52, naive definition. The denominator, the probability we have
a miss is two ways to do that either we could break it up into cases so there
is two cases, either we have two aces and you can find that we just did that or
we have one ace and one non-ace. Those are two disjoint cases,
we could add them up. I think it's a little bit easier
to do the complement, but you'll get the same thing either
way if you do it correctly. If we do the complement, that what we're saying is it's 1 minus the
probability that neither card is an ace. The probability that
neither card is an ace, well there are 48 non-aces in a deck,
right, 52- 4. We can choose any two of them,
divided by 52, choose 2. And if you simplify
this you get 1 over 33. So about a 3% chance of this
happening after simplification. Okay, now let's do this problem. What the probability that both are aces,
given that we have the ace of spades? So P of both aces,
given that we have the ace of spades. Again, there's more than one
way we can do this problem. We could just plug into the definition. The numerator would be,
what's the probability that we have both, this and this would say we have the ace of
spades and we have some other ace, right? And then you could do the denominator,
and you can work that out. I would prefer to think of
this in a simpler way but for practice it's good to see that
you get the same answer either way. I'd rather think of this more directly. Just what does this mean? We have two cards here,
now I didn't say anything about order or unordered, what we're given,
that is we learn this information, we learned that we have the ace of spades. So here's my two card hand, and we have the ace of spades,
I'll abbreviate it to AS, Ace of Spades. Of course if we want we can put
the ace of spades on the left and the other card on right
that's why holding these. I have the ace of spades here,
and this one here, okay. And so this card is the mystery card,
right? Now this second card could be any card
other than the ace of spades, right? By symmetry,
this card is equally likely to be any of the 51 cards in the deck
other than the ace of spades. There's no reason that this is more likely
to be the jack of diamonds than the ace of clubs, right? Completely symmetric. Therefore, we can immediately just write
down the answer, 3/51 because if this is any of the other three aces, then we
have both aces and otherwise, we don't. So, it's immediately 3
over 51 by symmetry. You could write something
more analogous to this and if you do it correctly, you'll get
the same thing, but that's simpler. 3 over 51 simplifies to 1/17, right? 1/17 is, it's basically double this if
this were 1/34 it would be exactly double. It's about twice as likely. So let's think about that for a second. Does that make sense? Here we have an ace and
we haven't specified what suit it is. Here, we specify though
it's the Ace of Spades, and suddenly our probability almost doubled. Now, if I had said Ace of Hearts here,
that's not gonna change, right? It's still gonna be one-seventeenth. If I had said Ace of Clubs
here that's not gonna change. If I said Ace of Diamonds
here that's not gonna change. The problem is symmetric. It doesn't actually matter
that it's the Ace of Spades. We don't care what suit it is, right? It could say hearts, clubs,
diamonds, spades here, it's still gonna be one-seventeenth. And yet here, where we didn't specify it,
it's 1 over 33. So I'm gonna let you think
about this problem for a while. This is a problem that
you could spend hours and hours trying to build intuition about. I'm mainly doing it just for
practice with the definition. And to show you that even
something that sounds, this sounds like a very simple problem,
okay? But already something very
surprising is going on here. So conditional probability can be very,
very subtle. One hint for thinking about it is, here we're saying we have an ace
that means at least one, right? Now here, well we could say we have
at least one Ace of Spades, but if we say there's at
least one Ace of Spades, we're just saying we have the Ace of
Spades cuz there's only one Ace of Spades. So here I can specifically say, here I
have the Ace of Spades, something else. Here, it's complicated. I can't draw the same
picture here as there, because I'm just saying
there's at least one ace. So there's a difference here in terms
of talking about at least one or a specific one. All right, so that's just a fun problem to
think about with conditional probability. Let's do another example. This kind of thing actually
is important in gambling, but let's do one that's
important in daily life. So, suppose that we're testing for
a disease. So this is a problem
that comes up every day everywhere in the world
in a medical context. So suppose that a patient comes in, and
is getting tested for a certain disease. Okay, this is a good problem and
just illustrating. You have to think very carefully
about what your conditioning on, what's your goal? And that's a hint for the homework as well to try to clearly
specify what are you trying to find. Make up some clear notation and
then say very, very explicitly, our goal is to find p
of what given what, that kind of thing. Okay, so patient gets tested for disease. And suppose that that particular disease
afflicts 1% of the population or maybe wanna say, 1% of people who
are similar in demographically, same age, and so on as this patient. 1% of similar patients have the disease, okay, that's the assumption. And suppose that the patient tests
positive, that's the result. And even though testing positive
sounds good, that's actually bad. Tests positive means that the test is asserting that
the patient has the disease. Now, the test could be wrong, right? But if the test is correct, then that
means the patient has the disease, okay? So that's what a positive test means. Now, so far I haven't said anything
about how reliable is the test, right? So some diseases are hard to detect, okay? And so some diseases are easy to detect,
some are hard to detect. And some tests work better
than other tests, right? Okay, so we need some assumption there. And so
suppose that the test is advertised, As being 95% accurate. You might see some marketing
that the people who manufacture this particular test and
they say it's 95% accurate. And what does that actually mean? I"m putting that in quotes
because that's not yet precise enough to actually
do anything with. So we're gonna have to interpret what
does it mean for it to be 95% accurate? There's more than one ways you
can interpret that phrase. That's ambiguous right now. So to be able to solve the problem
I'm gonna make a specific assumption. So suppose, that's an assumption,
that this means, Now we start needing some notation for
the different events. So let's define our events here. Let's let D be the event that
the patient has the disease. Patient has disease, D for disease. Okay, when you're defining events, try to
write it out as carefully as possible. It will be tempting here to
just write D for disease, okay? But that's confusing, right? Disease is not an event. The event is that
the patient has a disease. Now if it's very obvious in the context
what you mean, then that's fine, but a lot of times, if you're just writing
down, B is blood, and D is disease and so on, and they're not really events. It's gonna be very confusing for anyone
to understand what you're doing, okay? So, D is the event that this patient, we're assuming we have a specific
patient that we're talking about. That patient has the disease,
that's event D. And let's let T be the event
that the patient tests positive. I would normally wanna use P for positive,
but it's very confusing to use P for probability and P for positive. So don't call your events P, okay? So I'll just say patient tests positive. Okay, so
those are our two events that we need. I don't think we need any
more notation at this point. So let's suppose that 95% accurate means that the probability of T given D = 0.95 = the probability of T
complement given D complement. So that's an assumption. What this assumption is? So that's an interpretation
of 95% accurate. And what this says,
is if the patient has the disease, then 95% of the time that the test
will correctly report test positive. If the patient does not have the disease, then 95% of the time the test
will correctly report negative. So conditional on whether
the patient has the disease or not, 95% of the time the test
gives the correct answer. Does that make sense? So that's the assumption, okay. But that's not actually
what we want to know. What do you think the patient cares about? The patient doesn't care about that. The patient wants to know whether he or
she has the disease, right? So what the patient cares
about is not P of T given D, it's P of D given T, okay? So that's our goal. Our goal is to find P of D given T. Now, so one of the most common
mistakes in statistics as applied in real life is confusing P of
T given D with P of D given T. Although they're completely
different concepts. Luckily, we know how they're connected,
right? We don't just ever say this
is different from this, we actually know how they're related. They're related by Bayes rule, right? So a Bayes rule, what does that say? Bayes' rule says P(D given T)
is P(T given D)P(D) over P(T). That's just Bayes' rule,
which we proved in one line Wednesday or Monday, that's Bayes' rule. And we already know P(D given T),
we know P(D), that's just for the populations, that's 0.01. The ony thing left is P(T), we don't yet
know P(T), so here's a little trick. And sometimes if you look in books
they won't state Bayes' rule this way. They'll do something more complicated
with the sum in the denominator. But I don't consider that Bayes' rule,
this is Bayes' rule. Okay, and now often the denominator
is the tricky part, and for doing the denominator, that's when
we do this law of total probability. Okay, so
it's very common to use Bayes' rule and the law of total probability in tandem. So if we expand the denominator using
the law of total probability, well, that's just gonna be P(T given D)P(D) +
P(T given D complement)P(D complement). So our partition is a very obvious one, the partition is just saying either
the patient has the disease or does not have the disease, so
breaking up into those two cases. Right, so it's hard to
immediately see what P(T) is, but it's easy as soon as we
break into two cases. So we have two cases, it's kind of neat
also that this thing in the numerator, when we write it this way,
it's the same thing here. And then plus the case where D complement,
it kind of has a nice structure to it. All right, so at this point,
we know all of these numbers, we can just plug them in. I guess I didn't mention, let's see,
what's P(T given D complement)? Well, we know P(T complement
given D complement) is 0.95, so therefore P(T given D complement) is 0.05,
right? Cuz given, it's just the complement, so if you plug all that in,
you get approximately 0.16. So even though the test is supposedly
95% accurate in this sense, there's only a 16% chance that
the patient has the disease. And that seems surprising both to most
patients and to most doctors at first. And in fact, there was a study
Done at Harvard where they asked something like 60 Harvard doctors
a question very similar to this. And something like 80% of the doctors, and they were basically asked to
guess what should this number be. And 80% of the doctors said
numbers like 95% or very, very high numbers, and
didn't realize that it was so small. So what's the moral of the story,
one thing is, you should get a second opinion,
right, do another test. And there's some subtleties
that come up there too, because maybe the second test is
not independent of the first test. In that if there was something that was
causing the test to be wrong the first time, maybe the same
thing would happen again. So it'd be a good idea to get
a different kind of test. Secondly, what's really going on here, most people's intuition on this
problem is completely wrong. Maybe they guess 95%, or maybe they would conservatively
lower it down to 70% or 80%. But hardly anyone,
if you ask this problem, will say 16%. So why is people's intuition so
completely wrong about this? Well, I think the reason is that they
focus on this part of the problem. But an equally important part
of the problem is this 1% here, and that's what gets ignored. So there's a tradeoff here,
it's fairly rare that the test is wrong, but it's also
fairly rare that someone has the disease. So there's kind of a competition
between how rare is the disease versus how rarely is the test wrong. And for some psychological reason, most people focus on the part
about the test being wrong. And don't focus on the fact that
the disease itself is rare. So most people's intuition
is wrong because of that. Here's another, just if you want another
quick intuition into this problem. Remember we talked about
frequentists' world last time? Where we were repeating the same
experiment over and over again, so suppose we didn't just do
this with one patient. Suppose we had 1000,
I'm not going to write this, I'm just giving you a quick intuition. Suppose we repeated this 1000 times,
as we have 1000 patients. And just speaking roughly,
if we have 1000 patients, about 10 of them will have the disease,
right, that's 1%. Maybe not exactly 10, but just roughly,
intuitively speaking, we'd imagine, 1000 patients, 10 have the disease. And let's suppose that for those 10,
the test is correct every time, so all 10 of them test positive. Now what about the other 900 and
whatever people? 10 people have the disease, so
990 people do not have the disease. They all get tested, but something like 5%
of those people are gonna test positive. Just roughly speaking in that example,
I'm not gonna compute exact numbers, I'm just giving you some intuition. Roughly speaking, 50,
that's 5% of 1000, so about 50 people would test positive
who don't have the disease, and about 10 people would test
positive who do have the disease. So that's in a ratio of 5 to 1,
and 0.16 is about one-sixth, so that's what's going on. You have 50 people who tested positive
who don't have the disease, and 10 who did, question? >> [INAUDIBLE]
>> Yeah. >> [INAUDIBLE]
>> Yeah, so that is an extremely
important question. In case anyone couldn't hear, the question
is, will they usually be higher? Because if the patient
who's getting tested, maybe they came in because they
are having symptoms and things like that. So the calculation would completely change
if that patient already has information, like certain symptoms that
are consistent with the disease. The calculation would change,
but the principle is the same. The principle is whatever evidence,
you have some initial, so maybe it's initially 1% if
don't have any evidence. As you start getting symptoms,
then that's evidence, and you update your probabilities. And maybe this 1% would
change to something higher, and that would change the numbers around. But the principle is the same,
you get evidence and you update your probabilities
using Bayes' rule. And here's one really crucial and beautiful fact about Bayes' rule,
it has a certain coherency property. And I kinda work out the math of this in
one of the strategic practice problems, but let me just tell you
the intuition right now. Suppose that you get two pieces of
evidence, not just one, suppose you are investigating a crime and
there's two possibilities. Either you get two clues and you update using Bayes' rule using the
intersection of those two events, okay? But another possibility would be,
you get one clue, and then you go off and get lunch and take a break. You come back after lunch,
you get another clue, okay? So suppose that you updated your
probabilities using the first clue, and then you come back later and
get the second clue, update again. You'll get the same thing, so as long as,
so you can update it more than one step, you can update all at once. You can update in either order, it's
always gonna be completely consistent. So Bayes' rule is coherent in that sense,
that whatever evidence you get, one piece at a time or
all at once or some combination. The end result is gonna
be what's the probability of the thing you're interested
in given the evidence you have. So that's really nice. So just a couple quick
warnings about common mistakes with conditional probability. And then we can do another example, So I call these biohazards,
basically just common mistakes. But they are hazardous to your
health if you make them, so you have to be really careful. These are common mistakes
with conditional probability. So 1 is the one we just talked about, confusing P(A|B) with P(B|A). And we know that Bayes' rule is
how you can connect these two, but they're not the same thing. Sometimes this is called
the prosecutor's fallacy. And if you wanna find examples of this,
you can just search online for prosecutor's fallacy. That's kind of unfair to prosecutors, because defense attorneys
make the same mistake. Doctor's make the same mistake, people
make that mistake all the time in everyday life, so
it's unfair to just pick on prosecutors. The reason it's called
the prosecutor's fallacy is that it's a common situation that. And maybe sometimes the prosecutor
does it deliberately, and maybe sometimes it's cuz
they don't know this stuff. But it's the mistake of,
if you're deciding a criminal case, what you care about is
the probability that the defendant is guilty given all the evidence, right? And the mistake would be focusing entirely
on the probability of the evidence, given innocence. And you want the probability of
innocence given evidence, right, so those two things get confused. Let me mention one legal example, which is a very sad true story
called the Sally Clark case. This is an extreme example, but there are
many other legal cases that are similar in flavor, in varying degrees,
similar in spirit to this one. This is an especially sad, true story. So Sally Clark was a British
woman who had, two of her babies both died for mysterious,
unexplained reasons. So they were calling it like SIDS,
Sudden Infant Death Syndrome. That basically just meant the two babies
died and they had no explanation. She was convicted of
murdering her two children. And here's basically the total evidence
that was put forward against her, okay? So the prosecution got some so-called
expert to come up to testify. And that expert said, assuming that
she's innocent, the probability of a baby just spontaneously dying for
no apparent reason. The so called expert,
said that the probability is 1 in 8500, I don't know where he got that number
from, but let's just accept that. Said there is only a 1 in 8500 chance
that the baby is just going to die mysteriously, assuming it's not murdered. But she had two babies and they both died, which is obviously a terrible tragedy for
her. So he said, okay, well,
there was another one. So that's 1 over 8500 times 1 over 8500, which is about 1 in 73 million. So the first thing that's wrong with this
ties in with what were doing last time, about independence. That assumes independence,
which is very questionable here. Because it's not really, maybe there's some genetic factor that
caused the first baby to die, and then the second baby had similar genetic
characteristics that led to that. So that assumes independence, which
there's absolutely no justification for assuming independence here. But even, let's just assume independence
for the sake of argument, and let's assume that we agree with these
numbers, so it's 1 in 73 million. Even then,
since she was convicted based on this. They said, there is only 1, and basically they were saying there's a 1 in
73 million chance that she's innocent. So that's beyond a reasonable doubt, or whatever the British equivalent of that
is, so she was convicted and went to jail. But that's the prosecutor's fallacy,
right? Because the relevant information is not, the relevant thing we want to compute is P(innocence|evidence), not
the other way around. So even if you accept this number,
that's evidence given innocence, right? How are those things related, well,
they're related by Bayes' rule. So we could do P(evidence|innocence)
times P(innocence) over P(evidence), and I'm not gonna try to
do a calculation with that. But notice that if you write that down, you're gonna have a term
that's P(innocence). That's the prior probability of innocence, that's the probability of innocence
before we have any evidence, okay? Now there are billions of
mothers in the world, a vast, vast majority of them do not
murder their babies, okay? So the prior probability of innocence
is extremely close to 1, okay? So there's a trade-off between
the prior probability, versus how extreme this number is, that completely
got ignored, and she went to prison. And later it sort of got
exposed that this was wrong and it sort of got overturned, but
by then she'd spent years in prison. And basically, I think she died shortly
after being released from prison, just because she was too
miserable from that. I mean, it's just an unimaginable trauma. So that was one extreme case, but
there are other cases like that. In fact, I brought two books today. The first one is called
Statistical Science in the Courtroom, this is a really, really good book. The other one is called Statistics for
Lawyers, this is also a really, really good book. I'm not gonna talk much about them,
but if any of you are interested, you can come look at them after class. So there are a lot of important
connections between statistics and the law. And in fact,
we have a student in this class here, who's organizing a group at Harvard
to focus on statistics and the law. So if anyone is interested, you can let me
know and I'll put you In touch with him. Okay, so
that's called the prosecutor's fallacy, although it's not
restricted to prosecutors. Now, another one that I wanted to
emphasize is one that I was just leading into when I mentioned
this idea of priors. So I wanna tell you what the word
prior and the word posterior mean. So the second mistake is confusing P(A), that's called the prior. Prior means before we have evidence,
posterior means after we have evidence. With P(A|B),
which is called the posterior. So one instance of that is,
sometimes students, if the problem says that A occurs, and then they would be very
tempted to write P(A) = 1. And I ask, why did you write P(A) = 1? And the student says,
because it's given that A occurred, okay? But that's completely wrong and
leads to completely wrong arguments. P(A|A) =1,
that's the key thing to keep in mind. That's always gonna be true,
given that A occurred, P(A) = 1. But if we just write the P(A),
we're not assuming that. So the problem they say we
observe that they occurred and that's saying we're interested
in computing stuff given A but doesn't mean just write P(A)=1,
P(A) given A is 1. So when you're writing down conditional
probability calculations you have to very, very careful about what should you
put to the left of this bar and what do you put to the right of this bar,
okay? And one more common mistake
with probability is confusing independence with
conditional independence. And I haven't yet defined the term
conditional independence, so I'll do that now. But I'm actually kind of assuming that
you already understand what conditional independence means. That is,
you should be able to guess what it means. I'll write it down. And you can see if that
agrees with your guess. Confusing independence with
conditional independence. This is a more subtle mistake
than these first two, but also comes up a lot in practice and
can lead you to completely wrong results. So I want to talk about this distinction. Independence versus
conditional independence. So, first let me write the definition
of conditional independence. So we say that,
this is the definition, but it should be kind of
an intuitive definition. We say that a and b,
this if we have events a and b, are conditionally independent. Now we can't just say
conditionally independent unless it's clear what
we're conditioning on. So to say this precisely, we would
say conditionally independent given some other event, we'll just call that C,
so C is what we're conditioning on. We say they're conditionally independent
given C, and then the definition should be obvious, just write down
the definition of independence, it's just assume everything
is conditional on C. So it's just completely analogous,
everything's conditioned on C. So I can write that down immediately. P(A intersect B given C) =
P(A given C)P(B given C). Okay, so that's the definition,
just looks like independence, except I put given C everywhere. And so a natural question then,
is does independence imply conditional independence, and does conditional
independence imply independence? So, let's talk about those two questions. So, first of all,
does conditional independence, Let's say given C, but given whatever, imply unconditional independence? And the answer is no, not in general. So if they're conditionally independent
they may or may not be independent. And there's an example that I worked out
in detail on the strategic practice, so I'll just talk about
that example briefly but you can read that example
if you haven't already. That was the chess player example. Chess opponent of unknown strength, this is a very common
situation in real life, right. You're going to, it doesn't have to be
chess obviously, that's just an example. Chess opponent of unknown, so
you can make up your own example. And that's one that I
like since I like chess. Unknown strength. So you're playing chess with
someone you've never met before. You assume you don't know that person's
rating or any information at all, so you have no idea about how
good that person is, right? Well, Suppose you play a series of chess games with that person, same person,
and then you play multiple games. Now it's possible that
after you win the first few then that person gets demoralized and
collapses and starts playing badly. Well it's possible they get mad and
they think harder and they wanna take revenge and
things like that. Ignore all that cuz I'm just
coming up with an example. So let's assume that conditional on how strong of a player that person is,
all the games are independent. That's reasonable. But that does not imply that the, That doesn't imply that the games
are unconditionally independent if you don't condition on how
strong that person is. Because if you think about it, suppose
that you win the first five games or something, okay. At that point you'd be pretty confident
that you are a better chess player than that person, right. So therefore the earlier
games give you evidence. So even though the games are seemingly
independent, the earlier games give you evidence that helps you assess
the strength of your opponent, right. So that gives you evidence that's relevant
for predicting the later outcomes. Independence would mean
that the earlier games that you played give you no information
whatsoever that helps you predict the outcomes of the later games, okay? But actually the earlier games give you
a good sense of how strong that person is, so that's the distinction. So it may happen that the games
are conditionally independent. I'll say game outcomes are conditionally
independent, but not independent. And when I say conditionally independent, I mean conditionally independent
given the strength of your opponent. So, that's one example, but
it would be good for practice for you to try to come up
with your own examples. Given strength of opponent,
but not independent. So that's an example in this direction. The last thing for today though, would
be what about the converse questions. So let's talk about that. So they're conditionally independent
given strength of opponent, but not independent unconditionally. This is a fairly common structure for this kind of thing where there's
something that's unknown, okay? And if we knew that thing, it may be reasonable to assume
independence conditionally, but without that thing, then the earlier
trials give us data that we can use, okay? So that's the distinction there. All right, last thing is what
about the reverse question? So I wanna know does independence
imply conditional independence? So if A and
B are independent without any conditions, does independence imply
conditional independence given C? Okay, so that's a natural question. Given C in general, is this true? So this is unconditional independence,
does that imply conditional independence? Okay, and if you have to guess how of
you would guess the answer is yes? Okay, how many of you would guess no? Okay, well, the answer is no. So, let's talk about why. It's not obvious,
it sounds like unconditional independence, that's like a stronger condition. If it's unconditional,
surely it should be conditional. It's more subtle than this case,
try and come up with. But this is also a very,
very common phenomenon in real life when you have some phenomenon
with multiple causes. Okay, so
if you have something that gonna occur and cause by multiple things then
you can see this kind of thing. So just to give a counter example, supposed the fire alarm goes off, so
let's say F is fire alarm goes off. Hopefully it won't go off right now so
we can finish doing this example. Okay, now suppose that there are two,
just for simplicity, suppose there only two things
that can cause the fire alarm to go off. So, that could be caused
by two possibilities. Either, actually, let me call this A for
alarm, and let's use F for fire. So, it's caused by one of two things,
either F, that is actually is a fire, right. Or, well let's say, I'll suppose the other possible
cause is someone is making popcorn. Popcorn, and again, do not call your
events P, so I'll call it C for corn. And supposed that making popcorn is
completely independent of creating a fire. Okay, but suppose that either of these
two things will cause the fire alarm to go off. So suppose, that's an assumption, but
I'm just constructing an example so I can assume that. Suppose that F and
C are independent, okay? But the key observation,
what's the probability that, what's the probability that someone,
that there's a fire, given that the alarm goes off,
and no one's making popcorn? Well according to what I just said,
that's 1. Because if the alarm goes off, then there
are two explanations, either this or this, just like Sherlock Holmes said,
right. If you eliminate all
the other explanations, so you eliminate the popcorn explanation,
it must be that there's a fire. So they are independent, they're not conditionally independent
given that the alarm goes off. Initially these are independent. As soon as the alarm goes off,
then you wanna try to explain that, and they become dependent, so given A. All right, so that's all for today.