Some of you may have heard this paradoxical
fact about medical tests. It's very commonly used to introduce the topic of bayes rule in
probability. The paradox is that you could take a test which is highly accurate, in the sense
that it gives correct results to a large majority of the people taking it, and yet under the right
circumstances when assessing the probability that your particular test result is correct, you can
still land on a very low number. Arbitrarily low in fact. In short, an accurate test is
not necessarily a very predictive test. When people think about math and formulas they
don't often think of it as a design process. I mean maybe in the case of notation it's easy
to see that different choices are possible, but when it comes to the structure of the
formulas themselves and how we use them that's something that people
typically view as fixed. In this video you and i will dig into this
paradox, but instead of using it to talk about the usual version of bayes rule, I'd like to motivate
an alternate version, an alternate design choice. Now what's up on the screen is a little
bit abstract, which makes it difficult to justify that there really is a substantive
difference here, especially when i haven't explained either one yet. To see what i'm
talking about though we should really start by spending some time a little more concretely
and just laying out what exactly this paradox is. Picture one thousand women and suppose that one
percent of them have breast cancer. And let's say they all undergo a certain breast cancer
screening and that nine of those with cancer correctly get positive results and there's one
false negative. And then suppose that among the remainder without cancer 89 get false positives
and 901 correctly get negative results . So if all you know about a woman is that she does
the screening and she gets a positive result, you don't have information about symptoms or
anything like that, you know that she's either one of these 9 true positives or one of these 89 false
positives. So the probability that she's in the cancer group given the test result is 9 divided
by (9 + 89) which is approximately 1 in 11. In medical parlance you would call this the
"Positive Predictive Value" of the test, or PPV. The number of true positives divided by
the total number of positive test results. You can see where the name comes from to
what extent does a positive test result actually predict that you have the disease. Now hopefully, as I've presented it this
way where we're thinking concretely about a sample population, all of this makes
perfect sense. But where it comes across as counterintuitive is if you just
look at the accuracy of the test, present it to people as a statistic, and then ask
them to make judgments about their test result. Test accuracy is not actually one number but
two. First you ask how often is the test correct on those with the disease, this is known as
the test sensitivity. As in how sensitive is it to detecting the presence of the disease. In
our example test sensitivity is 9 in 10 or 90%. Another way to say the same fact would
be to say the false negative rate is 10%. And then a separate not-necessarily-related
number is how often it's correct for those without the disease, which is known as the
test specificity. As in, are positive results caused specifically by the disease or are there
confounding triggers giving false positives? In our example the specificity is about 91%. Or
another way to say the same fact would be to say the false positive rate is 9%. So the paradox
here is that in one sense the test is over 90% accurate, it gives correct results to
over 90% of the patients who take it. And yet if you learn that someone gets a
positive result without any added information, there's actually only a 1 in 11 chance
that that particular result is accurate. This is a bit of a problem because of all of
the places for math to be counter-intuitive medical tests are one area where it matters a lot.
In 2006 and 2007 the psychologist Gerd Gigerenzer gave a series of statistics seminars to practicing
gynecologists and he opened with the following example: A 50 year old woman, no symptoms
participates in a routine mammography screening. She tests positive, is alarmed and wants to know
from you whether she has breast cancer for certain or what her chances are. Apart from the screening
result you know nothing else about this woman. In that seminar the doctors were then told
that the prevalence of breast cancer for women of this age is about 1%, and then to
suppose that the test sensitivity is 90% and that its specificity was 91%. You might
notice these are exactly the same numbers from the example that you and I just
looked at, this is where I got them, so having already thought it through you and
i know the answer, it's about 1 in 11. However the doctors in this session were not primed
with the suggestion to picture a concrete sample of one thousand individuals the way that
you and I had. All they saw were these numbers. They were then asked: "How many women who
test positive actually have breast cancer? What is the best answer?", and they
were presented with these four choices. In one of the sessions over half the doctors
present said that the correct answer was 9 and 10, which is way off. Only a fifth
of them gave the correct answer, which is worse than what it would have
been if everybody had randomly guessed! It might seem a little extreme to be calling
this a paradox. I mean it's just a fact, it's not something intrinsically self-contradictory.
But as these seminars with Gigerenzer show, people (including doctors) definitely find it
counterintuitive that a test with high accuracy can give you such a low predictive value. We might call this a "vertical paradox", which
refers to facts that are provably true but which nevertheless can feel false when phrased a certain
way. It's sort of the softest form of a paradox, saying more about human psychology than about
logic. The question is how we can combat this. Where we're going with this by the way is that I
want you to be able to look at numbers like this and quickly estimate in your head that it means
the predictive value of a positive test should be around 1 and 11. Or if i changed things and
asked what if it was 10% of the population who had breast cancer, you should be able to quickly turn
around and say that the final answer would be a little over 50%. Or if i said imagine a really low
prevalence something like 0.1% of patients having cancer, you should again quickly estimate that the
predictive value of the test is around 1 in 100, that 1 in 100 of those with positive test
results in that case would have cancer. Or let's say we go back to the 1% prevalence, but I make the test more accurate, I tell
you to imagine the specificity is 99% there you should be able to relatively quickly
estimate that the answer is a little less than 50%. The hope is that you're doing all of
this with minimal calculations in your head. Now the goals of quick calculations might feel
very different from the goals of addressing whatever misconception underlies this paradox,
but they actually go hand-in-hand. Let me show you what I mean. On the side of addressing
misconceptions, what would you tell to the people in that seminar who answered 9 and 10?
What fundamental misconception are they revealing? What I might tell them is that in much the
same way that you shouldn't think of tests as telling you deterministically
whether you have a disease, you shouldn't even think of them as telling
you your chances of having a disease. Instead, the healthy view of what tests
do is that they *update* your chances. In our example before taking the test a
patient's chances of having cancer were 1 in 100. In Bayesian terms we call this the "prior
probability". The effect of this test was to update that prior by almost an order of
magnitude, up to around 1 in 11. The accuracy of a test is telling us about the strength of this
updating, it's not telling us a final answer. What does this have to do with quick
approximations? Well a key number for those approximations is something called the
Bayes factor, and the very act of defining this number serves to reinforce this central
lesson about reframing what it is the tests do. You see, one of the things that makes test
statistics so very confusing is that there are at least four numbers that you'll hear associated
with them. For those with the disease there's the sensitivity and the false negative rate, and then
for those without there's the specificity in the false positive rate. And none of these numbers
actually tell you the thing you want to know! Luckily if you want to interpret a positive test
result you can pull out just one number to focus on from all this. Take the sensitivity divided
by the false positive rate. In other words how much more likely are you to see the positive test
result with cancer versus without. In our example this number is 10. This is the Bayes factor,
also sometimes called the likelihood ratio. A very handy rule of thumb is
that to update a small prior, or at least to approximate the answer, you simply
multiply it by the Bayes factor. So in our example where the prior was 1 in 100, you would estimate
that the final answer should be around 1 in 10, which is in fact slightly above the true correct
answer. So based on this rule of thumb, if I asked you what would happen if the prior from our
example was instead 1 in 1,000, you could quickly estimate that the effect of the test should
be to update those chances to around 1 in 100. And in fact take a moment to check yourself
by thinking through a sample population. In this case you might picture 10,000 patients
where only 10 of them really have cancer. Then based on that 90% sensitivity we would expect
9 of those cancer cases to give true positives. And on the other side, a 91% specificity means
that 9% of those without cancer are getting false positives, so we'd expect nine percent
of the remaining patients, which is around 900, to give false positive results. Here, with such
a low prevalence, the false positives really do dominate the true positives, so the probability
that a randomly chosen positive case from this population actually has cancer is only around one
percent, just like the rule of thumb predicted. Now, this rule of thumb clearly cannot work for
higher priors. For example, that would predict that a prior of 10% gets updated all the way
to 100% certainty, but that can't be right. In fact take a moment to think through what the
answer should be again using a sample population. Maybe this time we picture 10 out of 100 having
cancer again. Based on the 90% sensitivity of the test we'd expect 9 of those cancer cases to
get positive results, but what about the false positives? How many do we expect there? About 9
of the remaining 90, or about 8. So upon seeing a positive test result it tells you that you're
either one of these 9 true positives or one of the 8 false positives. So this means the chances are
a little over 50%, roughly 9 out of 17, or 53%. At this point, having dared to dream that Bayesian
updating could look as simple as multiplication, you might tear down your hopes
and pragmatically acknowledge sometimes life is just more complicated than that. Except, it's not. This rule of thumb
turns into a precise mathematical fact as long as we shift away from talking about
probabilities to instead talking about odds. If you've ever heard someone talk about the
chances of an event being "1-to-1" or "2-to-1", things like that, you already know about odds.
With probability we're taking the ratio of the number of positive cases out of all possible
cases, right? Things like "1 in 5" or "1 in 10". With odds what you do is take the ratio of all
positive cases to all negative cases. You commonly see odds written with a colon to emphasize the
distinction, but it's still just a fraction, just a number. So an event with a 50% probability
would be described as having one-to-one odds. A 10% probability is the same as 1-to-9 odds.
An 80% probability is the same as 4:1 odds, you get the point. It's the same information, it
still describes the chances of a random event, but is presented a little differently,
like a different unit system. Probabilities are constrained between 0
and 1, with even chances sitting at 0.5, but odds range from 0 up to infinity with
even chances sitting at the number 1. The beauty here is that a
completely-accurate-not-even-approximating-things way to frame Bayes rule is to say: Express your
prior using odds, then just multiply by the Bayes factor. Think about what the prior odds are really
saying, it's the number of people with cancer divided by the number without it. Here, let's just
write that down as a normal fraction for a moment so we can multiply it. When you filter down just
to those with positive test results the number of people with cancer gets scaled down scaled down by
the probability of seeing a positive test result given that someone has cancer, and then similarly
the number of people without cancer also gets scaled down, this time by the probability of
seeing a positive test result but in that case. So the ratio between these two counts, the
new odds upon seeing the test, looks just like the prior odds, except multiplied by this
term here, which is exactly the Bayes factor. Look back at our example where the bayes factor
was 10. And as a reminder this came from the 90% sensitivity divided by the 9% false positive rate;
how much more likely are you to see a positive result with cancer versus without. If the prior
is 1%, expressed as odds this looks like 1-to-99. so by our rule this gets updated to
10-to-99, which if you want you could convert back to a probability. It would be
10 divided by (10 + 99), or about 1 in 11. If instead the prior was 10%, which was the
example that tripped up our rule of thumb earlier, expressed as odds this looks like 1-to-9. By
our simple rule this gets updated to 10-to-9, which you can already read off pretty
intuitively. It's a little above even chances, a little above 1-to-1. If you prefer you
can convert it back to a probability, you would write it as 10 out of 19, or
about 53%. And indeed that is what we already found by thinking things
through with a sample population. Let's say we go back to the 1% prevalence,
but I make the test more accurate. Now what if I told you to imagine that the
false positive rate was only 1% instead of 9%. What that would mean is that our
Bayes factor is 90 instead of 10, the test is doing more work for us. In this
case, with the more accurate test it gets updated to 90-to-99, which is a little less than
even chances, something a little under 50%. To be more precise you could make the conversion back
to probability and work out that it's around 48%, but honestly if you're just going for a
gut feel, it's fine to stick with the odds. Do you see what I mean about how just
defining this number helps to combat potential misconceptions? For anybody who's a
little hasty in connecting test accuracy directly to your probability of having a disease, it's
worth emphasizing that you could administer the same test with the same accuracy
to multiple different patients who all get the same exact result, but if
they're coming from different contexts, that result can mean wildly different
things. However the one thing that does stay constant in every case is the factor by
which each patient's prior odds get updated. And by the way this whole time we've been using
the prevalence of the disease, which is the proportion of people in a population who have it,
as a substitute for the prior, the probability of having it before you see a test. However that's
not necessarily the case! If there are other known factors, things like symptoms, or in the case of
a contagious disease things like known contacts, those also factor into the prior, and they
could potentially make a huge difference. As another side note, so far we've only
talked about positive test results, but way more often you would be seeing
a negative test result. The logic there is completely the same but the Bayes factor
that you compute is going to look different. Instead you look at the probability of seeing
this negative test result with the disease versus without the disease. So in our cancer example
this would have been the 10% false negative rate divided by the 91% specificity, or about
1 in 9. In other words seeing a negative test result in that example would reduce your
prior odds by about an order of magnitude. When you write it all out as a formula, here's how
it looks. It says your odds of having a disease given a test result equals your odds
before taking the test, the prior odds, times the bayes factor. Now let's contrast this
with the usual way that Bayes rule is written, which is a bit more complicated. In case you haven't seen it before it's
essentially just what we were doing with sample populations, but you wrap it all up symbolically.
Remember how every time we were counting the number of true positives and then dividing it
by the sum of the true positives and the false positives? We do just that, except instead of
talking about absolute amounts we talk of each term as a proportion. So the proportion of true
positives in the population comes from the prior probability of having the disease multiplied
by the probability of seeing a positive test result in that case, and then we copy that term
down again into the denominator and then the proportion of false positives comes from the prior
probability of not having the disease times the probability of a positive test in that case. If
you want you could also write this down with words instead of symbols, if terms like sensitivity
and false positive rate are more comfortable. This is one of those formulas where once you
say it out loud it seems like a bit much, but it really is no different from what
we were doing with sample populations. If you wanted to make the whole thing
look simpler you often see this entire denominator written just as the probability of
seeing a positive test result overall. While that does make for a really elegant little expression,
if you intend to use this for calculations, it's a little disingenuous because in
practice every single time you do this you need to break down that denominator into
two separate parts, breaking down the cases. Wo taking this more honest representation of it,
let's compare our two versions of Bayes rule. And again maybe it looks nicer if we use the
word sensitivity and false positive rate, if nothing else it helps emphasize which parts
of the formula are coming from statistics about the test accuracy. I mean this actually
emphasizes one thing i really like about the framing with odds and a Bayes factor,
which is that it cleanly factors out the parts that have to do with the prior and the
parts that have to do with the test accuracy. But over in the usual formula all of those are
very intermingled together. And this has a very practical benefit, it's really nice if you want
to swap out different priors and easily see their effects. This is what we were doing earlier.
But with the other formula, to do that you have to recompute everything each time, you can't
leverage a pre-computed Bayes factor the same way. The odds framing also makes things really
nice if you want to do multiple different Bayesian updates based on multiple pieces of
evidence. For example, let's say you took not one test but two. Or you wanted to think about
how the presence of symptoms plays into it. For each piece of new evidence you see you
always ask the question: How much more likely would you be to see that with the disease
versus without the disease? Each answer to that question gives you a new Bayes factor,
a new thing that you multiply by your odds. Beyond just making calculations easier, there's
something I really like about attaching a number to test accuracy that doesn't even look like a
probability. I mean, if you hear that a test has for example, a 9% false positive rate, that's
just such a disastrously ambiguous phrase! It's so easy to misinterpret it to mean there's a
9% chance that your positive test result is false. But imagine if instead the number that
we heard tacked on to test results was that the Bayes factor for a positive test
result is, say, 10. There's no room to confuse that for your probability of having a disease,
the entire framing of what a Bayes factor is is that it's something that acts on a prior,
it forces your hand to acknowledge the prior as something that's separate entirely and
highly necessary to drawing any conclusion. All that said, the usual formula is definitely
not without its merits. If you view it not simply as something to plug numbers into, but as an
encapsulation of the sample population idea that we've been using throughout, you could
very easily argue that that's actually much better for your intuition. After all, it's
what we were routinely falling back on in order to check ourselves that the Bayes factor
computation even made sense in the first place. Like any design decision there is
no clear-cut objective best here. But it's almost certainly the case that
giving serious consideration to that question will lead you to a better
understanding of Bayes rule Also since we're on the topic
of kind of paradoxical things, a friend of mine Matt Cook recently wrote a book
all about paradoxes. I actually contributed a small chapter to it with thoughts on the question
of whether math is invented or discovered, and the book as a whole is this really nice connection
of thought-provoking paradoxical things ranging from philosophy to math and physics. You can of
course find all the details in the description.
Absolute gold mine Grant. You've changed the way I look at probability. This used to be my weakest point. Now I'm slowly able to tackle newer and newer concepts.
Great vid! One thing I gathered that I want to explore more is the intersection between quick estimations and combating misconceptions.
I feel like both teachers and students alike struggle with this as I think the idea of "quick estimations" has been misunderstood (maybe that's just me). I think alot of profs I've had have just interpreted that as giving the student many problems instead of specifically tailored problems. Do quick estimations refer to solving a simple case (like Prior << 1 in the video)? Or is it a gross approximation, like a Taylor series expansion to understand a function better around a local area (I know weird example that's all I could think of). Obviously, the example chosen greatly affects how the misconception is tackled but there lies my question, what makes a good example for an avid learner?
Maybe I'm overthinking this or I'm completely misunderstanding it, but I do feel like this is such an integral part of the learning process. Also, if anyone is willing to share some resources on the matter it would be greatly appreciated.
Wow, this was very helpful. I've always knew I had to be very careful handlig medical tests sensitivity, but I was never actually able to grasp its logical description. Very cool video! (as usual).
But, as every proper explanation should do, as much as it clarified me, as much it raised more questions. If it is very clear now how a medical test can update your probability of a certain result, now I'm struggling to understand the prior, or rather, I'm not understanding how to deal with my medical test when i don't have any information about the prior whatsoever.
Let me try to make an example.
Let's say I created a sensor for detecting the presence of a molecule inside a solution. Let's assume for simplicity that the sensor result is simply "yes, the molecule is present" or "no, the molecule is not present". Let's assume I can create some test solutions, which content is know with 100% accuracy, that I can use for determining the four parameters of my sensor: sensitivity, specificity, FPR, FNR.
Having done so, what can I say about the result given by an arbitrary solution which content is completely unknown?
Ok, the example is full of flaws, but please try to understand my point. I imagine it is possible to create a medical test which parameters (sens., spec., FPR, FN) can be determined experimentally by "feeding" the test with known test samples. If this is possible, I could then use the medical test with a population which prior is completely unknown, but what can i say about the test result then?!
Hi Grant. Do u like making a video about History of Numbers like how people got inspired about numbers becoz that shot everywhere.
I really, really, enjoyed this video, Bayes theorem presented in such a simple way simply cuts through all explanations I have tried wrapping my head around previously. I actually think this method is the best bit of probability theory I have come across. My question...where can I find more resources, i.e. books, that cover this idea and in particular it's practicalities. I have been scouring the internet for a couple of days and I came across a paper about hypothesis testing, though it was a bit heavy for me. Anybody have links to more information about calculating Bayes factor? Thanks!