PETER SZOLOVITS: OK. Today's topic is
differential diagnosis. And so I'm just
quoting Wikipedia here. Diagnosis is the identification
of the nature and cause of a certain phenomenon. And differential diagnosis
is the distinguishing of a particular
disease or condition from others that present
similar clinical features. So doctors typically talk
about differential diagnosis when they're faced
with a patient and they make list of what
are the things that might be wrong with this patient. And then they go
through the process of trying to figure out
which one it actually is. So that's what we're
going to focus on today. Now, just to scare you,
here's a lovely model of human circulatory physiology. So this is from Guyton's
textbook of cardiology. And I'm not going to hold
you responsible for all of the details of this
model, but it's interesting, because this is, at least
as of maybe 20 years ago, the state of the art of how
people understood what happens in the circulatory system. And it has various
control inputs that determine things like
how your hormone levels change various aspects of the
cardiovascular system and how the interactions
between different components of the cardiovascular
system affect each other. And so in principle, if I
could tune this model to me, then I could make all kinds
of pretty good predictions that say if I increase my
systemic vascular resistance, then here's what's
going to happen as the rest of the
system adjusts. And if I get a blockage
in a coronary artery, then here's what's going to
happen to my cardiac output and various other things. So this would be terrific. And if we had this
kind of model for not just the cardiovascular system,
but the entire body, then we'd say, OK, we've solved medicine. Well, we don't have this kind
of model for most systems. And also, there's
this minor problem that if I give you this
model and say, "How does this relate to a
particular patient?", how would you figure that out? This has hundreds of
differential equations that are being represented
by this diagram. And they have many
hundreds of parameters. And so we were joking when we
started working with this model that you'd really have to
kill the patient in order to do enough measurements to
be able to tune this model to their particular physiology. And of course, that's probably
not a good practical approach. We're getting a little better
by developing more non-invasive ways of measuring these things. But that's moving
along very slowly. And I don't expect that I
or maybe even any of you will live long enough that
sort of this approach to doing medical reasoning
and medical diagnosis is actually going to happen. So what we're going
to look at today is what simpler models are
there for diagnostic reasoning. And I'm going to take
the liberty of inflicting a bit of history on you, because
I think it's interesting where a lot of these ideas came from. So the first idea was
to build flowcharts. Oh, and by the way,
the signs and symptoms, I've forgotten if we've talked
about that in the class. So a sign is something
that a doctor sees, and a symptom is something
that the patient experiences. So a sign is objective. It's something that can
be told outside your body. A symptom is something
that you feel. So if you're feeling
dizzy, then that's a symptom, because it's not
obvious to somebody outside you that you're dizzy, or that you
have a pain, or such things. Normally, we talk about
manifestations or findings, which is sort of a super
category of all the things that are determinable
about a patient. So we'll talk about
flowcharts, models based on associations
between diseases and these manifestations. Then there are some issues
about whether you're trying to diagnose a single
disease or a multiplicity of diseases, which makes the
models much more complicated whether you're trying to
do probabilistic diagnosis or definitive or categorical. And then we'll talk about some
utility theoretic methods. And I'll just mention some
rule-based and pattern-matching kinds of approaches. So this is kind of cute. This is from 1973. And if you were a woman and
walked into the MIT Health Center and complained
of potentially a urinary tract infection,
they would take out this sheet of paper, which
was nicely color-coded, and they would check
a bunch of boxes. And if you hit a red box,
that represented a conclusion. And otherwise, it
gave you suggestions about what further tests to do. And this was essentially
a triage instrument. It said, does this woman
have a problem that requires immediate attention? And so we should either
call an ambulance and take them to a hospital,
or is it something where we can just tell them to
come back the next day and see a doctor,
or is it in fact some self-limited thing where
we say, take two aspirin, and it'll go away. So that was the attempt here. Now, interestingly, if
you look at the history of this project between the
Beth Israel Hospital and Lincoln Laboratories, it started
off as a computer aid. So they were building
a computer system that was supposed to do this. And then in-- but you can
imagine, in the late 1960s, early 1970s, computers
were pretty clunky. PCs hadn't been invented yet. So this was like mainframe
kinds of operations. It was very hard to use. And so they said, well, this
is a small enough program that we can reduce it to
about 20 flow sheets-- 20 sheets like this, which
they proceeded to print up. And I was amused,
because in the-- around 1980, I was working
in my office one night. And I got this
splitting headache. And I went over to MIT medical. And sure enough,
the nurse pulled out one of these sheets
for headaches and went through it
with me and decided that a couple of
Tylenols should fix me. But it was interesting. So this was really
in use for a while. Now, the difficulty with
approaches like this, of which there have
been many, many, many in the medical world,
is that they're very fragile. They're very specific. They don't take account
of unusual cases. And there's a lot of effort
in coming to consensus to build these things. And then they're not necessarily
useful for a long time. So MIT actually stopped
using them shortly after my headache experience. But if you go over
to a hospital and you look on the bookshelf
of a junior doctor, you will still find
manuals that look kind of like this
that say, how do we deal with tropical diseases? So you ask a bunch of
questions, and then depending on the branching
logic of the flowchart, it'll tell you whether
this is serious or not. And the reason is because if
you do your medical training in Boston, you're
not going to see very many tropical diseases. And so you don't have
a base of experience on the basis of which
you can learn and become an expert at doing it. And so they use this as
a kind of cheat sheet. I mentioned that the association
between diseases and symptoms is another important
way of doing diagnosis. And I swear to you, there was
a paper in the 1960s, I think, that actually proposed this. So if any of you have hung
around ancient libraries, libraries used to have
card catalogs that were physical pieces
of paper, cardboard. And one of the things they
did with these was each card would be a book. And then around the edges
were a bunch of holes, and depending on
categorizations of the book along various dimensions,
like its Dewey decimal number, or the top digits of
its Library of Congress number or something, they would
punch out holes in the borders. And this allowed
you to do a kind of easy sorting of these books. So if you've got
a bunch of cards together when people were
returning their books and you pulled a bunch of cards. And you wanted to find
all the math books. So what you would do is you'd
stick a needle through the hole that represented math books,
and then you shake the pile, and all the math
books would fall out because they had punched. So somebody seriously proposed
this as a diagnostic algorithm. And in fact, implemented it. And was trying to
even make money on it. I think this was an attempt at
a commercial venture, where they were going to provide doctors
with these library cards that represented diseases. And the holes now
represented not mathematics versus literature,
but they represented shortness of breath versus
pain in the left ankle versus whatever. And again, as people
came in and complained about some condition,
you'd stick a needle through that condition
and you'd shake, and up would come the cards that
had that condition in common. So one of the obvious
problems with this approach is that if you had two
things wrong with you, then you would wind up with no cards
very quickly, because nothing would fall out of the pile. So this didn't go anywhere. But interestingly,
even in the late 1980s, I remember being asked by the
board of directors of the New England Journal of Medicine
to come to a meeting where they had gotten a pitch
from somebody who was proposing essentially exactly
this diagnostic model, except implemented in a computer
now and not in these library cards. And they wanted to know
whether this was something that they ought to get
behind and invest in. And I and a bunch
of my colleagues assured them that this was
probably not a great idea and they should stay away
from it, which they did. Well, a more sophisticated model
is something like a Naive Bayes model that says if
you have a disease-- where is my cursor? If you have a disease, and you
have a bunch of manifestations that can be caused
by the disease, we can make some
simplifying assumptions that say that you
will only ever have one disease at a
time, which means that the values of that node D
form an exhaustive and mutually exclusive set of values. And we can assume that
the manifestations are conditionally independent
observables that depend only on the disease that you
have, but not on each other or not on any other factors. And if you make that
assumption, then you can apply good old
Thomas Bayes's rule. This, by the way, is
the Reverend Bayes. Do you guys know his history? So he was a nonconformist
minister in England. And he was not a
mathematician, except I mean, he was an amateur mathematician. But he decided that
he wanted to prove to people that God existed. And so he developed
Bayesian reasoning in order to make this proof. And so his argument
was, well, suppose you're completely in doubt. So you have 50/50
odds that God exists. And then you say,
let's look at miracles. And let's ask, what's the
likelihood of this miracle having occurred if God exists
versus if God doesn't exist? And so by racking up
a bunch of miracles, you can convince people more
and more that God must exist, because otherwise
all these miracles couldn't have happened. So he never publish this in his
lifetime, but after his death one of his colleagues
actually presented this as a paper at the Royal
Society in the UK. And so Bayes became
famous as the originator of this notion of how to do
probabilistic reasoning about at least fairly
simple situations, like in his case, the existence
or nonexistence of God. Or in our case, the
cause of some disease, the nature of some disease. And so you can draw these trees. And Bayes's rule is very simple. I'm sure you've all seen it. One thing that, again,
makes contact with medicine is that a lot of times,
you're not just interested in the impact of one
observable on your probability distribution, but you're
interested in the impact of a sequence of observations. And so one thing you
can do is you can say, well, here is my
general population. So let's say disease 2 has
37% prevalence and disease 1 has 12%, et cetera. And now I make some observation. I apply Bayes's rule. And I revise my
probability distribution. So this is the equivalent of
finding a smaller population of patients who have
all had whatever answer I got for symptom 1. And then I just keep doing that. And so this is the sequential
application of Bayes's rule. And of course, it does depend
on the conditional independence of all these symptoms. But in medicine,
people don't like to do math, even
arithmetic much. And they prefer doing addition
rather than multiplication, because it's easier. And so what they've
done is they said, well, instead of
representing all this data in a probabilistic framework,
let's represent it as odds. And if you represent it
as odds, then the odds of some disease given
a bunch of symptoms, given the independence
assumption, is just the prior
odds of the disease times the conditional
odds, the likelihood ratio of each of the symptoms
that you've observed. So you've just got to
multiply these together. And then because they like
adding more than multiplying, they said, let's take
the log of both sides. And then you can just add them. And so if you remember when I
was talking about medical data, there are things
like the Glasgow Coma score, or the APACHE
score, or various measures of how badly or well a patient
is doing that often involve adding up numbers corresponding
to different conditions that they have. And what they're
doing is exactly this. They're applying
sequentially Bayes's rule with these independence
assumptions in the form of logs rather than
multiplications, log odds, and that's how they're doing it. Very quickly. Somebody in a previous lecture
was wondering about receiver operator characteristic curves. And I just wanted to give you a
little bit of insight on those. So if you do a test on two
populations of patients-- the red ones are sick patients. The blue ones are
not sick patients. You do some test. What you expect is that
the result of that test will be some continuous
number, and it'll be distributed something
like the blue distribution for the well patients
and something like the red distribution
for the ill patients. And typically, we
choose some threshold. And we say, well,
if you choose this to be the threshold between
a prediction of sick or well, then what
you're going to get is that the part of the
blue distribution that lies to the right is
the false positives and the part of the
red distribution that lies to the left is
the false negatives. And often people will choose the
lowest point at which these two curves intersect as the
threshold, but that, of course, isn't necessarily the case. Now, if I give
you a better test, one like this, that's terrific,
because there is essentially no overlap. Very small false negative
and false positive rates. And as I said, you
can choose to put the threshold in
different places, depending on how you
want to trade off sensitivity and specificity. And we measure this by
this receiver operator characteristics curve,
which has the general form that if you get a
curve like this, that means that there's an
exact trade-off for sensitivity and specificity, which is the
case if you're flipping coins. So it's random. And of course, if you manage
to hit the top corner up there, that means that there
would be no overlap whatsoever between
the two distributions, and you would get
a perfect result. And so typically you get
something in between. And so normally,
if you do a study and your AUC, the area
under this receiver operator characteristics curve,
is barely over a half, you're pretty
close to worthless, whereas if it's
close to 1, then you have a really good
method for distinguishing these categories of patients. Next topic. What does it mean
to be rational? I should have a
philosophy course here. AUDIENCE: Are you
talking about pi? PETER SZOLOVITS: Sorry. AUDIENCE: Are you
talking about pi? Pi is-- PETER SZOLOVITS:
Pi is irrational, but that's not what
I'm talking about. Well, so there is this
principle of rationality that says that what you want
to do is to act in such a way as to maximize your
expected utility. So for example, if
you're a gambler and you have a choice of various
ways of betting in some poker game or something, then if
you were a perfect calculator of the odds of getting a
queen on your next draw, then you could make
some rational decision about whether to
bet more or less, but you'd also have to take into
account things like, "How could I convince my opponent that I am
not bluffing if I am bluffing?" and "How could I convince them
that I'm bluffing if I'm not bluffing?" and so on. So there is a
complicated model there. But nevertheless,
the idea is that you should behave in a
way that will give you the best expected outcome. And so people joke that
this is Homo economicus, because economists make
the assumption that this is how people behave. And we now know that that's
not really how people behave. But it's a pretty common
model of their behavior, because it's easy
to compute, and it has some appropriate
characteristics. So as I mentioned,
every action has a cost. And utility measures the
value or the goodness of some outcome, which is the
amount of money you've won, or whether you live or die, or
quality adjusted life years, or various other
measures of utility-- how much it costs for
your hospitalization. So let me give you an example. This actually comes from a
decision analysis service at New England
Medical Center Tufts Hospital in the late 1970s. So this was an elderly
Chinese gentleman whose foot had gangrene. So gangrene is an infection
that usually people who have bad circulation
can get these. And what he was
facing was a choice of whether to amputate
his foot or to try to treat him medically. To treat him medically
means injecting antibiotics into his system and hoping that
his circulation is good enough to get them to the
infected areas. And so the choice becomes
a little more complicated, because if the medical treatment
fails, then, of course, the patient may
die, a bad outcome. Or you may have to now
amputate the whole leg, because the gangrene has spread
from his foot up the foot, and now you're
cutting off his leg. So what should you do? And how should you
reason about this? So Pauker's staff came up
with this decision tree. By the way, decision tree in
this literature means something different from decision
tree in like C4.5. So your choices here
are to amputate the foot or start with medical care. And if you amputate
the foot, let's say there is a 99% chance
that the patient will live. There's a 1% chance that maybe
the anesthesia will kill him. And if we treat
him medically, they estimated that there is a
70% chance of full recovery, a 25% chance that he'd
get worse, a 5% chance that he would die. If he got worse, you're now
faced with another decision, which is, do we
amputate the whole leg or continue pushing medicine? And again, there are various
outcomes with various estimated probabilities. Now, the critical thing here
that this group was pushing was the idea that
these decisions shouldn't be based on what the
doctor thinks is good for you. They should be based on what
you think is good for you. And so they worked
very hard to try to elicit individualized
utilities from patients. So for example, this guy said
that having your foot amputated was worth 850 points on a scale
of 1,000 where being healthy was 1 and being dead was 0. Now, you could imagine
that that number would be very different
for different individuals. If you asked LeBron
James how bad would it be to have your
foot amputated, he might think that it's
much worse than I would, because it would be a pain
to have my foot amputated, but I could still do
most of the things that I do professionally,
whereas he probably couldn't as a star basketball player. So how do you solve
a problem like this? Well, you say, OK,
at every chance node I can calculate the expected
value of what happens here. So here at it's 0.6
times 995, 0.4 times 0. That gets me a value
for this decision. Do the same thing here. I compare the values here
and choose the best one. That gives me a value
for this decision. And so I fold back
this decision tree. And my next slide should have-- yeah, so these are the
numbers that you get. And what you discover
is that the utility of trying medical treatment
is somewhat higher than the utility of
immediately amputating the foot if you believe these
numbers and those utilities, these probabilities
and those utilities. Now, the difficulty is that
these numbers are fickle. And so you'd like to do some
sort of sensitivity analysis. And you say, for example,
what if this gentleman valued his living
with an amputated foot at 900 rather than 850. And now you discover
that amputating the foot looks like a slightly better
decision than the other. So this is actually applied
in clinical medicine. And there are now
thousands of doctors who have been trained
in these techniques and really try to work through
this with individual patients. Of course, it's used much more
on an epidemiological basis when people look at
large populations. AUDIENCE: I have a question. PETER SZOLOVITS: Yeah. AUDIENCE: How are the
probabilities assessed? PETER SZOLOVITS:
So the service that did this study would
read the literature, and they would
look in databases. And they would try to
estimate those probabilities. We can do a lot better today
than they could at that time, because we have a lot more
data that you can look in. But you could say,
OK, for people-- men of this age who have
gangrenous feet, what fraction of them have
the following experience? And that's how
these are estimated. Some of it feels like 5%. OK. So I just said this. And then the
question of where do you get these utilities
is a tricky one. So one way is to do the
standard gamble, which says, OK, Mr. Szolovits,
we're going to play this game. We're going to roll a fair die
or something that will come up with some continuous
number between 0 and 1, and then I'm going to play the
game where either I chop off your foot, or I roll
this die and if it exceeds some threshold,
then I kill you. Nice game. So now if you find the point
at which I'm indifferent, if I say, well, 0.8, that's
a 20% chance of dying. It seems like a lot. But maybe I'll go
for 0.9, right? Now you've said,
OK, well, that means that you value
living without a foot at 0.9 of the value
of being healthy. So this is a way of doing it. And this is typically done. Unfortunately, of course,
it's difficult to ascertain the problem. And it's also not stable. So people have done experiments
where they get somebody to give them this kind of
number as a hypothetical, and then when that
person winds up actually faced with such a
decision, they no longer will abide with that number. So they've changed their mind
when the situation is real. AUDIENCE: But it's nice, because
there are two feet, right? So you could run this
experiment and see. PETER SZOLOVITS: They
didn't actually do it. It was hypothetical. OK. Next program I want to
tell you about, again, the technique for this was
developed as a PhD thesis here at MIT in 1967. So this is hot off the presses. But it's still used,
this type of idea. And so this was a program that
was published in the American Journal of Medicine, which is
a high impact medical journal. I think this was
actually the first sort of computational program
that journal had ever published as a medical journal. And it was addressed
at the problem of the diagnosis of acute
oliguric renal failure. Oliguric means you're
not peeing enough. Renal is your kidney. So this is something's gone
wrong with your kidney, and you're not
producing enough urine. Now, this is a good problem to
address with these techniques, because if something
happens to you suddenly, it's very likely that
there is one cause for it. If you are 85 years old
and you have a little heart disease and a little
kidney disease and a little liver disease
and a little lung disease, there's no guarantee that there
was one thing that went wrong with you that caused all these. But if you were OK yesterday
and then you stopped peeing, it's pretty likely that there's
one thing that's gone wrong. So it's a good
application of this model. So what they said is there
are 14 potential causes. And these are exhaustive
and mutually exclusive. There are 27 tests or
questions or observations that are relevant
to the differential. These are cheap
tests, so they didn't involve doing anything
either expensive or dangerous to the patient. It was measuring
something in the lab or asking questions
of the patient. But they didn't want to
have to ask all of them, because that's pretty tedious. And so they were
trying to minimize the amount of
information that they needed to gather
in order to come up with an appropriate decision. Now, the real problem, there
were three invasive tests that are dangerous
and expensive, and then eight
different treatments that could be applied. And I'm only going to tell
you about the first part of this problem. This 1973 article shows you
what the program looked like. It was a computer terminal
where it gave you choices, and you would type in an answer. And so that was the state
of the art at the time. But what I'm going to
do is, god willing, I'm going to demonstrate
a reconstruction that I made of this program. So these guys are the potential
causes of stopping to pee-- acute tubular necrosis,
functional acute renal failure, urinary tract obstruction, acute
glomerulonephritis, et cetera. And these are the
prior probabilities. Now, I have to warn
you, these numbers were, in fact, estimated by
people sticking their finger in the air and figuring out
which way the wind was blowing, because in 1973, there were not
great databases that you could turn to. And then these are
the questions that were available to be asked. And what you see in
the first column, at least if you're sitting
close to the screen, is the expected entropy of
the probability distribution if you answered this question. So this is basically saying, if
I ask this question, how likely is each of the possible answers,
given my disease distribution probabilities? And then for each
of those answers, I do a Bayesian revision,
then I weight the entropy of that resulting distribution
by the probability of getting that answer. And that gets me
the expected entropy for asking that question. And the idea is that the
lower the expected entropy, the more valuable the question. Makes sense. So if we look, for example,
the most valuable question is, what was the blood pressure
at the onset of oliguria? And I can click on this
and say it was, let's say, moderately elevated. And what this little
colorful graph is showing you is that if you
look at the initial probability distribution, acute tubular
necrosis was about 25%, and has gone down to
a very small amount, whereas some of these
others have grown in importance considerably. So we can answer more
questions, we can say-- let's see. What is the degree-- is there proteinuria? Is there protein in the urine? And we say, no, there isn't. I think we say, no, there isn't. 0. And that revises the
probability distribution. And then it says the next most
important thing is kidney size. And we say-- let's say
the kidney size is normal. So now all of a sudden
functional acute renal failure, which, by the way, is one
of these funny medical care categories that says
it doesn't work well, doesn't explain to why
it doesn't work well, but it's sort of
a generic thing. And sure enough. We can keep answering
questions about, are you producing
less than 50 ccs of urine, which is a tiny
amount, or somewhere between 50 and 400? Remember, this is for people
who are not producing enough. So normally you'd be over 400. So these are the only choices. So let's say it's moderate. And so you see the probability
distribution keeps changing. And what happened in
the original program is they had an
arbitrary threshold that said when the probability of one
of these causes of the disease reaches 95%, then we
switch to a different mode, where now we're actually
willing to contemplate doing the expensive tests and
doing the expensive treatments. And we build the
decision tree, as we saw in the case of the
gangrenous foot, that figures out which of those
is the optimal approach. So the idea here was
because building a decision tree with 27 potential questions
becomes enormously bushy, we're using a
heuristic that says information maximization
or entropy reduction is a reasonable way of
focusing in on what's wrong with this patient. And then once we focused
in pretty well, then we can begin to do
more detailed analysis on the remaining more
consequential and more costly tests that are available. Now, this program didn't
work terribly well, because the numbers
were badly estimated, and also because of the
utility model that they had for the decision analytic
part was particularly terrible. It didn't really reflect
anything in the real world. They had an incremental utility
model that said the patient either got better, or stayed
the same, or got worse. And obviously in that
order of utilities, but they didn't correspond
to how much better he got or how much worse he got. And so it wasn't
terribly useful. So nevertheless, in the 1990s,
I was teaching a tutorial at a Medical
Informatics conference, and there were a bunch of
doctors in the audience. And I showed them this program. And one of the doctors came
up afterwards and said, wow, it thinks just the way I do. And I said, really? I don't think so. But clearly, it
was doing something that corresponded
to the way that he thought about these cases. So I thought that
was a good thing. All right. Well, what happens
if we can't assume that there's just a single
disease underlying the person's problems? If there are
multiple diseases, we can build this kind
of bipartite model that says we have
a list of diseases and we have a list
of manifestations. And some subset of
the diseases can cause some subset
of the symptoms, of the manifestations. And so the manifestations
depend only on the diseases that are
present, not on each other. And therefore, we have
conditional independence. And this is a type of
Bayesian network, which can't be solved exactly
because of the computational complexity. So a program I'll
show you in a minute had 400 or 500 diseases and
thousands of manifestations. And the computational complexity
of exact solution techniques for these networks tends
to go exponentially with the number of undirected
cycles in the network. And of course, there are
plenty of undirected cycles in a network like that. So there was a program
developed originally in the early 1970s
called Dialog. And then they got sued, because
somebody owned that name. And then they
called it Internist, and they got sued because
somebody owned that name. And then they
called it QMR, which stands for Quick
Medical Reference, and nobody owned that name. So around 1982, this program
had about 500 diseases, which they estimated
represented about 70% to 75% of major diagnoses in
internal medicine, about 3,500 manifestations. And it took about 15 man
years of manual effort to sit there and read medical
textbooks and journal articles and look at records of
patients in their hospital. The effort was led by
a computer scientist at the University of Pittsburgh
and the chief of medicine at UPMC, the University of
Pittsburgh Medical Center, who was just a fanatic. And he got all the
medical school trainees to spend hours and hours
coming up with these databases. By 1997, they had
commercialized it through a company that had
bought the rights to it. And they had-- that
company had expanded it to about 750 diagnoses and
about 5,500 manifestations. So they made it
considerably larger. Details are-- I've tried to put
references on all the slides. So here's what data
in QMR looks like. For each diagnosis,
there is a list of associated
manifestations with evoking strengths and frequencies. So I'll explain
that in a minute. On average, there are about
75 manifestations per disease. And for each disease-- for each manifestation
in addition to the data you see here, there is
also an important measure that says how critical is it to
explain this particular symptom or sign or lab value
in the final diagnosis. So for example, if
you have a headache, that could be
incidental and it's not that important to explain it. If you're bleeding from your
gastrointestinal system, that's really
important to explain. And you wouldn't
expect a diagnosis of that patient that
doesn't explain to you why they have that symptom. And then here is an example
of alcoholic hepatitis. And the two numbers here are
a so-called evoking strength and a frequency. These are both on scales-- well, evoking strength
is on a scale of 0 to 5, and frequency is on
a scale of 1 to 5. And I'll show you what
those are supposed to mean. And so, for example,
what this says is that if you're
anorexic, that should not make you think about alcoholic
hepatitis as a disease. But you should expect
that if somebody has alcoholic hepatitis, they're
very likely to have anorexia. So that's the frequency number. This is the evoking
strength number. And you see that there
is a variety of those. So much of that many,
many years of effort went into coming
up with these lists and coming up with
those numbers. Here are the scales. So the evoking strength-- 0 means nonspecific. 5 means its pathognomonic. In other words, just
seeing the symptom is enough to convince
you that the patient must have this disease. Similarly, frequency 1
means it occurs rarely, and 5 means that it occurs
in essentially all cases with scaled values in between. And these are kind
of like odds ratios. And they add them kind of as
if they were log likelihood ratios. And so there's been
a big literature on trying to figure out exactly
what these numbers mean, because there's no formal
definition in terms of you count the number of this and
divide by the number of that, and that gives you
the right answer. These were sort of the
impressionistic kinds of numbers. So the logic in the system
was that you would come to it and give it a list of the
manifestations of a case. And to their credit, they went
after very complicated cases. So they took clinical
pathologic conference cases from The New England
Journal of Medicine. These are cases selected to be
difficult enough that doctors are willing to read these. And they're typically presented
at Grand Rounds at MGH by somebody who is often
stumped by the case. So it's an opportunity to watch
people reason interactively about these things. And so you evoke
the diagnoses that have a high evoking strength
from the giving manifestations. And then you do a
scoring calculation based on those numbers. The details of this
are probably all wrong, but that's the way
they went about it. And then you form a
differential around the highest scoring diagnosis. Now, this is actually
an interesting idea. It's a heuristic idea, but it's
one that worked pretty well. So suppose I have two diseases. D1 can cause
manifestations 1 through 4. And D2 can cause 3 through 6. So are these competing to
explain the same case or could they be complementary? Well, until we know what
symptoms the patient actually has, we don't know. But let's trace through this. So suppose I tell you that
the patient has manifestations 3 and 4. OK. Well, you would say,
there is no reason to think that the patient
may have both diseases, because either of them can
explain those manifestations, right? So you would consider
them to be competitors. What about if I add M1? So here, it's getting
a little dicier. Now you're more likely
to think that it's D1. But if it's D1, that could
explain all the manifestations, and D2 is still viewable
as a competitor. On the other hand,
if I also add M6, now neither disease can
explain all the manifestations. And so it's more likely,
somewhat more likely, that there may be
two diseases present. So what Internist had was
this interesting heuristic, which said that when you get
that complementary situation, you form a differential around
the top ranked hypothesis. In other words, you retain
all those diseases that compete with that hypothesis. And that defines
a subproblem that looks like the acute
renal failure problem, because now you have
one set of factors that you're trying to
explain by one disease. And you set aside all of
the other manifestations and all of the
other diseases that are potentially complementary. And you don't worry about
them for the moment. Just focus on this
cluster of things that are competitors
to explain some subset of the manifestations. And then there are different
questioning strategies. So depending on the scores
within these things, if one of those diseases
has a very high score and the others have
relatively low scores, you would choose a pursue
strategy that says, OK, I'm interested
in asking questions that will more
likely convince me of the correctness of
that leading hypothesis. So you look for the things
that it predicts strongly. If you have a very large
list in the differential, you might say, I'm
going to try to reduce the size of the differential
by looking for things that are likely in some of the
less likely hypotheses so that I can rule them out
if that thing is not present. So different strategies. And I'll come back to
that in a few minutes. So their test, of course,
based on their own evaluation was terrific. It did wonderfully well. The paper got published
in The New England Journal of Medicine, which was an
unbelievable breakthrough to have an AI program that
the editors of The New England Journal considered interesting. Now, unfortunately, it
didn't hold up very well. And so there was this
paper by Eta Berner and her colleagues in
1994 where they evaluated QMR and three other programs. DXplain is very similar
in structure to QMR. Iliad and Meditel
are Bayesian network, or almost naive
Bayesian types of models developed by other groups. And they looked for
results, which is coverage. So what fraction of the real
diagnoses in these 105 cases that they chose to test on could
any of these programs actually diagnose? So if the program didn't
know about a certain disease, then obviously it wasn't
going to get it right. And then they said, OK, of
the program's diagnoses, what fraction were considered
correct by the experts? What was the rank order
of that correct diagnosis among the list of diagnoses
that the program gave? The experts were asked to list
all the plausible diagnoses from these cases. What fraction of those showed
up in the program's top 20? And then did the program have
any value added by coming up with things that the experts
had not thought about, but that they
agreed when they saw them were reasonable
explanations for this case? So here are the results. And what you see is that the
diagnoses in these 105 test cases, 91% of them appeared
in the DXplain program, but, for example, only 73%
of them in the QMR program. So that means that
right off the bat it's missing about a quarter
of the possible cases. And then if you look
at correct diagnosis, you're seeing numbers like
0.69, 0.61, 0.71, et cetera. So these are-- it's like the
dog who sings, but badly, right? It's remarkable that
it can sing at all, but it's not something
you want to listen to. And then rank of the correct
diagnosis in the program is at like 12 or
10 or 13 or so on. So it is in the top 20, but it's
not at the top of the top 20. So the results were
a bit disappointing. And depending on where
you put the cut off, you get the proportion of cases
where a correct diagnosis is within the top end. And you see that at 20,
you're up at a little over 0.5 for most of these programs. And it gets better if you extend
the list to longer and longer. Of course, if you
extended the list to 100, then you reach 100%, but it
wouldn't be practically very useful. AUDIENCE: Why didn't
they somehow compare it to the human decision? PETER SZOLOVITS:
Well, so first of all, they assumed that their
experts were perfect. So they were the gold standard. So they were comparing
it to a human in a way. AUDIENCE: Yeah. PETER SZOLOVITS: OK. So the bottom line is that
although the sensitivity and specificity
were not impressive, the programs were
potentially useful, because they had interactive
displays of signs and symptoms associated with diseases. They could give you
the relative likelihood of various diagnoses. And they concluded
that they needed to study the effects of
whether a program like this actually helped a doctor
perform medicine better. So just here's an example. I did a reconstruction
of this program. This is the kind of
exploration you could say. So if you click on
angina pectoris, here are the findings that
are associated with it. So you can browse
through its database. You can type in an example
case, or select an example case. So this is one of those
clinical pathological conference cases, and then
the manifestations that are present and
absent, and then you can get an
interpretation that says, OK, this is our differential. And these are the
complementary hypotheses. And therefore these
are the manifestations that we set aside, whereas
these are the ones explained by that set of diseases. And so you could watch how the
program does its reasoning. Well, then a group at
Stanford came along when belief networks
or Bayesian networks were created, and
said, hey, why don't we treat this database as if
it were a Bayesian network and see if we can
evaluate things that way? So they had to fill
in a lot of details. They wound up using
the QMR database with a binary interpretation. So a disease was
present or absent. The manifestation was
present or absent. They used causal independence,
or a leaky noisy-OR, which I think you've
seen in other contexts. So this just says if there are
multiple independent causes of something, how likely is it
to happen depending on which of those is present or not. And there is a simplified way
of doing that calculation, which corresponds to sort
of causal independence and is computationally
reasonably fast to do. And then they also estimated
priors on the various diagnoses from national health statistics,
because the original data did not have prior data-- priors. They wound up not using
the evoking strengths, because they were doing a
pretty straight Bayesian model where all you need is the
priors and the conditionals. They took the frequency as a
kind of scaled conditional, and then built a
system based on that. And I'll just show
you the results. So they took a bunch of
Scientific American medicine cases and said, what are the
ranks assigned to the reference diagnoses of these 23 cases? And you see that like
in case number one, QMR ranked the correct
solution as number six, but their two methods,
TB and iterative TB ranked it as number one. And then these are attempts to
do a kind of ablation analysis to see how well the program
works if you take away various of its clever features. But what you see is that
it works reasonably well, except for a few cases. So case number 23, all variants
of the program did badly. And then they excused
themselves and said, well, there's actually
a generalization of the disease that was in the
Scientific American medicine conclusion, which the
programs did find, and so that would have been
number one across the board. So they can sort of make a
kind of handwavy argument that it really got
that one right. And so these were pretty good. And so this validated the
idea of using this model in that way. Now, today you can go out and
go to your favorite Google App store or Apple's app store
or anybody's app store and download tons and tons
and tons of symptom checkers. So I wanted to give you a demo
of one of these if it works. OK. So I was playing earlier
with having abdominal pain and headache. So let's start a new one. So type in how
you're feeling today. Should we have a cough, or runny
nose, abdominal pain, fever, sore throat, headache, back
pain, fatigue, diarrhea, or phlegm? Phlegm? Phlegm is the winner. Phlegm is like coughing
up crap in your throat. AUDIENCE: Oh, luckily,
they visualize it. PETER SZOLOVITS: Right. So tell me about your phlegm. When did it start? AUDIENCE: Last week. PETER SZOLOVITS: Last week? OK. I signed in as Paul,
because I didn't want to be associated
with any of this data. So was the phlegm bloody
or pus-like or watery or none of the above? AUDIENCE: None of the above. PETER SZOLOVITS:
None of the above. So what was it like? AUDIENCE: I don't know. Paul? PETER SZOLOVITS: Is it
any of these colors? AUDIENCE: Green. PETER SZOLOVITS: I think
I'll make it yellow. Next. Does it happen in the morning,
midday, evening, nighttime, or a specific time of year? AUDIENCE: Specific time of year. AUDIENCE: Yeah. Specific time of year. PETER SZOLOVITS:
Specific time of year. And does lying down or physical
activity make it worse? AUDIENCE: Well, it's
generally not worse. So that's physical activity. PETER SZOLOVITS:
Physical activity. How often is this a problem? I don't know. A couple times a week maybe. Did eating suspect food
trigger your phlegm? AUDIENCE: No. PETER SZOLOVITS: I don't know. I don't know what
a suspect food is. AUDIENCE: [INAUDIBLE] food. PETER SZOLOVITS: Yeah. This is going to
kill most of my time. AUDIENCE: Is it getting better? PETER SZOLOVITS:
Is it improving? Sure, it's improving. Can I think of another
related symptom? No. I'm comparing your case
to men aged 66 to 72. A number of similar
cases gets more refined. Do I have shortness of breath? No. That's good. All right. Do I have a runny nose? Yeah, sure. I have a runny nose. It's-- I don't know--
a watery, runny nose. AUDIENCE: Does it say you've
got to call [INAUDIBLE]?? PETER SZOLOVITS: Well,
I'm going to stop, because it will just take-- it takes too long to go through
this, but you get the idea. So what this is
doing is actually running an algorithm
that is a cousin of the acute renal failure
algorithm that I showed you. So it's trying to optimize the
questions that it's asking, and it's trying to come up
with a diagnostic conclusion. Now, in order not
to get in trouble with things like
the FDA, it winds up wimping out at the
end, and it says, if you're feeling really
bad, go see a doctor. But nevertheless, these kinds
of things are now becoming real, and they're getting
better because they're based on more and more data. Yeah. AUDIENCE: [INAUDIBLE] PETER SZOLOVITS: Well,
I can't get to the end, because we're only at 36%. [INTERPOSING VOICES] Yeah. Here. All right. Somebody-- AUDIENCE: Oh, I think
I need your finger. PETER SZOLOVITS: Oh. OK. Just don't drain
my bank account. So The British Medical
Journal did a test of a bunch of symptom checkers,
of 23 symptom checkers like this about four years ago. And they said, well, can it
on 45 standardized patient vignettes can it find at least
the right level of urgency to recommend whether you should
go to the emergency room, get other kinds of care, or
just take care of yourself. And then the goals were
that if the diagnosis is given by the program, it should
be in the top 20 of the list that it gives you. And if triage is
given, then it should be the right level of urgency. The correct diagnosis was
first in 34% of the cases. It was within the top
20 in 58% of the cases. And the correct triage
was 57% accurate. But notice it was more accurate
in the emergent cases, which is good, because those are the
ones where you really care. So we have-- OK. So based on what
he said about me, I have an upper respiratory
infection with 50% likelihood. And I can ask what to do next. Watch for symptoms like
sore throat and fever. Physicians often
perform a physical exam, explore other treatment
options, and recovery for most cases like this is
a matter of days to weeks. And I can go back and
say, I might have the flu, or I might have
allergic rhinitis. So that's actually reasonable. I don't know exactly
what you put in about me. AUDIENCE: What is
the less than 50? PETER SZOLOVITS: What is what? AUDIENCE: The less than 50. [INTERPOSING VOICES] AUDIENCE: Patients have to
be the same demographics. PETER SZOLOVITS: Yeah. I don't know what the less
than 50 is supposed to mean. AUDIENCE: It started
with 200,000 or so. PETER SZOLOVITS:
Oh, so this is based on a small number of patients. So what happens, of course,
is as you slice and dice a population, it gets
smaller and smaller. So that's what we're seeing. OK. Thank you. OK. So two more topics I'm
going to rush through. One is that-- as I mentioned in
one of the much earlier slides, every action has a cost. It at least takes time. And sometimes it induces
potentially bad things to happen to a patient. And so people began studying
a long time ago what does it mean to be rational
under resource constraints rather than rational just in
this Home economicus model. And so Eric Horvitz, who's
now a big cheese guy, he's head of Microsoft
Research, but used to be just a lowly graduate
student at Stanford when he started doing this work. He said, well,
utility comes not only from what happens
to the patient, but also from the
reasoning process from the computational
process itself. And so consider-- do
you guys watch MacGyver? This is way out of date. So if MacGyver is defusing
some bomb that's ticking down to zero and he runs out of
time, then his utilities take a very sharp
drop at that point. So that's what this work is
really about, saying, well, what can we do when we don't
have all the time in the world to do the computation as well
as having to try to maximize utility to the patient? And Daniel Kahneman, who won
the Nobel Prize a few years ago in economics for this notion
of bounded rationality that says that the way we
would like to be rational is not actually the
way we behave, and he wrote this popular
book that I really like called Thinking,
Fast and Slow that says that if you're trying to
figure out which house to buy, you have a lot of time to do it,
so you can deliberate and list all the advantages and
disadvantages and costs and so on of different houses and take
your time making a decision. If you see a car barreling
toward you as you are crossing in a crosswalk, you
don't stop and say, well, let me figure out the pluses and
minuses of moving to the left or moving to the right, because
by the time you figure it out, you're dead. And so he claims that
human beings have evolved in a way where we have a kind of
instinctual very fast response, and that the deliberative
process is only invoked relatively rarely. Now, he bemoans this
fact, because he claims that people make too
many decisions that they ought to be deliberative about based
on these sort of gut instincts. For example, our
current president. But never mind. So what Eric and his
colleagues were doing was trying really to look at
how this kind of meta level reasoning about how
much reasoning and what kind of reasoning is worth doing
plays into the decision making process. So the expected
value of computation as a fundamental
component of reflection about alternative
inference strategies. So for example, I
mentioned that QMR had these alternative
questioning methods depending on the length
of the differential that it was working on. So that's an example of a kind
of meta level reasoning that says that it may be
more effective to do one kind of question asking
strategy than another. The degree of
refinement, people talk about things like
just-in-time algorithms, where if you run out of time
to think more deliberately, you can just take the best
answer that's available to you now. And so taking the
value of information, the value of computation, and
the value of experimentation into account in doing
this meta level reasoning is important to come up with
the most effective strategies. So he gives an example
of a time pressure decision problem where you have
a patient, a 75-year-old woman in the ICU, and she develops
sudden breathing difficulties. So what do you do? Well, it's a challenge, right? You could be very
deliberative, but the problem is that she may die because
she's not breathing well, or you could
impulsively say, well, let's put her on a
mechanical ventilator, because we know that
that will prevent her from dying in the
short term, but that may be the wrong decision,
because that has bad side effects. She may get an infection, get
pneumonia, and die that way. And you certainly don't want
to subject her to that risk if she didn't need
to take that risk. So they designed an
architecture that says, well, this is the decision
that you're trying to make, which they're modeling
by an influence diagram. So this is a Bayesian network
with the addition of decision nodes and value nodes. But you use Bayesian network
techniques to calculate optimal decisions here. And then this is kind of
the background knowledge of what we understand
about the relationships among different things in
the intensive care unit. And this is a representation
of the meta reasoning that says, which utility
model should we use? Which reasoning
technique should we use? And so on. And they built an
architecture that integrates these various approaches. And then in my last
2 minutes, I just want to tell you about
an interesting-- this is a modern view,
not historical. So this was a paper presented at
the last NeurIPS meeting, which said the kinds of problems
that we've been talking about, like the acute renal
failure problem or like any of these others,
we can reformulate this as a reinforcement
learning problem. So the idea is that if you
treat all activities, including putting somebody on a
ventilator or concluding a diagnostic conclusion or
asking a question or any of the other things
that we've contemplated, if you treat those
all in a uniform way and say these are
actions, we then model the universe as a
Markov decision process, where every time that you
take one of these actions, it changes the state
of the patient, or the state of our
knowledge about the patient. And then you do
reinforcement learning to figure out what
is the optimal policy to apply under all
possible states in order to maximize
the expected outcome. So that's exactly the
approach that they're taking. The state space is the set of
positive and negative findings. The action space is
to ask about a finding or conclude a diagnosis. The reward is the correct or
incorrect single diagnosis. So once you reach a
diagnosis, the process stops, and you get your reward. It's finite horizon
because they impose a limit on the number of questions. If you don't get an
answer by then, you lose. You get a minus one reward. There is a discount factor so
that the further away a reward is, the less value it has to you
at any point, which encourages shorter question sequences. And they use a pretty standard Q
learning framework, or at least a modern Q learning framework
using a double deep neural network strategy. And then there are two
pieces of magic sauce that make this work better. And one of them
is that they want to encourage asking
questions that are likely to have
positive answers rather than negative answers. And the reason is
because in their world, there are hundreds and
hundreds of questions. And of course,
most patients don't have most of those findings. And so you don't want to ask
a whole bunch of questions to which the answer is no, no,
no, no, no, no, no, no, no, because that doesn't give
you very much guidance. You want to ask questions
where the answer is yes, because that helps you clue
in on what's really going on. So they actually
have a nice proof that they do this thing
they call reward shaping, which basically adds
some incremental reward for asking questions that
will have a positive answer. But they can prove that
an optimal policy learned from that reward
function is also optimal for the reward function
that would not include it. So that's kind of cool. And then the other
thing they do is to try to identify a
reduced space of findings by what they call
feature rebuilding. And this is essentially a
dimension reduction technique where they're co-training. In this dual network
architecture, they're co-training
the policy model. It's, of course, the
neural network model, this being that 2010s. And so they're generating a
sequence, a deep layered set of neural networks that
generate an output, which is the m questions and the n
conclusions that can be made. And I think there's
a soft max over these to come up with the right policy
for any particular situation. But at the same time, they
co-train it in order to predict a number of-- all of the manifestations from
what they've observed before. So it's using-- it's learning
a probabilistic model that says if you've answered
the following questions in the following ways, here
are the likely answers that you would give to the
remaining manifestations. And the reason they
can do that, of course, is because they really
are not independent. They're very often co-varying. And so they learn
that covariance, and therefore can predict which
answers are going to get yes answers, which questions are
going to get yes answers. And therefore, they can bias
the learning toward doing that. So last slide. So this system is called REFUEL. It's been tested
on a simulated data set of 650 diseases
and 375 symptoms. And what they show is that the
red line is their algorithm. The yellow line uses only
this reward reshaping. And the blue line is just
a straight reinforcement learning approach. And you can see
that they're doing much better after many
fewer epochs of training in doing this. Now, take this with a grain of
salt. This is all fake data. So they didn't have real
data sets to test this on. They got statistics on
what diseases are common and what symptoms are
common in those diseases. And then they had
a generative model that generated this fake data. And then they learned from
that generative model. So of course it would be
really important to redo the study with real data,
but they've not done that. This was just published
a few months ago. So that's sort of where we
are at the moment in diagnosis and in differential diagnosis. And I wanted to
start by introducing these ideas in a kind
of historical framework. But it means that there are a
tremendous number of papers, as you can imagine, that have
been written since the 1990s and '80s that I was
showing you that are essentially elaborations
on the same themes. And it's only in the
past decade of the advent of these neural network
models that people have changed strategy,
so that instead of learning explicit
probabilities, for example, like you do in a Bayesian
network, you just say, well, this is simply
a prediction task. And so we'll predict
the way we predict everything else with neural
network models, which is we build a CNN, or an RNN,
or some combination of things, or some attention
model, or something. And we throw that at it. And it does typically
a slightly better job than any of the previous
learning methods that we've used
typically, but not always. OK. Peace.