LILY PENG: Hi, everyone. My name is Lily, and I am
a physician by training. At Google I am a
product manager, and I work on a team
with physicians, research scientists, software
engineers that applies AI to health care problems. So today I'm going
to go over the three common myths in building
AI models for health care. So AI has been shown to have
huge potential for many very challenging problems,
including ones in health care. So, for example, we have seen
some very interesting work in the realm of applied
AI for eye disease, for breast cancer, skin
cancer, colon cancer. And the best part is
that this technology seems to work in the hands of
not just research scientists, but undergraduates, business
owners, and even high school students. And in the recent years,
we've seen a huge increase in the number of papers at the
intersection of deep learning and health care. And given the adoption of deep
learning based technologies in consumer products and
these exciting studies, one would expect that we would
have many enabled AI products in the health care space. However, the translation into
product has been quite slow. And why this gap between
expectation and reality? So the translation of
AI into health care is obviously more
challenging than it seems. And today I'm going to
cover three common myths in building and translating
AI models that might be contributing to this gap. There are clearly more
than three blockers, but these are the
ones that we've been able to identify as
we're working in this space. So the first myth
is more data is all you need for a better model. And what we found with our
work is that it's not just about the quantity of
data, but it's really about the quality of data. So I'm going to go over
one example of this. But there are a ton
of other examples of how data quality really
impacts algorithm performance. So this particular example
is rooted in our work in diabetic retinopathy. And so a few years
ago, we set out to see if we could train a
model to classify photographs for this disease. This is a complication
of diabetes that leads to vision loss. And the way that
we screen for it is to take pictures of
the back of the eye, and then read the images to
see if there are lesions that are consistent with either
mild disease or really, really severe disease, which we
call proliferative DR. So in this case, we started
with 130,000 images, and we worked with
54 ophthalmologists to produce 880,000 labels,
or ground truth diagnoses. We then took this data
and trained a model using an existing CNN
architecture called Inception, and produced a fairly
accurate model, one whose performance rivaled that
of the general eye doctors that were part of this study. We then reported these results
in the "Journal of the American Medical Association"
a few years back. So the punch line
of this paper was that we were able to train
a very accurate model. But there were also lots of
really interesting figures I think as a part of this
that actually told you a lot more about the process and
how to do this in the future. So a particularly useful
figure in this paper that doesn't get
as much attention is this one, figure four. And where we tested how
the size of the data set and the number of labels
affects algorithm performance. What we find here,
which I'll go into more detail in the next slide is that
while in general more data is better, the key is
actually a high quality data and an efficient
labeling strategy. So in panel A, we looked at how
algorithm performance varies with the number of
images in the data set. The question was,
what would happen if you used a smaller
data set for training? So in this graph,
performance is on the y-axis, and the number of
images is on the x-axis. And each of these dots
represent a different algorithm that was trained on a
data set of varying sizes. We started off with
a few hundred images, and then used the full data set. And as you can see
from the figure, the performance plateaus
around 50,000 or 60,000 images. This means that for
this particular problem, we didn't actually have
to go up to 130,000 images to get comparable performance. This also means that with
similar types of problems in the future, a data set of
this size with accurate labels would be a good starting point. In panel B, we
measured performance compared to the number
of images per grade. The development data
set had an average of 4 and 1/2 labels per image. And this was because we
found that multiple opinions from several doctors
provided a better ground truth than an opinion
from a single doctor. So we asked, what would happen
to algorithm performance if you had noisy or
imperfect labels for each of the development set? So using the full data
set, we trained models using a subsample of labels,
either from the trained set or the tune set. So the trained set
is 80% of the images. And the tuned set,
or sometimes called the test set, is
20% of the images. What we found was that
decreasing the number of labels on the train set seemed to have
little impact on performance. However, the
algorithm's performance did depend a lot on the accuracy
of labels in the tuned set, which is the orange line there. So the takeaway here is,
given limited resources, invest in labeling the tune set. So we took these learnings,
and we applied it to subsequent papers. So in this paper, we leverage
a much smaller data set, so a tuning set of a
few thousand images whose labels were derived
through an adjudication process with retina specialists. And then here we
were able to increase the overall performance
from a generalist level in the original paper
to a specialist level, leveraging this
smaller tune set. So that's just one example
of how data quality really impacts performance. And we can go into more later
on in our fireside chat. So the second myth is that
an accurate model is all you need for a useful product. And what we find here is
not just about the accuracy of the model, but usability. So again, going
into one example. But there will be a ton more
color during the fireside chat. So building a machine
learning model is one step to preventing
blindness or other disease using AI. This model really needs
to be incorporated into a product that is
usable by doctors and nurses. So it's critical
that we study how AI can fit into clinical workflow. So going into an example
of our work in Thailand with some of our partners
at Rajavithi Hospital, we conducted a retrospective
validation study to make sure that the
model is generalizable. And it is. So this was the first step,
the retrospective study. Then we launch a
prospective study to evaluate the performance
and feasibility of deploying AI into existing DR screening
clinics across the country. And earlier this year,
we closed the recruitment about 7,600
participants, all of whom were screened using AI
across nine different sites. And we're currently
in the process of analyzing the data, both
quantitative and qualitative. Now, give you a sneak peek
here is that what we've learned is that the human
centered approach is really critical in
building useful products. Here we worked with HCI experts
and user experience experts to understand the feasibility
of implementing this product. We actually started
off with mapping out each step of the patient
journey from the minute they present at the
clinic to when they exit. And this helps us identify
potential inefficiencies and bottlenecks as we
implement the software. And so you could
see here, it's not just the patient that we
look at there that we follow, but also that of nurses,
technicians, and the doctors. So we publish this
methodology in a recent paper as a part of the
CHI proceedings, and where we cover not
just product functionality, but the workflow design
to maximize the product's potential. So this brings us
to the last myth, that a good product is
sufficient for clinical impact. In fact, what we found is
that while a good product is necessary for clinical
impact, we also have to address the product's
impact on the entire health care system. So taking a step
back, the truth is we can have the best
product in the world. But patients have to
have access to it. So, for example, one of the
reasons that people don't show up for screening in specialty
[? I-hospital ?] has nothing to do with the product at all. For many patients in India
or in rural Thailand, the trek to the hospital
could take an entire day. And that means finding someone
to take care of their children, coping with lost wages. And it's just very, very
inconvenient to a point that it's actually
very difficult to do. So screening in
facilities closer to where patients live, so
it can be AI enabled or not, but just moving the
screening closer to patients, means that they can
easily and efficiently access this kind of care. And this means that
they don't have to choose between getting
care for themselves and providing for
their loved ones. And, of course, not only is
access a critical issue here. We also have to look
at cost effectiveness of these interventions. And this includes
how a product should be implemented to take into
account the downstream effects. This includes not just
the cost of screening, but also follow
up and treatment. A good example of
this is the work that Professor Wong and his team
at SERI in Singapore have done. They published a paper
recently in "The Lancet" sharing the results of
an economic modeling study comparing
two deep learning approaches, an automated
and a semi automated, so with humans in the loop. And what was
interesting about this was that the semi
automated, or [? SISDIF ?] approach was actually the most
cost saving, more so than AI alone or humans alone. So it's really exciting to see
some of this research come out. And I think there's going
to be a ton more that will be needed so
that we can adopt these technologies at scale. In summary, three common myths
in building and translating AI models. The first myth is
more data is all you need for a better model. And what we find is that it's
not just about data quantity, but also about data quality. The second myth,
an accurate model is all you need for
a useful product. And what we really find is that
a human center approach is also required to build
that useful product. And the third myth is
that a good product is sufficient for clinical impact. And here what we find is that
implementation in health care economic research is critical to
adoption of these AI products, and critical to adoption
of these products at scale. All right, thank you. So I'm here with
Kira Whitehouse, who's our lead for a lot
of the deployment work that we've done in
Thailand and in India. Kira is kindly joining
us today to talk about what happens when
you have a great AI model, and you're ready, you think, to
deploy it into the real world. KIRA WHITEHOUSE:
Thanks for having me. I'm so excited to chat with you
today about AI and health care. LILY PENG: So Kira, why don't
you tell the audience a little bit about yourself,
kind of what you do, and what goes into building
a product from an ML model. KIRA WHITEHOUSE: Absolutely. So I am a software engineer. I joined the team
about four years ago. And I've helped the
group get a CE marking for our medical device. And in terms of what goes
into the actual development of that device, once you
have a brilliant AI model, it comes down to going
through this process that we call design controls. So in the first
phase you're going to be thinking about
who your users are and what their needs
are, the intended use of the device,
that kind of thing. And then from there, you'll
map those requirements into software requirements. So you'll think about
how you're actually going to implement
those in your software. And then the next
stage is really thinking about risks, and
potential harm to your users, whether they're patients or
other folks who are interacting with the device. So in our case, right, we
have a screening device, which means that we're going
to be telling patients either to see an ophthalmologist,
a specialist, or we're telling
them, OK, you're fine. You're healthy for now. You can go home and
come back in a year. So there's different
risks associated with that kind of device
than with other kinds of device, devices like
assisted read or second read. And then to sort of
wrap up that process, you're going to be doing
something typically called verification
and validation, which just means you're
going to be making sure you built the right thing. You built it to spec,
and that it's actually going to be helping the users
in the way you think it should. So that whole end
to end process is what we think of in
developing medical device. And then obviously, you're going
to be working with partners to deploy that and get it
into the hands of patients and physicians. LILY PENG: So it sounds
like this process that you're describing,
is it specific for AI? Or is it just like
any kind of device? KIRA WHITEHOUSE:
Great question, yeah. This is pretty much
any kind of device. It could be software as
a medical device, which is typically abbreviated
SAMD, or it could be hardware. So in certain cases, again,
if you think about risks and how your device is going to
be used in hardware scenarios, you actually will
think about things like how is shipment going to
damage my device potentially? Like the process of actually
getting your camera, let's just say fundus camera,
over from the United States to Europe, as an example. And for us, for the
software medical devices, we have different kinds
of considerations. So we decided to develop
an API only device. So we don't have our
own user interface. And that means that we
rely on our partners to actually display the results
of the device in their UI, whether it's an EMR or a
[? PAK ?] system, right? And with that there's
some benefits, which is we have this
seamless integration. So we don't have to force
the health care system to change their workflow. They're already using
their own user interface. They can just
display our results. And that can be really
useful in some contexts, as you've seen, Lily. In other contexts, if we're
trying to deploy in a setting where there's no existing
infrastructure, like we've experienced in Thailand,
that can actually be really challenging. They're only on a
paper-based workflow. In terms of AI and
some differences here, so if you use
YouTube as an example, you're going to see the latest
and greatest videos there will be suggested to you. So you'll get results
from today or yesterday maybe at the latest. And with our medical
device software, we are also deploying our
device on a regular basis. So just like with
YouTube, we're going to be deploying our
software maybe every day, every week, something like that. With the AI component,
though, of the medical device, we typically don't roll that out
more than once every six months or something like that, because
that's a more major update. So when you think about
the risk of the device and what components are
lending to that risk, a lot of the sort of serious
logic that could cause harm is within the AI, cause
the AI is the thing that's taking in the image
and then predicting, do you have disease or not? Does that make sense? So it is kind of interesting to
think about the counterexample of YouTube, where that AI
might deploy on a daily basis, whereas in our situation, we
have this much longer time frame. So Lily, we were talking about
how in the design input phase you want to figure out
what your intended use is, who your users are,
what their needs are. And when we had
this AI model, we could have done a number
of things with it, right? We could have used it in
an assisted read context. We could have used
it in a second read context or something else. We ended up choosing screening. Can you talk a little
bit about why that was the decision that we made? LILY PENG: One of the really
amazing things about AI is the ability to move
care closer to patients. So what I mean is that there
are a lot of restrictions on how a person can get care. And a lot of it has to do with
whether or not the health care professional is actually
locally, physically within a certain area, right? What technology
allows you to do, whether it's through
telemedicine or AI enabled telemedicine, is to bring
care closer to patients. And so one of the things that
we've seen with screening, in particular, is that access
issue is a really big barrier. So if you put screening
closer to a patient, your screening rates will go up. Whether it's AI enabled or not,
screening closer to a patient means higher adherence rates. And so with screening, that's
actually what matters the most. It's like you're really
trying to catch everybody. And so this is where we thought
AI could be the most useful in terms of accessibility. Also, it probably is pretty
well adapted to the problem, just because AI are really
good for high volume type of repetitive tasks. And screening tends to
fit that mold quite a bit. So the two different kind
of aspects of screening that make AI really applicable
to this problem space is the accessibility
requirements, as well as the scale and the sheer
volume of screening procedures that are required. One of the myths
that I talked about was that all you need
for a useful product is a more accurate model. And so what do you think
about that statement? Like where are the
caveats to that statement? KIRA WHITEHOUSE: I guess one of
the topics that we can touch on is image quality. That's been a big problem
in our actual deployment sort of in the field. So our device, right, we
take in an image that's off of the back of your eye. So if you've gotten your
retinal exam before, you might have done a slit
lamp examination in person. But if you've gotten
a fundus photograph, they'll be shining a
light through your pupil, and take a photo of
the back of your eye. And our algorithm
takes that as input. Sometimes the images themselves
are really low quality. So maybe half of the
image is obscured. Maybe it's just really blurry. Sometimes there will be
dust on the camera lens, and that will cause
either a lesion to be obscured by
the piece of dust, or potentially there'll be
something on the image that looks like a lesion that's not. It's just a dust spot, right? And those problems, Lily,
from your perspective, because you're a physician,
if you got an image like that, right, what would you do? In what cases would you
decide that an image was associated with a patient
who is diseased versus not? Would you be able
to make that call? Cause those are the
kinds of challenges that our AI has as well, right? LILY PENG: Yeah, yeah, I
mean, so for a physician we would probably just see
the pictures from the same day and see if, let's say, the
dust spot was still there. So how would we address
on the AI level? KIRA WHITEHOUSE: So we could
do actually something similar, which is interesting. If we had a very
tight integration with the camera itself,
we could actually provide signal to users
when they're actually taking the photo. So when the camera is put
up to the patient's eye, we could have a little
bubble that pops up and says, hey, it looks like there's
a dust spot on your camera. Can you clean it? We could also maybe help
them if, for instance, we see that the patient
maybe has cataracts, there's some media
opacity that's preventing the light from
getting to the back of the eye and getting a good photo. We could tell them, hey,
it looks like this patient has cataracts. Maybe try these different
things to capture a good photo. We've also seen, though, that
a lot of the times it ends up being really simple
solutions that have nothing to do with technology, right? So in some cases,
it's just that they need to install better
curtains in the clinic. Because by having
a darker setting, the pupil's going
to be more dilated and you can get a better photo. In other cases, we've seen
that in certain clinics patients will often come
with personal belongings. And then they'll be
sitting at the camera trying to get a good photo
with their purse or handbag. And maybe it's hard to
get a good position. So even installing
something like a shelf could potentially
be helpful there. LILY PENG: So it sounds
like image capture, or just getting
the right pictures to put into the system
in itself is challenging. KIRA WHITEHOUSE: Right, that's
one problem that we see. Another problem
that's interesting, and also it'd be great to
hear from you about why this is challenging, but even
if we have an amazing AI, and let's say we can
get good quality images, we still see these problems of
patient recruitment and patient follow up, which means
we're not actually getting the patient base that
we want into the clinic to get screened. And then when the
patients are there, and we give them an
output from the device, and we tell them
they're diseased, we actually see that a
lot of these patients don't even come back
to see the specialist. And I'd love to hear from
your perspective, what are the obstacles there
that you've experienced, or that in talking with other
health care professionals you've seen? LILY PENG: Yeah,
yeah, I think it's really interesting that you
can get the best images. And then you can
have the best model. And then you can give
people the information. But if you don't make that
information actionable and easily actionable, you
may have lost the game, so to speak, right there. Right? And so what we found sometimes
is that the information comes too late, right? So a lot of times
patients are told, hey, we'll send you a
mail, or we'll give you a call if anything's wrong. No news is good news, right? And so then they miss
the call, or they don't get the piece of mail. And they think, well, no
news is good news, right? So the default is no follow up. So that, in itself,
is the timeliness of that information
can be problematic. And that's why an
automated system can be really
helpful is cause you can get that information to the
patient very instantaneously. Now, that instantaneous of the
delivery of that information then enables a bunch
of other things, right? So same day follow up
is yet another thing that we've seen that seems
to make a big difference in adherence rates. And we've actually interviewed
a lot of patients to ask why. And a lot of times our
studies and other studies have shown that the number
one reason is transportation. Right, it's not I
don't understand, I didn't realize
I was sick, or I didn't believe I
was going to get better, which are also reasons. But number one is
I can't get a ride, or I can't take time
off my schedule. So a lot of times it's crazy
that we're the AI people. And it's like we're not able
to provide the solution here. It's actually quite common
sense solutions that actually need to be implemented well. So we've covered what it
takes to take an AI model and kind of verify it, validate
it, and then potentially put this into a clinic. What are the things that
you have to do after that? Are you done once you kind of
sell that piece of software or install that
piece of software? What else is there to do? KIRA WHITEHOUSE:
One of the things that's kind of exciting
about software is that you can monitor it, right? And before you go and
deploy the medical device, when you're getting regulatory
approval from getting a CE marking, or
FDA approval, you're going to go through some
sort of validation study. So you'll be validating
that your device works against some population. And it's usually not possible
to have representation from every single population,
thinking of sex, age, ethnicity, and whatnot. So one of the things
that's really important is to make sure
that your device is functioning as intended whenever
and wherever you deploy it, right? So we actually have kind of
a fun, creative, post market monitoring solution
where we take a subset of the images
that are captured during clinical
workflows, and we actually have doctors adjudicate
them in-house to see what the grade should be. And we compare that
result to [INAUDIBLE].. So we can see the
performance of the device when it's actually live and
impacting real patients. So that's been really
exciting to see. The other things that are
involved just with post market are handling customer feedback,
if they have feature requests or they have complaints. Making sure that if
they have complaints, there's no defects
with the device. Or if there are, we follow
up and address them. And then for feature
requests, that's kind of an exciting thing to
see that our device has been useful, and how we
can make it even more impactful to our users. LILY PENG: So Kira,
it sounds like there are a lot of
expectations around AI. What are some common
misconceptions of what people think AI
can do, where it maybe isn't able to right away? KIRA WHITEHOUSE: So
one example of this, our device takes as input a
45 degree field of view image. So again, when you think of the
light going through your pupil and taking a photograph
of the back of your eye, it's going to get
some portion of it. And there's different field
of views that you can capture. So you can get something called
an ultra wide field image, which is up to 200 degrees. And we received some feedback
from partners at some point that they were expecting our
AI could take the smaller field of view, the
45 degree image, and actually predict
what you can see or what humans could
typically see only using the bigger field
of view, the 200 degree field of view image. And that's really
interesting feedback, right? If we had the data, if we
had a bunch of paired data that was ultra
wide field with 45, we might be able to train a
model to do that, potentially. And you can see why people might
have that sort of misconception if you think about the risk of
heart attack, the cardio paper that our team published. AI being able to
take a fundus image and then predict the
likelihood of having a heart attack or some
other cardiac event, having that compared to
something like a 45 to 200 degree field of view, it kind
of makes sense that people might think, oh, you could
just do a little bit more using this smaller image, right? So that's been really
interesting to see. As I mentioned,
data can sometimes be an obstacle in to
actually developing a model that can do things
like predict cardio, or predict disease that's
only seen from these wider field of view images. So Lily, I have to ask you,
if you could have any data, snap of your fingers,
what kinds of models would you want to train? LILY PENG: That is a
fantastic question. And I think it goes to the core
of problem selection, right? What are the things
that you want to train a model to
do that actually is helpful for the patient, right? I generally find that
picking a problem where there is an intervention,
it's a lower priority. Like I think there
are some things where you can predict
a risk of progression of a certain disease,
or another one. But if there isn't
an intervention that you can take because
of that prediction, then it's probably not
going to do as much good as if you would change
your course based on that prediction, right? So the way that I
think about it is that the prediction needs to
be actionable in some way, first and foremost. So what I mean by
actionable is for screening, if we find that this person
needs to be followed up at a shorter time
interval, three months versus a year,
that is an action that we would do differently. And that makes that problem
a good problem to tackle. If it's no change, we
could say, no matter what, this person's going
to be followed up in a year. That becomes a less
interesting problem I think. So that would be the first
criteria of problem selection is the actionability
of what you're doing. The next one is I would
think about scale, which is how many people would
benefit from this task being done correctly. And within that,
there's two components. One is, how many people are
getting the procedure done, but also how many
people wouldn't get the procedure done, or would
be potentially misdiagnosed if you didn't do this. Right, so I think that's
the second component of it. So I think one of the
interesting things about these particular
selection criteria is that there are
actually already programs in the medical
community where we do this. And those are called
screening programs. So screening for breast cancer,
screening for lung cancer, screening for colon
cancer, or screening for diabetic complications. And it's because
overall we found that if we screen
people early, we're able to help them live
happier, longer lives. And so I think screening
programs is actually a really big deal. And within screening programs,
the better outcomes we have, the more hard, we
say, outcomes we have, the better the data for
that particular problem. So what I mean by hard
outcomes is really survival rates for
cancers, for example, vision loss rates for
diabetic retinopathy, so things that really
matter a lot to patients, rather than kind of
other proxy outcomes. So the more concrete
you are about how that affects patients, the
better the problem for ML. KIRA WHITEHOUSE: Your
background in health care, Lily, I'm really curious. We, Google, have come a long
way in the last four years getting this project from
just a machine learning model, to actually getting
a CE marked device, and having that device being
used by health care providers and impacting patients. I'm curious, have
you seen a shift in health care providers'
perception of AI, either from us or from
everyone else in the industry, in tech, and in
medicine that are making these sorts of devices? LILY PENG: Yeah, I think
there has definitely been a shift in how we
think about AI in medicine. I think when we first
started, the question was if, if AI would have
an impact on medicine. I think now the conversation
has changed a little bit to the how. How will AI impact medicine? And how do we do it responsibly? How do we do it in a way that
safeguards patient privacy, but also maximizes
patient impact, right? So really we've gotten to how
the implementation is done, because, honestly, there's
a lot of research out there that shows the potential
of AI to make big changes, and ensure better, more
equitable care for lots of folks, if
implemented correctly. And so I think that's
where lots of folks are spending a lot
of time, obviously external to Google,
and within Google as well is
understanding the how. And so some of the work that
Kira and her team are doing is really helping us gather
information on the how, right? How does this ML model
fit into a product. How does this product-- how is it verified to
be safe and effective? How do we put it into
a health care system such that patients are
actually benefiting from it? And then how do we monitor
the products in real time, or near real time
so that we make sure that the
diagnoses are accurate, and we find out if anything
goes wrong quickly. So I think the how here is now
kind of the next big mountain to scale. And I think we've made some
really, really good progress there. KIRA WHITEHOUSE:
Absolutely, yeah. It's been so exciting to
see really what we've done. And also, I love
the way you framed that of how health
care is maybe moving from an if AI can help to how. That's a very,
very exciting time. LILY PENG: So I am here
with Scott, one of our ML leads on the team. And Scott is the lead for
our [INAUDIBLE] paper, which was published in
"Nature" recently, as well as has done
a lot of modeling with other types of
radiological images and data. So Scott, obviously
you've trained a lot of models in your life. So do you have any tips or
rules of thumbs for people listening to this cast? SCOTT MCKINNEY:
Absolutely, I'm hoping to help people avoid all
the mistakes that I've made along the way. And there have certainly
been many of them. So the first one that I'd
encourage people to do is visualize their data. The second tip is question the
construction of your data set. And the third is
really make sure that you're trying
to solve a problem with genuine clinical
utility, rather than something that's just easy to model. LILY PENG: OK, so what I
hear is visualize your data, question the construction
of your data sets, and solve a problem with
genuine clinical utility. So for our audience, can you
elaborate a little bit more about what you mean by each? SCOTT MCKINNEY: I think that
in building machine learning models for medicine,
people often blind themselves to the data. And they're thinking
that they're not going to be able to make sense
of it in the first place. And it's intimidating,
because the experts who interpret these images
may have trained for years to be able to do this well. But I think that if you
don't actually get in there and look at the data, you
can miss important patterns. So I encourage people to get
familiar with the examples and inspect the
images when you can, because there may be
obvious things wrong that you don't need a
medical degree to notice. So I can give you an
example of this in practice. We were building a model to
find pulmonary nodules, which are potential lung
cancers in chest x-rays. And we had built a model that
was performing astonishingly well. And obviously this
is very exciting. But when we looked at some
of the true positives, these are the cancers that the
model was supposedly catching, we found bright circles
overlaid on the images. LILY PENG: Oh. SCOTT MCKINNEY: And so
these digitized x-rays had had pen markings on them
from prior interpretations. They put them up on the
lightbox and circled the nodules that they were worried about. And so this obviously
invalidated a lot of our work, because all of this
effort had gone into building a very
sophisticated circle detector. So clearly this would have been
easy to detect if we had just spent some time
browsing the data, and noticing that it was
contaminated in this way. LILY PENG: Yeah,
yeah, it totally sounds like you don't really--
for some of these first line passes, you don't really need
a medical degree to make sense and like find this, for
example, circle sign. SCOTT MCKINNEY: Absolutely. Yeah, so these sorts
of patterns can be really stark and
easy to see just by thumbing through some of
them without having a fellowship training in radiology. LILY PENG: Yeah, yeah. And so tell me a little bit
more about rule number two, about the construction
of your data sets. Where have you
seen this go right? And where have you seen
this gone terribly wrong? SCOTT MCKINNEY: Yeah, so I think
that it's really easy to assume that whoever's
curating this data, especially if they are
on the health care side, know what they're doing. And they're going to
deliver you something that is machine learning
friendly out of the gate. But unfortunately, when
constructing a data set for machine learning,
it's really easy to introduce
confounding variables that the model can
then use to cheat. And obviously, models that
cheat won't generalize. So it's really important to
interact with the curators. These are maybe the IT folks who
are putting together the data sets, gain an understanding
of how the data is sourced, and be on the alert for spurious
correlations between some of the inputs, whether they're
images or medical records and some of the labels
that might enable the model to cheat. So an example that
I think has probably occurred in many domains,
but for us occurred when training a model
to identify tuberculosis from chest x-rays. So we worked with a
partner who had given us a bunch of positives and
a bunch of negatives. And we did the
obvious thing, which is train a classifier
to distinguish between those positives
and negatives. And the first model we
trained was fantastic. And so we were excited,
but again, cautious. And so we investigated,
and discovered that all the positives were from
one hospital, all the negatives from another. Now, these images
looked quite different, coming from the
different hospitals, using different scanners
with different parameters. And so to detect
tuberculosis, all the model would have to do is identify
which hospital it came from. And this information
is encoded in the image through its pixels in
a pretty obvious way, and has nothing to do with
any anatomy or physiology. And it's just kind of patterns
in the texture of image, or the contrast of the image,
or even in some of the markings that they might put in the
image when setting it up. And these models are lazy. And they're going to
cheat if given the chance. And so we've definitely been
stumped by this phenomenon in more than one spot. LILY PENG: Yeah, yeah. It sounds like
what I'm hearing is that folks who are clinical
who spent all their lives not in machine learning, actually
we could also do a lot in terms of letting them, or educating
them, or sharing knowledge with them about how to
construct data sets, and how these models work. Because I think a
lot of clinicians, if we tell them we want X
number of positives and Y number of negatives, they kind
of find those things. But they're also not vigilant
about all these other things that machine learning
experts kind of have sort of almost as a
background and you don't even think about it. It's like these
things that you do. But clinicians don't
necessarily know that yet. And it's actually
quite helpful to let them know the underpinnings
of how machine learning models cheat. SCOTT MCKINNEY: Yeah, I think
that's really well said. I think that when there is a
general mutual unfamiliarity, so data scientists
tend to be aloof from some of the
clinical aspects, and likewise, some
of the clinicians might be a little naive
to some of the phenomena that we're familiar with
in machine learning. And so when we
might specify data set characteristics
that will help us, those are taken very literally. And there are certain
dimensions that could be ignored and that kind of
stymie the endeavor. So yeah, there has to be
a lot of communication and a lot of conversation to
make sure things are done well. LILY PENG: Yeah, yeah, sounds
like lots of talking involved. SCOTT MCKINNEY: Yeah, which
is hard for a lot of us data scientists. [LAUGHS] LILY PENG: Yeah, yeah. OK, so Scott, the third rule,
tell me a little bit more about it. Solve a problem with
genuine clinical utility. What does that mean? SCOTT MCKINNEY:
It's easy to when thinking about tackling
a problem in diagnostics to kind of overly
generalize a label and think that if you can
identify that condition, then it's always useful
to a clinician. And in particular,
that you may need to narrow that definition
in order to surface cases that are actually
clinically relevant, and actually actionable. So the example that
comes to mind here is when looking
for pneumothorax, which is also described as
collapsed lung in chest x-rays. Now, this is a life threatening
condition that, importantly, can be treated very easily. You stick a tube in
the chest, and the lung will be able to reinflate. Now, we had a big data
set of chest x-rays. And we labeled them as
having pneumothorax or not having pneumothorax. And we turned the
crank, and we learned to classify these x-rays and
find those with pneumothorax. Now, the problem is that for
every case of pneumothorax that's found,
you'll also acquire x-rays to watch the condition
resolve once the chest tube has been placed. And that means that in a
retrospective data set, most of the x-rays that
supposedly have pneumothorax have an already
treated pneumothorax. And, in fact, the already
treated pneumothorax is easy to spot because there's
a big chest tube in the image. Now, obviously, these
are not the cases that need to be identified,
because the doctors are already aware of them. And so when you
define labels, you want to make sure that
the positives are really the positives you care
about, and that you don't have an overly broad
definition that encompasses mostly cases that
are already being treated. And in this case, the fact that
the treated pneumothoraces have a very overt signal in them
that enables the model to cheat, this is doubly bad because
not only are your metrics maybe off because of the
composition of the data set, but it also means that
the computer vision model you built
probably isn't doing what you think it's doing. LILY PENG: Yeah, yeah. So it almost sounds
like in this case there's like two clinical
problems, both called pneumothorax, right? The first one is
undetected, doctors don't know about it, untreated. And the second one
is a treated one. And we had inadvertently
solved the latter that had limited clinical utility,
rather than the former, which had genuine clinical utility. SCOTT MCKINNEY: That's right. We collapsed the positive
category into one. And unfortunately, the
treated pneumothoraces that have little clinical
utility overwhelm the sort of needle
in the haystack, undiagnosed pneumothorax, which
is the one that we really do want to target with
machine learning. LILY PENG: Got it. Got it. OK, so the three rules. The first one is,
visualize your data. Definitely look at the
thing, even if you do not have a medical degree. If you have friends with a
medical degree, co-workers, even better. But definitely
look at your data. The second rule I'm hearing
is question the construction of your data set. And then the third
rule is solve a problem with genuine clinical
utility, including thinking about the
timing in which you're getting this information
to the clinician. So I feel like those
are the three rules. Do you have any kind
of overarching thing that you always kind of have
at the back of your head that helps you
train better models or helps you tackle this space? SCOTT MCKINNEY: Yeah,
I think the thing that ties these
together is skepticism. Be really skeptical
of good results. Machine learning in
medicine is really hard. The combination of small
and messy data sets, coupled with the perceptual
challenge of the task means that easy wins
are really elusive. And so you should dig into
these and see what you can find, because there's probably a bug. LILY PENG: Yeah, yeah. SCOTT MCKINNEY:
At least at first. LILY PENG: Yeah, for sure. So if it feels easy,
it's probably too easy. That's what I'm hearing. SCOTT MCKINNEY: That's right. I think people have been
working in this field long enough to pick all
the low hanging fruit. So yeah, be skeptical. LILY PENG: So
Scott, thank you so much for talking to me, and
to sharing your knowledge with the rest of the audience. Thank you for the three
rules and the special sauce. And hopefully this will help
everyone train better models, and be a little more vigilant
for when models like to cheat. SCOTT MCKINNEY: Absolutely,
it was a pleasure. And I hope everybody else has
an easier time than we did. [CHUCKLES] LILY PENG: All right, thanks. [MUSIC PLAYING]