Building AI models for healthcare (ML Tech Talks)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

LILY PENG: Hi, everyone. My name is Lily, and I am a physician by training. At Google I am a product manager, and I work on a team with physicians, research scientists, software engineers that applies AI to health care problems. So today I'm going to go over the three common myths in building AI models for health care. So AI has been shown to have huge potential for many very challenging problems, including ones in health care. So, for example, we have seen some very interesting work in the realm of applied AI for eye disease, for breast cancer, skin cancer, colon cancer. And the best part is that this technology seems to work in the hands of not just research scientists, but undergraduates, business owners, and even high school students. And in the recent years, we've seen a huge increase in the number of papers at the intersection of deep learning and health care. And given the adoption of deep learning based technologies in consumer products and these exciting studies, one would expect that we would have many enabled AI products in the health care space. However, the translation into product has been quite slow. And why this gap between expectation and reality? So the translation of AI into health care is obviously more challenging than it seems. And today I'm going to cover three common myths in building and translating AI models that might be contributing to this gap. There are clearly more than three blockers, but these are the ones that we've been able to identify as we're working in this space. So the first myth is more data is all you need for a better model. And what we found with our work is that it's not just about the quantity of data, but it's really about the quality of data. So I'm going to go over one example of this. But there are a ton of other examples of how data quality really impacts algorithm performance. So this particular example is rooted in our work in diabetic retinopathy. And so a few years ago, we set out to see if we could train a model to classify photographs for this disease. This is a complication of diabetes that leads to vision loss. And the way that we screen for it is to take pictures of the back of the eye, and then read the images to see if there are lesions that are consistent with either mild disease or really, really severe disease, which we call proliferative DR. So in this case, we started with 130,000 images, and we worked with 54 ophthalmologists to produce 880,000 labels, or ground truth diagnoses. We then took this data and trained a model using an existing CNN architecture called Inception, and produced a fairly accurate model, one whose performance rivaled that of the general eye doctors that were part of this study. We then reported these results in the "Journal of the American Medical Association" a few years back. So the punch line of this paper was that we were able to train a very accurate model. But there were also lots of really interesting figures I think as a part of this that actually told you a lot more about the process and how to do this in the future. So a particularly useful figure in this paper that doesn't get as much attention is this one, figure four. And where we tested how the size of the data set and the number of labels affects algorithm performance. What we find here, which I'll go into more detail in the next slide is that while in general more data is better, the key is actually a high quality data and an efficient labeling strategy. So in panel A, we looked at how algorithm performance varies with the number of images in the data set. The question was, what would happen if you used a smaller data set for training? So in this graph, performance is on the y-axis, and the number of images is on the x-axis. And each of these dots represent a different algorithm that was trained on a data set of varying sizes. We started off with a few hundred images, and then used the full data set. And as you can see from the figure, the performance plateaus around 50,000 or 60,000 images. This means that for this particular problem, we didn't actually have to go up to 130,000 images to get comparable performance. This also means that with similar types of problems in the future, a data set of this size with accurate labels would be a good starting point. In panel B, we measured performance compared to the number of images per grade. The development data set had an average of 4 and 1/2 labels per image. And this was because we found that multiple opinions from several doctors provided a better ground truth than an opinion from a single doctor. So we asked, what would happen to algorithm performance if you had noisy or imperfect labels for each of the development set? So using the full data set, we trained models using a subsample of labels, either from the trained set or the tune set. So the trained set is 80% of the images. And the tuned set, or sometimes called the test set, is 20% of the images. What we found was that decreasing the number of labels on the train set seemed to have little impact on performance. However, the algorithm's performance did depend a lot on the accuracy of labels in the tuned set, which is the orange line there. So the takeaway here is, given limited resources, invest in labeling the tune set. So we took these learnings, and we applied it to subsequent papers. So in this paper, we leverage a much smaller data set, so a tuning set of a few thousand images whose labels were derived through an adjudication process with retina specialists. And then here we were able to increase the overall performance from a generalist level in the original paper to a specialist level, leveraging this smaller tune set. So that's just one example of how data quality really impacts performance. And we can go into more later on in our fireside chat. So the second myth is that an accurate model is all you need for a useful product. And what we find here is not just about the accuracy of the model, but usability. So again, going into one example. But there will be a ton more color during the fireside chat. So building a machine learning model is one step to preventing blindness or other disease using AI. This model really needs to be incorporated into a product that is usable by doctors and nurses. So it's critical that we study how AI can fit into clinical workflow. So going into an example of our work in Thailand with some of our partners at Rajavithi Hospital, we conducted a retrospective validation study to make sure that the model is generalizable. And it is. So this was the first step, the retrospective study. Then we launch a prospective study to evaluate the performance and feasibility of deploying AI into existing DR screening clinics across the country. And earlier this year, we closed the recruitment about 7,600 participants, all of whom were screened using AI across nine different sites. And we're currently in the process of analyzing the data, both quantitative and qualitative. Now, give you a sneak peek here is that what we've learned is that the human centered approach is really critical in building useful products. Here we worked with HCI experts and user experience experts to understand the feasibility of implementing this product. We actually started off with mapping out each step of the patient journey from the minute they present at the clinic to when they exit. And this helps us identify potential inefficiencies and bottlenecks as we implement the software. And so you could see here, it's not just the patient that we look at there that we follow, but also that of nurses, technicians, and the doctors. So we publish this methodology in a recent paper as a part of the CHI proceedings, and where we cover not just product functionality, but the workflow design to maximize the product's potential. So this brings us to the last myth, that a good product is sufficient for clinical impact. In fact, what we found is that while a good product is necessary for clinical impact, we also have to address the product's impact on the entire health care system. So taking a step back, the truth is we can have the best product in the world. But patients have to have access to it. So, for example, one of the reasons that people don't show up for screening in specialty [? I-hospital ?] has nothing to do with the product at all. For many patients in India or in rural Thailand, the trek to the hospital could take an entire day. And that means finding someone to take care of their children, coping with lost wages. And it's just very, very inconvenient to a point that it's actually very difficult to do. So screening in facilities closer to where patients live, so it can be AI enabled or not, but just moving the screening closer to patients, means that they can easily and efficiently access this kind of care. And this means that they don't have to choose between getting care for themselves and providing for their loved ones. And, of course, not only is access a critical issue here. We also have to look at cost effectiveness of these interventions. And this includes how a product should be implemented to take into account the downstream effects. This includes not just the cost of screening, but also follow up and treatment. A good example of this is the work that Professor Wong and his team at SERI in Singapore have done. They published a paper recently in "The Lancet" sharing the results of an economic modeling study comparing two deep learning approaches, an automated and a semi automated, so with humans in the loop. And what was interesting about this was that the semi automated, or [? SISDIF ?] approach was actually the most cost saving, more so than AI alone or humans alone. So it's really exciting to see some of this research come out. And I think there's going to be a ton more that will be needed so that we can adopt these technologies at scale. In summary, three common myths in building and translating AI models. The first myth is more data is all you need for a better model. And what we find is that it's not just about data quantity, but also about data quality. The second myth, an accurate model is all you need for a useful product. And what we really find is that a human center approach is also required to build that useful product. And the third myth is that a good product is sufficient for clinical impact. And here what we find is that implementation in health care economic research is critical to adoption of these AI products, and critical to adoption of these products at scale. All right, thank you. So I'm here with Kira Whitehouse, who's our lead for a lot of the deployment work that we've done in Thailand and in India. Kira is kindly joining us today to talk about what happens when you have a great AI model, and you're ready, you think, to deploy it into the real world. KIRA WHITEHOUSE: Thanks for having me. I'm so excited to chat with you today about AI and health care. LILY PENG: So Kira, why don't you tell the audience a little bit about yourself, kind of what you do, and what goes into building a product from an ML model. KIRA WHITEHOUSE: Absolutely. So I am a software engineer. I joined the team about four years ago. And I've helped the group get a CE marking for our medical device. And in terms of what goes into the actual development of that device, once you have a brilliant AI model, it comes down to going through this process that we call design controls. So in the first phase you're going to be thinking about who your users are and what their needs are, the intended use of the device, that kind of thing. And then from there, you'll map those requirements into software requirements. So you'll think about how you're actually going to implement those in your software. And then the next stage is really thinking about risks, and potential harm to your users, whether they're patients or other folks who are interacting with the device. So in our case, right, we have a screening device, which means that we're going to be telling patients either to see an ophthalmologist, a specialist, or we're telling them, OK, you're fine. You're healthy for now. You can go home and come back in a year. So there's different risks associated with that kind of device than with other kinds of device, devices like assisted read or second read. And then to sort of wrap up that process, you're going to be doing something typically called verification and validation, which just means you're going to be making sure you built the right thing. You built it to spec, and that it's actually going to be helping the users in the way you think it should. So that whole end to end process is what we think of in developing medical device. And then obviously, you're going to be working with partners to deploy that and get it into the hands of patients and physicians. LILY PENG: So it sounds like this process that you're describing, is it specific for AI? Or is it just like any kind of device? KIRA WHITEHOUSE: Great question, yeah. This is pretty much any kind of device. It could be software as a medical device, which is typically abbreviated SAMD, or it could be hardware. So in certain cases, again, if you think about risks and how your device is going to be used in hardware scenarios, you actually will think about things like how is shipment going to damage my device potentially? Like the process of actually getting your camera, let's just say fundus camera, over from the United States to Europe, as an example. And for us, for the software medical devices, we have different kinds of considerations. So we decided to develop an API only device. So we don't have our own user interface. And that means that we rely on our partners to actually display the results of the device in their UI, whether it's an EMR or a [? PAK ?] system, right? And with that there's some benefits, which is we have this seamless integration. So we don't have to force the health care system to change their workflow. They're already using their own user interface. They can just display our results. And that can be really useful in some contexts, as you've seen, Lily. In other contexts, if we're trying to deploy in a setting where there's no existing infrastructure, like we've experienced in Thailand, that can actually be really challenging. They're only on a paper-based workflow. In terms of AI and some differences here, so if you use YouTube as an example, you're going to see the latest and greatest videos there will be suggested to you. So you'll get results from today or yesterday maybe at the latest. And with our medical device software, we are also deploying our device on a regular basis. So just like with YouTube, we're going to be deploying our software maybe every day, every week, something like that. With the AI component, though, of the medical device, we typically don't roll that out more than once every six months or something like that, because that's a more major update. So when you think about the risk of the device and what components are lending to that risk, a lot of the sort of serious logic that could cause harm is within the AI, cause the AI is the thing that's taking in the image and then predicting, do you have disease or not? Does that make sense? So it is kind of interesting to think about the counterexample of YouTube, where that AI might deploy on a daily basis, whereas in our situation, we have this much longer time frame. So Lily, we were talking about how in the design input phase you want to figure out what your intended use is, who your users are, what their needs are. And when we had this AI model, we could have done a number of things with it, right? We could have used it in an assisted read context. We could have used it in a second read context or something else. We ended up choosing screening. Can you talk a little bit about why that was the decision that we made? LILY PENG: One of the really amazing things about AI is the ability to move care closer to patients. So what I mean is that there are a lot of restrictions on how a person can get care. And a lot of it has to do with whether or not the health care professional is actually locally, physically within a certain area, right? What technology allows you to do, whether it's through telemedicine or AI enabled telemedicine, is to bring care closer to patients. And so one of the things that we've seen with screening, in particular, is that access issue is a really big barrier. So if you put screening closer to a patient, your screening rates will go up. Whether it's AI enabled or not, screening closer to a patient means higher adherence rates. And so with screening, that's actually what matters the most. It's like you're really trying to catch everybody. And so this is where we thought AI could be the most useful in terms of accessibility. Also, it probably is pretty well adapted to the problem, just because AI are really good for high volume type of repetitive tasks. And screening tends to fit that mold quite a bit. So the two different kind of aspects of screening that make AI really applicable to this problem space is the accessibility requirements, as well as the scale and the sheer volume of screening procedures that are required. One of the myths that I talked about was that all you need for a useful product is a more accurate model. And so what do you think about that statement? Like where are the caveats to that statement? KIRA WHITEHOUSE: I guess one of the topics that we can touch on is image quality. That's been a big problem in our actual deployment sort of in the field. So our device, right, we take in an image that's off of the back of your eye. So if you've gotten your retinal exam before, you might have done a slit lamp examination in person. But if you've gotten a fundus photograph, they'll be shining a light through your pupil, and take a photo of the back of your eye. And our algorithm takes that as input. Sometimes the images themselves are really low quality. So maybe half of the image is obscured. Maybe it's just really blurry. Sometimes there will be dust on the camera lens, and that will cause either a lesion to be obscured by the piece of dust, or potentially there'll be something on the image that looks like a lesion that's not. It's just a dust spot, right? And those problems, Lily, from your perspective, because you're a physician, if you got an image like that, right, what would you do? In what cases would you decide that an image was associated with a patient who is diseased versus not? Would you be able to make that call? Cause those are the kinds of challenges that our AI has as well, right? LILY PENG: Yeah, yeah, I mean, so for a physician we would probably just see the pictures from the same day and see if, let's say, the dust spot was still there. So how would we address on the AI level? KIRA WHITEHOUSE: So we could do actually something similar, which is interesting. If we had a very tight integration with the camera itself, we could actually provide signal to users when they're actually taking the photo. So when the camera is put up to the patient's eye, we could have a little bubble that pops up and says, hey, it looks like there's a dust spot on your camera. Can you clean it? We could also maybe help them if, for instance, we see that the patient maybe has cataracts, there's some media opacity that's preventing the light from getting to the back of the eye and getting a good photo. We could tell them, hey, it looks like this patient has cataracts. Maybe try these different things to capture a good photo. We've also seen, though, that a lot of the times it ends up being really simple solutions that have nothing to do with technology, right? So in some cases, it's just that they need to install better curtains in the clinic. Because by having a darker setting, the pupil's going to be more dilated and you can get a better photo. In other cases, we've seen that in certain clinics patients will often come with personal belongings. And then they'll be sitting at the camera trying to get a good photo with their purse or handbag. And maybe it's hard to get a good position. So even installing something like a shelf could potentially be helpful there. LILY PENG: So it sounds like image capture, or just getting the right pictures to put into the system in itself is challenging. KIRA WHITEHOUSE: Right, that's one problem that we see. Another problem that's interesting, and also it'd be great to hear from you about why this is challenging, but even if we have an amazing AI, and let's say we can get good quality images, we still see these problems of patient recruitment and patient follow up, which means we're not actually getting the patient base that we want into the clinic to get screened. And then when the patients are there, and we give them an output from the device, and we tell them they're diseased, we actually see that a lot of these patients don't even come back to see the specialist. And I'd love to hear from your perspective, what are the obstacles there that you've experienced, or that in talking with other health care professionals you've seen? LILY PENG: Yeah, yeah, I think it's really interesting that you can get the best images. And then you can have the best model. And then you can give people the information. But if you don't make that information actionable and easily actionable, you may have lost the game, so to speak, right there. Right? And so what we found sometimes is that the information comes too late, right? So a lot of times patients are told, hey, we'll send you a mail, or we'll give you a call if anything's wrong. No news is good news, right? And so then they miss the call, or they don't get the piece of mail. And they think, well, no news is good news, right? So the default is no follow up. So that, in itself, is the timeliness of that information can be problematic. And that's why an automated system can be really helpful is cause you can get that information to the patient very instantaneously. Now, that instantaneous of the delivery of that information then enables a bunch of other things, right? So same day follow up is yet another thing that we've seen that seems to make a big difference in adherence rates. And we've actually interviewed a lot of patients to ask why. And a lot of times our studies and other studies have shown that the number one reason is transportation. Right, it's not I don't understand, I didn't realize I was sick, or I didn't believe I was going to get better, which are also reasons. But number one is I can't get a ride, or I can't take time off my schedule. So a lot of times it's crazy that we're the AI people. And it's like we're not able to provide the solution here. It's actually quite common sense solutions that actually need to be implemented well. So we've covered what it takes to take an AI model and kind of verify it, validate it, and then potentially put this into a clinic. What are the things that you have to do after that? Are you done once you kind of sell that piece of software or install that piece of software? What else is there to do? KIRA WHITEHOUSE: One of the things that's kind of exciting about software is that you can monitor it, right? And before you go and deploy the medical device, when you're getting regulatory approval from getting a CE marking, or FDA approval, you're going to go through some sort of validation study. So you'll be validating that your device works against some population. And it's usually not possible to have representation from every single population, thinking of sex, age, ethnicity, and whatnot. So one of the things that's really important is to make sure that your device is functioning as intended whenever and wherever you deploy it, right? So we actually have kind of a fun, creative, post market monitoring solution where we take a subset of the images that are captured during clinical workflows, and we actually have doctors adjudicate them in-house to see what the grade should be. And we compare that result to [INAUDIBLE].. So we can see the performance of the device when it's actually live and impacting real patients. So that's been really exciting to see. The other things that are involved just with post market are handling customer feedback, if they have feature requests or they have complaints. Making sure that if they have complaints, there's no defects with the device. Or if there are, we follow up and address them. And then for feature requests, that's kind of an exciting thing to see that our device has been useful, and how we can make it even more impactful to our users. LILY PENG: So Kira, it sounds like there are a lot of expectations around AI. What are some common misconceptions of what people think AI can do, where it maybe isn't able to right away? KIRA WHITEHOUSE: So one example of this, our device takes as input a 45 degree field of view image. So again, when you think of the light going through your pupil and taking a photograph of the back of your eye, it's going to get some portion of it. And there's different field of views that you can capture. So you can get something called an ultra wide field image, which is up to 200 degrees. And we received some feedback from partners at some point that they were expecting our AI could take the smaller field of view, the 45 degree image, and actually predict what you can see or what humans could typically see only using the bigger field of view, the 200 degree field of view image. And that's really interesting feedback, right? If we had the data, if we had a bunch of paired data that was ultra wide field with 45, we might be able to train a model to do that, potentially. And you can see why people might have that sort of misconception if you think about the risk of heart attack, the cardio paper that our team published. AI being able to take a fundus image and then predict the likelihood of having a heart attack or some other cardiac event, having that compared to something like a 45 to 200 degree field of view, it kind of makes sense that people might think, oh, you could just do a little bit more using this smaller image, right? So that's been really interesting to see. As I mentioned, data can sometimes be an obstacle in to actually developing a model that can do things like predict cardio, or predict disease that's only seen from these wider field of view images. So Lily, I have to ask you, if you could have any data, snap of your fingers, what kinds of models would you want to train? LILY PENG: That is a fantastic question. And I think it goes to the core of problem selection, right? What are the things that you want to train a model to do that actually is helpful for the patient, right? I generally find that picking a problem where there is an intervention, it's a lower priority. Like I think there are some things where you can predict a risk of progression of a certain disease, or another one. But if there isn't an intervention that you can take because of that prediction, then it's probably not going to do as much good as if you would change your course based on that prediction, right? So the way that I think about it is that the prediction needs to be actionable in some way, first and foremost. So what I mean by actionable is for screening, if we find that this person needs to be followed up at a shorter time interval, three months versus a year, that is an action that we would do differently. And that makes that problem a good problem to tackle. If it's no change, we could say, no matter what, this person's going to be followed up in a year. That becomes a less interesting problem I think. So that would be the first criteria of problem selection is the actionability of what you're doing. The next one is I would think about scale, which is how many people would benefit from this task being done correctly. And within that, there's two components. One is, how many people are getting the procedure done, but also how many people wouldn't get the procedure done, or would be potentially misdiagnosed if you didn't do this. Right, so I think that's the second component of it. So I think one of the interesting things about these particular selection criteria is that there are actually already programs in the medical community where we do this. And those are called screening programs. So screening for breast cancer, screening for lung cancer, screening for colon cancer, or screening for diabetic complications. And it's because overall we found that if we screen people early, we're able to help them live happier, longer lives. And so I think screening programs is actually a really big deal. And within screening programs, the better outcomes we have, the more hard, we say, outcomes we have, the better the data for that particular problem. So what I mean by hard outcomes is really survival rates for cancers, for example, vision loss rates for diabetic retinopathy, so things that really matter a lot to patients, rather than kind of other proxy outcomes. So the more concrete you are about how that affects patients, the better the problem for ML. KIRA WHITEHOUSE: Your background in health care, Lily, I'm really curious. We, Google, have come a long way in the last four years getting this project from just a machine learning model, to actually getting a CE marked device, and having that device being used by health care providers and impacting patients. I'm curious, have you seen a shift in health care providers' perception of AI, either from us or from everyone else in the industry, in tech, and in medicine that are making these sorts of devices? LILY PENG: Yeah, I think there has definitely been a shift in how we think about AI in medicine. I think when we first started, the question was if, if AI would have an impact on medicine. I think now the conversation has changed a little bit to the how. How will AI impact medicine? And how do we do it responsibly? How do we do it in a way that safeguards patient privacy, but also maximizes patient impact, right? So really we've gotten to how the implementation is done, because, honestly, there's a lot of research out there that shows the potential of AI to make big changes, and ensure better, more equitable care for lots of folks, if implemented correctly. And so I think that's where lots of folks are spending a lot of time, obviously external to Google, and within Google as well is understanding the how. And so some of the work that Kira and her team are doing is really helping us gather information on the how, right? How does this ML model fit into a product. How does this product-- how is it verified to be safe and effective? How do we put it into a health care system such that patients are actually benefiting from it? And then how do we monitor the products in real time, or near real time so that we make sure that the diagnoses are accurate, and we find out if anything goes wrong quickly. So I think the how here is now kind of the next big mountain to scale. And I think we've made some really, really good progress there. KIRA WHITEHOUSE: Absolutely, yeah. It's been so exciting to see really what we've done. And also, I love the way you framed that of how health care is maybe moving from an if AI can help to how. That's a very, very exciting time. LILY PENG: So I am here with Scott, one of our ML leads on the team. And Scott is the lead for our [INAUDIBLE] paper, which was published in "Nature" recently, as well as has done a lot of modeling with other types of radiological images and data. So Scott, obviously you've trained a lot of models in your life. So do you have any tips or rules of thumbs for people listening to this cast? SCOTT MCKINNEY: Absolutely, I'm hoping to help people avoid all the mistakes that I've made along the way. And there have certainly been many of them. So the first one that I'd encourage people to do is visualize their data. The second tip is question the construction of your data set. And the third is really make sure that you're trying to solve a problem with genuine clinical utility, rather than something that's just easy to model. LILY PENG: OK, so what I hear is visualize your data, question the construction of your data sets, and solve a problem with genuine clinical utility. So for our audience, can you elaborate a little bit more about what you mean by each? SCOTT MCKINNEY: I think that in building machine learning models for medicine, people often blind themselves to the data. And they're thinking that they're not going to be able to make sense of it in the first place. And it's intimidating, because the experts who interpret these images may have trained for years to be able to do this well. But I think that if you don't actually get in there and look at the data, you can miss important patterns. So I encourage people to get familiar with the examples and inspect the images when you can, because there may be obvious things wrong that you don't need a medical degree to notice. So I can give you an example of this in practice. We were building a model to find pulmonary nodules, which are potential lung cancers in chest x-rays. And we had built a model that was performing astonishingly well. And obviously this is very exciting. But when we looked at some of the true positives, these are the cancers that the model was supposedly catching, we found bright circles overlaid on the images. LILY PENG: Oh. SCOTT MCKINNEY: And so these digitized x-rays had had pen markings on them from prior interpretations. They put them up on the lightbox and circled the nodules that they were worried about. And so this obviously invalidated a lot of our work, because all of this effort had gone into building a very sophisticated circle detector. So clearly this would have been easy to detect if we had just spent some time browsing the data, and noticing that it was contaminated in this way. LILY PENG: Yeah, yeah, it totally sounds like you don't really-- for some of these first line passes, you don't really need a medical degree to make sense and like find this, for example, circle sign. SCOTT MCKINNEY: Absolutely. Yeah, so these sorts of patterns can be really stark and easy to see just by thumbing through some of them without having a fellowship training in radiology. LILY PENG: Yeah, yeah. And so tell me a little bit more about rule number two, about the construction of your data sets. Where have you seen this go right? And where have you seen this gone terribly wrong? SCOTT MCKINNEY: Yeah, so I think that it's really easy to assume that whoever's curating this data, especially if they are on the health care side, know what they're doing. And they're going to deliver you something that is machine learning friendly out of the gate. But unfortunately, when constructing a data set for machine learning, it's really easy to introduce confounding variables that the model can then use to cheat. And obviously, models that cheat won't generalize. So it's really important to interact with the curators. These are maybe the IT folks who are putting together the data sets, gain an understanding of how the data is sourced, and be on the alert for spurious correlations between some of the inputs, whether they're images or medical records and some of the labels that might enable the model to cheat. So an example that I think has probably occurred in many domains, but for us occurred when training a model to identify tuberculosis from chest x-rays. So we worked with a partner who had given us a bunch of positives and a bunch of negatives. And we did the obvious thing, which is train a classifier to distinguish between those positives and negatives. And the first model we trained was fantastic. And so we were excited, but again, cautious. And so we investigated, and discovered that all the positives were from one hospital, all the negatives from another. Now, these images looked quite different, coming from the different hospitals, using different scanners with different parameters. And so to detect tuberculosis, all the model would have to do is identify which hospital it came from. And this information is encoded in the image through its pixels in a pretty obvious way, and has nothing to do with any anatomy or physiology. And it's just kind of patterns in the texture of image, or the contrast of the image, or even in some of the markings that they might put in the image when setting it up. And these models are lazy. And they're going to cheat if given the chance. And so we've definitely been stumped by this phenomenon in more than one spot. LILY PENG: Yeah, yeah. It sounds like what I'm hearing is that folks who are clinical who spent all their lives not in machine learning, actually we could also do a lot in terms of letting them, or educating them, or sharing knowledge with them about how to construct data sets, and how these models work. Because I think a lot of clinicians, if we tell them we want X number of positives and Y number of negatives, they kind of find those things. But they're also not vigilant about all these other things that machine learning experts kind of have sort of almost as a background and you don't even think about it. It's like these things that you do. But clinicians don't necessarily know that yet. And it's actually quite helpful to let them know the underpinnings of how machine learning models cheat. SCOTT MCKINNEY: Yeah, I think that's really well said. I think that when there is a general mutual unfamiliarity, so data scientists tend to be aloof from some of the clinical aspects, and likewise, some of the clinicians might be a little naive to some of the phenomena that we're familiar with in machine learning. And so when we might specify data set characteristics that will help us, those are taken very literally. And there are certain dimensions that could be ignored and that kind of stymie the endeavor. So yeah, there has to be a lot of communication and a lot of conversation to make sure things are done well. LILY PENG: Yeah, yeah, sounds like lots of talking involved. SCOTT MCKINNEY: Yeah, which is hard for a lot of us data scientists. [LAUGHS] LILY PENG: Yeah, yeah. OK, so Scott, the third rule, tell me a little bit more about it. Solve a problem with genuine clinical utility. What does that mean? SCOTT MCKINNEY: It's easy to when thinking about tackling a problem in diagnostics to kind of overly generalize a label and think that if you can identify that condition, then it's always useful to a clinician. And in particular, that you may need to narrow that definition in order to surface cases that are actually clinically relevant, and actually actionable. So the example that comes to mind here is when looking for pneumothorax, which is also described as collapsed lung in chest x-rays. Now, this is a life threatening condition that, importantly, can be treated very easily. You stick a tube in the chest, and the lung will be able to reinflate. Now, we had a big data set of chest x-rays. And we labeled them as having pneumothorax or not having pneumothorax. And we turned the crank, and we learned to classify these x-rays and find those with pneumothorax. Now, the problem is that for every case of pneumothorax that's found, you'll also acquire x-rays to watch the condition resolve once the chest tube has been placed. And that means that in a retrospective data set, most of the x-rays that supposedly have pneumothorax have an already treated pneumothorax. And, in fact, the already treated pneumothorax is easy to spot because there's a big chest tube in the image. Now, obviously, these are not the cases that need to be identified, because the doctors are already aware of them. And so when you define labels, you want to make sure that the positives are really the positives you care about, and that you don't have an overly broad definition that encompasses mostly cases that are already being treated. And in this case, the fact that the treated pneumothoraces have a very overt signal in them that enables the model to cheat, this is doubly bad because not only are your metrics maybe off because of the composition of the data set, but it also means that the computer vision model you built probably isn't doing what you think it's doing. LILY PENG: Yeah, yeah. So it almost sounds like in this case there's like two clinical problems, both called pneumothorax, right? The first one is undetected, doctors don't know about it, untreated. And the second one is a treated one. And we had inadvertently solved the latter that had limited clinical utility, rather than the former, which had genuine clinical utility. SCOTT MCKINNEY: That's right. We collapsed the positive category into one. And unfortunately, the treated pneumothoraces that have little clinical utility overwhelm the sort of needle in the haystack, undiagnosed pneumothorax, which is the one that we really do want to target with machine learning. LILY PENG: Got it. Got it. OK, so the three rules. The first one is, visualize your data. Definitely look at the thing, even if you do not have a medical degree. If you have friends with a medical degree, co-workers, even better. But definitely look at your data. The second rule I'm hearing is question the construction of your data set. And then the third rule is solve a problem with genuine clinical utility, including thinking about the timing in which you're getting this information to the clinician. So I feel like those are the three rules. Do you have any kind of overarching thing that you always kind of have at the back of your head that helps you train better models or helps you tackle this space? SCOTT MCKINNEY: Yeah, I think the thing that ties these together is skepticism. Be really skeptical of good results. Machine learning in medicine is really hard. The combination of small and messy data sets, coupled with the perceptual challenge of the task means that easy wins are really elusive. And so you should dig into these and see what you can find, because there's probably a bug. LILY PENG: Yeah, yeah. SCOTT MCKINNEY: At least at first. LILY PENG: Yeah, for sure. So if it feels easy, it's probably too easy. That's what I'm hearing. SCOTT MCKINNEY: That's right. I think people have been working in this field long enough to pick all the low hanging fruit. So yeah, be skeptical. LILY PENG: So Scott, thank you so much for talking to me, and to sharing your knowledge with the rest of the audience. Thank you for the three rules and the special sauce. And hopefully this will help everyone train better models, and be a little more vigilant for when models like to cheat. SCOTT MCKINNEY: Absolutely, it was a pleasure. And I hope everybody else has an easier time than we did. [CHUCKLES] LILY PENG: All right, thanks. [MUSIC PLAYING]

Info

Channel: TensorFlow

Views: 62,403

Rating: undefined out of 5

Keywords: GDS: Yes, building ai models, build ai models, building ai models for healthcare, ai models for healthcare, build ai models for healthcare, ai, healthcare, ml tech talks, machine learning tech talks, ml, machine learning, developer, developers, tensorflow developer, tensorflow developers, google developers, tensorflow, Lily Peng, Kira Whitehouse, Scott McKinney

Id: UZEstizNxkg

Channel Id: undefined

Length: 48min 3sec (2883 seconds)

Published: Thu Jun 24 2021