Big data and public health

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
coming up in Harvard Chan this week in Health big data and Public Health every time we'll go to a doctor we receive a diagnosis we start a new treatment the information goes into a database and when you put all of us together that is a database with millions of data points that can be used for research purposes in fact chances are that our own health information is being used right now to gain new insights about health researchers are now harnessing vast amounts of information to assess what works in medicine this new data-driven approach holds promise but there are some potential risks and in this week's episode we'll discuss that with an expert in the increasingly important field of causal inference hello and welcome to Harvard Shan this week in hell it's Thursday January 25th 2018 I'm a meme on tomorrow and I'm Noah love it know it medicine and public health are constantly evolving as new research and Technology open the doors to new ways to treat or prevent diseases and in other cases new findings are challenging our preconceived notions of what works best and a key challenge for doctors and scientists is exactly that figuring out what is best for patients and the public's health at large that means asking questions like when is the best time to start a treatment in individuals with HIV or is it safe to give an in to prescence to pregnant women there are also policy questions that should be answered such as nutrition recommendations regarding dietary fats in an ideal world those questions would be answered through a randomized control trial the gold standard of scientific research but in many cases that's not possible because it may be too expensive too difficult to enroll the right number of people or the study itself may be unethical and that's where big data and the focus of today's podcast comes in researchers are now able to harness vast amounts of existing information on patients such as in Medicare databases to in a sense replicate randomized control trials it's an approach with great promise but also potential downsides if the research isn't conducted properly and that's where a Miguel Hernandez as much of his work Hernan is the kolokotronis professor of biostatistics in the department of biostatistics here at the Harvard Chan School and he's a leading expert in the field of causal inference which includes comparative effectiveness research to guide policy and clinical decisions we spoke with Hernan about how researchers are using big data to answer important questions about health and the safeguards that need to be in place to avoid misleading results I began our conversation by asking or none to define big data which is a term you've probably been hearing a lot lately take a lesson big data means different things to different people and in fact there is not even an agreement on how big data need to be to be called big data but in a health context we used the term big data to refer to these large data bases where our interactions with the health care system I restored every time we go to a doctor will receive a diagnosis we start a new treatment the information goes into a database and when you put all of us together that is a database with millions of data points that can be used for research purposes and of course there are very strict protocols to prevent any leaks of personal information as an example there are lots of research that is conducted based on the information of Medicare beneficiaries or information on members of private insurance companies in fact chances are that our own health information is being it's been used right now to gain new insights about health in that sense we are all part of the research enterprise which I find very exciting and so just to follow up on that I mean is it seems from my perspective that this whole field of big data has seemed to grow really really rapidly over the maybe the last decade or so has that been the case or has it been as big data kind of been in use maybe longer than people realize that's a very good question people in health started to use big data big databases probably in the 70s at the time the ones doing that were we're a minority and now everybody is using these big databases I think that the term big data comes to us from other places from Google and Facebook and places like that but health researchers have been using big data for a long time so I know much of your research focuses on on causal inference so what does that mean when it comes to using these large databases of information cosine inference is the term that has become very fashionable among investigators but it actually means what we actually do is to try to learn what works and what doesn't work to improve health and we use big data for that we ask questions like how much screening colonoscopy lowers the risk of cancer or what is the best time to start treatment in individuals with HIV or is it safe to give antidepressants to pregnant women in the past we have very little data to answer these questions for each of the questions we have to recruit participants collect data which meant that we have relatively small studies but in the last decades with the availability of this big databases we can study these issues we can ask these questions are try to answer them in a more efficient way and at the fraction of the cost and so you touched on their that I mean the standard for assessing one of these questions might be to do a randomized control trial recruit participants but what Big Data allows you to do is maybe to measure the effectiveness of an intervention without having to do that's okay so can you give an example of where that might occur sure well first first of all you mentioned randomized trials which have historically been the gold standard to learn what works and what doesn't and the idea of randomized trial as I'm sure many of our listeners know is that we assign people to choose different treatments and we assign them at random to choose different treatments then we compare the outcomes between the two groups and because a treatment assignment happened by chance any differences between the groups have to be due to the treatment they are receiving so this is the best possible way of making causal inferences now in the real world there are many practical difficulties to carry out randomized trials some trials could be so expensive that we cannot even consider them others would not be ethical suppose that we want to learn about the risk of birth defects well we cannot conduct a trial in which we intentionally expose pregnant women to dangerous treatments all the times we are interested in the long term effects of treatments maybe after using them for 10 or more years and again a randomized trial would not be practical so as much as we love randomized trials in many cases we are not going to be able to conduct them and that is when we use big databases in those cases our best chance to learn what works is really they the use of these big databases and even when we can conduct a randomized trial we can actually do it we will have to wait three four or five years until we know the results from the travels and in the meantime we still need to make decisions for those decisions again we need some information which will come from big databases so in a sense are you basically taking existing data that's out there and then kind of modeling what is has already happened in the real world and try conclusions from that that is exactly what we do and that is what that is what causal inference is so we we take the data that has happened already and we try to use this this day that you emulate a randomized trial that we would like to conduct but we can't and so what are some of the benefits of this approach women on the flip side walk with some of the risks pay the benefits are that being formal about causal inference being formal means trying to be very precise about what is the what is the what is the randomized trial that is our target what is the randomized trial that we will actually like to emulate and then go about try to emulate it that approach results in fewer mistakes if we try to do it in a more casual way in which well we have data we do a data analysis we find some associations and we trying to to give them a causal interpretation it's more likely that we will make mistakes for example a naive data analysis will find that cigarette smoking to emergency is associated with lower mortality in babies with low birth weight but that doesn't mean that cigarette smoking during pregnancy lowers the risk of mortality that is just something that is that that that we are guaranteed to find in the data and former causal inference analysis will explain why cigarette smoking really does not lower their risk of mortality in those babies so by being formula what causal inference we can leave it's and common biases that we sometimes see in data analysis that are more as well or naive I mean it seems like when you're talking about is there and what you met in the example of smoking during pregnancy I mean it seems like it's an example of kind of these seemingly random associations that people can find if they if they if they play with the data enough like what are some of the common biases that you do need to be aware of that would separate as you imagine kind of like and I a naive study of make a formal causal inference well you just touch on a very important problem of this type of analysis with big data which is the problem of multiple comparisons because you can compare anything that you want then just by chance you are guaranteed to find some associations and that is very that's a very serious problem one way of fighting that problem is precisely to be formal about the question so by by pretty specifying randomized trial that you like to conduct but you can't and then trying to emulate that trial using the big data then you can actually constrain yourself in the in terms of the number of analysis that you are going to do because you cannot do anything you have to do only the type of analysis that will help you answer that specific question and not the other million questions that that could come but that is only one of the problems the other problem when doing when trying to make causal inferences with with big data is that we have a lot of data but that doesn't mean that we have the data that we need of course we need data on the treatments of interest we need datum the outcomes of of interest if we are trying to estimate the effect of aspirin on stroke we do we need data and good data on husband and with data and Stroke but besides that we also need very good data on the reasons why people take aspirin because people who do take aspirin and people who don't take aspirin in the real world are different so we cannot just compare them this is not a randomized trial therefore we just compared people who take aspirin with probably a people who have a higher risk of heart disease to start with and people who don't take aspirin who have a lower risk of heart disease then they will have different risk of stroke but not because of aspirin is just because there are different types of people that is the problem that randomization solves and that is a problem that we have in this type of studies so we will need very good data on the variables that make the treat eat and the untreated different and that is I would say that's a main limitation of many of these analyses that in this large healthcare databases we may have high quality informations and sometimes we can have high quality information on treatments and high quality information on outcomes but not always high quality information on these prognostic factors that are needed for a valid analysis on some level you're a little bit at the mercy of the data available to you where if the data on a particular intervention or the variables on right you you you might not be able to proceed so is that a challenge that that researchers find themselves running into a lot where they do want to do this type of research but the data just doesn't exist yet absolutely absolutely and that is that is one of the first decisions that all researchers have to make they may want to answer certain important question they look at the data that they have and sometimes you just have to decide that there is not enough data there that you cannot provide em an accurate answer to that may be because you have again very good data on a stream very good data on on stroke but you don't have very good data on they pee on the reasons why people take aspirin that is when that happens you probably have to stop there and not try to use the observation later the big data bases on the other hand there are many other examples in which we do have enough data to give an approximate answer and we can also explore the data in ways that give gives us confidence in our answer so we can do parallel analyses that show that it is unlikely that our results are explained by differences between the groups those studies will we sometimes refer to as sensitivity analysis which are a very important part of all the analysis of big data bases and in doing the sensitivity analysis that was something we're like controlling for like confounding variables comes in or is that errors like a confounding variable more that we were talking about with like the reason someone would take aspirin there are many different types of sensitivity analysis a type that we like a lot is something known as negative controls so the way this works is let me let me give you an example a few years ago we conducted a study using a large database of electronic medical records and we wanted you to estimate the effect of of studies which is a treatment for cholesterol on diabetes so we found that people who initiated statin therapy had a 10% or so increased risk of diabetes compared with people who didn't now this might be due to many reasons and one of reason is that people who start studies are by definition seen their doctors more often so it is possible that a statins do not really increase the risk of diabetes what happens is that you start studying you go to a doctor more often and you are more likely to be diagnosed with a diabetes that you had already and could not have been diagnosed otherwise okay so how can we how can we learn from the data whether that is likely to be the explanation or not we can use a negative control meaning we can find another outcome which is not diabetes that is not expected to be associated in any way with statin therapy but but that could also be increased if you go to a doctor often for example gastric ulcer some people may have some some symptoms of ulcer but they are not diagnosed when the symptoms are mild unless they they go to the doctor for all the reasons so we did the same analysis that we have done for studies and diabetes but now for studies and an ulcer and we found that there was absolutely no association between studies an ulcer so that gives us some confidence that the association that we have found between statins and diabetes was not due to visiting the doctor more often interesting see you almost like have something that's unrelated to confer to kind of validate what you're doing in the in the study exactly that's really interesting to kind of continue it with the statin example you know if I sound at home and I'm reading a story about statins and cholesterol you know I guess the first thing would be you know check with your doctor but if someone is reading a study about the latest findings on stands I mean what are something that they should keep in mind when they're when they're you know reading news coverage of this type of research to maybe find out okay this is something that is worth paying attention to worth you know dig know a little deeper into there are a few things that you have to pay attention to one is of course whether there is appropriate adjustment for the difference between the differences between treatment users and own treatment users another one is how the treatment group is actually defined because you can define treatment group in such a way that guarantees that treatment is going to be good but it has nothing to do with a true effect of treatment and one example is the use of studies in cancer patients imagine that you define the use of studies in cancer in cancer patients and say well anyone who has a cancer diagnosis at a start and then after the cancer diagnosis starts studying in the next four five years will be in the study in user group and everyone who doesn't start will be in the in the in the non user group okay now imagine the sanguine dies one year after cancer diagnosis that person had very little chance of being in the in the in the user group because I was like very early so that person will be automatically put in the non user group that means that just by defining users and non-users in that way we guaranteed that non users we have a shorter survivor time than users and that is a type of bias that is not sometimes known as in mortal time bias because someone who is a user because has lived because as starting they start in four years after diagnosis of cancer is by definition immortal for four years so that type of classification of users and non-users are as important or more important that the proper adjustment for differences between groups and sometimes it's not even enough attention when reading a paper or by or by the media when they report on a paper so it seems like we're saying every researchers need to be incredibly strict and kind of setting the parameters what is the process like if you want to kind of conduct one of these studies you mean how how does that what is that process like in terms of making sure that you are being strict about when you're setting the follow-up times you know like what does that process look like in building on one of these studies well the funny thing is that we've always known how to do this right because we conduct randomized trials in which some basic principles of study design and analysis are followed and the problem is that for some reason when we started to analyze this big databases we forgot about the hosts basic principles of design it turns out that some of the of the of the best-known failures of observational research are just the result of not following the same rules that we would follow for a randomized trial so once we we go back to the big databases and we analyze this data as I said making sure that we we have defined there my strand that we would like to to mimic and now mimic it if we do that then we will define our groups correctly we'll find out the follow-up correctly and the only thing that is left only is in quotes here you know the only little thing in that that is left is to adjust correctly for the differences between the groups that's always going to be the biggest limitation of observational research from big databases that we don't know you we have adjusted for all those differences but all the others all the other problems like then in mortar time bias or other types of selection bias etc those are just self-inflicted injuries that we can very easily eliminate and so we've talked a lot of the examples today are questions of effectiveness or safety so I mean moving forward how do you see this growing use of big data how do you see it affecting patients and the care they receive well it is already affecting patients and the care that they inclusive because for many questions as I said we're not going to be able to conduct randomized trials so the only information will be coming from from observational data from some time to come it's possible that in some cases there there will be randomized trials at the end but but in the mean time we only use data from large databases again let me give you an example a few years ago there were there were questions about when is the optimal time to start therapy in patients infected with the with HIV there were arguments for and against starting very early in the disease and there were no randomized trials all that we had were observational studies in which initiation was not a initiation of HIV therapy was not randomized but you could compare groups that initiated at different times adjust for the differences between all those groups to mimic the target randomized trial as well as possible and all of those studies found that early initiation was better than later than delayed initiation so the guidelines the clinical guidelines for for the treatment of HIV were change based on the observational studies a few years later a couple of randomized trials were conducted that confirmed what the observational studies had found but for that period of time the only thing that we had were observational estimates is that is that in a way kind of the ideal scenario that you conduct the observational research it you know maybe it influences policy and then when down the line when you can conduct a randomized control trial it's done it validates what the original study said I mean is that I mean is that a situation you think will play out more often in the future I think so I think that this is going to happen more of course this is the ideal situation it's possible also that in some cases the brand materials will will will not validate what the observational studies found and in those cases we will learn something about what is that we did wrong with the observational data but in the absence of randomized trials is either making these decisions based on no information at all on based on the limited information that we can obtain from big data and so there's the last question I know you run a MOOC through Harvard X was a free online course focusing on causal inference so if people listen to this podcast they're really fascinated they want to learn more about this can you just kind of tell me a little bit about that course I'm gonna focus on and then what you hope the course participant participants will learn well that is a course that describes the theory of causal graphs in non-technical terms so cosa graphs are a very helpful tool because that's how we express the assumptions that we have the knowledge that we have about a causal problem and based on a few graphical rules you learn in the course then make decisions about how to best analyze the data so the course the title of the course is draw your assumptions before your conclusions and that is exactly what it is what it is about how to draw your draw cosa graphs that summarize your causal assumptions so that then you can extract conclusions from the data in the best possible way that was our interview with Miguel or nun on big data and public health and as you heard us discuss at the end there he does offer a free online class to Harvard X if you're interested in registering povel Encarna website HS PhD ma slashed this week in health Hernan has also written a free book on causal inference and we'll have a link to that as well that's all for this week's episode or reminder that you can always find this podcast on iTunes SoundCloud and stitcher [Music] [Music] [Music] you [Music]
Info
Channel: Harvard T.H. Chan School of Public Health
Views: 4,322
Rating: 4.7647057 out of 5
Keywords: Harvard T.H. Chan School of Public Health, HSPH, big data, public health, information, medicine, potential risk, biostats, epidemiology, policy, clinical decisions
Id: 15n9QEUlDxU
Channel Id: undefined
Length: 26min 23sec (1583 seconds)
Published: Fri Jan 26 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.