Collider Bias (Berkson’s paradox): how ‘censored’ data leads to flawed conclusions

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
one of the biggest challenges of empirical studies especially those that try to determine which factors increase the chance of something good or bad happening is to get data which are representative of the relevant population if we don't we can end up with flawed conclusions like students who skip lectures are more likely to get better grades attractive people are more likely to be nasty and smokers are less likely to get covid19 these are all examples of collider bias or bergson's paradox that i explained in this video for simplicity let's assume that students either attend regularly or mainly skip lectures so every student is classified as as fitting into one of these two categories they either attend lectures or they don't attend lectures now similarly let's assume they either get good grades or they don't get good grades now in reality we know that students who skip lectures are less likely to get good grades but we want to show this empirically so we survey students through the university online forum the problem with this is that those who are unengaged meaning those who don't attend lectures and get poor grades are the ones who are unlikely to respond so who is it responds well very few of these more of these so students who i've attended lectures or go i'll get good grades but not necessarily both and lots of these most of these people who are who attend lectures and get good grades so the problem is that these unengaged students are mainly censored out of the data and this means that most students who don't attend lectures but respond to the forum get good grades now let's make this even more explicit here is the full set of students we've got the students who attend and get good grades we get the students who attend and get poor grades not as many as them we've got the students who skip lectures and get good grades and we get the students who skip lectures and get poor grades so what we can see is if we focus on the students who skip lectures most of those who skip lectures get poor grades so that's the reality okay but if we focus only on the students who complete the forum survey we can see that again most of those who attend and get good grades are still there they complete the survey about half of those are skipped lectures and get good grades and half those attend lectures and get poor grades do the forum but very few of those who skip lectures and get poor grades complete the forum survey so most of these who skip lectures get good grades and there you see the problem that's who you're surveying now assuming we can estimate the probability of a student responding we can use the bias sample data to determine the true population so suppose that those who respond call those r those are 10 lectures called those a's and those who get a good grade call them g we can estimate let's say the probability that they'll respond if they're both a good student and a ten lecturer is point nine ninety percent that's in here the probability they respond if they're if they don't attend lectures they they don't get good grades it's very low at ten percent and fifty percent for these now we can actually look at the responses that's what we actually observe and let's suppose that these are the numbers we get then we can get the true totals simply by using this information and these observations so the number of students who get good grades and attend lectures it's 300 because we know that that 270 was 0.9 of the true total and this 27 who didn't attend lectures and didn't get good grades was only 10 of the true total so there were 270 of those in total and from that we can see that the true total therefore sums to 1000 can get these probabilities and from these we also get these probabilities so so the probability of students 10 lecturers point just over 0.5 and the probability of them not ending lectures is therefore just under 0.5 and from these we can just calculate the true conditional probabilities so the probability of g given a is just of course that formula and we simply now plug in these values and we get the probability of g given a is close to 0.6 and similarly the probability of g given not a is about 0.46 okay so here is the model laid out as a causal bayesian network and what you can see is that in this node here if we look at the probability table for that it's specified exactly according to our estimates so the students who get good grades and attend lectures well 90 of those will be in the forum poll and only 10 of those who don't get good grades and don't attend lectures will be in there and these others have a 50 50 chance of being in there and all i've done for this node here the probability of a good grade given attending lectures is i have simply now entered those values that we learned from the empirical study so what we can see if we look at the prior marginal probabilities here is that we actually know about 51 of the students participate in the poll now if they don't attend lectures there's a 45.8 probability that they get a good grade whereas if they do attend lectures as we know there's a 59.9 so that simply reflects what we encoded from the data so overall attending lectures leads to better grades however if we now constrain the model to those who have completed the poll so this is basically what we get if we only looked at the poll results so first of all notice that most of those who completed the poll are people who attended lectures and got got good grades okay we know that but now with this constrained study if we compare those who don't attend lectures about 81 of those get a good grade whereas those who do attend lectures 72.8 percent of those get a good grade so you can see that if we're fooled into thinking that the respondents to the poll representative of the whole population then we've got the collider bias and we get the exact opposite of the true result namely if you skip lectures you're more likely to get a good grade okay here's another example of collider bias or bergson's paradox why is it that most people believe that attractive people are more likely to be mean than nice well basically let's assume that people are classified by their looks as either attractive or not attractive and let's suppose that their personality is either nice or mean now the problem is that you will generally date somebody for either their looks or their personality at least one of them so the people that you might date are well if you're very lucky you'll be able to in your lifetime date a small number of attractive nice people your date lots of attractive and not nice people and lots of unattractive but nice people the people you generally will not date very rarely will be the unattractive and mean people so the people you date who are attractive which are these guys here are more likely to be mean than nice and therefore positive association between being attractive and mean is it's just an illusion it's just because mean people are underrepresented in the sample of people that you date let's just suppose in reality that there's an equal number of attractive and mean people attractive and nice people unattractive and nice people and unattractive and mean people so there is that's represents the reality there is no association between attractiveness and meanness now as far as who you might date are concerned there'll be some of those there'll be some of those very few of those because you're not going to be that lucky maybe one of those very very rare and these are the attractive people that you date and so therefore that's your perception of the set of attractive people and of course most of those are mean and hence you get the paradox the final example is this result whereby it seems non-smokers are more at risk of coving 19 than smokers so again we have people classified as either smokers or non-smokers having on not having 19 the question is is there any association between these and in particular is it the case that smokers are less likely to get coveted 19 than non-smokers well the problem is that the empirical studies on which the positive correlation between non-smoking and covin 19 were discovered were based on a sample of people who were tested and the problem is who was tested well it was mainly healthcare workers who tend not to smoke and people already suspected of having covid19 i mean in many cases they were hospitalized with severe symptoms we had a few of these people who were smokers and who didn't have kovi 19 not that many we had quite a few of these smokers who had covered 19 and non-smokers who didn't have coving 19 but the majority were non-smokers who had covered 19. and so what you can see is that smokers with no covert symptoms these guys were mainly censored out of the data and of those who have covered 19 in the sample a disproportionate number are non-smokers hence we get the collider bias and bergson's paradox and in conclusion then this paradox occurs when we rely on a data set which over represents some subjects and underrepresents others and we end up concluding that the true a relationship between two factors is the opposite of what is in reality the case the case of the skipping lecture we had a positive relationship in real terms between skipping lectures and getting poorer grades but the bergson paradox reversed that and in the case of personality and looks and also smoking and coving 19 the paradox simply led us to conclude that for two unrelated factors one has a causal influence on the other now the key thing is that by using a causal model that explicitly identifies the collide variable we can overcome the paradox without having to get any new data which is exactly what i showed you in the case of the students
Info
Channel: Norman Fenton
Views: 861
Rating: 5 out of 5
Keywords:
Id: eJNPUfO-Raw
Channel Id: undefined
Length: 12min 4sec (724 seconds)
Published: Tue Oct 27 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.