Causal Inference in Data Science From Prediction to Causation by Amit Sharma | DataEngConf NYC '16

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so as Adam described another postdoc researcher at Microsoft Research and most of my research is in understanding the relationship between prediction and causation so what I'm going to do in this talk is to just give a very very broad overview of why we should care about causal inference and I'll give you examples on why it should we should and then I'll also try to show you that it's not just that it's good in a very academic sense because you're you feel happier if you know the causes it's also helpful to evaluate the systems that we build like recommendation systems and it's actually also useful to make predictions more robust and I think the goal for this stock would be the sort of mainly focus on why we should care because I'll show you a lot of examples where you can get really really bad results if you're not careful about that and I'm hoping that with that motivation there are lots of other tools that I'll point to which can help in points second and third so I'll just sort of briefly go over the point second and third but really the goal of this talk is to understand why should we care especially when we have increasing amounts of data we have data about all the properties all the users about any online system if you have access to the logs and we can do really good predictions I mean this conference I guess it doesn't need to be said that there are lots of machine learning advances that are helping us make create predictions so here's some examples we can recommend people what to buy what to wear who to date we can find fake activities we can find fraudulent activities in money markets and we can also give really really good search results right so in terms of prediction if you give me a problem and say that this is what I want to predict we are very successful and so let's just with good a few seconds on how this prediction work very very bare-bones for you suppose you have a problem of finding out which users will contribute more to your platform in the future right so these are Xbox users so maybe at Microsoft we want to know which you we'll play more games in the future and which would not right and so one day answer would be that let's just make it a prediction problem so we collect a lot of features about the users maybe their age gender past activity social network and so on and then we can simply say that the final activity is just a prediction function of their earlier activities right and this is what we find in practice so if you look at this user they have a lot of friends and if you look at the other user who have lower activity they have no friends right so if I make a prediction algorithm that just takes the number of friends and the logins in the past month I can predict with really good accuracy who's going to come back in the next month right and this is almost like a solved problem you can get packages to do it but the question then becomes once we've made this prediction so what if I was doing this and I would take this to my boss they would ask me this question so that's good okay so we know that people have more friends than they play more what do I do with it how do I now make more people claim more games on xbox because that's what my boss wants to do so here's the den causal question then comes up right we're increasing the number of friends increase activity on Xbox right it's clearly predicts well it's very high accuracy but do we know that well think of let's think about it right so if it's the case that having more friends makes you much more likely to play more because you feel a part of the community yes right and this is what this model is showing each node is essentially a variable so here number of friends and each arrow shows the direction of causal connection so this graph essentially is saying number of friends causes Xbox activity great but what if it's the opposite what if all that's happening is that just the people who play a lot of games are also make a lot of friends just because they're playing games and while playing games they meet new people if that's the case then our model may not really help that much and what's insidious is that maybe both of them are not true really something else is happening like what's maybe happening is that they're just these some people who like playing games and if you give an xbox to them they'll both play a lot of games and make a lot of friends but if there are the other kinds of people who have an Xbox they would do neither of those things and any effort to increase their friends or any effort to increase their activity would not affect any of the outcomes and so really the key question becomes that once we have these models how do we know what causes what and this doesn't happen in just about Xbox systems or prediction systems in the next couple of examples I'm going to show that this is sort of a problem endemic to the whole online system and online platforms world so let's take search engines right if you read any marketing journal or article they will say that search ads are the best they get about the highest CTR compared to display ads and any other ads that you can think of right so this seems like greater search ads really work well but again do they really so if you want to predict when a user searches for Amazon toys what is the most relevant ad to show to them we are very very accurate right you can try this for most queries I'm sure a lot of you have noticed this the ads would be really relevant and once I see an ad for Amazon toy at amazon.com that's exactly what I wanted and I would click on it right and so in some sense you might think that wow you're getting high click throughs but the question again begs what's the counterfactual right what would have happened if Amazon did not pay Bing for those ads and instead the user was just forced to look at the organic search results well the answer is they would have also seen the same links in the organic search results and so it's a sort of an empirical question to what extent are these ads actually helping Amazon gain new customers right and if you can want to think of the graphical model again and this is very nice because thinking of in terms of graphical model helps us expose these sort of complicated procedures and do very very simple principles so it could be that search add causes visits to toy website great then you should spend a lot of time and money on Bing ads good for Microsoft do but it could be just happening that it's just a search query right there are some people who are interested in toys they issue search queries they are more likely to click on the search ad they're also more likely to go to the toy website anyways and then there are just lots of other people who are not interested in toys maybe that's the ones you want to show ads to but if you use this approach you're not gonna get them but you'll still get a high CTR right so the whole point of this example is that when we look at observational metrics I mean when we look at how great our models are at predicting what a user will do they are good at exactly that but they are not good at the counterfactual what would her user have done if we had not spent money on this item or if he not had this new feature that we booked and so you might think that there are some concocted cases like these in search and recommendations but actually they apply almost anywhere so even if you look at offline ads like display bull Goods right let's suppose Toys R Us has a display billboard ad in streets in newspapers everywhere because it's December right they think that we have a lot of inventory that we want to sell and so we're going to make really really creative ads and put them across on all areas clearly they might find that their ads did really well and sales shut up compared to the last month so one answer is the creative was really good so we should reward our designers another answer could be that people got a lot of interest in toys because of these ads so ads are really effective but then if you think just a little bit more it doesn't take a lot you probably realize that December is when most of us buy toys because it's holiday season and we go and give toys to kids so what might just be happening is that irrespective of the intent irrespective of the ad irrespective of the creative it's just that if you show ads in December or if you do anything in December your clicks and your purchases are going to go up right so what you might be doing well off is actually comparing it to December of the last year or maybe doing something more competitive than just comparing it the last month or a week before and so really if you think about these cases you might be still wondering that it's just we just overestimate right there is an effect I believe there's an effect we just overestimate maybe it's okay right well let me show you something else it's it's not just that we overestimate our we underestimate if we are not careful we can gain completely wrong conclusions and and this is the point that surprised me as well this was a study we did on reddit which I'll show you which was just it took us weeks to understand what was happening but once we got it it was clear that it was another problem of causal inference and we just got entangled into reading too much in the observed error how does this happen let's suppose you want to evaluate a new recommendation system or really any new algorithm right and the only thing you have is fast observational data so you can't do an experiment right so you think okay maybe there are other ways to figure it out and so you what you might do is you have algorithm a you have algorithm B very carefully you can sample past impressions of both algorithms right so you have the first algorithm which predicts recommendations and then you have the second algorithm protract recommendations and you can look at the CTR so you might find that the old algorithm on thousand impressions does 5% CTR very nice the new algorithm does 5.4 and so that's a very nice result which is telling you that algorithm B is better and so you go to the decision makers and tell them that yes you should continue with algorithm B because in the observed logs of the system this clearly has a bigger effect but then they might ask you that's great but we also want to know how this algorithm is affecting different kinds of people so for example they might know in their domain that higher activity people actually contribute a lot to their platform than lower activity people so they just want to know how this new algorithm is affecting different segments of people so you say well great you go back to your data set you segment the users into higher activity and lower activity and you look at the CTR and here's what you find for lower activity users the CTR for the old algorithm is better and for higher activity users the CTR is again better for the old algorithm now at this point you might be wondering that I'm just making some error in computing what's going on but really that this is all I'm doing I just had thousand impressions I looked at the total CDR which is right at the bottom it's out of thousand 50 clicks out of thousand fifty four clicks for algorithm B this is better right and just when a condition on higher activity and lower activity I find exactly the opposite result so for old algorithm I find in both cases the CTR is better any guesses on what's going on or what might be the problem here and so one one hint you might get is that these numbers look really different right so somehow for the new algorithm the higher activity users got almost 80% of the views and somehow for the lower activity users it was the reverse right so double the number of users were seeing the old algorithm versus the new algorithm right so this is this should give you some hints that what was happening was that just this sort of selection effect where the people who are likely to click more the higher activity users somehow always went into the condition for the new algorithm and the people who were clicking less somehow always went to the old algorithm right and just because of this sort of underlying variance in your data you were getting completely nonsensical results and the sort of unsatisfying part is that even after discovering this you might be tempted to think that actually the old algorithm is now better but I don't think even that's true because what might be happening is that we control for the activity level right so again if you look at this diagram we were supposing to get this effect so recommendation algorithm to CTR we got a positive effect for the new algorithm but then we controlled for the activity level because we thought that would matter so we got a better result which was that the old algorithm was better but who knows what other things were happening in the data right so here's just one example maybe there was a time of day effect as well maybe for some reason or whatever bug or whatever engineering building you might have found that the time of day was also factoring in so the new algorithm was shown more in evenings the old algorithm was shown more in mornings right and it's entirely possible that if you continued Chinon that to your resolver flip again or maybe it'll just become insignificant so really my point of this is to show that there are valid methods when we can condition on things we think good but really after a point it's just a matter of knowing what your system is with these kinds of methods the best we can do is approximate to the truth here's another example this is a personal example that I found while working on reddit and it took us weeks as I said so I was working with a PhD student and they took the entire reddit data set from 2008 to 2014 right so there's no sampling bias there is no API restrictions we have the entire reddit comment history of everyone who commented at any point of mine what you find is that if you look at the ears and on the x axis I have the ears from 2008 it wasn't for dean the comment length is decreasing over time and that's a sort of purple line that's going down right so what's happening is if you were reddit and you were looking at the average comment length in 2008 it was high around 220 and if you were looking at it at 2010 2011 it's about 180 characters per user per comment and if you're looking at 2014 it's gone down to 170 clear cause for concern if you read it because what's happening is that for some reason people are just starting to comment less and less and so we thought of theories like maybe what's happening is that when reddit was smaller people felt a more sense of community so they were commenting more and now it's just become a big mess so people probably don't associate depth that much with Reddit and hands don't comment that much they do yeah so that's the interesting point we have the entire reddit comments we did not take a sample so we there was an API which could allow us to access reddit comments and we sort of swept through entire history so this average is across all the comments made ever from 2008 to 2014 right you might have like a million comments in 100 million comments in 2014 compared like a hundred thousand in 2008 so it's not really an accurate distribution I see so so that's an interesting question so yeah the question essentially is that maybe what's happening is that there were lesser comments earlier on and then the comments become more and more as you go on so I think that would matter if the sample was smaller but because we had even in about 2008 and 2009 we already had or more than a million users on reddit so the error bars on these are really small so you can essentially plot this mean with the error bars and they are almost indistinguishable so that could have been happening but at least at Reddit scale it's very unlikely that was the case but then we found something else and by the way if you have any questions feel free to just jump in yeah just switch to a mobile app which reduces your mode of commerce needles yes and and that's a good suggestion yeah so maybe that was happening and and this is the sort of thing that is so interesting about these problems is that we can come up with a lot of theories but it's just hard to know what was actually going on right but let me show you another plot which might help us gain some understanding so what we did then was instead of looking at the overall comment history we just picked each user whenever they joined reddit and just plotted how their average comment length is increasing over time right so what I'm showing you now here is time in user referential which essentially means if you join in 2008 you start from here if you join into 2014 the graph still starts from here it just starts at a different level right and what we find is that for every user for every year that they joined their comment length is increasing right the only difference is that people who joined earlier in 2008 would start at a much higher level they would already be commenting much more and they would sort of then increase their level of commenting as they grew on the new users for whatever reasons their millennion or they're using a mobile app they were just using starting from less comments per post increasing still but just starting from less common post and that's all that was happening right so it was it wasn't that read it had to discover a new way of making people want to comment more people anyways were once they started from whatever level they had they were always going towards a plot a tone of commenting more and more and more so the problem wasn't they wanted to comment more the problem was just the selection effect the new users who are coming in but somehow just not in a style of commenting more right and so once you discover this your solution may be very different from what you might might be if you thought the comment length of the problem right in simple terms what people in the social sciences say is happening is selection bias in other words the population of people that come in each year is changing and there's only so much you can do except know exactly what kinds of people you're targeting and what kinds of strategies may work for them and so the first part of the talks really I guess makes the point that making sense of data can be just too complex even if you have millions of users even if we have billions of impressions there are some key fundamental problems that cannot be inferred just from a lot of data and so then the logical question then becomes how do we then reason about these kinds of paradoxes right and the good news is that people across different Sciences have spent a lot of time on this even before we had so much data they were philosophers like Aristotle and Hume who were sort of thinking about what does it mean to cause something like what is it that we want to know when we say we want to discover the cause or we want to know the scientific understanding for some effect and this sort of applies in medicines if you have a drug you want to know how well this drug does all you get to know is all you get to see is sick patients going to hospitals and getting drugs in the social do you want to know how well your economic policies are doing would make the exchange rate higher or lower make your economy better or not happy to do an experiment but most likely people won't allow you to do it and then similarly in genetics you want to find what's the effect of genes under C so there's just lots of examples not just non line systems but almost every part of science which thankfully has contributed to us to our understanding so I'm wondering the details there are two very very well-established frameworks the first one is causal graphical models which we were using because we were sort of looking at each of these effects in terms of graphs and there's also potential outcomes frameworks if you're interested you can sort of go into details of how it works I'm gonna give examples so for us it's not a philosophical question we are very practical this is the data engineers conference so we just have a very practical causal meaning we want to know if X causes Y whenever change in X causes necessarily a change in Y and then all you want to estimate is the magnitude by which Y changes so if X is your new algorithm or X's are your new UI we just want to know how much the standing x cause a change in Y which would not have happened if you hadn't changed X simple definition and so what it'll force us now to do is to think in terms of counterfactuals and this personally I found the most important concept to think of these problems is whenever you make a decision or whenever you're thinking of improving something just think of the question what would have happened if you did not do that or what would have happened if you actually did that and maybe increase the intensity or decrease the intensity and once we ask these what-if questions it becomes much clearer about what exactly are we doing and also how much do we think practically it's going to have an effect so what's comes I'll just try to explain the power of these methodologies using two separate examples the first one would be can you evaluate the impact right so so you can think of two questions so one is that you just want to know there's something in the world like a recommendation system you just want to know how much effect it's having right because that will help you decide how much resources to put in that then there are other things that you want to may want to do is that you may want to design a recommender system in a way that it takes causal influence principles into account right so the first part I'm just going to say if you already have an existing system how do you evaluate and then we'll go to how do you then just design something from causal principles because I think that's exactly what a lot of us should be doing it just becomes very hard to do it so why does it become so hard I've already given some examples but I think this picture really this cartoon gets to the heart of it here's the problem we have an old algorithm or we have an old policy and it gives us some outcome number of clicks we then think well change the algorithm we change the algorithm we get new recommendations we get better clicks but we don't solve the problem because our question wasn't to compare the new from the old above a question was really to compare if you had the new algorithm and in a counterfactual world where you did not have a new algorithm what was the difference right so if this was Toys R Us they had their clicks in November and they had their new clicks in December but really I want to know what were the clicks in a different world a different counterfactual world where they cloned their customers and then showed them nothing in December that's what I want to know and you can clearly see it requires changing laws of physics to actually estimate these effects so what people have done is and I'm sure all of you are familiar with our a/b tests right you say and cannot clone users but what we can do is randomly assign what people see doing in these different conditions and the argument is that as long as your coin is random if you show some users a new algorithm if you show some users the old algorithm on average there's nothing different between random user one and random user two so it's all good and this works brilliantly and across all different concepts but the only problem is that experiments are hard to do and sometimes there are just infeasible so think of an experiment to determine if a subscriber on Amazon or Netflix actually shops more or watches more videos the only way to do an experiment like this is to artificially make some people subscribers who are not and artificially make some people are subscribers nonsubscribers right and and you can go imagine going to Amazon and asking them to do this real Amazon Prime and see the reaction right so then you need to have some other method of doing this another thing that comes up is that it may just be unethical right suppose your amazon and you want to find out the subscription price for Amazon Prime should it be 99 dollars should it be $80 should be fifty dollars one simple issue an experiment that's what economists would tell you but what you would not expect is that a lot of customers may lose faith in you because just because of random chance some had to pay much more for membership some had to pay much less for membership and then it can also sort of just be inconvenient if you're showing like a UI element it can just drive it away if it doesn't work right so then the natural question becomes what can you do with just observational data I showed some examples where you can just keep conditioning so if you have very good idea of how your system was built hopefully because you built it then you can keep on conditioning on things that you think I'm good matter like activity level time of day so on but it's sort of a rabbit hole I mean you can always approximate the truth you'll never be able to get to the truth the other approach and this is sort of recent research work that's been developing is that you can somehow find randomized experiments in the observational data and so the basic approach is that we can't do experiments but what we can do is find as if random events that happen in the data set that you're interested in so here for example if you want to calculate the effect of a recommendation what you can do is that you can exploit some naturally occurring variation like Jon Stewart suggesting a book on his show right what will that do is that it'll create the sort of an artificial shock to your book system or a book website and then you can just see how much of that flows through the recommendation system to other books right so instead of doing an experiment where you sort of artificially put a lot of users in one condition where they had to see a book and see the effect on recommendations you can use Jon Stewart Oprah Winfrey anyone who just creates these unnatural conditions in your data set but then helps you to see the spillover effect on recommendations and it turns out that these external shocks can be used to estimate the counterfactual because it's pretty clear that if Jon Stewart had not sent a lot of people to this book page it is very unlikely that they would have also clicked on to the recommendation to the recommended books without the recommendation system right make sense except there was no recommendation people would have only seen the first book because of recommendations they saw other books and so you can estimate the effect that way and here's again the causal graphical model you can actually prove it but I'm just going to a schematic in general visits to a book or an app are just confounded because Wizards can cause recommendation clicks but just demand in itself is correlated right because that's exactly why you were recommending them in the first place the Wizards could happen anyways right the nice thing is that when you have a certain shock what happens is these visits become very high right so suppose this was your first product app one that had a spike a huge increase in traffic and you can make a clear case that if demand for app two remains constant then additional views to Abdul would not have happened without this recommendation system that was actually just showing people recommendations to click on all right so this based on this very intuitive strategy you can have a method that describes this effect and this is just detail you can actually just show that this effect would be equal and to the effect of removing the recommendations from that page and comparing those to counterfactuals but then the problem is and I think as some of you might also think is that this is a very odd approach like this you're sort of taking these very very odd things that happen in your data set and computing something on those small set of shocks right so in some sense it may not be as satisfying so what you can do is then you can try to increase the breadth of what you can do but really the upshot is that if it's possible do randomization but if it's not then you can consider these creative methods of finding experiments in your data and in fact what I've been finding in my recent research and I have some papers at the end is that you can actually do this in a data mining way you can sort of just use an algorithm to find these unnatural effects but in some sense it still is unsatisfying because now we know that this recommendation system may not be working as well or maybe it's working great but then what do we do with it how do we come up with a new better algorithm that should replace it right and here I think there's been a lot of work which shows that what you should really do is just think of systems as continuous experiments and what what this means is that let me take an example of gambling if you've gone to these casinos what you would find are these sort of slots in the casinos and you just pick one slot at random you get some reward right now assuming you were an algorithm right what you might think of it is that whenever I pick one slot which is I show one recommendation I get some reward it's a click or it's not a click right so what you can just do is over a period of time you can just randomly select different of these arms and figure out which arm works the best right so so it's this sort of requires a change in the way we think about algorithms because what this thing is that there's no one code algorithm and maybe it's not worthwhile to find one go that guard we should really be doing is that we have a list of items to recommend what we're gonna do is we're gonna just randomly recommend maybe each of these items in turn and then find which ones people like and which kinds of people like which items and as we learn that we are going to keep showing more and more of that strategy and less and less of the other strategies that we tried out and it sort of keeps playing on so the most simplest algorithm that you could implement with such kind of an approach would be bandits and what this does is the old algorithm was showing you some recommendations it says I don't trust this algorithm I would not even trust a new algorithm that comes in its place so here's what I do I would show whatever I think is the best algorithm right now to most of the users and then I'm going to show whatever is a random algorithm completely arbitrary the rest of the users right and the idea is that by showing the right algorithm to most people you're exploiting right and this is the world we are always in we are always exploiting we predict well and we show people always that but then we also have this exploration here which is saying that that algorithm is great but I'm just going to test new strategies random strategies on other people and you can again theoretically show that if you always do this you can sort of have a feedback loop where this strategy informs your current best algorithm and it sort of keeps on improving every day every second and as simplest algorithm for this would be something like this with low probability you show a completely random item to a user this is where you're learning and with high probability you show the current best output from the algorithm and then essentially if you want to think of it in some sense your system is learning as it's proceeding so instead of having a train test it's sort of learning with as it gets new I new items it tests on its users and then feedbacks in to the actual algorithm and this is also practical just like the last one was used successfully on Amazon's recommendation this is being used on MSN for example here this is Yahoo homepage and there's a paper that shows how this can work for showing different news articles as recommendations and finally I just want to close with a very very strong warning in the last two parts of the top I presented two methods that work well and so you might get the impression that if you're using those methods you'll be doing mostly correct in terms of causal inference but I would be a little skeptic for this graph and I think that if one thing that you want to take from the stock I would prefer that you just take this graph in your heads and this graph is very compelling because it shows that computer science doctorates awarded in the US and the number of revenue generated by our kids is almost perfectly correlated and you might just say oh that's that's not right that doesn't make sense but we know this because we have some sense of what's going on here we know that well we hope that computer science PhD students are not motivated by our kids but this could be any time series right this could be any two variables in your data set and if you're not careful and if you don't have an intuition on what's going on you've been you'll be in trouble trying to interpret these things and also predicting based on these things and so really always be skeptical of causal claims from observational data actually from any data and then another thing which is quite true is that more data may not necessarily improve your causal effects it might just make you more sure of wrong results and that's it thank you [Applause] we have time for one or two questions if anyone has them I can bring the mic over to you yeah I just wondering you mention enroll at ETA Hyatt 850 how did you define you mentioned a high activity and a low activity how did you define a low and high like that I see so the question was in the example for recommendations we divided people into higher activity and lower activity that was just based on what the people in the property were using already so it was essentially if I remember correctly 10% of the users were higher activity users and everything else was no activity but but I think the larger point is that the specifics do not matter that much because you can also imagine another study where you do 20% where you do 30% right I think Ian's site for us was that it changes and flips based on what you select but it you can use any cutoff how long should you run the the multi-armed bandit you can call it a explore exploit how long should you explore for and then exploit like with that so you avoid some kind of stigma OG or something like so I think the the question was in multi armed bandits expert proach how long should you keeping on exploring exploit I think that really depends on how important do you think is getting the right answer but maybe another approach is to think of a case where these distributions are keeping on changing over time right so maybe the right answer is to just say that I'm going to run this always and the only thing I'm going to change based on my confidence is the percentage of time and exploring and the person is just time exploiting but I think it's sort of requires a key philosophical shift all almost in the way we think about algorithms because first it needs some humility to know that you don't have the best algorithm and second it's this sort of weird scenario where each every day is an experiment going on and it's sort of a self learning system but I think it's really important because in some cases it might be not that interesting for example if you're trying to predict someone's credit history it's not changing day after day but think of if you're trying to predict cases like what people like in music right or what people are trying to do with news these things change very quickly so what's interesting today may not be interesting tomorrow so it it doesn't make any sense to say that the prediction algorithm was working yesterday would also sort of keep working as people's preferences change so maybe the answer is to sort of always to explore explain

Info

Channel: Data Council

Views: 21,420

Rating: 4.8780489 out of 5

Keywords:

Id: 6SCoaBo1MqU

Channel Id: undefined

Length: 39min 48sec (2388 seconds)

Published: Mon Dec 19 2016