Sean Taylor- When do we actually need causal inference?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
uh good evening everyone i see everyone's filing in i see some names as we go by see jason harlan paul georgette's here it's good to see some familiar names uh too many i've already lost track of all the names of the people i've seen before but it's good to have everyone here we'll just give a few moments as everyone logs in i see michael there's a lot of michaels though i see mike and michael a lot of you i know there's somebody the same name i'm just gonna say mike and michael let's see who else do we have here um good to see everyone we have tom good to see everyone coming in here and again we'll give everyone just a few moments to uh file in as the waiting room opens up and we get everyone situated very excited everyone hope everyone's summer is going well we have katarina and clifford coming in hope everyone's having a nice start to the summer unless you happen to be coming from the southern hemisphere in which case i hope you're enjoying your winter i know we do often have people from the southern hemisphere so hope everyone's enjoying whatever season you're in but we'll do it that way all right so oh dusty's here good to see you too and let's see i don't get to see anybody it really really sucked hopefully soon enough we will be seeing each other in person so first things first people who used to come in person back when that was a thing you all know that i like to always start to meet up with who is hiring of course we can't exactly do that right now because i can't in fact it's not even capable of people jumping here and shout so i'd like everyone to do if you're looking to hire someone go into the nyhakar slack and in the job postings channel host your job this meetup has been great for the past 11 years getting people jobs as you know 12 years now it's been great at getting people jobs uh all throughout the years but there's some people here in the chat room who i know got jobs through the meetup i'm talking a decade ago i know they got jobs so if you want to hire someone one of these many talented people at this group go to the job postings channel and post your job there and make sure you include a link so people can apply we'd love to have as many people here getting more jobs as possible um one of the uh lucky things for me is that since i am on screen i get to announce that i am hiring i'll take that opportunity i'm hiring three roles we are looking for a data scientist a salesperson and a linux sysadmin with um familiarity with mostly ubuntu but red hat's useful also so if you're looking for any of these three roles data scientist sales or linux system talk to me you know how to find me find me on slack buy me an email find me on twitter find me anywhere i'm earning easy to find so everyone post jobs next up those of you who've been coming in person and even the past few years virtually know that we're all about the pizza here mine has not arrived yet so i have this beautiful slice coming from dominic's pizza right here i hold it upside down to show you the cheese law i'll show you the bottom um hope everyone's enjoying the pizza you have wherever you have it coming from sound off in the chat let me know where you got your pizza from or your other snack that you're having instead of pizza because you're like that's cool a few other events we have coming up where there will all be pizza involved because you know you folks know how i am uh starting tomorrow and it's uh sold out um but i like telling people about it we do it every year the stan workshop we've been doing this for i want to say like seven years or something it's been a very long time it's um uh it's a three-day workshop with the team behind the stand language to do a bayesian analysis journal gabriel is going to be leading and we have andrew gellman appearing and rob trangucci and scott spencer all coming to teach this class we do it every year and the money goes for further stand development so we are very excited for that we'll do it again next year and maybe sometime in between that is sold out but we are excited to be doing it um i see this question i think from sean about cheese lock what cheese lock is um so everybody just slice the pizza and the cheese comes sliding right off it's not attached cheese lock is where the cheese stays in place and believe it or not the big pizza chains do a lot of like chemical research about this to figure out how they're going to make it work and smaller pizzerias often do it by having here go to pizzeria where you can see the sauce that allows the water to evaporate meaning the cheese will hold on tight so there's cheese lock for you next month still virtual we have our meetup i don't have the date yet because i forgot to write it down we have a date i forgot to write it down though ian cook will be giving a talk uh so we're very excited to have ian cook virtually with us in august we will be announcing that hopefully by friday maybe by monday but either way stay tuned august someday in august we will be doing that then september we are really really really excited for september it is our return to live at least partially depending what we could find i'll explain it all so we have our seventh annual new york r conference coming up september 9th and 10th it's gonna be two days are fun we have some really great speakers um some of them are even this room right now virtually in this room uh speaking as you've heard of it's gonna be a really fun time um september 9th through 10th we have workshops associated with that conference on september 1st this is the first year we split it across two different weeks uh so workshops from the likes of david robinson malcolm barrett lucy d'agostino i'm forgetting her last name i'm sorry lucy if you're here i forgot your last name cass akimoto your own yansons we have a bunch of really fun cool good people giving workshops on september 1st and the conference on september 9th and 10th we also are trying to host an in-person meetup september 1st but unlike the conference it is very hard to get a space to hold us for the meetup because a lot of offices are not really set with their opening plans so if you have a space that can hold about 100 people on september 1st or 2nd we would love to hold the our conference they are meet up there that week to coincide with the workshops in the conference so if you have an office space or a venue or any sort of space that can hold about 100 people send me a message right away because we would love to have someone host us if you are looking to attend the conference and just about any event we put on through this meetup group and you want to attend use the code ny hack r for a 20 discount any event that we organized you could use the code in my hack r because the events we organized are really for this group you'll get a code but you'll get a discount 20 by using code and one hack r um i see in the chat that nicole posted uh the next meetup is august 12th with ian cook and if you want to go and get tickets for the workshop and the conference you can do so at rstats.ai and by the way the conference it's in person but it's also virtual so for some reason you're living somewhere you can't attend or don't feel comfortable attending you could watch it virtually it's going to use a platform called hop in like we used this past year for the virtual events it would be really great for those of you worried about attendance again if you're not don't feel comfortable do it virtually but we are following the guidelines of broadway broadway is fully opening in september and i thought if broadway can do it we can absolutely do it too and for those of you that have attended before it is in a much bigger space it's gonna be much more physically comfortable you can spread out more it's going to be it's going to be a lot of fun it's going to be really great so for tonight's meetup we obviously can't ask questions as we go um we can't you can't just all shout now so if you have a question for sean and i hope there will be a lot of questions do your best to stump them that's our goal today we want to stump sean right uh post them in either the chat you see right here and zoom or in the meetup in the meet up slack i'm going to post a link directly to the channel in slack i hope this works oh i gotta send it to the right clear hold on if you send it to that you click that link that should take you right to the meetup channel it is called monthly meetup chat and it's inside the ny hacker slack so you have questions ask it in there ask it and zoom and at the end of the talk i will collate the questions and ask out of sean all right and we are hoping that we're gonna have a lot of good questions a lot of fun and uh because this is a very fun topic right now we have so with that we have um the question from um there is a question coming in about fully faster anyone can attend we are following cdc guidelines and new york rules about gatherings and from as far as we can tell the conferences it's open to anybody whatever broadway's doing is what we're doing okay just to keep it safe that way but we'll follow the rules as they exist um regardless of my proclivities um but you want everyone to be saved that's the most important thing so we have a speaker who's been a long time member of the meetup he used to live in new york came to meet ups a lot spoke in the meetups spoke at the conference i'm not sure even how many times you've spoken at all of our events i think it's been multiple between everything um so i'm very excited to have this very hot topic puzzle imprints it's so hot right now and i'd like to invite sean to the stage to give us a rousing talk i'll virtually pause for you thanks jared really happy to be here excited to be back i think the last talk was actually quite a long time ago so um so you're overdue it's definitely overdue try to get slides cued up here all right so this is sort of like a pretty personal talk for me because um i'm now about 10 years into my like official career as a data scientist and um and i i was trained in i did grad school in new york at nyu and i was trained with economist so i hung around with a bunch of economists and they have a certain aesthetic for causal inference and basically it's that like we can't even publish something if it's not identified and that's a that's a pretty strong perspective and then so then i had like 10 years of working on practical problems at companies like facebook and lyft and and hearing about other data scientists at other companies and what they're working on and trying to figure out like when am i supposed to apply this tool like should i be as you know rigid as those economists are or you know should i sort of like be more open-minded about when you know when is the right time to use these tools and so i think it's a really interesting topic to think about and it's very personal to me because i've become sort of this like causal inference guy people associate me with forecasting from profit and then also now with with causal inference and and i have a very nuanced perspective on it i don't think it's like i'm not like a kool-aid drinker in any way i think that there's actually like a lot of interesting uh reasons to use it and then not to use it so that's that's the spirit of this talk is to be like intellectually critical of these ideas and try to kind of explore when when should we be thinking this way and when should we be thinking in different ways all right so just to preview like my answer to this question is like there's two provocative answers that you could give like one is that you always need cousin prince and the other one is never and i'm gonna i'm gonna do both so uh that's that's the spirit of the talk is that i'm gonna try to like take both both perspectives and you know there's a little bit of intellectual rigmarole to to make them both fit but in the always section i'm going to make the claim that useful all models are wrong but some are useful useful models are ones that tell us what we should do and um telling you something useful involves like having a causal model it just it just has to and because you have to do something as a result of what the model says and you have to model what the result of that is going to be so in the always case it's basically about pragmatism and pragmatism yields sort of like a a setup where we always need causal inference because we're always trying to make something happen um in the never section i think i'm gonna it's gonna be even more provocative i'm gonna take a sort of like strong stance on causal inferences just a tool and so like you know maybe there are lots of cases where we don't need it and in fact like maybe all of the useful cases are the ones where we can estimate what we could have estimated using cause and reference tools through other means and so this this is maybe a little more provocative than the first section but i think like you know also has a lot of pragmatism associated with it from from my experience working in industry all right first section we always need causal inference so to start this section i want to introduce my the data scientist best friend which is the conditional expectation function you may know this as scikit-learn or pytorch or keras or whatever you know your preferred function fitter is but uh we use it all the time in data science this is the workhorse technology of data science and i'm going to completely abstract away like the implementation details of how we how we get this thing um but we use it all the time even when you do tabular data like you know you do a group by and you compute means that's what a that's a conditional expectation function too we're always conditioning on things and always trying to estimate something from the data so so this is like a a thing that we're going to continue to build on top of but assuming your job is to make one of these functions and just like for the venture capitalists in the audience this this is like this function and its existence and usefulness like suggests a set of businesses that you should invest in it's like feature stores provide s's metric stores provide y's we like train models and produce the hats on the e's and then like we store and serve the models so um you could you know all these things should work together and they're all pain points for data scientists almost everybody could relate to like one of these problems being a big pain in the ass so when you work on conditional expectation functions all to live a long day there's a sort of like set of problems that you run into like as a as a data scientist and i think they sort of there's a journey of of of things that you hit so when you're early in your career you think oh i need i'm not using the right language maybe i'm not i should be using python instead of r or stata doesn't work for this particular thing then you get on to like you know caring about packages or tuning hyper parameters or better feature engineering is like you know you're climbing up the hierarchy of needs and then like i think really enlightened people are like i just need better data data i have is adequate for the problem that i want to solve and i i have lately been really interested in model evaluation i think it's really hard and probably like even even after you have all this stuff do you know if you have a good model was pretty hard question to answer but at the very end of that is like you just have a model right you don't have you haven't done anything you haven't made anything happen in the world um and so like you need the model to make something good happen and so conditional expectation functions are just a tool for like enacting some kind of like outcome that you would like to see in in the real world and so the next three sections of this talk i'm going to give stories of like you know how the model can actually make something happen in in different sort of like data science context so story number one is about term prediction it's sort of like almost like a this is like hello world for for data scientists so what does this model look like uh well first we have like a label which is like you know maybe the user cancels their account or we haven't seen them in 30 days or whatever and then you know like s is just anything it's anything in your database so you can compute you can create all these features about what they've been doing recently and you're going to build some great model sorry i'm getting a bunch of chats i just want to make sure i'm not okay positive feedback good all right so we have this model we fitted it awesome going to use this uh and then like there's there's a theory that you have when you build a model like this right you you were like you are doing the underpants gnomes business model of like you have a model there's some step where someone's gonna do something about the model using the model and then you're gonna like make more money and it's very actually reasonable theory because you could just send them coupons and everyone's got this idea like what are we going to do with the term prediction model we're going to like do something about the scores and then there's going to be like a great business win there because we're going to prevent them all from leaving instead of just like waving them goodbye which is really all that model is equipped to tell you how to do um and so like i think that uh this is the second step should probably be part of the first step and it's kind of a sub optimization to think of them as separate so i am going to propose and a big thesis of this of this section of the talk is that like these two activities of of business and of society are sort of like fundamentally intertwined in a way that we have underrated and that like you know thinking about doing good stuff in the world can inform how we should be estimating stuff at the data science phase and so like when we can tear down this wall we can actually create like a joint procedure for doing these things and that and then this would be a better way for us to be for us to be working is like explicitly acknowledge uh that we're that we're gonna need to do these things right so how do we solve that problem uh well we add a new variable to the model and that model contains stuff we can do about the user you know potentially turning um and so we have this like y y s a s and a are ordered in this way because this is the order in which they occur s is sort of like information that we have available at the time we're making the prediction and a is a choice that we make and then y is what we observe afterward and this this is sort of like a ubiquitous causal architecture for business is like we have sort of like collected lots of information we're forced periodically to make some choices and then we sort of observe outcomes that are associated with those choices and we would like to you know form some conditional expectation of what would happen if we took different kinds of actions this is a this is a counter factual model i didn't put any counter factual language on it judah pearl would really hate me for not putting a do operator on the a but like let's just like we'll get to that in the next section all right so in this model you are part of the model and this is a really underrated aspect of modeling is i think that like when we're when we're trained to scientists we are trained in this uh very like it's a positivist regime right it's like that you know the universe has certain laws and we're meant to like learn them and we are passive observers of of the of how the universe plays out and so even when you write an academic paper you some people will use like passive language like the instrument was applied to whatever and this is explicitly like you know removing the language of you being involved in the procedure like you are you are taken out of the context and it's sort of like meant to be more objective and i think i'm sort of like more on the other side of like let's just acknowledge that that we take a pla we take a part in the system that we're studying and so that requires us to like put variables that represent our own behaviors and actions like inside the model so oh yeah i was just joking around with like well the order of the snake could be the other way we can make some jokes all right so we have s and a are our two right hand side variables and and then this yields a very useful kind of visualization which is that what is the distribution of state action pairs that we observe in the data that we have that we have and so like i call these state action diagrams but that yields a very like sad acronym so we'll call them state action plots and a very interesting thing about these pots is that they're not meant to show the full joint distribution of of s a they are meant to show where we have positivity so this is this is a big assumption in causal inference is that we have sort of overlap um between actions that we would we would like to predict the outcomes of actions that we have taken historically so positivity is sort of like a necessary condition for us to be able to infer what would happen under that particular action given the state we have to have observed it historically in the data we have we have to have the ability to observe it and so so these lines the green lines are showing where we have positive mass so like where where have we ever done this action given this state so this diagram on the left is very boring one which is that we always take the same actions like not give the user a coupon and so from this model we cannot learn anything about a world where we gave them coupons uh like there's it's just not really in the in the data for us to say um and then this diagram over here has mass in both places which means that there's a positive probability given any state of getting a coupon or not and then in this world for any user we could tell what would happen if they got a coupon or not by using by this is just a pure pure induction we can we can sort of gather examples of users that are similar and when they got a coupon and when they didn't and then we were able to kind of like perform this causal inference problem um and this this solves the estimation problem that we have and so these i'll refer back to these diagrams quite a bit because they're quite useful okay so when you're in the model you fit a model of what happens under coupon and no coupon and then you use them and they suggest a course of action this is commonly called a heterogeneous treatment effect model where there's sort of like different effects depending on some context variable and so so if we computed some retention score we might learn like where the causal effect is the largest so these are the highest incrementality people to be giving coupons to so by fitting the right kind of model rather than the turn prediction model this model tells us like what to do like who to give coupons to and it also tells us like when the coupons are actually valuable or not so there's some people for whom they're not very useful maybe these people are gonna leave like anyway so it doesn't matter if you give them a coupon or not and so it's sort of like silly to even even do it and then there are users where there's very high impact from giving coupons so we should give them to those people even though they look like um you know they're already like have a high probability of retaining it's like the marginal impact is what matters and so this is a model that's actionable it translates like the business problem directly into a course of action that can be executed like quite quite readily by ops team whereas if i handed them like a set of term production scores they would just take the top scorers and just like we would have to hope that by coincidence those are also the people with the highest incrementality okay story number two forecasting and i have like some history doing forecasting uh in my past job and now currently lift my team works on forecasting and um working at lyft has completely changed the way that i think about forecasting and i i think you'll you'll see that sort of baked in here and it's it's because we have control over the forecasts and it's very similar to term predictions like forecasts are just sort of like status quo forecasts there's a there's other other possible counterfactuals so we have this normal forecasting model it looks like this is this our same conditional expectation function we just changed the variable definitions so s is like what we have observed in the historical data um and why is like a set of observations that we will observe in the future and so we fit models like this all the time and we evaluate their forecasting performance and we try to build ones that forecast things really well there's a there's there's like a couple important things that are missing from this model one is the effects of stuff that we did in the past so we committed to courses of action that might have impacts today and so accounting for the effects of those pest actions could improve the model and so by like not including the things that we have historically done as actions and trying to kind of like anticipate what their effects will be in the future we have we have built a broken forecasting model and then the second thing is stuff that we can plan to do to correct the forecast so this is sort of like you know if the forecast says things are going to look really bad we'll clearly do something about it and then the forecast will be broken as well because it will sort of not include the effects of things that we have chosen to do in order to you know correct the poor forecast and try to like you know fix fix what we anticipate happening so we have two kinds of actions that we'd like to incorporate and we need a causal model to do this so here's a good example of the story that i just told which is like let's say at lyft we lowered prices and that happened in the past but people take some time to respond to price changes because not everybody uses lift every day maybe maybe it takes them like two weeks before their first ride and they notice that they see that the list prices are now lower so we lower prices and then demand sort of like slowly creeps up and this relationship is something that we can estimate from historical data so we sort of have made price changes in the past and it's not impossible to incorporate this information about the price index into the future into our into our forecast now demand going up means that we might not have enough drivers um to meet the demand so we have way more requests because prices are low that would be really bad people would be sort of like waiting a really long time for rides and so we'd like to sort of engineer supply conditions that match the demand conditions in some way and we have a tool to do that which are called driver incentives where we send offers to drivers to sort of encourage them to drive more during times when we have peak demand and we have to make a plan for how we're going to set those driver incentives in advance because drivers need to know like you know more than just like right now whether they need to be on the road to like to help us like meet all the demand that we have and so we have the both of these causes need to be in the model in order for us to make effective decisions we sort of need to know that demand is going to go up because we lower prices and we need to know what we can do about it in order to fix it so like i need to use a like models that tell you what will happen like great but they don't tell you what to do then you still have there's still like a missing step you still have to figure out like what are you going to do about it and that's a whole other step that you have like delegated to someone else i would say you you've taken a very narrow view of your job if you if you sort of like forecast things and then just leave it to somebody else to use the forecast and like figure out what you're supposed to do about it and i hate to use a drake meme like in in 2021 it's just sort of like toying with this state action notation and yeah like in 2021 i don't expect you to be laughing as much about it okay so we had this causal model now that the last piece and this is sort of like you know completes the diagram of all the stuff that you need to be an effective data scientist is the art max operator so argument acts just says take the best action given in the state and find me the best possible thing that i could do in this situation and this is this is a this just like the conditional expectations operator is like a beautiful abstraction um arc max is a great abstraction too which is sort of like it's hard to maximize things uh but like you know assume that we can we're all we're all smart and we know how to run lbfgs or sgd or whatever so we have a model already we can argmax it we can find the best plan and this planning optimal planning problem is sort of like baked into this model from the start it is designed to admit solutions to the optimal planning problem and this is what we do at lyft so we have goals that we'd like to sort of like achieve and we are making forecasts that condition on information about what we will do about it and the goals are sort of changed the results of those of the actions change depending on state um and so we can kind of like then take this and build an optimal planning procedure directly from the model and and so this this is sort of like casting the forecasting problem as a planning problem which is exactly what forecasting sort of like intended to to do if you think about it why are we building a forecast is because we think something's going to happen in the future that we would like to respond to today and use that information in order to like make take that action and prevent the bad thing from happening or make make the good thing happen okay so that is like act one act two act three is ranking and recommendations so so we've done turn prediction we've done forecasting and now it's sort of like this is like a you know content recommendation problem so picture you're like netflix and you need to figure out what to recommend people or or your google and you're going to put some search results at the top of the page or you know pinterest is going to list pins this is just a ubiquitous data science problem um and so like it's also a conditional expectation function right so you're you're sort of like even a recommender system fits into this framework there's just sort of like things about the user and things about the item and then we you know user clicking on the item is a pretty bad label here but we might say like user like rates the item highly or likes it or you know thinks you know watches the whole video on netflix or something like that would be i don't i don't know what outcomes they use but it sounds like you'd want to you want to sort of capture something about like that that the the recommendation was valuable to the user as a label um and so this this seems like a perfectly plausible model to fit and it seems like it would be very useful and in practice people do ignore the causal aspect of this model quite a bit sort of like that this this data this seems like a like a reasonable approach to solving this problem but everybody who works on recommender systems knows about position bias um which is that like we oh sorry let me get to the position bias in the next slide i just want to talk about just briefly like why we're worried about overlap and i'm going to get to overlap and a couple more slides for state action space so overlap in the s distribution is something that we have to think about as machine learners so and this this is related to like the cold start problem and recommender systems which is like if if there are s's that we have never observed in our in our training sample and we're going to make a prediction for them it's going to be a bad prediction because that model is forced to extrapolate to a new a new part of the s space and so what this is sort of like a you know knowing the distribution of your training data and your testing data and your prediction data and how they differ is just like a really common problem for data scientists to encounter and have to work on and so like when you're working on non-causal models we think about this all the time so we're gonna have to think about overlapping in state action space so let's go ahead and add the action to the model and if you worked on recommender systems before you know about position bias you know that like you know changing the position of the item changes the outcome and so like where you where you visualize it where you display it to users will affect whether they attend to it and whether they're likely to sort of complete the steps needed to uh to actually like you know get the benefits of the recommendation so if we recommend something really lonely even if it's a good match like the outcome might be might be quite for so we have a role in generating the data that we don't really think about and so like you know one solution to this is just to gather the position data while you're while you're training the model and try to create some adjustment for it but but actually it is a causal problem sort of like we we have we have caused something to happen via putting um recommendations in certain positions and so the choice of the position is an action we get to choose we get to choose a whole slate of positions so really we're like you know we're talking about action being sort of a combinatorial set of all items and positions that they get but we'll just abstract that for now and just say like you know let's just say like whether it was displayed to the user or not could be the could be the position okay so if we're building a ranking or recommender system how are we going to actually run that model in production well we're going to score all the items and so we're going to have this like score of the based on which is going to be just a function that we've estimated from data but it's still historical so it counts as state and it's going to be estimated from the data and we're going to score all the items and then we're going to a is going to be the you know how high up in the ranking or how you know probability of being seen by the user or something like that how it's it's it's an action that we take and if we if we built a ranking system with no exploration this is what our state action diagram would look like all the high scores get high positions all the low scores get very low positions and what's the problem with this is that the low scoring items we will have never observed them in high positions and and the highest scoring items we will have never observed them in low positions and so we have sort of like a missing data problem we have we have never observed that action conditional on that state and and so that's going to cause us to to not know what would have happened in the case that we so we fill in that model we try to put in different positions for the item we're not going to get um you know the variance of those estimates is infinite sort of drawing on zero historical observations in order to make that even though most machine learning models would just spit out a square anyway um it's it's probably a bad idea to train a causal model on data that doesn't actually have any variation so how do we solve that problem well we add exploration to our to our policy so we and but and exploration always looks like more mass in the state action diagram so here is sort of just like widening this line into a band is like local exploration so this is like score position perturbation we take the you know we flip a coin and where we add random noise to the score and then we re-rank things and then it sort of like generates some subtle changes and the ranking and now we get counter factual data so we get sort of like state action pairs that that we wouldn't have gotten under the status quo policy and a very clear implication for this is that this bears cost so we are sort of like purposefully degrading the quality of our system so that we can build a better model that can hopefully yield a better solution in the future so so we're sort of like taking a we're paying a cost to build a better model that will hopefully benefit us um but you can see what this is doing is this is this is turning this exploration is turning our model into something that can do a do operator do operator means that we can arbitrarily set the value of a uh with within some with within some balance it you know if we want a model that we can put in do a then we need to actually have done that thing in the past and historical data and so in a pure explorer regime is sort of like you know it's madness because you're every item has a problem as a positive probability of being in any any position um which would probably look like a really bad recommender but it would be really easy to to model that causally because you'd have positive support for any action given any state so what does overlap protein distributions look like in state action space well now we have to like use little rectangles and i hate these diagrams could look like a flag of a country that i can't name um and so like the way to think about this is like we have collect data collection policy which is data we've you know we've gathered to produce our model then we have target policy which is like policy we'd like to be able to estimate what will happen this task is often called off policy evaluation so if we've gathered data under the yellow policy we have no ability to estimate what would happen under the blue policy here because because we've never observed those actions in the states it's very similar to the yes coupon no coupon version of the world so taking more actions always yields like you know like expanding the set of actions that you take always yields a better you know better set of data because it sort of you know can answer more questions about more different actions that you could have taken we can produce uh you know cases of partial overlap where we can say something but there are some kind of actions that we can't say anything about and so there's you know there's models like this where like we're able to extrapolate a little bit but not all the way and then in a full overlap situation we have like gathered data in this like expansive way and then um we're able to kind of any any like data any target policy that's contained within the set of positive uh pos you know positive state action pairs is sort of like something that we can get an unbiased estimate for so this is a this is sort of like in a way that the causal inference is on easy mode but it is a it is an important causal reference problem is that we like in in the case where we have full overlap we still need to figure out how to reweight the data that we have in order to kind of mimic the target distribution that we that we want and so when i talk about causal inference here i'm talking about like full of fully observable confounding and so it is a solvable problem and it's solvable through normal causal inference techniques it just happens to be ones where we like have really good answers to the questions about how much bias is going to be introduced by that particular technique which which if we do the weighting rate will be zero okay so we've done term prediction we've done forecasting we've done ranking and recommendations i've sort of i think i hope i made the case that they're all like causal models and and that they if they're made puzzle models we sort of have made explicit an assumption that we have made like that we haven't made or we've sort of expanded the model to be able to answer more questions and more interesting and actionable questions than they could before and and that's a good thing and it means that like you know that we have been able to uh put ourselves in the model and make the models do things that we naturally need to do as data scientists rather than just train something and like blindly get a prediction out of it so let's try to like synthesize what we've what we've learned into like more of a coherent framework so fundamentally what i'm trying to what i'm trying to propose here is that um on the left-hand side we have the machine learner's perspective which is i can just like fit a model with state action pairs and outcomes and and i'll just use that model and that's fine and that's the case where i don't need causal inference because i can just use whatever conditional expectation function that i want and then on the right hand side i have causal inference which is like i have to estimate this function with the do operator which means i need overlap and i need to choose some weighting algorithm in order to in order to adjust for the difference in the sampling between my training data and the test data so just being aware of the state action distributions and then and then correcting for it for the target distribution that i have i have in mind so when when are these two things equal is when you don't need causal inference if they're equal then the machine learner and the causal influence person get the exact same answer to the question and so causal inference was like a total waste of time for us so so really we want to know is like when they're not equal that's when you need causal inference you can derive from from everything that i just talked about the two the two conditions under which uh this thing is trivially true so it's it's true either if you have no actions at all so actions the action set is is null set and so we we sort of like don't have any actions to take and that's that's a world where you can't do anything but it's also a world where you don't need causal imprints because those models are trivially the same so if you have no variables that you have any control over then you don't need causal inference but i would argue that so that probably never happens and then on the right hand side here we have the no confounding uh condition so if there's no confounding then then it's fine that means that a is just a is independent of uh of s like they're completely you know completely independent variables then yeah you can just put them in the same model and you get like a reasonable answer to the to the question of like what will happen if you do a um and and this is true for like you know some design data that we might create like experiments and stuff like that but it's not true if we used the state to select a which is extremely common it was like in what world would you take actions and not use some of the state information to choose the action that would be kind of crazy and so like so here's the the diagram of like when you know when you need causal inference so on the on the x-axis i have control which is the size of the set of the actions that you can take and we can even operationalize control in a more precise way which is like you know the total effects that we could generate from the set of actions that we have available how much power do we have as people operating within the system do we have the power to make anything happen is really the question that we have and then on the y-axis i have like amount of confounding and here it's like fixable confounding through causal inference methods but when we have when we have confounding we need to use something that looks like causal inference methods to correct for the way that the sample that we've gathered is different from the sample that we would like to estimate which which is sort of like that do operator so so what happens here in the and the control is very low there's nothing to do so i'm going to argue if you don't have anything to do then it's not really you know it's not a valid case then in the case where we have zero confounding is just we just ran like a pure randomized experiment so we could argue whether you should do that or whether you should just skip to this like contextual policy optimization uh or maybe we can just count randomized experiments as causal improvements to begin with but either way this is the this is the realm of where you need causal inferences if you're going to do anything and you're going to use the state in order to choose the choose what you do then you need causal inference and i think that that's mostly all the time all right that's the that's the the end of the always argument cool how am i doing on time all right section number two we never need causal inference this is gonna be like you know actually more more challenging for me to do because i think i was more previously more aligned with the first one so i'm gonna i'm gonna like kind of uh first draw the dag you haven't seen a single tag in any of this so far so i purposely kind of like kept out all the dags that was and that was on purpose and like there's this like fun meme on twitter which is like draw the dag where you tell people to draw dags of things that don't make a lot of sense um but it is very useful to take like what i was just showing you and draw dags of it it's just sort of like dags are useful and they're great tool and i i think that everybody everybody should learn them but let's let's go and drop dags for like what we what we were just doing um so the three dags on the top are are very reasonable data generating processes that you might encounter in practice under the model that i that i just described or sort of like we can condition on two variables like you know state and action and then we can generate some outcome as a result of the action that we take so randomized experiment is where we just don't even have condition and state and we just sort of like actions are chosen randomly and they affect the outcome so don't need causal importance for this this is a stupid dag because it's very boring it's just it's just saying sort of like we did something and we observe the outcome and we can just like take difference in means to compute effects and the second one there's no confounding because s a is chosen independently of s so this is also quite a boring setup the s is just providing like variance reduction for the effect of a on why and sort of like maybe you can you know compute use a heterogeneous treatment effects model here or something like that so but it's you know there's no confounding so you can use machine learning will get you a reasonable answer here as well and then on the right we have like the case where we observe the confounders and use them to choose a because we operate the system that chooses a and so we do need the do operator here but but this tag is also quite boring right it just says that there's confounding but we can fix it is what the dag tells you um it's like you know a is confounded by s but we can condition on s and block the backdoor path and so there's really no identification problem at all so so these are all special dags where a is a parent node pure parent node except for in cases where it isn't it has a parent that we can that we also know about right so we know the treatment assignment mechanism perfectly and when we know the treatment assignment mechanism we essentially know the distribution of p p of a given s which which is which is this arrow here we don't need any fancy causal inference tools to to to apply to these dags they're all sort of like solvable with like pretty standard statistical techniques so the dags themselves are sort of like stories but they're not very interesting stories and they don't tell us much about what we should do they actually actually sort of like end at a point where the interesting part just gets started which is like how do we analyze the data that we generated under a dag that was created like this now if we have a confounded dag which is like you know all observational data has this like potential that there was a variable that you didn't observe that affected what action was taken and this is super common in the social scientists like you know humans take actions based on information that they have that we can't get out of their heads and and so we just don't know this u and we don't know what the set u is and so we don't know what to condition on to block the backdoor path and d bias the estimate of the effect of a on y so what do we know if we know we don't control the the treatment assignment mechanism it means that we don't know how the action was chosen it means that the confounding could be arbitrary we can get an arbitrarily wrong answer here i mean sure there's lots of work on sort of bounding how wrong we could be but but it is a sort of a case where we don't control it we couldn't have controlled it we don't know how it was set and and if we try to analyze the data we're going to get it we're going to get a wrong answer so the answer is the question is like when should you use observational data to take actions i i would argue like potentially the answer is never um because we should just use experiments and try to like stick to the dags on the top and solve the kinds of problems that we can get good answers to now this is a super strong perspective and i think i don't even fully believe it but let's let's talk more we'll get some more about like you know what what what role the dags are playing and why we might think about compounding but i think one one reasonable story is that i'm going to use the understanding that i get from this to like suggest an action that i will test using one of the top tags later but but i sort of like you know that's a that's a reasonable argument is that i can like generate some knowledge that a human can then use to go design an experiment that might actually give us a well identified estimate so people who like dags love to talk about mechanisms because you know it's one of the great things that they can kind of help provide answers to so so in the case here we might say hey like a caused y and we got a really clean estimate of that from our great experiment that we ran but did it cause y via mechanism m1 or m2 and that would matter because i've got this other treatment that's laying around and it operates through mechanism m1 or m2 i have action a1 and a2 and i'm going to decide like which of those two things to work on you can think of these like you know maybe like ads like you know this was this was a super effective ad did it appeal to you know their sense of belonging in our community or did it appeal to their need for technical features if it was like the other the latter then you would want to sort of like develop an ad that was different um so it's it's super plausible that you you would want to use your knowledge of the effect on the mechanism to to kind of like generate a new action from intuitions so maybe maybe like dags are useful at least in the case where we want to do that so we can tell like there's a little interesting story and i highly recommend reading the article it's it's super super interesting like epistemologically it's like the and it's it's become a parable at this point but there is a really interesting story about it the lost cure for scurvy and and so scurvy is like a disease that you get when you're on ship for a long time and you don't have access to vitamin c we know this now because we have great research and have done a lot of science on it and people go on these long journeys and not have access to fresh fruits and vegetables get scurvy and has terrible symptoms by all accounts of horrible disease and you know they learn that consuming lemon juice cured scurvy like could basically prevent anybody from ever getting it and you could like it could be operating through like either vitamin c levels or like you know maybe just delicious citrus fruits are the kind of thing that cure scurvy and some other mechanism that we don't know about so so a bunch of british sailors go on a trip and they they take lime juice because lime limes are just like lemons right and then they put them in copper tanks exposed to the air and it like you know uh destroys all the vitamin c in the juice and so this lime juice like tastes like juice like and it tastes like what they were taking before but it doesn't have vitamin c anymore and so it doesn't operate on the same mechanism and it doesn't it doesn't cure scurvy anymore i should have just removed this arrow from the diagram so that you know that doesn't work you know you can't you can't use that treatment and so if we if they knew this mechanism was important and they knew that lemon and lemon juice had vitamin c and lime juice didn't then this is valuable knowledge right you would you would know that this like this what this thing wasn't going to help you on your on your trip at all but like you know i'm going to take the experimentalist perspective here and say like okay like we could have just lime juice is just the new action right and so you just assumed you had no positivity when you just put people on a ship with lime juice and said you know it's going to be good just try it out and you never tried it before that would be a bad idea right like you're kind of like trying a new action without any any inferential ability to say what will happen there so the experimentalists would get this problem right too right like you know if i if i were just sort of like testing new cures for scurvy and i were i had a good experimental design knowledge i i wouldn't i wouldn't try to just like cure scurvy without experimenting on it first with with you know with some kind of like positivity in that part of the distribution where we tried lime juice in estimating this outcome um and fun fact like the isolation of vitamin c wasn't until 19 1932 so we didn't even know what vitamin c was until like 200 years after this anecdote or sorry the anecdote is like 1911. so it's like 20 years too late they knew what um they knew what vitamin c was so even if even if like you had to wait even to understand what the mechanism was we had to identify and measure the mechanism these are all cases where like you could still get the right answer without knowing this by running experiments or you wait until do all this costly research and find out what vitamin c is and als also that humans just like can understand like why it worked but you could you could have just gotten the right answer and had something that worked like without doing any of that science stuff and having humans been in the loop in any way so just kind of like interesting thought experiment there so the the big claim of this section is that like those dags that have action as the parent node that i control is just experimentation it's where i get to like take interventions myself and shoot you know choose the value of certain variables in the system experimentation is a proven architecture for accumulating use useful causal knowledge now it doesn't mean that it tells us like why things happen but it does tell us like what will happen if we do certain things which is sort of like maybe the most important thing that we can do in a practical setting so if we're looking for a useful causal knowledge then i would argue experimentation sort of obviates the need for dags and that in many cases the dag is just sort of like a story that you tell uh after the fact but the experiment was the thing that really gave you the answer in the first place the dag didn't give you the answer so this is this is a relatively strong perspective but if you have the ability to experiment um which which i hope that you do if you're doing things that you if you have control over the system anyway then then adding randomness to that is a very natural next step and then then you're in a world where the dags are not super useful anymore so this is my like you know bear bear case for causal inference but i will say that actions have to come from somewhere so we put like coupons coupons came from you have a theory in your head that like people like discounts and so there was like like a note in the graph that you had in your head which is like um the like the cheapness of the product and you and you had this idea that cheapness of the product was going to like cause people to be less likely to turn turn from your product um and that's a human intuition and so maybe maybe like human design is sort of where actions come from and that's a very reasonable role for causal inferences that if you have a dag in your head and it may suggest sort of like things that you want to you want to apply as treatments and then we will go run an experiment on that but that but you're still using a dag and causal inference as like a you know hypothesis generation engine i can get behind that and i think that that's relatively reasonable this is the most important when the feedback loop is really slow so why do we need human design it's because humans have good priors they're bringing a lot more information to the table than a model that have available and so yeah like by all means you know humans design experiments that's that's what we do we sort of like pick new product directions we pick we pick designs um we we choose like whole bundles of things as actions rather than just like a b before the a and the b test is actually like a super well designed and engineered thing that we're comparing against one another and so that's a slow feedback loop and say sure we need human design and humans need to solve some sort of like speculative causal inference problem in order to in order to do that but there is a world where he where machines can design the action space and that's very reasonable as well and humans have a role here too which is to create a hypothesis space for machines to explore um and this can be this can be many things that don't you know come to mind very easily like a whole app or website can be parameterized into like how big are buttons and what color are they and what position are they in or you know text or movies can be parameterized in various ways like star you know stars tom cruise or star someone who looks like tom cruise are like factors about a movie so you can design a movie parametrically um if you wanted to and and then you know then it's just about how fast can you test it and get feedback about whether that set of parameters corresponds to like a successful outcome and in a case where you have a feedback loop that's fast then i think i would argue that machine exploration of action space is probably in general and machines generating new actions is probably a much more efficient way to do it because they'll be able to kind of generate candidates more effectively so if you have a well parameterized space and you can get fast feedback then then we don't need humans to design actions anymore and so like this is the world where like you know ai can can automatically generate actions and automatically test which ones are good and so like you know humans don't really have a great role anymore in that one um we can talk about that's a good thing or not okay that's a good segue into like i just i just articulated like a causal architecture that actually sounds kind of terrifying right it's like you know we've given machines sort of like actions that they can take and we're like trusting this like you know these systems to to operate safely so you know maybe we haven't learned anything like if we if we give systems the ability like ai to the ability to reason about causality then it can do things like this like like articulate that oh this is why my robot uprising failed and then like i'll send a terminator back in time to uh prevent that from happening and then you know there's lots of additional consequences from that like kyle reese comes back and makes sure the terminator doesn't kill sarah connor and then then it's kind of become self-aware because of that and all right this is this is all nonsense um actually this is the problem it's not robot uprising it's this it's we are going to apply this maximization operator and maximizations are kind of terrifying if you think about them because they're sort of like they're going to find everything that's wrong in your model so if you maximize something then you better gamble be sure that you're maximizing the right thing and it's optimization technology is really probably the scariest piece to this because it's sort of like it's hope climbing and you have to have chosen the hill correctly and i made this argument quite a few times about metrics like choosing metrics and whether you're optimizing the right thing i do think it's like a relatively ill-considered aspect of data science and it follows naturally from that data science you know pathway that i was i was showing you in the first couple slides which is that you start with i need to make something good happen but like the very next step is i need this to not do something bad it's like because making something good happen might have some consequence that we didn't anticipate so controlling and managing risk of models that act in the real world is is a really important part of the part of the process and then naturally if you have bad things and good things like both of those things are going to happen sometimes and so you're going to need to actually manage trade-offs and so you know you need to get if you if you live in a world where you're going to use algorithmic decision making then it strongly implies that managing trade-offs is like first-order concern for you because maximization needs to sort of have something to kind of pull it back from over-exploiting and then very naturally sort of like someone's got to help you you have to come to a consensus as an organization or as a society about where you should be on the on the trade-off curve and and so this this is the path that we're seeing play out and we're going to see more of it because we're building systems that like can can act and do things in the real world and so like they're they're going to be based on causal models they're going to be able to optimize these are all like known working technologies and and we're really going to need to start thinking really carefully about these next couple boxes so you know you started the journey the very beginning of this talk of like should i choose python or r and the end of the journey is like am i going to like enact this the collapse of society through you know creating a model and some optimization and some action space that that can do that kind of thing and i i do think that that's the kind of responsibility that you should be thinking about first uh and all the whole way along okay so what have we emitted from the model and why why is trust and trust and response why is safety and responsibility so hard it's because the model is misspecified um the actions perfectly well specified we we know those because we control them the state space you know we could be missing features that's sort of like a well-known problem but the y variable we have not talked about at all she's like what is why where where did we get it like is it the right is it the right thing to be optimizing for or estimating in the first place and and to be honest with you i don't know many people that think really hard about the why's in their models it you know it's it's sort of like common to use the thing that's conveniently available and it's also common to just use one thing um and and sort of like forget about trade-offs and so what i would push you to do if you're thinking about safety and responsibility it's all about what you put into that y variable and how many how many things you consider and whether they represent the needs and risks to your to your stakeholders and so you can kind of picture how this plays out is if we if we sort of only measure one thing in y and it's a good thing then we're going to maximize the hell out of it and we're going to end up here and if there's a trade-off between y1 and y2 we get we end up at this point and so this this curve are all sort of states of the world that we can operate in so we can we can choose from among these possibilities these are all like this is called a production possibilities frontier and economics like we're able to like choose a point on this curve and and get what we want we have to choose it and that's sort of like an important question for for any business or society or organization it's like where do we want to operate how much of this bad thing are we going to tolerate and for whom or will we will we tolerate it maybe maybe the value of this changes depending on the state as well all right so just to sort of slowly wrap up here um i hope i've convinced you causal inference is useful in a lot of settings if you define it a certain way and also sort of like maybe maybe some of it's a little overblown like maybe you don't need like all the all the fancy causal inference stuff all the time because there's there's ways to structure your problems and your data collection and the way that you operate that so that you you make the problem easier and you don't need kind of all the fancy machinery that causal inference can provide you and and that's just sort of like a standard story that tools tools are just tools and they're useful in certain settings and not useful in others and so there's nothing really super surprising about the fact that you might sometimes need causal inference and sometimes you don't but but it's the it's the consistency part that i think you need to just like you know you know not to always use the same modeling technique because there are some that might work better you know you shouldn't be just as consistent about you know applying causal inference strictly all the time when you don't need it or we're not applying it all the time because you just don't think that you need it and i i read this this quote was really good in that it actually sort of has this leg with consistency a great soul has simply nothing to do um but you can invert this the other way as like having nothing to do and no action space yields a great amount of consistency for for you as a person so like it's actually quite easy to simplify and make a really consistent career as a data scientist if you just don't do anything um but i would argue that it's really hard to have like a productive and successful career if you don't think about the consequences of that simple conditional expectations function that you're that you're working all the time to make good so just to conclude i'm claiming you can do most of data science with three variables now we can make them tensor valued or you know very high dimensional or whatever but it's really ought to be enough for most business problems which is like we observe a state stuff we know stuff we can do stuff that happens like that's that's a fundamental architecture um if we are careful to design the data generating process correctly where we where we explore actions based on states then we only really need very straightforward causal inference to learn everything that we need to know like we don't need anything fancy at all i mean some of the modeling techniques are fancy and interesting and they have their own sort of interestingness but we we don't really need like dags or super complicated identification arguments in this setting sort of work are capable of engineering a system that that can answer questions with relatively routine statistical analysis and dags are useful tools but they're more useful for humans like they're useful for us to tell stories and to help us determine um like what things might be happening in a system and for us to reason about things uh but but they're sort of like you know in in this particular regime which is a which is a very flexible and it covers a very broad array of data science problems um they they sort of like we can mostly omit them and get away with not thinking about them and just to kind of like i i i refuse to do anything super technical in this talk but i'm going to provide all the pointers that you need if you're if you want to know more about everything that i talked about there's so many different topics that i touched on that like that like every little detail like how do we estimate the conditional expectation function with with causal interpretation is a super interesting problem how do we sample actions given states and how do we do that adaptively is a super interesting set of problems how do we like perform this maximization problem and one of the things we've been doing on my team is using differentiable programming to make models that are easy to optimize and that's been like super super fun and interesting to see how that plays out but also like you know that art max requires you know designing metrics and estimating trade-offs and eliciting preferences from users so just like a ton of interesting questions in there and research topics and then like i didn't talk about variants like one time in this whole talk which is like you know maybe you also care about the variance of that estimate and like whether you're likely to be right in this sort of like in a risk management setting that wouldn't be really first order um that you care about sort of like minimizing some downside risk or you care about like this is uh this is like very bayesian and i didn't talk about any uncertainty or you know even even gathering sample size at all here but it's it's a super interesting problem of like how would you estimate how wrong you're likely to be in that conditional expectation function so so these four topics like fundamental to this like nascent field of like what i'm calling causal engineering um and there's plenty of pointers there and i'm happy to happy to provide more so with that i'll stop and take some questions so thank you very much that was terrific i'm going to give you a golf clap and i hope everyone else in the chat can do it also um we miss the times where people get a nice big applause from the crowd is always great in person this would have gotten a huge applause if we were in person so please accept my verbal applause as a big crowd of cheering for you we had we had a some questions come in but feel free to ask more questions folks as i ask the questions and sean answers uh please feel free to chime in if something um i think that was ken but i'm not sure just that excellent presentation not a question but still nice to hear ken thank you excellent presentation great slides um speaking of your slides that uh vc focus slide was really great um i really appreciated that one and i will be i already tweeted a photo of it so i need to actually tweet an actual real still of it um that was just i just posted the link to the slides in the um in the in the chat too you can share this like i'm not on the sucker yep i'll share this in slack and then we'll also we'll be posting the actual slides and the video up at nyhackr.org um hopefully for the next few days we'll have that up there so everyone just check out nyhackguard.org there's a very crappily written page presentations page up there you can actually search for sean's name you'll see this talk in his previous talks um and by the way if anyone wants to redesign that is open source it's on github at github.com please redesign it it was built pre-hugo pre-blog down of straight up regular markdown please redesign it i i i can't stress that enough because i designed it and you see my aesthetic for there it just wasn't working so i really want someone to redesign my personal website so anybody who wants to help me with that is it available publicly on github and then it's on github yeah yeah they can make a pr for you too that's perfect all right so we had a few things come in and let me just i saw one come in i need to okay great um here's a question for early on and you did touch upon it later but you know that was a good question um am i wrong in thinking that the last slide i don't have the reference for you but the reference hopefully the question will give you context am i wrong in thinking that the last slide for similarities to beijing optimization except now we're trying to build an active causal learning algorithm that can determine the optimal sequence of interventions yes yeah a lot of this is inspired by things like banded algorithms and contextual bandits and bayesian optimization and you know like one of the things i didn't touch on at all was like how how often do you change the p probability of a given s and so banded algorithms change that as the data streams in so i would call that like an online learning algorithm but you can do you can do these learning things in batch so you can just like wait a month and then re you know change the distribution and then so so it's a little bit a little bit tied in with bandit architecture but it doesn't need to be so these things can be you can run sequences of experiments and that can approximate what uh what a bandit does and bayesian optimization is just sort of like bandits for continuous variables one way to think about it bandits tend to be sort of focused on discrete discrete action sets but in many cases you have continuous ones and if you want to learn more about bandits uh go to that aforementioned poorly designed webpage and search for shane conway and john miles white they both given bandits talks and so is emily robinson her talk wasn't necessarily about bandits but it was about bandits and those are the three that come to the top of my mind and i know i like to particularly mention shane and john because they are members back when uh back when sean was living in new york with us big big fan of both of them they're pretty awesome um you mentioned about judea pearl be mad at you for not using the do notation can you get more into that like why is such an important thing to have the do notation yeah i mean um today pearl's a really smart guy and has done a lot of great work you know it's very i found it really inspirational to borrow ideas from from his work and um the do operator is important it's sort of like it is important to specify like what can be estimated from the data you know and the do operator fundamentally tells you like like that it is a different thing you're estimating when you want to set the value of a variable than when you want to condition on it and and you know setting means it can be set to any arbitrary value um given given s and that's different than like you know just whatever value nature gave you for that thing it's a fundamentally more useful useful thing to know um i think he takes a pretty strong view that like not not putting a do operator in from the start means that you are you you are doomed to never be able to solve the hardest problems of causal inferences so like would you appreciate this talk you would say you always need my causal inference because i i will show you cause inference problems that you can't solve using uh you know using anything but what i have design and i i think my argument is something like actually there's a simplified set of data science problems where you really don't don't need all of that machinery and it's much simpler than that and so like so if you want to solve all these hard problems sure like yes i have not encountered those problems personally as a data scientist so uh so like so that's why i think he might be a little mad about this but you know i'm a big i'm a big fan of his i do get into arguments with him on twitter but i think everybody does yeah and he and he's open and engaging about it too yeah we would love to have him speak one time here um shane's not at nyu and he didn't want you a long time ago we'll try to find some way to get him involved um speaking of judea pearl i'm gonna give you a leading one um some of the best books you've seen for uh learning causal inference yeah that's a really good question um so i i think that there are some good ones that sort of are focused on the scientific application of causal inference uh so so you know there's uh calls inference mixtape scott scott cunningham is a more recent one and um mostly harmless econometrics is quite a good one for economists also uh you know pearl's book pearls book of why or and they all give you very different perspectives on it and i think that a lot a lot of ways this talk is a response to the fact that i have not seen good materials about like how to apply these techniques in practice like they often sort of teach you like little parables like you know smoking causes cancer and so you learn these little stories but they don't tell you like how am i gonna use this in my job how am i going to get something done and i do think that we're missing a little bit of effort there and that the causal causal influence of machine learning intersection is just starting to take place some people would argue it's been taking place for a long time but i would argue practical tools are just starting to emerge there um and so i i can't point to any specific resource there there are there are some really good researchers out there that are kind of at the forefront of this i think um one of my favorites is nathan kalas who's a researcher at kernel tech um and you know doing really good work and some of the key words you might look for there like off policy evaluation is really important contextual bandwidth optimization also sort of fundamental to this and basically optimization and also fundamental um it's on bakshi's group that facebook is he's a good friend and they do awesome research you can go to like acts.dev and find out a lot about adaptive experimentation which is really sort of like a an engineering approach to what i just described and um sean is just learning that he'll be giving a how-to of code in a few months at a future meetups let's choose his favorite language and it teaches all yep um how do you uh and folks if you need those resources maybe sean you could just uh tell us later that you know some of those books that you know you mentioned mixtape you know there's a recording cousin friends mixtape the question of book of why ask.dev and facebook and you might have mentioned one thing you want me to sing a lot i'll write it down cool thanks thank you because i know this would be very popular afterwards um let me see oh here's like sort of a follow-up question that you i'm gonna give it to you anyway just because um you sort of already answered it is there an accessible reference paper or open source code for applying causal inference to a forecasting problem specifically yeah that's a much harder question right so causal ideas and forecasting is actually has a historical precedent that goes back quite a long ways to macroeconomic forecasters i used to work at the federal reserve board and that's sort of like a first order question is like what should we set the interest rate to to make sure that the economy grows fast but not too fast and and that's so putting themselves in the model was sort of like very early in in macroeconomic forecasting especially for central banks i don't have any papers at hand that i can i can point to and in fact like i think most of their there's a lot of great ideas to borrow there but i don't know how particularly practical they are because macroeconomic forecasting has to face this sort of like uh they worry about the people in the model optimizing in in response to their decisions so it's a game theoretic problem you have to sort of solve for an equilibrium and so it's a little a little different in nature than things where we can assume that people aren't smart enough to optimize in response to what you're going to do um so i we have we have we have coded up most of our stuff in pi torch and have built for forecasting models that have causal components one one of the kind of key concepts there is what's what we call causal causal convolution so convolutional neural network is really great for two-dimensional images right sort of like moves a little window across the image you also have a 1d convolution which is like a sliding window across an input wave and so on you can think of your policies as input waves and compilations as the effects of those waves and that's an idea that we've used and gotten a lot of leverage out of um i don't have like i don't have any resources we built a bunch of stuff but it's sort of like all in-house um i hope we get to write some blog posts about it someday soon so it would go beyond like putting an exogenous variable and something like profit for instance yeah profit profit cannot do this and so i haven't yeah if only you someone could fix that yeah well i think it should be something new i think um profit said profits close to future feature complete so well maybe there'll be a new package someday that can do that you got to come up with like a profit in another language that way you know some other language um all right wait no i want to that's just what ask that one um how do you distinguish between confounders and prognostic variables yeah that's a good question so so prognostic variables are predictive of the outcome um and so they can be used mostly for what's called variance reduction which is like you know you're explaining more of the reason for the outcome from from the s's so then it allows the variation that's due to the actions that you take to be to be easier to easier to see and confounders are variables that are are like kind of like also having an effect on the on the action itself and really the reality is a lot of variables are kind of like both because if you're if you're using these the actions to choose what to do and they have some influence on the um on the outcome then they're probably a little bit of both now knowing if it's a confounder is like is your business right so if you i would this is a strong perspective but if you're taking an action in your business and you don't know what contributed to it and you're not blogging it then you're sort of like making a fundamental mistake that's going to prevent you from learning in the future so people call this exposure logging or but like you know building a good logging system that captures like the why that you took actions and if there's a randomness in it which there should be then you need to have logged that as well um so yeah the answer is like whether it's prognostic or not is estimatable from the data whether it's a confounder or not is something that you have to know from with domain knowledge and have sort of either like engineered or assumed great and i guess sort of about knowing um things there's a question about a lot of job postings right now focusing on languages and packages and ml models do you think they'll be more of a push toward having these calls and print skills in the future in the near future or do you think it's gonna be more just about you know implementing yeah it's a good that's another good question i mean i think that so on the one hand we're really lacking people and causal influence skills um in industry and it's very hard to hire for finding people with prior experience is challenging so i i would encourage everybody to learn about it and get experience because it will make you more marketable i would like to hire you if you're really good i know a lot about causal imprints so i hope i hope people are willing to take that on faith um i do think that some of this stuff is gonna turn into the realm of engineering in the future just like it's the same arc for a lot of machine learning systems is like you know do we really need humans building models eventually hyper parameter tuning is is automated and feature engineering is automated and i think some of the causal inference stuff will sort of tend toward engineering systems in the future and so the role of data science and causal imprints will become a little bit less but there we're much earlier in that cycle for causal inference where there's still a lot of human domain knowledge that's needed to solve these problems in practice and much less automatable because there's a lot of like assumptions that are needed by humans in order to make these things work and the modeling architectures are just more complicated when you're jointly modeling actions and states you need to treat the actions differently i i abstracted this completely in the in the talk but actions are fundamentally different kinds of variables and need to be fit differently than just like standard features in the machine learning model it's very inefficient to fit them you know just as like other kinds of features there's a large literature on that and knowing how to do that is very useful because you get a lot more precise information about what's likely to happen under certain kinds of actions so i think it's a really deep area to learn it's very useful to learn it's probably going to be a great job skill at a lot of places all marketplace companies like uh you know any place that sort of like operates a marketplace is very concerned about causal influence because they care a lot about like what they can do to cause the marketplace to function more efficiently but um uh it's it's not going away and there's going to be a lot of jobs i'm pretty sure and uh speaking of jobs someone just wrote in a little bit a little bit ago how did you personally decide between academia and industry oh i like uh didn't get sleep for like three months of my life and was a miserable wreck uh no i i had a i had a really good experience interning at facebook when i was in grad school and i i was like i was in new york at the time and i i came up to california and um i i felt like in industry i was learning very quickly because i was exposed to this constant stream of problems so you know being a consultant internally at facebook working on what you know what was coming up there was a little bit of a preview of the future you know you were seeing the kinds of problems that companies were going to have like you know way before the industry right before the academic researchers were thinking about them so i always felt like um i was learning so much faster there that like i would choose a job where i learned quickly more than a job where i wasn't learning very quickly and i had kind of tapped out on what i could learn from reading papers and being in seminars and stuff like that uh i still miss my academic life and i do miss sitting down and writing papers and it's pretty tough to get that anywhere else and if you really do like having open-ended research problems and a lot of space and bandwidth to work on them and it's pretty hard to replicate that in industry so so you can't it's not just a free lunch but um but it was a tough decision and there's there's the world where i'm like a professor struggling to get tenure right now that i'm like a little bit terrified of so no looking back with regrets no no that's a question i've asked of you before in the past i'm glad someone else asked you again yeah um and calling back is something you actually mentioned when you talked about the when you worked at the fed um does causal inference have any difficulty incorporating deliberately adversarial behavior into the model how do you account for that yeah it's uh because it's a really good question um you know typically we're at the we're in the world where we don't think about that too much because so okay here's a good example like lift lift coupons right so we like we give out coupons we use a model to do that um and you might think like oh i could figure out like what the algorithm is and like you know do the kind of behaviors that cause to give me coupons we should be we should be worried about that and that's that's the thing that like you know you know gaming the system so that you you produce the kind of features that get you benefits um and there is some literature on that no um this name is escaping me right now but uh there are people who work on that that kind of problem i think we have like relatively simple fixes for that like rules-based systems that help make sure that people can't explain things but i do i do think that there's a there's a probably a bunch of open-ended research questions there about how to sort of like make what you're doing you know robust to when the user gets to set s in your model and they have some control over it then you have you don't have full control over your system anymore that might be a reasonable way to think about it um and so maybe maybe there's maybe there's some ways to kind of like provably fix that problem but i doubt it i bet that that just adds a lot of waste um and as you just mentioned as we have a question that was actually asked a while ago how do you know that your experiment is a good match for the s values that you observe uh i mean so one of the things so the diagrams that i showed the state action diagrams are are like a kind of they're just they're just theories right they just say like where could i have done this but they're not densities um they're not showing like where you did collect the data and one of the interesting things you can do with that diagram is assess balance right so balance would say like how much is the distribution in this action similar to that distribution in this action or you could frame balance the other way which is given the state what was the balance between the actions but really what you care about is the balance in the state given the action so we invert that probability distribution and we can check that those two distributions are are similar and you know if they're not similar enough then it means that there was something that failed about your randomization so that people sometimes call those randomization checks and uh i know that a lot of researchers are really interested in like producing like automated versions of these checks like maybe there's ways to verify that your causal inference system is likely to work by just checking that you gather the data that could answer the question so you can do positivity which i showed there is actually like an empirical quantity that you can you can estimate that you so you can say like could my model actually make a prediction about this was there a positive probability of ever observing me doing this with this state in the past and if it's zero then your model is gonna it's going to return an answer because all machine learning models produce except for the very honest uh you know k-means or sorry k-nearest neighbors is the most honest model because if it's if there's a cluster really far away that it was sort of like you know there's no no data nearby it will be very honest with you but most models won't be honest they'll just sort of return an answer so i think like distribution overlap and whether the distributions are capable of answering the data that you collected are capable of answering the questions that you have is a really fundamental thing and it's it's checkable mostly and it just all has to do with like common support and intensity which are things you can estimate and i know in the past you've been you've used stand and build some stuff you know particularly profit uh if you're doing bayesian stuff today are you using stan using pie torch pie charts have been used for mcnc sort of yeah so i i had like emitted variance from this whole this whole talk um i love bayesian models i think it's a really powerful methodology and you can get a lot out of it it's also really slow right everybody knows that so so you spend a lot of time waiting and there's a there's a sort of like uh feedback loop for researchers right you want to try a new model change so you you have to kind of make move some things around and and bayesian models are are very fragile and that you know if you specify things in an incorrect way you know they'll they'll find that very quickly and it's they'll they'll be wrong but they'll take you a long time to tell you that they were and and i think that we just decided early on to to not pay that cost and kind of move a little faster and not actually deal with uncertainty in a principle way uh and i i think if i could go back in time i'm not sure if i would do that again i think that there are there are big trade-offs there but i guess the hope is that the neural network frameworks like pytorch start bolting on better inferential capability over time and that you can get it for free and not have to not have to worry about it so much but um there's lots of frequent this kind of uh uncertainty estimation i had a bullet on my slides about conformal inference conformal inference is a very powerful tool that you can use and you don't need to um you don't need to have a full posterior you can still get like prediction sets for you know for predictions that you're making so there's there are other tools but bayesian models are just like slow and hard to hard to scale so we've had some challenges using them at practice makes sense all right i do know um scaling a lot of these neural networks are um used on gpus span actually recently is now able to be worked on the gpu it's a new feature so and there's the speed up so dramatic probably not as fast as you need but that is i'm very i'm very excited about that you know i i think that team is amazing and i you know they do awesome work and i still think stan is like the only like ground truth like estimated i would always believe this you know a posterior i get from stan over like anything i did myself so it's very valuable to see them making progress there and i do hope that someday it's just as fast as anything else and we don't have to make any compromises but i think you know the world as it is is that we have to like pick somewhere on the spectrum of fully bayesian is the most expensive modeling that you can do it's sort of like you know time and effort both are very expensive for that one i like that that you're right it's both time and effort it is over involved defining your model and doing the fitting that's actually really great all right so i think that was all of the questions i was looking around to see if anyone's messaged me on any of the screens i have open and it looks like that is all of them so everybody knows where to find me so you know happy to answer other questions i should please tell them where to find you in case anyone doesn't know about you on twitter anywhere else yeah you can you can connect with me on twitter it's just you know my name but without the period in it um and uh and yeah always happy to chat about this stuff i think it's really interesting i think i think it's uh exciting because it's a bunch of unsolved problems and i'm really looking forward to like this becoming like these ideas becoming a more full-fledged like part you know part of data science and i'm also really looking forward to people just sort of like learning about the basics of causal inference and getting started on it because it's just like a super fun way to think about what you work on and think about the world right so thank you very much and everyone please tag him at sean jay taylor um you know be in touch let everyone ask me questions or just say hi and say thank you um and i'll say thank you for being here and thank you for all the cool stuff you built and i look forward to one day hopefully soon seeing you in person and um have a great time and i'm going to give you a golf clap on behalf of everyone else and hope that everyone enjoyed this as thoroughly as i did i see a lot of comments in the in the comment section saying thank you so everyone's telling you thank you right now thanks everybody appreciate your time too thanks jared yep and everyone i'll see you next month that on august 12th for e cook in september at the meetup and at the conference in person have a good night everybody
Info
Channel: Lander Analytics
Views: 1,321
Rating: 5 out of 5
Keywords:
Id: 2dv7NrYExzo
Channel Id: undefined
Length: 88min 37sec (5317 seconds)
Published: Thu Jul 15 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.