Talking Bayes to Business: A/B Testing Use Case | Shopify

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
you [Music] hello everyone for the next 30 minutes or so I'm going to be sharing with you an use case around a be testing I'll talk a little bit about how we started the questions how we got to the answers and maybe generalize a little bit how I base can help you change the way that you talk to your stakeholders hopefully by the end of this talk I'll be able to convince you that basic statistics can actually give you the answers that you want and also that it's not as frightening as it might seem to be at the beginning at least I know this is the feedback I get sometimes a few words about myself so I have a lot of mathematical training in my background but as a practicing data scientist here are more than 15 years I you know embarrassed admit I've been mostly doing a practicing frequentist methods when I was working for biotech companies or machine learning techniques so I want to say I'm based curious in the past few years kept trying to kind of incorporate Bayesian statistics back into my workflow data scientist well so in my honest opinion the term that the term has been completely you know over I would say over a hype - over a charged with meaning by now right when you say you a data scientist you can do so many different things so I like to say I know some mathematics and statistics and that my production engineer is actually allowing me to put some code into production that's where I am today and I'm kidding but of course and currently in Shopify I'm focused on forecasting and causality we do some works around elasticity and optimization and also problems that have to do with the working inside platforms and also on a complete tangent if you're doing NLP we heard a really interesting talk now for accommodations and search then please feel free to eat me up also I'm really interested in that and you can find me know the usual places are LinkedIn Twitter and github so just a little bit about what I'm going to be talking about today I'm gonna start with the motivation kind of what were the questions that we started with I'm gonna talk about the answers that we got and we wanted to get and we're getting now and how we got to getting these answers and mention some of the toolkits that we looked at and that we're using today two of the last points I'm not going to touch too much on the two last points just honor to what there's kind of placeholders to tease you a little bit to think more about these kind of things one is what you do when you can't really go can't really do a bee testing or when a/b testing is maybe not the right thing for you to do and the other one a huge huge topic coming from the world of the engineering problem forward versus versus solution backward if there is time I'll talk about it if not let's talk about this later in the office hours so let's start I'm sorry the icons in my in my laptop they look better so this is kind of a simplified version so meet Nadia we love Nadia Nadia is awesome because every time she has a question about the product or every time she wants to test a new feature or want to test a new KPI and see what it does to the product or to the users the first thing she does of course is to come to you her beloved data scientist and talk to you about all the right things about sample sizes and about the meaning of KPIs and about the questions that we can ask and about the answers we can get and you know through and what what's gonna be you know that how what does the report gonna look like and what it is decision downstream so all the right things she comes to you first thing and starts talking to you about it even before anything goes into any code is even written right so we love Nadia by the way is anyone here named Nadia okay if you know Nadia any Nadia I'm sure she's awesome Nadia in reality you know we don't have full control over all the stages of the process maybe you're working with some platform that has some testing already kind of hard-coded into it maybe you don't have exactly control over the duration or the sample size or the questions but at least as a data scientist you have this really healthy check point where you can say okay given all that we have you know given what we have not much we want but what we actually have we can answer question a question B we cannot answer question C and for some other questions well maybe we need to run at the test differently so at least you have this way of controlling the conversation or giving some kind of checkpoints and sanity checks on what's gonna happen so it doesn't end up when you run a test and then you can't really get any answers from it right so this is a pretty realistic scenario again someplace to start but try to do better in some situations well you know maybe you started your job as a product data scientist or working for the marketing department and you come to work and Nadia comes to you with a ton of data from experiments that you didn't run that were maybe misspecified or the KPIs were not designed correctly or maybe there was not even a test or maybe there was no testing done like in a proper way and you kind of need to start ringing out what can you do about this base can actually help you there I'll just mention it briefly at the end and maybe send you some some examples that and some code that might be interesting for you but let's talk about let's talk about reasonable Nadia and when you know when you have a question when you want to test a new feature for example or test something else about the product or marketing campaign or whatever hopefully you have all of these set up in a satisfactory way and this would be having a good understanding of causality right you wanna change the feature and you expect whatever use the happiness to increase so we expect revenues to increase or something to be better right there is a KPI at the end that needs to be impacted and we want to understand at least the causality of what impacts what it's an important thing to have we also have kpi's well um we heard some talks about today today and yesterday you know KPI sounds sometimes really really easy right we just count things well you know counting things can be really really hard you need to go to the other room to hear how that can be sometimes when you talk to the data engineers but let's say we do have some understanding of that the last three are sometimes harder than you think even you know you might think you are in a good place sometimes you need to think about it a little bit harder so volume versus velocity you know the classic trade-off I have you know I want to ask a question about a really small segment of users or some feature that's rarely being used I only get a few clicks every month so if I need to wait six months in order to accumulate enough data and enough exposure of the control and the test to answer the question maybe in six months this is no longer a relevant question right so we need to think about velocity when we design these kind of things and also again about volume because we can get nonsignificant results that won't be very interesting if we don't have enough users so it is something to think about there's also the flip side where things change too quickly meaning that you know if you think about examples like ad ads so clicking on ads or showing ads users sometimes you live in a really dynamic environment and taking three months or even two weeks to get to a decision is too slow by the time you kind of you know collect the data build your model or run the test update your assumptions two weeks and paths and things have to change completely you need something faster so it is something at least to consider is a B testing even relevant here I'll get to that at the end the last point when I was writing these slides I was thinking about the situation I've encountered before where you start working for a very small company they may be already wrote some kind of a backhand they didn't really think at the beginning about integrating with any kind of testing our platform or testing services so running your tests is actually just technical difficult that can happen sometimes maybe you are in a better place than I was in in some parts of my career where you already integrated you know that the platforms today are much better than they used to be and they integrate very easily with this testing services so that's actually pretty not as hard as it used to be I her actually discovered a few months ago that I think Apple just joined Google and Mozilla or sorry Google just on Apple and Mozilla with blocking cross-domain tracking so basically it's gonna be harder to track users across different domains those of you who have been working in the marketing space should be really really aware of this because if you don't have good tracking of users along the entire journey and you only see parts of it you know when users move between domains or in different parts of the panel you are gonna get a really really skewed results so you need to be really aware of getting the full picture and getting it correctly but let's say with all of these caveats we are in a place that we're happy with and we think we can move forward with the a/b testing so what does Nadia want to know now they wants to know if the new feature is working that's that's a basic question that's that's that's what we're here for right we want to know if the new feature is better than the old one we just heard a talk is the new chatbox better than the old one or whatever is the ad campaign better than not doing anything or something or whatever just a little pet peeve of mine you know you run a test you look at the data people start talking about significance or p-values or whatever just look at the data first sometimes you need just you pass what I like to what I love this acronym you know the intra ocular trauma test meaning that he just hits you between the eyes right so if you're there you're lucky but I really encourage you just you know first thing just to get the data sometimes you can just see things there I remember working for a biotech company working with some researchers doctors medical doctors and we had a lot of discussions again because we wanted to publish the results in journal in a good journal and the doctors told us at some point you know we can wait with the you know with the P values in the sample size and everything which I can see the results with my eyes if I need statistics to prove that something's working we're already in trouble right but life is noisy and complicated and sometimes you want to measure something a really small effect in a really noisy environment and this is why you know we run the test and we run the test we collect the data and then we go on it meet Nadia you know to conclude the whole thing another once you know okay is it working is the new feature better is the companion' working whatever and here starts this little strange song and dance that we as data scientist often do because let's be honest we can't really answer that question directly when we do a/b testing what we can say is the following we saw an increase of 5% in this KPI or in this conversion rate or whatever with a p-value of 0.05 percent okay um this is great right this is really interesting but this is not the answer to Nadia's question the problem is Nadia thinks that you answer the question when people hear this is statistically significant people think okay this is working and this is absolutely not the case so oh um and promised the only math equations you're going to see today promise what is P value so I can read this you p value okay it's a probability of seeing the data conditional on the null hypothesis that is not working what does it mean in a human language it means imagine a parallel universe a galaxy far far away where there is no effects where the new feature is not working this not a hypothesis this is not an idea this is the absolute concise reality of things we're talking philosophy here you understand this is the absolute reality of things and in this parallel universe we run an experiment there is no effect we shouldn't see anything and we see the data that we see and we are surprised that is P value that is what p value quantifies p value quantifies surprise in a universe when there is no effect when there is no change and this is not the same as answering the question is it working it's not the same thing this is what Nadya actually wants to know Nadya wants to know it's called the posterior distribution or again probability that something is working or where is the KPI where we think the KPI is now having seen the data it's called a posterior distribution this is actually the answer Nadia wants Nadia needs and in order to calculate this quantity we need three to calculate three different things one is the likelihood which comes from the model in the case of a B testing it's a very very simple thing the second thing the probability of the data which is something a little bit harder to calculate luckily the tools that we have today are much better rated and one of the reasons based on statistics kind of lagged behind machine learning and frequentist approaches is because this historically was really hard to calculate it's not the case now it's getting much better and the third one is the prior we'll get to the prior before that I just wanted to mention so why what led me to think about Bayesian statistics because first of all I just wanted to answer the damn question I wanted to be able to answer the question is something working yes or no or it is what's the probability that it's working again p-values I started realizing they're actually well not just me there are a source of miscommunication there is a really cool article that someone wrote with really nice examples on huge effects with really high p-values no effect with really low p-values you can get all of this real weird mixture together in though these examples and there is a name if you want a title it's called the replication crisis that hit social science and biology pretty hard I started reading papers about people from growth teams in companies that kind of you know these teams that come in and try to do a lot of testing and optimization of products or funnels and they are complaining now that okay I ran a lot of tests I got I chose the feature I thought were best and when I put it in productions I don't see the results I expected so I think it just creates a lot of miscommunication Bayesian statistics is also a good way of thinking about problems I'll touch that briefly and you also have really good tools that I think putting place really good processes so I mentioned the three quantities I promised I'll talk about the prior because prior is maybe the most tricky maybe the most controversial part of this process quoting from Wikipedia and this is my comment in the middle the prior distribution the probability distribution that would express one's beliefs that's my comment yes this is a belief about the quantity before some evidence since taking into account so imagine the situation you have your models let's say a really good model conversion model for the US and you want to run an experiment or want to test something in France and you have no information a prior distribution is me coming to you before seeing any information from let's say France or Spain and asking you where do you think the conversion rate is so this is where anywhere talks yesterday about expert opinion about integrating subject matter expert into your team in a good way this is where Bayesian statistics can of forces you to do it in a way the process is built-in because you have to go and ask people what you think is going on before seeing anything it's part of the process that you have to do but luckily again distributions are flexible things applicable this thing and if for example I think it's probably around 50% but I'm not sure then look at the orange line right centered at 50 percent but gives a lot of probability mass to other values as well maybe I think it's either 0 or 1 but probably not in the middle then look on the red line most of the probability sits at either 0 or 1 it's called a horseshoe prior also and I don't think there's anything happening in not not in the middle like people are the go hero there maybe it did another thing maybe I think it's you know my expert told me we think it's around 20% but we're not sure so I maybe I'll use that purple line over there to quantify my belief before doing anything so we need to do that and again choosing a prior I said this is subjective but there are really good guidelines sometimes you have these natural Reince some kpi's are only positive so you need to choose a positive distribution sometimes you know a be testing if you want to be really conservative then you can say okay the distribution is centered around zero and you just need to kind of decide how uncertain you are you can do a lot of really cool exercises that engage your stakeholders in a really meaningful conversation about what we expect to see what we think the effect would be versus where we see it after we look at the data and run the experiment again really really healthy discussions to have with stakeholders around where we think you are when we think we're going and also explains to you as a data scientist what people think what are the expectations you can discover really interesting things sometimes and also for the stakeholders to kind of get a better picture of what you're doing and you can do cool things like gamification and other things you know would that you usually do with training sets if you lucky you have some guide as some benchmarks sometimes and sometimes your tool just restrict you some some very simple choices and then you just need to be smart about making these simple choices sometimes you can just choose some Gaussian form and then you just need to choose which one to go that make kind of make sense beyond that you can go pretty crazy with Bayesian models you can build really complicated ones and then choosing priors becomes really difficult think about just what is the prior over correlation matrix of some different phenomenons that you're looking at that becomes really tricky but there are really good guidelines there's a link to a paper by McHale Betancourt one of the main contributors to Stan that wrote a really really good paper about this recommended reading but think about it this way you just got a new a new job description you translate business insights into distributions and this can be this is the new role for use the data scientists so let's say we got some opinions some prior opinions is about where we think the effect should be again a be testing is pretty simple it's zero but we can quantify how uncertain we are what the effect is going to be and we run you know we run the we run the we collect the data we calculate the posterior distribution and then we go again to sit down with Nadia and she asks again okay I listened to you talk for hours about Bayesian statistics and you convinced me that this is good we may be you know we let you build your models and everything can you please tell me if you know if the new feature is working and the answer is yes so if you look at the posterior distribution there we looked at the data and given everything that we kind of know and everything that we collected we now think that if I had to give a number like a single single number I would probably say that effect Li is some what I'm measuring in is around two right this is kind of the peak of the distribution when you have these nice new peaks but we can do much more because we can start talking about certainty and uncertainty we can start saying okay 95% probability that this is between 0 and 5 it's called high density interval there's also EDI I forgot what the e stands for but it means that you kind of chop two equal tails from both sides sometimes they're the same sometimes not so we can say okay this is what we think it is and this is where we are 95% certain that it is this is not the same as confidence interval that's not what confidence interval stands for please again if you know if it is not clear I hope it's clear now convinced interval again talks about this alternate reality and here we say in this reality we think having seen the data that this is where the KPI is now living but we can answer more questions for example you know we talked about again running a test seeing positive results putting something in production maybe you're not really certain about these positive results what is the risk we're taking here right what's the probability we're actually we're actually going to do damage we can answer that because we can look at the probability of being below zero okay of the difference being actually negative and say look everything looks great there is a 20% chance we're actually doing damage are you willing to take that chance again a different conversation to have with your stakeholders as data scientists not just what we see but also talking about risks as well same it's called Type S type sign error same thing for magnitude we see two what's the probability that is 20 or 200 or kind of completely off the scale we can say how should we are about that a really really good thing that happened you know especially when you talk about features sometimes there's some kind of a budget involved in the conversation that you need to take into account so sometimes another will tell you anything below 3% does not justify the changes we need to do in production in order to deploy this why is the risk we're taking here in terms of not returning the investment so you know to zero it's basically any threshold that you want to set you can say what's a probability being above or below the threshold and then your stakeholders can take really data-driven decisions right this is a really really cool feature here I'm not going to go into base factors there is some argument about them because they kind of try to go around the prior but maybe for a different talk so hopefully I've convinced you that this is a really great set of answers and really great set of questions to ask and answers to get how do we get them well there are a lot of tools out there again trying to put it on a graph there are what I call the low level frameworks I think perhaps my favorite and maybe one of the most prominent in the past years is Stan it's it's not written it's kind of language agnostic is it has bindings to Python and to R and to other languages as well you need to learn a new language in a sense and a new way of working with with a you know a new programming language and some new tools and new ideas but you can it's portable it's clock it's cross-platform and it's super powerful and super flexible you can really start with a clean slate you don't have to have any assumption you can start with a clean slate and kind of build your model a really complicated model again a be testing maybe a little bit of an overkill you can use some nicer packages but if you want to go further I highly recommend that you at least look into those into Stan times t3 is a pure Python package and there are other ones I guess all the ones bugs and jugs and all these kind of things that have usually bindings to R and Python and other languages somewhere in the middle there are these frameworks I like BSP I've used it a lot for other use cases where you can do a/b testing it's framework that are actually really flexible and really powerful but usually designed to solve a specific problem bsts is designed to solve time series issues and causality but they give a lot of flexibility but there is a trade-off there you can only choose some distribution some structures they have some assumptions behind them and kind of last but definitely not least are all the nice free candy that you can get thanks to thanks to the wonderful open-source community people right you know a lot of wrappers around these frameworks one of the famous one profit which made a lot of noise from Facebook it's a wrapper around Stan that does timeseriesforecasting it's focused around trends not around a more round trends less around causality so great to use as long as you understand what you're wrapping around right did you understand the model VRMs Stan AR M may be more useful if you go for a be testing you can actually run the ABT you can actually calculate things in those packages and they give you the nice are syntax that you use from a linear regression BFBS is actually what's powering Google's causal impact for those of you who heard about it so that's what powering that again it's a wrap around it really nice like one-liner that can do really complicated things and there's a whole universe of our packages I just wanted to mention to be ESD and bases are which are kind of really simple focused on on a be testing and these kind of simple questions and can sometimes help you guide you through the process if you you new to the new to this just a few mentions of things beyond a be testing I promise I'm I'll just say a few words about this every testing is wonderful it's you know the best way to answer questions except when it's not so when isn't when it when isn't it a good way to to answer questions when you're outside this what's called in physics the Goldilocks zone you know not too hot not too cold not too bright not too dark there's a really narrow band where life can exist so too fast too slow so where things are too slow again when it takes six months to gather enough information to answer a question maybe a B testing is not the right thing to do and again with Bayesian statistics you can inject more information through the Pires kind of say I'm pretty sure this is what's happening and let's slow trickle of data kind of update your beliefs as you go so it is you can start from somewhere so it helps that also you can run a lot of a B test a long time like these disjointed tests and completely miss the fact that there is an underlying for not phenomenon because it's just happening really slowly there is a change in the market new competition you haven't heard of just changing your customer base or whatever something's changing underground and when you kind of you're able to combine these tests into a single test into a single model you can actually inject the effect of time into your model throughout the different AP tests this is kind of this idea of pulling or hierarchical models there's one at a time train in the background that kind of affects your test a long time and you can actually incorporate that into your models and catch this okay where you if you just run a a disjointed set of AP tests you can stay pretty blind same thing for too fast sometimes you need to make a really fast decision so which ad to show to which customer if you need to update your model every week or every two weeks maybe that's too slow there's a whole universe of tools around that with the evasion for those who heard of multi-armed bandit that's I think kind of that almost in DC standard today it's a Bayesian method that basically helps you make decision really really fast when you can't just don't want to do the cycle of you know collect data stop update the model stop deploy new decision model stop collect new data stop you need to do things in real time and the same thing I mentioned before you know this some pooling idea if you only have a really small segment that you're interested in you you don't have enough information kind of to make a lot of influence throughout them but you do want to try and see something are they different how different are they bayesian tools are really really great for that because you can do something called pooling insurance companies for example I worked a lot with in nature in space so you know you have your portfolio and you want to calculate the risk in order to calculate your premium right what's our risky you are the customer and sometimes you only have one Maserati in your portfolio so how do you calculate the risk for you know N equals one there are actually interesting way of doing that and again I encourage you to investigate that other cases where okay you can't just can't test it's just not possible technically legally from a regulatory perspective and what you do then how do you say something is working especially when there's a lot of noise again really really cool concepts from our base in statistics this is an example I present in London two weeks ago with a causal impact from Google and how we kind of try to see okay we have to release the public campaign how can you prove that the campaign had any effect on anything that is not just noise and seasonality these model help you to break down this into really really in try to break down what you see the noise that you see in two different components and try to estimate that the way that I do it by the way is trying to simulate the control group it's really really interesting approach but if you want come and talk to me about this later I'll be happy to talk about it and there's also a link here to a github repository with the YouTube video and some code examples so feel free to to look into that one last thing again I'm not going to touch on it because of time I I just you know want to say one of the reasons I'm really excited about going kind of going back I would say to Beijing statistics when you use frequencies tools and when use machine learning tools it's a very tool oriented approach it's kind of okay what can I solve with neural networks what can i solve with GBM what can I saw okay so it's it's you kind of start with the tool a lot of the time neural networks are flexible but you know can be there is an interesting trade-off there especially in terms of explained ability and we heard yesterday to talk about okay the more complex the model then usually the less you understand different behaviors and different problems that come up from it and if you talk about frequent these tools okay I have this really complicated multivariate phenomenon that's changing over time and I want an insane causality so I usually try to squeeze it into some you know some model that I know ARIMA re/max or maybe some regression model or something else they're usually a lot of squeezing that needs to happen and a lot of things that you need to chop off along the way just because you have a package it kind of does what you want but not exactly and you need to decide okay maybe this package and you you try it a lot and then you reach a dead end in any try another package it has a completely different set up and different trade-offs not a very I would say another very flexible a set of tools and basing tools actually I almost want to say too flexible it's really easy when you work with this low-level framework to specify super super complicated models and then the trade-off becomes x to solve again I've seen these amazing work so people are doing for example trying to understand gas flow in pipes in order to to predict problems in infrastructure and they actually start from you know thermodynamic equations and build the model up from there like these amazing scientific works or people it for very practical reasons or for scientifically in the scientific domain you know people looking at equations of gravity and building a model of the distribution of galaxies you can really go really far out the problem then becomes time to estimate the posterior distribution it becomes computationally heavy but I really like the fact that kind of you have more room to maneuver more room to think about your problems as the way that you think they are not how the tool tells you to think about them and I really like it for me it kind of brings back a little bit of the science back in to data science which is something I'm really passionate about so again there is a trade-off there the complexity and everything but at least you have the flexibility and the ability to choose how complex you want to go and I'll again just mention what I mentioned before the paper by Michael Betancourt which is a really really good guideline and way of thinking about complexity as well not just about priors in conclusion hopefully I've convinced you that p-value is a really wonderful answer but to the completely to the wrong question to do a completely wrong question based on models again can give you the answer that you need and the answer that you want to give if you're willing to have an opinion the prior and you're willing to change it both of them maybe not as hard as you not as easy as you might think so also again basing tools are a really good way to start if you're willing to put in the investment to start going into this domain of having better discussions and better questions and better understanding with your stakeholders about the problem and better understanding of your stakeholders about the answers that you're giving them so it's a really good way of opening better communication channels for use the data scientists and for your teams that you're working in and kind of finally a little bit about at the warning before you go out there and start you know doing Bayesian statistics especially if you're using the wrappers and the packages just be aware of what's standing behind them so the packages look really nice it's almost too easy to run these hyper complicated models just be worth the assumptions behind them if you want I can tell you really good stories about profit versus bsts there's other things and how you know it seems things are not always as they are so you need to be aware of what's happening also underneath the surface not just run the command and look at the results be aware of the of the underlying models as I said a lot of that has to do with these underlying models of real see that you at least need to be aware of and understand before you go ahead but with that said I really encourage you to try these things when you go back to work tomorrow on or on Monday and think about your next test that you're going around or the next question you're going to ask or the next model that you're going to build consider going based this time so this would be a good time for questions thank you thanks for your very nice and instructive talk as you know in the business of I'm also in a internet startup typically a B tests are not judged by one single metric but several and my question was so there's a pretty well-established framework that I know to deal with multiple multiple hypothesis testing in the frequency domain is there an equivalent in the Bayesian framework that you describe so it really depends on so answer is pretty much yes if you look at multiple KPIs what you need to say is okay I need to have a better prior in a sense so you need to say not just why I expect for example the conversion rate to live and the revenues to live after the test I also need to say something about the correlation between them which is maybe not as hard as you think not not so easy but as maybe not so hard and maybe and again it goes back to what I mentioned about this disciplined workflow where you can start by saying I can start running just without any correlation between them see what comes out and then make the model more complicated as we go when you talk about the frequentist situation you have this problem of multiple comparisons right which you can abuse your model when you do Bayesian statistics but at least multiple comparisons are not not not not exactly a problem in that domain so you can try different things and see how complicate the model as much as you like but I think the answer to your question is yes you just need to understand how come with an opinion about what's the correlation between your different KPIs and by the way they are again there are tools and frameworks that we have to do you
Info
Channel: Data Council
Views: 5,035
Rating: 4.9076924 out of 5
Keywords: machine learning, computer vision, AI, big data technology engineering software engineering software development
Id: J6kqvWnUE2Q
Channel Id: undefined
Length: 33min 25sec (2005 seconds)
Published: Wed Oct 16 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.