Cracking A/B Testing Problems in Data Science Interviews | Product Sense | Case Interview

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi it's emma welcome back to my channel so  far the most popular video on my channel is   cracking product sense problems many people  have reached out to me and told me that   this is where they need more help so in today's  video i am going to talk about a b testing because   a b testing problems often ask together with  metric questions in data science interviews   it is hard to provide an insightful and  in-depth answer if you don't have much   knowledge on a b testing in this video i  will provide everything you need to know   to crack a b testing problems specifically  i will go through six important topics of   a b testing and provide you some of the  most commonly asked questions and answers   finally i will share with you some resources to  learn more about the subject let's get started since we have lots of things to cover here's  an outline of this video feel free to choose   the section you want to learn more and skip  the ones you are familiar with the topics we   are going to cover are what is a b testing how  long to run an a b test multiple testing problem   novelty and primary effect interference  between variants and dealing with interference   first and foremost let me briefly explain  what is a b testing habitats as known as   controlled experiments are used widely in  industry to make product launch decisions   in the simplest form there are two variants  control a and treatment b typically control group   uses the existing feature while the treatment  group uses the new feature a b testing allows   tech companies to evaluate a feature with a  subset of users to infer how it may be received   by all users a b testing is one of data scientists  core competences so a b testing questions   appear frequently in data science interviews they  are typically asked together with metric questions   and the questions can appear in any component of  a b testing including developing new hypotheses   designing a b test evaluating test results  and making ship of no ship decisions the second topic is about designing an  a b test specifically how long to run an   a b test this is one commonly asked question  during interviews to decide the duration of   the test we need to obtain the sample size  and three parameters are needed to get it   these parameters are type 2 error or power because  power equals to 1 minus type 2 error you know one   of them you know the other the significance  level and the minimum detectable effect   the rule of sum is sample size approximately  equals to 16 multiplied by sample variance divided   by delta squared whereas delta is the difference  between treatment and control i know some of you   may be interested in learning how we come up with  the rule of thumb formula so i have another video   to explain it step by step feel free to check out  the link in the description during the interview   is not required to derive the formula but you  want to talk about how each parameter influences   the sample size for example we need more samples  if the sample variance is larger and we need less   samples if the delta is larger sample variance  can be obtained from the data but how do we   estimate the difference between treatment and  the control actually we don't know that before we   run experiment and this is where we use the third  parameter the minimum detectable effect it is the   smallest difference that would matter in practice  for example we may consider a 0.1 percent increase of   revenue as a minimum detectable effect in reality  this value is decided by multiple stakeholders   once we know the sample size we  could obtain the number of days   to run the experiment by dividing the sample size  by the number of users in each group if we have   the number less than 14 days we typically would  run for 14 days to capture the weekly pattern sometimes we run tests with multiple  variants to see which one is the best   amongst all the features it can happen when we  want to test the multiple colors of a button   or test different home pages then we'll have  more than one treatment group a sample interview   question is we are running 10 tests at the same  time trying different versions of our landing page   in one case the test wins and the p-value is less  than .05 would you make the change the answer   is no in this case we should not simply use the  same significance level .05 to decide whether the   test is significant because we are dealing with  more than two variants and in such a scenario   the probability of false discoveries increases  for example if we have three groups to compare   what is the chance of observing at least one first  positive assuming the significance level is 0.05   well we could get the probability that there's no  false positives and it would be 0.95 to the power   of 3. then we can obtain the probability  that there's at least one false positive   with only three groups the probability of false  positive or type 1 error is over 14 precent this is called   the multiple testing problem there are several  ways to deal with the multiple testing problem   one commonly used method is Bonferroni correction  it divides the significance level by the number   of tests for the interview question since we are  measuring 10 tests then the significance level   for the test should be 0.05 divided by 10 which is  .005 basically only if a test shows a p-value less   than .005 we claim it's significant the drawback  of this method is it tends to be too conservative   another method is to control false discovery rate  fdr fdr is the expected value of number of false   positives divided by number of rejections it  measures out of all the rejections of the null   hypothesis that is all the metrics that you  declare to have a statistically significant   difference how many of them have a real difference  as opposed to how many were false positives   this only makes sense if you have a huge number of  metrics say hundreds suppose you have 200 metrics   kept fdr at 0.05 this means you're  okay with seeing first positives 5   of the time you will observe at least the one  false positive in those 200 metrics every time when there is change in the product people  react to it differently somewhat used to the   way it works and are reluctant to change this  is called primary effect or change aversion   others may welcome changes and the new feature  attracts them to use more this is called the   novelty effect but both effect will not last  long people's behavior will stabilize after a   certain amount of time so if an a b test has a  larger or smaller initial effect it's probably   due to novelty or primary effect it's a common  problem in practice and many interview questions   are about this topic a simple question is we ran  an a b test on new feature and the test won   so we launched the change to all users however  after launching the feature for a week we found   the treatment effect quickly declined what was  happening the answer is the novelty effect over   time as the novelty wears off repeat usage will be  small so we observe a declining treatment effect   now you understand both normality and primary  effects how do we deal with them one way to deal   with such effects is to completely rule out the  possibility of those effects we could run tests   only on first time users because the novelty  effect and the primary effect obviously don't   affect such users but if we already have a test  running and we want to analyze if there's novelty   effect we could compare first-time users vs  old user's results in the treatment group   to get an actual estimate of the impact  of novel effect same for the primacy effect we have just covered two effects  that make the test results unreliable   interference between control and treatment  groups can also lead to unreliable results   typically we split control and treatment groups  by randomly select users in the ideal scenario   each user is independent and we expect no  interference between control and treatment groups   however sometimes this does not work this  may happen for testing social networks such   as facebook or two-sided markets such as uber  lyft let's look at a sample interview question   company x has tested a new feature with a goal  to increase the number of posts created per user   they assign each user randomly in either control  or treatment group the test won by one percent in   terms of the number of posts what do you expect to  happen after new feature is launched to all users   would it be the same as one percent if not  would it be more or less assume there's no   novelty effect the answer is that we will see a  value different from one percent let me explain   why in social networks such as facebook linkedin  and twitter user's behavior is likely impacted by   other people in their social circles a user  tends to use a feature or product more often   if their friends use it this is called a network  effect so if we use user as a randomization unit   and the treatment has an impact on users the  effect may spill over to the control group that   is people in the control group are influenced  by those in the treatment group in that case   the difference between control and treatment  groups underestimates the real benefit of the   treatment effect so back to the question there  will be more than one percent that's how network   effect influences social networks for two-sided  markets such as uber lyft and airbnb interference   between control and treatment groups can also  lead to biased estimates of treatment effect   it is mainly because resources are  shared among control and treatment groups   meaning control treatment groups will compete  for the same resources for example if we have   a new product that attracts more drivers in the  treatment group less drivers will be available   in the control group so we will not be able  to estimate the treatment effect accurately   but different from social networks where the  treatment in fact underestimates the real benefit   of a new product in two-sided markets the actual  effect will be less than the treatment effect   now you understand why interference  between control and treatment   can cause the post-launch effect different than  the treatment effect it leads us to the next   question how do we design the test to prevent  the spill over between control and treatment   a sample interview question is we are launching  a new feature that provides coupons to our riders   the goal is to increase the number  of rides by decreasing the price   for each ride outline a testing strategy to  evaluate the effect of the new feature   there are many ways to deal with the spillover  between groups the main idea is to isolate users   in the control and treatment group here i will  just list out a few commonly used solutions   for two-sided markets we could use geo-based  randomization instead of splitting by users   we could split by geo locations for example we  could have the new york metropolitan area in the   control group and the san francisco bay area in  the treatment group this will allow us to isolate   users in each group but it will have big variance  since each market is unique in certain ways   the other method though used less commonly is time  based randomization basically we select a random   time for example a day of a week and assign  all users to each treatment or control group   it works when the treatment effect only lasts for  a short amount of time for example if a new surge   price algorithm works better it does not work  when the treatment in fact takes a long time to   be effective for example a referral program it can  take some time for user to refer his or her friend   for social networks one way is to create network  clusters to represent groups of users who are more   likely to interact with people within the group  than people outside of the group once we have   those clusters we could split them into control  and treatment groups another way is called ego   network randomization the idea was originated from  linkedin a cluster is composed of an ego a focal   individual and her alters the individuals she's  immediately connected to it focuses on measuring   the one out network effect meaning the effect of  my immediate connections treatment on me so each   user either has a feature or does not there's no  complicated interactions between users is needed   this approach is simpler and more scalable than  the previous one to summarize the methods we just   mentioned apply in different scenarios and all  of them have limitations in reality we want to   evaluate which methods work better in a certain  scenario and we could even combine more than one   method to get reliable results so those are the  six topics that i have promised to share with you   i hope you've learned something new finally  let me recommend two resources for you to   learn more about a b testing the first one is an  online course from udacity it is completely free   and it covers all the fundamentals of a b testing  my friend Kelly has a great post summarizing the   content of that course check it out if you are  interested the other one is this book trustworthy   online control experiments it has more in-depth  knowledge on how to run a b testing in reality   some of the potential pitfalls and solutions it  has lots of useful stuff so i actually plan to   make a video to summarize the content of this book  stay tuned if you are interested so that is all   for today thank you so much for being here let me  know if you have any questions i will see you soon
Info
Channel: Data Interview Pro
Views: 46,326
Rating: 4.9480519 out of 5
Keywords: data science interview, data science interviews, ab testing, a/b testing, ab testing interview, data science interview questions, ab testing interview questions, hypothesis testing, data science, data interview, data interview pro, data scientist interview questions
Id: X8u6kr4fxXc
Channel Id: undefined
Length: 16min 36sec (996 seconds)
Published: Wed Jan 13 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.