Cracking A/B Testing Problems in Data Science Interviews | Product Sense | Case Interview

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi it's emma welcome back to my channel so far the most popular video on my channel is cracking product sense problems many people have reached out to me and told me that this is where they need more help so in today's video i am going to talk about a b testing because a b testing problems often ask together with metric questions in data science interviews it is hard to provide an insightful and in-depth answer if you don't have much knowledge on a b testing in this video i will provide everything you need to know to crack a b testing problems specifically i will go through six important topics of a b testing and provide you some of the most commonly asked questions and answers finally i will share with you some resources to learn more about the subject let's get started since we have lots of things to cover here's an outline of this video feel free to choose the section you want to learn more and skip the ones you are familiar with the topics we are going to cover are what is a b testing how long to run an a b test multiple testing problem novelty and primary effect interference between variants and dealing with interference first and foremost let me briefly explain what is a b testing habitats as known as controlled experiments are used widely in industry to make product launch decisions in the simplest form there are two variants control a and treatment b typically control group uses the existing feature while the treatment group uses the new feature a b testing allows tech companies to evaluate a feature with a subset of users to infer how it may be received by all users a b testing is one of data scientists core competences so a b testing questions appear frequently in data science interviews they are typically asked together with metric questions and the questions can appear in any component of a b testing including developing new hypotheses designing a b test evaluating test results and making ship of no ship decisions the second topic is about designing an a b test specifically how long to run an a b test this is one commonly asked question during interviews to decide the duration of the test we need to obtain the sample size and three parameters are needed to get it these parameters are type 2 error or power because power equals to 1 minus type 2 error you know one of them you know the other the significance level and the minimum detectable effect the rule of sum is sample size approximately equals to 16 multiplied by sample variance divided by delta squared whereas delta is the difference between treatment and control i know some of you may be interested in learning how we come up with the rule of thumb formula so i have another video to explain it step by step feel free to check out the link in the description during the interview is not required to derive the formula but you want to talk about how each parameter influences the sample size for example we need more samples if the sample variance is larger and we need less samples if the delta is larger sample variance can be obtained from the data but how do we estimate the difference between treatment and the control actually we don't know that before we run experiment and this is where we use the third parameter the minimum detectable effect it is the smallest difference that would matter in practice for example we may consider a 0.1 percent increase of revenue as a minimum detectable effect in reality this value is decided by multiple stakeholders once we know the sample size we could obtain the number of days to run the experiment by dividing the sample size by the number of users in each group if we have the number less than 14 days we typically would run for 14 days to capture the weekly pattern sometimes we run tests with multiple variants to see which one is the best amongst all the features it can happen when we want to test the multiple colors of a button or test different home pages then we'll have more than one treatment group a sample interview question is we are running 10 tests at the same time trying different versions of our landing page in one case the test wins and the p-value is less than .05 would you make the change the answer is no in this case we should not simply use the same significance level .05 to decide whether the test is significant because we are dealing with more than two variants and in such a scenario the probability of false discoveries increases for example if we have three groups to compare what is the chance of observing at least one first positive assuming the significance level is 0.05 well we could get the probability that there's no false positives and it would be 0.95 to the power of 3. then we can obtain the probability that there's at least one false positive with only three groups the probability of false positive or type 1 error is over 14 precent this is called the multiple testing problem there are several ways to deal with the multiple testing problem one commonly used method is Bonferroni correction it divides the significance level by the number of tests for the interview question since we are measuring 10 tests then the significance level for the test should be 0.05 divided by 10 which is .005 basically only if a test shows a p-value less than .005 we claim it's significant the drawback of this method is it tends to be too conservative another method is to control false discovery rate fdr fdr is the expected value of number of false positives divided by number of rejections it measures out of all the rejections of the null hypothesis that is all the metrics that you declare to have a statistically significant difference how many of them have a real difference as opposed to how many were false positives this only makes sense if you have a huge number of metrics say hundreds suppose you have 200 metrics kept fdr at 0.05 this means you're okay with seeing first positives 5 of the time you will observe at least the one false positive in those 200 metrics every time when there is change in the product people react to it differently somewhat used to the way it works and are reluctant to change this is called primary effect or change aversion others may welcome changes and the new feature attracts them to use more this is called the novelty effect but both effect will not last long people's behavior will stabilize after a certain amount of time so if an a b test has a larger or smaller initial effect it's probably due to novelty or primary effect it's a common problem in practice and many interview questions are about this topic a simple question is we ran an a b test on new feature and the test won so we launched the change to all users however after launching the feature for a week we found the treatment effect quickly declined what was happening the answer is the novelty effect over time as the novelty wears off repeat usage will be small so we observe a declining treatment effect now you understand both normality and primary effects how do we deal with them one way to deal with such effects is to completely rule out the possibility of those effects we could run tests only on first time users because the novelty effect and the primary effect obviously don't affect such users but if we already have a test running and we want to analyze if there's novelty effect we could compare first-time users vs old user's results in the treatment group to get an actual estimate of the impact of novel effect same for the primacy effect we have just covered two effects that make the test results unreliable interference between control and treatment groups can also lead to unreliable results typically we split control and treatment groups by randomly select users in the ideal scenario each user is independent and we expect no interference between control and treatment groups however sometimes this does not work this may happen for testing social networks such as facebook or two-sided markets such as uber lyft let's look at a sample interview question company x has tested a new feature with a goal to increase the number of posts created per user they assign each user randomly in either control or treatment group the test won by one percent in terms of the number of posts what do you expect to happen after new feature is launched to all users would it be the same as one percent if not would it be more or less assume there's no novelty effect the answer is that we will see a value different from one percent let me explain why in social networks such as facebook linkedin and twitter user's behavior is likely impacted by other people in their social circles a user tends to use a feature or product more often if their friends use it this is called a network effect so if we use user as a randomization unit and the treatment has an impact on users the effect may spill over to the control group that is people in the control group are influenced by those in the treatment group in that case the difference between control and treatment groups underestimates the real benefit of the treatment effect so back to the question there will be more than one percent that's how network effect influences social networks for two-sided markets such as uber lyft and airbnb interference between control and treatment groups can also lead to biased estimates of treatment effect it is mainly because resources are shared among control and treatment groups meaning control treatment groups will compete for the same resources for example if we have a new product that attracts more drivers in the treatment group less drivers will be available in the control group so we will not be able to estimate the treatment effect accurately but different from social networks where the treatment in fact underestimates the real benefit of a new product in two-sided markets the actual effect will be less than the treatment effect now you understand why interference between control and treatment can cause the post-launch effect different than the treatment effect it leads us to the next question how do we design the test to prevent the spill over between control and treatment a sample interview question is we are launching a new feature that provides coupons to our riders the goal is to increase the number of rides by decreasing the price for each ride outline a testing strategy to evaluate the effect of the new feature there are many ways to deal with the spillover between groups the main idea is to isolate users in the control and treatment group here i will just list out a few commonly used solutions for two-sided markets we could use geo-based randomization instead of splitting by users we could split by geo locations for example we could have the new york metropolitan area in the control group and the san francisco bay area in the treatment group this will allow us to isolate users in each group but it will have big variance since each market is unique in certain ways the other method though used less commonly is time based randomization basically we select a random time for example a day of a week and assign all users to each treatment or control group it works when the treatment effect only lasts for a short amount of time for example if a new surge price algorithm works better it does not work when the treatment in fact takes a long time to be effective for example a referral program it can take some time for user to refer his or her friend for social networks one way is to create network clusters to represent groups of users who are more likely to interact with people within the group than people outside of the group once we have those clusters we could split them into control and treatment groups another way is called ego network randomization the idea was originated from linkedin a cluster is composed of an ego a focal individual and her alters the individuals she's immediately connected to it focuses on measuring the one out network effect meaning the effect of my immediate connections treatment on me so each user either has a feature or does not there's no complicated interactions between users is needed this approach is simpler and more scalable than the previous one to summarize the methods we just mentioned apply in different scenarios and all of them have limitations in reality we want to evaluate which methods work better in a certain scenario and we could even combine more than one method to get reliable results so those are the six topics that i have promised to share with you i hope you've learned something new finally let me recommend two resources for you to learn more about a b testing the first one is an online course from udacity it is completely free and it covers all the fundamentals of a b testing my friend Kelly has a great post summarizing the content of that course check it out if you are interested the other one is this book trustworthy online control experiments it has more in-depth knowledge on how to run a b testing in reality some of the potential pitfalls and solutions it has lots of useful stuff so i actually plan to make a video to summarize the content of this book stay tuned if you are interested so that is all for today thank you so much for being here let me know if you have any questions i will see you soon

Info

Channel: Data Interview Pro

Views: 46,326

Rating: 4.9480519 out of 5

Keywords: data science interview, data science interviews, ab testing, a/b testing, ab testing interview, data science interview questions, ab testing interview questions, hypothesis testing, data science, data interview, data interview pro, data scientist interview questions

Id: X8u6kr4fxXc

Channel Id: undefined

Length: 16min 36sec (996 seconds)

Published: Wed Jan 13 2021