Data Scientists Must Know: A/B Testing Fundamentals

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey guys welcome back to my channel in this video i want to talk about the fundamentals of a b testing different from my previous videos on av testing where i just share commonly asked interview questions and answers in this video and the next few videos i will start from the very basics and dive deep into some practical problems running advertising in reality so if you want to learn advertising in depth to prepare for your interview or to expand your knowledge this video is definitely for you today's video is going to be a slightly different video from what i normally do where i just share what i already knew from my own experience instead i want to share with you not only what i already knew but also a few new things that i was able to learn from a specific book that i have been reading over the last several weeks and that book is called transversally online control experiments a practical guide to a b testing it was published in 2020 and it was written by three industry professionals all of them have lots of experience leading every testing in big tech companies i have recommended this book in my blog posts and videos and it is in my opinion one of the most practical books on eb testing so if you want to learn advertising in depth this book is a great resource since every testing covers many things it's hard to make a single video to cover everything so i plan to make a couple videos from the very basics to advanced topics each topic will be independent so it's easier for you to rewatch a video to learn about that topic okay that's a long introduction now let's jump right into what are habitats ambi test is an experiment in which all elements are held constant except for one variable typically it compares a control group against a treatment group all variables are identical between the two groups except for one factor that's been tested different versions of a product or user experience are formerly referred to as the variance varies can be as simple as colors of a button or as complicated as different back-end algorithms to display search results in cases there are two variants one control and one treatment group is called an a b test if there are more than two variants it's called a b n test but in reality habitats could also be used to refer to experiments with multiple variants i sometimes get this question what are the differences between habitats and controlled experiments well they are the same thing habitats are sometimes called ebin tests controlled experiments randomized controlled experiments split tests but they refer to the same thing now let me give you an example of an epitaph in the book it mentions an interesting example google tested 41 gradations of blue on google search result pages in each treatment group the color is different even though the tests frustrated the video design lead at that time the result showed that color schemes significantly changed user engagement habitats are widely adopted in the industry while evaluating new product ideas in fact when you are browsing a website or using a mobile app you might be part of an experiment that is running behind the scene but why do we need to run experiment why do companies run experiments instead of simply rolling out a new feature the goal of running epitapes is to make data-driven decisions only when the results are reliable and repeatable can we make the right decision to make the result reproducible an important requirement is that the factor we are testing is the cause of the change in the metric so that when launching the feature to all the traffic the impact can be predicted from the treatment effect measured in the experiment for example changes of colors could cause changes in user engagement assuming other things stay the same and running habitats is the scientific way to do it in the book the authors claim that randomized controlled experiments are the gold standard for establishing causality we believe online control experiments are the best scientific way to establish causality with a high probability able to detect small changes that are hard to deduct with other techniques such as change over time able to deduct unexpected changes often unappreciated but many experiments uncover surprising impacts on other metrics now you know what is an eb test as well as the importance running habitats let's dive into the major steps involved in running epi tests in general there are five major steps involved in running a test correctly i have drawn this diagram to help you understand it clearly let's go through each step one by one before running experiments a few things need to be ready first of all we need to define key metrics to measure the goal of an experiment the key metric is formally known as the overall evaluation criteria or oec it should be agreed upon by different stakeholders and should be practically measured for example if we want to test if changing the color of the checkout button could impact revenue the key metric of the oec could be revenue per user per month the second requirement is that changes are easy to make this should be obvious because we need to compare different variants and find the one that has the highest positive impact on the oec if changes are very hard to make it will introduce complexities to generate variants for example it will be very difficult to redesign the whole website and consider that redesign as a variant the last requirement is to have enough randomization units to be assigned to different variants but what is a randomization unit it's simply the who or what that is randomly allocated to different groups the most commonly used randomization unit is the user so how much is enough the recommendation in the book is to have thousands of randomization units because the larger the number the smaller the effects that can be detected okay after these requirements are fulfilled we could move forward to designing the experiment and the book touches a few things that need to be considered what population of randomization units do we want to select basically do we want to target a specific population or all the users sometimes it's helpful to run experiments for a specific segment because the change only affects that segment for example a new feature that is only available for users in a particular geographic region another factor to consider is the size of the experiment we need to compute the sample size of the experiment in order to achieve the required statistical power detecting a small change will need more users if you are interested in learning how to get the sample size i have a video to derive the formula step by step the last important consideration is how long to run an experiment to determine the duration we will need to consider seasonality the day of weak effect as well as primary and novelty effects all of them will influence the decision on how long we should run an experiment after all those decisions are made we could run experiments and collect the data in this process typically data scientists work with engineers to instrument logins to get logged data for companies that have built their own experimentation platform this is done automatically after running the experiment for the required amount of time we need to check and interpret the results and use them to make a decision in reality this is where data scientists spend most time and energy on once we obtain the data the very first step is to do sending checks to make sure the data are reliable we could only continue the analysis once the send checks are passed if not we need to discard the results and look into the root cause and we may need to re-run the experiment here we will not dive into those checks but i will explain them in detail in upcoming video once those sending checks are passed we could use the results to make a launch decision and there are many factors to consider in the book it recommends examining at least these factors the first one is the trade-offs between different metrics this refers to the scenario that different metrics move to opposite directions for example user engagement goes up but the revenue goes down how to make the decision the other factors can be summarized as the cost of launching a change for example cost for engineering maintenance after launch since new code may introduce complexity and bugs to the code base the maintenance efforts can be costly also there are opportunity costs the time and effort we spend launching a change might not be as much as opportunity cost of giving up a different idea if those costs are high we need to ensure that the expected benefits can outweigh the costs in fact that's why we typically set a practical significance of boundary to reflect those costs and we only launch a product if the result is practically significant on the country if the cost is low we will choose to launch any change that is positive in other words as long as the result is statistically significant we can launch the change if you're not familiar with the concept of practical significance and boundary i highly recommend checking out this video which covers an analysis using both statistical and practical significance boundaries to make a launch decision at this point you might think we're done with experiments because we have made a decision well we're getting close but we're not done yet if we decide to launch a new product based on the results of an experiment we need to monitor the long-term effect after launch because the short-term effect can be different from the long-term effect due to various reasons also major long-term effects have a few benefits such as insight on north impacts could help improve future iterations alright guys this is the first part of the tutorial on a b testing in the next video we will dive into an end-to-end example to talk about the whole process running experiment in reality we will talk about how to select the right metric and randomization units how to decide how long to run an experiment stay tuned for upcoming videos i will see you soon

Info

Channel: Data Interview Pro

Views: 16,572

Rating: undefined out of 5

Keywords: data science, data scientist, ab testing, ab test, a/b testing, a/b test, trustworthy online controlled experiments, a practical guide to a/b testing, trustworthy online controlled experiments book summary, data interview, data interview pro

Id: VpTlNRUcIDo

Channel Id: undefined

Length: 11min 38sec (698 seconds)

Published: Mon May 31 2021