Choosing the Right Metrics for A/B Testing | Success Metric, Driver Metric, Guardrail Metric

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey guys welcome back to my channel in this video we will do a deep dive on matrix selecting the right metrics is super important to run an epitax in practice because we want to be clear about the goal as well as how to measure the results before running it not only these metric related questions often appear in data science interviews it can be a straightforward question that asks you about the pros and cons of a specific metric it can also be a question that asks you to formulate a metric for an online experiment so as a data scientist the knowledge of matrix is fundamental in this video we will start with business metrics including gold magic driver magic and guardrail magic then we'll talk about how to format metrics for online experiments the content of this video is based on not only my knowledge on metrics but also a few new things i've learned from reading the book trustworthy online control experiments i have learned quite a few things from reading that book and i recommend it to anyone who is interested in learning about a b testing okay let's get started there are three kinds of operational metrics that the companies use to major success and progress and to understand areas for improvements the first kind of magic is a goal magic it's also known as success magic true norse magic north star magic okr magic and the primary magic this kind of magic reflects a company's long-term vision and it always ties to a company's mission gold metrics are a small set of metrics that the company truly cares about i know it may sound abstract how do we translate such a mission or vision to a set of metrics let me give you an example facebook's mission is to give people the power to build the community and bring the world closer together and its goal metrics include advertising revenue daily active users and monthly active users while the transformation from its mission to its goal metrics isn't perfect the goal metrics do reflect what the companies ultimately care about and they are simple enough to be easily communicated to different stakeholders such as investors customers and employees the gold metric should also be stable over long periods of time to allow the whole organization to work towards improving it while gold metrics are critical to measure the overall success of a company they may not be suitable for online experiments because they can be difficult to major or may not be sensitive enough to product changes for example facebook hears about ad revenue but not every team could use it for a b testing there are teams focusing on improving user engagement and also teams focusing on website or native app performance for such teams what they do definitely contribute to the company's overall success but they don't use those commonly level goal metrics to major performance compared with goal metrics which are about the long-term vision we also need metrics to reflect short-term progress the driver metric also known as surrogate magic indirect or predict metric are often used to measure short-term objectives they align with the goal metrics but they are more sensitive and actionable to be able to major short-term progress and the drive teams to work on it that's also why they are better than the gold metrics to be used for a b testing now let me give you a concrete example of a driver metric a marketing team's goal is to acquire new users and one of the driver metrics could be the number of new users registered per day the distinction between the gold magic and the driver metric is actually something new i learned from the book before reading it i thought i knew what a success metric is i have developed such metrics in practice and have used them to run online experiments but after reading the book i realized that i was wrong what i thought of as success metrics were actually driver metrics in fact success magic is the same as the goal magic and it's about the long term vision while the driver metric is used to major short-term progress and is more suitable for online experiments check out this blog written by my friend rob and me it covers several metric frameworks which can be very helpful for you to understand what metrics are used in different business domains the last category of metrics is a guardrail magic as the name suggests government metrics guard us from harming the business and violating important experiments assumptions in this book it categorizes guardrail matrix into two groups which i think is very helpful to understand different roles of gallery metrics the first one is the organizational gallery metric if this kind of metric shifts to the negative direction the business will suffer significant loss for example if the loading time of a web page increased by a few million seconds there can be a significant loss of customers and revenue in practice the page loading latency is often used as a gallery metric when new features are developed and tested through every testing a few other commonly used organizational guide metrics include errors per page and client crisis the other kind of gallery magic is trustworthy related metrics they are used to monitor the trustworthiness of an experiment that is to check if there is any violation of its assumptions one of the common use metric is to check if randomization units assigned to each variant is truly random when the numbers in different groups are different the authors refer this as simple ratio mismatch we then need to perform a t-test or a chi-square test to check if the assignment ratio matches with what was designed now you know the definition of goal matrix driver metrics as well as gary metrics in practice we need to be clear about the contacts when talking about a specific metric because the same metric can be used differently for different teams one team's driver metric can be other teams garbo metric for example one front-end team's goal is to improve web performance so reducing latency is their goal and time to interactive tti can be one of their driver metrics a product team may use the same metric as a guardrail metric to make sure any product changes don't increase latency next let's look at what are the attributes of a good metric in the blog post i mentioned earlier it has a few general rules to formulate matrix a good magic should be simple it's easy to understand and calculate and people should be able to remember and discuss it easily if you cannot use one sentence to explain magic it's not simple secondly the definition of a good metric is clear and there is no ambiguity in interpretation thirdly a good metric should be actionable the metric can be shifted by changes in products and it offers insights on how you can improve it should not be easily gamed gaming means that a metric makes you feel like you are getting results but it offers no insights into actual business health or growth short-term revenue is an example of such a metric increasing prices of products may increase short-term revenue however the business may lose customers in the long run we have talked about operational metrics which are critical to major accounting or teams performance however not all of them are suitable for online controlled experiments next let's go over what are the requirements of metrics that can be used directly for experimentation in the book the authors summarize three attributes for metrics that are suitable for experimentation measurable we should be able to calculate those metrics with data collected during the experiment period attribute we should be able to attribute matrix values to the experiment variants it means that the metrics will be able to be calculated separately for different variants in the experiment sensitive and timely experiment metrics should be sensitive enough to detect changes in a timely manner in online experiments we typically select a few driver metrics as a key metrics as well as some government metrics to monitor impacts on other aspects of the business now i want to share one question i get constantly since we have multiple metrics for experiment how do we make the launch decision when one metric goes up and one metric goes down it's a very reasonable question and this scenario happens often in practice many organizations have a mental model of the trade-offs they are willing to accept when they see any particular results for example trade-offs between user acquisition and revenue how should the company strike the optimal balance between revenue maximization and user acquisition acquiring new users can always be down by expensive campaigns such as providing large discounts or gifts but it will degrade revenue this kind of trade-off is not something that can be determined by a single data scientist or a single data science team it's something that is discussed and aligned among various stakeholders such as product teams marketing teams engineering teams and the leadership team etc finally i want to share with you two practical suggestions from the book on formulating metrics for experiments one is to combine a few target metrics into an overall evaluation criteria a weighted combination of the most important driver metrics and use it as the only criteria for an experiment if coming up with such an oec is difficult the authors recommend choosing no more than five metrics as their target metrics there are two main disadvantages of having too many metrics one is that too many metrics may confuse people and it may possibly lead to ignoring the key metrics the other downside is including too many metrics may affect the decision-making process and increase the chances of having forced discoveries alright guys i hope you have learned something new about the metrics in this video in the next few videos we will continue talking about interesting topics about a b testing stay tuned i will see you soon
Info
Channel: Data Interview Pro
Views: 9,971
Rating: undefined out of 5
Keywords: data science, data scientist, ab testing, ab test, a/b testing, a/b test, trustworthy online controlled experiments, a practical guide to a/b testing, trustworthy online controlled experiments book summary, data science interview, data science interview questions, ab testing metrics, data interview, data interview pro
Id: SuXc5ckvlJ8
Channel Id: undefined
Length: 11min 13sec (673 seconds)
Published: Mon Jun 14 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.