Cracking Hypothesis Testing Problems in Data Science Interviews | Binomial test, z-test and t-test

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey guys welcome back to my channel in this video i want to dive into hypothesis testing problems in data science interviews the hypothesis test we are going to cover include binomial tests z-test and t-test by the end of this video you will be able to understand what are different kinds of interview questions on hypothesis testing what are the differences between those hypothesis tests and when to use each of them the knowledge you gain from this video will help you solve those interview problems easily and accurately let's get started there are three kinds of hypothesis testing questions in interviews the first kind of questions are very basic this kind of questions may appear in any part of a technical phone screen or an on-site interview to evaluate if a candidate has fundamental knowledge on hypothesis testing here are some examples what are the differences between a z-test and a t-test want to use a z-test versus a t-test giving the data how do you calculate the t-statistic or z-statistic the second type of hypothesis testing questions typically ask together with eb testing questions a typical example would be the interviewer gives you a test result and asks you to calculate if the result is significant and how you would make launch decisions based on the result this type of question requires you to understand both epi testing and hypothesis testing as well as how to use hypothesis testing in practice if you are not familiar with epi testing or you need a refresher on the topic i have a great video covering the common ask every testing questions and answers the last type of questions is using sql query to calculate metrics and test statistics for example giving a table containing user behavior data write a sql query to get the average number of likes in control and treatment groups then obtain the test statistic and tell if it's significant or not the query itself is not difficult to write but you will need to have a clear understanding of the formula to calculate the test statistic to solve the problem now you understand the three types of interview questions on hypothesis testing now let's dive into when to use each test and what are the differences between them when i first learned those concepts i feel it was pretty confusing and there seemed to be lots of things to remember i then realized it's easier to understand and explain those concepts if i just make a flow chart so here it is this chart summarizes we want to use a particular test first of all we want to know the metric we want to evaluate if the metric we are interested in follows a bernoulli distribution we need to further check the sample size for those of you who don't know the bernoulli distribution is a distribution with a random variable taking the value 1 with the probability p and the value 0 with the probability 1 minus p a practical example is click through probability the proportion of users who click a button on web page is p and those who don't is 1 minus p similarly clicks rate and the conversion rate can also be considered following a newly distribution another way to understand it is to see if what we want to test is a proportion or not for example percentage of users or pages if we want to compare proportions of two groups such as if there's a change in click-through probability if we change the color of a button we would go this route if the metric does not follow bernoulli distribution for example we want to find out how different two sample means are different from each other then the first thing we want to check is the size of the sample the magic number here is usually 30 30 or above is considered as a large sample and below it is considered a small sample if it's a small sample we need to make sure the probability is normally distributed in order to use a z test or a t-test you may wonder do we care if the proportion is normally distributed for large sample we don't because the central limit theorem tells us that the sample mean follows a normal distribution we don't need to worry about whether or not the population is normally distributed for a large sample if we have a large sample or a small sample from a normally distributed population we need to also consider if the population variance is known to us or not if it is we could use a z-test otherwise we'd choose a t-test that's why in reality the z-test is not used as commonly as t-test because it requires population variance to be known and in lots of cases we don't now it's clear to you one to select a particular test i want to highlight two things to provide more clarity the first thing is the difference between student t distribution and normal distribution we use the t test when the test statistic follows a student t distribution on the null hypothesis how would it be different from a z distribution here's a diagram showing the comparison of t-distribution and z-distribution or standard normal distribution we observe that a t-distribution is more prone to error it's more spread out and thicker in tails than a normal distribution this makes sense because we do not know the standard deviation of the population this also means that the t distribution produces a wider interval than the corresponding standard normal based confidence interval because if we don't know the standard deviation and we estimate it we are less certain about our estimate note that the shapes are different for in lesson 30 and and close to 30. this is related to another concept degree of freedom it is the number of pieces of information that can be freely varied without violating any giving restrictions that is number of independent pieces of information available to estimate another piece of information for example if we have n data points there will be n minus 1 independent values after we know the mean we can see that as n increases the t distribution better approximates normal distribution actually for large sample sizes the t-test gives almost the same p-values and confidence intervals as a z-test the second thing i want to clarify is why we don't use t-tests for proportions we mentioned a z-test or binomial test to compare proportions and we didn't mention a t-test can do so why the reason is that the test statistic doesn't have a t-distribution it does approximately have a z-distribution let me explain in a typical t-test the t-statistics follow the form d over s where d is the difference between means and s is the estimated standard error of d because of central limit theorem when sample sizes are sufficiently large a statistic like d which is the difference between means is very asymptotically normally distributed and the standardized version of d d over sigma of d will be a symmetrically standard normal there is another theorem called slavsky's theorem states that as long as the denominator s converges in probability to that unknown standard error sigma d then d over s should converge to a standard normal distribution the typical one sample and two symbol proportions tests are in the form so we have some justifications for treating them as asymptotically normal but we don't have any justification to consider them as following a t-distribution in practice as long as mp and n1 minus p are not too small specifically when both are larger than 10 the symmetric normality of the proportions test comes in rapidly so theoretically we don't use t-tests to test the proportions and there's no good argument that t-distribution should be better than the z-distribution as an approximation to the distribution of the test statistic but many people do use t-tests for testing proportions academically speaking they are wrong but in practice the approximation obtained by using a t-test on bernoulli data seems to be very good also as we have mentioned earlier as the sample becomes larger using t-test generates almost the identical constant intervals and the p-values compare with a z-test so to summarize using z-tests for testing proportions is theoretically correct while using a t-test is wrong but the results from a z-test and t-test do not have a significant difference especially when the sample is large so it's okay to use a t-test for proportions in reality this is the part one of cracking hypothesis testing problems in data science interviews in part two of the video i will dive into some practical examples and show you how to solve them step by step stay tuned if you're interested in learning how to apply hypothesis test in reality as always guys i appreciate you for taking the time to watch this video let me know if you have any questions i will see you in the next video
Info
Channel: Data Interview Pro
Views: 21,102
Rating: undefined out of 5
Keywords: data science interview, hypothesis testing, t test, data science, z test, a/b testing, ab testing, data science interview questions, data interview, data interview pro, data scientist interview questions
Id: IY7y-t30UJc
Channel Id: undefined
Length: 9min 53sec (593 seconds)
Published: Wed Feb 17 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.