5 Statistics Concepts in Data Science Interviews | Power, Errors, Confidence Interval, P value

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi guys welcome back to my channel in this video i want to dive into five statistical concepts that are so common asked in data science interviews and they are power of a statistical test type 1 error type 2 error confidence interval and p-value sometimes the interviewer will ask you to explain these concepts to a non-technical audience and that requires you to not only have a good understanding of all these terms but also deliver them in a very intuitive way if you ever find it difficult to answer this kind of question this video is definitely for you by the end of this video you will learn how to showcase your knowledge on these five concepts to both technical and non-technical audiences the methods i'm gonna teach you will not only apply to these five concepts but to other concepts as well so if you're ready to dive in with me then keep watching to start off i'd like to share with you a few steps to follow when explaining technical terms to a technical person such as a data scientist you may think this is pretty trivial if the audience is technical then the person is expected to understand everything you say right but the fact is if your answer is disorganized or obscure it's very hard for people even technical people to follow so here are the steps i recommend we can start with talking about where or when a terminology is used then we provide a definition of that terminology even we are explaining it to a technical person this should be easy to understand it should not be obscure like what you see in wikipedia afterwards we can explain the meaning of changes in values if the concept can be represented by numbers basically what does it mean with a larger or smaller value the final step is optional we could talk about the application of the term in practice such as why the concept is widely used why it is important in data science sometimes you are asked to explain technical concepts in layman's terms or to a non-technical audience it requires you to explain things in a very intuitive and understandable way in such cases using examples is a good way to explain a terminology i will show you later what examples to provide for each concept also it's important to avoid introducing more technical terms for example when explaining power of a test you don't want to introduce hypothesis testing null hypothesis or alternative hypothesis this will confuse the audience even more now you've learned the theory let's now put it into practice for the rest of the video i will go through 5 statistical concepts including power type 1 arrow type 2 error confidence interval and p-value to show you how to explain them to both technical and non-technical audiences the first concept is power first let's explain it to a technical person i will follow the steps i shared earlier to give you the answer statistical power is used in a binary hypothesis test it is the probability that a test correctly rejects the null hypothesis when the alternative hypothesis is true to put it in another way statistical power is a likelihood that a test will detect an effect when effect is present the higher the statistical power the better the test is it is commonly used in experiment design to calculate the minimum sample size required so that one can reasonably detect an effect the next terminology is type 1 error type 1 error also known as false positive it is used to categorize errors in a binary hypothesis test it occurs when we mistakenly reject a true null hypothesis it means that we conclude our findings are significant when in fact they have occurred by chance the larger the value the less reliable a test is meaning that we want to minimize the type 1 error of a test type 1 error is commonly used in eb testing to show that we observe differences between two groups but in reality there's no difference the third one is type 2 error type 2 error also known as force negative it is used to categorize errors in binary hypothesis test type 2 error refers to force negative it occurs when we fail to reject a null hypothesis which is in fact false basically we conclude there is not a significant effect when actually there really is the larger the value the less reliable of the test results meaning we want to minimize the type 2 error of the test it is commonly used in eb testing to show that we don't observe differences between two groups but in reality there is a difference we have just explained three concepts in a technical way now let's see how we can explain them to a non-technical audience for example if a person wants to test if he is infected by chrono drivers or not and there are three scenarios we care about the first scenario is that the person is indeed infected by the virus and the test result shows us the same that is the power of a test basically it is a chance that the test result tells us a person is infected when he truly is the second scenario is that the person is not infected but the test result shows here is that is a type 1 error this can be really bad because the person may take some medical treatment that is completely unnecessary the third scenario is that a person is indeed infected but the test result tells us he's not this is a type 2 error it is also very bad because the person may miss the best timing to get treatment that he really needs the next concept is confidence interval let's explain it to a technical person first again i will follow the steps that i mentioned earlier to explain it confidence interval is used when we want to get an idea of whole variable assemble results might be the confidence interval is for the true value but we never know what the true value is and the purpose of having samples and observations is to estimate the true value the conscious interval is a range of numbers it tells us how often it would contain the true value and the probability of it covering the true value is a confidence level a common user value is 95 percent the wider the interval the more uncertain we are about the sample result so the more confidence we want to be and less data we have the wider we make the confidence interval to be enough confident of capturing the true value in short the higher the level of confidence the wider the interval and the less the sample the wider the interval okay that's how we can explain complex interval during an interview i want to highlight a common misconception it considers that the confidence interval answers this question what is the probability that the true value lies within a certain threshold well this is not what context interval is answering because the misconception assumes the true value is a variable and the constant interval is deterministic the correct understanding is just the opposite the true value is determined by nature but is unknown to us it will not change at all the things that can change are the boundaries of the complex intervals which are estimated from the samples and the level of companies we set basically for a specific conflict interval the true value is either a hundred percent within it or not the 95 percent refers to after the 95 percent confidence intervals computed from many samples how likely it would cover the true value now let's try to explain confidence interval to a non-technical person confidence interval measures the level of uncertainty when we try to estimate a value for example we want to know the average height of men in the u.s we can randomly select certain men and measure their heights and let's say we can get a 95 confidence interval and let's say it's 168 to 185 centimeters the constant interval we have means that it is likely to cover the true average height of all men in the u.s but how likely if we repeat the process over and over again we expect the conflict interval we construct to cover the true value 95 of the time the next terminology is p-value similarly let's explain it to a technique audience first p-value is commonly used in hypothesis testing to connect the dots between observation and conclusion it is a conditional probability measures the probability of getting testing results at least as extreme as observed results giving that the null hypothesis is true a low p-value indicates less support for the null hypothesis in practice we often choose 0.05 as a cut of value p-value less than 0.05 denotes strong evidence against the null hypothesis which means the null hypothesis can be rejected and the value larger than .05 denotes weak evidence against the null hypothesis which means the null hypothesis cannot be rejected it is commonly used in every testing when we have a treatment and a control group and we want to test whether a metric is different in those two groups suppose we have done the experiment and obtained the measurements from the two groups the smaller the p-value the more we are convinced there is a difference between the two i have just shared with you how to describe p-value during interview i want to point out one common mistake people make when interpreting p-value very often we have observations and we would like to prove there is a difference between two groups the mistake people make is to define p-value as given the observation the probability of there is at least such a difference between the two groups in other words the belief p-value captures the probability that the null hypothesis is true giving the data observed it may sound reasonable at first but it's almost the opposite of the true meaning of p-value which is that given the null hypothesis is true the probability of obtaining differences at least as large as the data we observed now you understand why the misconception of p-value is wrong let's try to explain p-value in layman's terms we could reuse example when we explain confidence interval and that is we want to get the average height of men in the u.s we randomly select 30 people and get the measurement of their heights but now the question is we want to know if the average value is the same as a fixed value say 175 centimeters the p-value connects the dots between what data we observe and what conclusion we could draw it tells us that assume the true value ie the average height is 175 centimeters how likely we observe the data a very small p-value let's say less than 0.05 means that assume the true average height is 175 the chance that we observe the data is very low or the data we observe is very extreme so we believe the true value should now be 175 centimeters so that's how we can explain p-value to a non-technical person note that we did not introduce any terminology and we use a very simple example to explain it during interviews it can be hard to come up with good examples quickly so i recommend you to prepare some examples for some of the commonly asked concepts if i'm interested in learning more about how to answer real questions in data science interviews stay tuned for more videos to come as always i appreciate you for watching this video let me know if you have any questions or feedback i will see you in the next video
Info
Channel: Data Interview Pro
Views: 21,991
Rating: undefined out of 5
Keywords: data science interview, data science interview questions, statistics interview, statistics interview questions, stats interview, stats interview questions, interview questions and answers, data interview, data interview pro
Id: Allap_hrjyo
Channel Id: undefined
Length: 13min 10sec (790 seconds)
Published: Wed Feb 03 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.