Facebook Statistics Interview Question | Google Data Scientist | DataInterview

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey guys it's dan the founder of data.com ex google and paypal data scientists in this video i'd like to cover a question that was asked in facebook's data scientist interview the question is the following if you sample 10 000 users multiple times what would the distribution of false positives look like briefly pause the video and think about the following choices could it be exponential normal binomial or none of the above think about your response and drop a comment below now throughout the video i'd like to present three things that you'd be very interested in hearing the first thing is a preview of what this solution looks like in the monthly subscription course on data tv.com the second thing is i'll provide an intuitive explanation in terms of what the solution is and third thing is i'll provide a collab along with the link provided in the description below that will give you like a hands-on simulation in terms of what this distribution of false positives look like now if you like this type of content make sure you hit the like button subscribe and check out dating.com which contains courses and coaching services that are going to be really helpful for your data scientist interview prep now let me show you a preview of the monthly subscription course the monthly subscription course contains multiple courses including a b testing product sql case and point and so forth now in the case and point course contains case questions covering statistics machine learning and product cases and each case is broken down into four sections problem statements hints solution and assessment let's take a look at what this interview question looks like in this case score so the case lesson starts with the following a problem statement along with information about which company asked that question and additional details that are going to be helpful for you when you're doing a preparation it contains information about the duration and it contains information about the level difficulty and the next section are tips and hints and you will see that it's broken down into practice tip and solution hint so the practice tip provides insight about how do you practice this interview question effectively you have to keep in mind that case questions and interviews is not a written exercise but it's a verbal exercise and so i provide some tips about how do you practice this question orally now along with the practice step i also cover solution hints so if you get stuck while trying to answer this question then you can take a peek at this and use some of the hints as a way to converse toward the right solution now let's go to the solution bar this solution contains a dialogue between the candidate and interviewer the reason why it is done in this form is because in the actual case interviews you don't just provide a single paragraph or response there's an interaction between the interviewer and the candidate and so i emulate this style into the actual solution itself and so you'll see that there's a back and forth exchange between the candidate and the interviewer and then i also provide some commentaries that helps you understand what the interviewer is thinking as a candidate is providing a response and there are follow-up questions embedded in the solution and this is a chance for you to get engaged and practice these additional questions so this solution style is designed to give you an experience of what an actual case interview looks like finally let's jump to the assessment section so the assessment covers two elements it contains a grading rubric basically grading scale along with information about how the ratings are assigned and then the actual evaluation portion itself so you will see that the candidate is evaluated based on statistics and communication and this contains what the rating score is along with the justification in terms of why the candidate receive a certain rating score and this can be really helpful for two reasons first of all this is a way for you to self-assess yourself secondly this gives you the inside information in terms of how the interviewer will evaluate the candidate so that you're better prepared for your upcoming data scientist interviews now let's go through the solution to this question if you sample 10 000 users multiple times what would the distribution of false positives look like in order to solve this problem we have to think about it in terms of steps so the first thing to do is clarify the problem right off the bat you can see that there's a little bit of ambiguity with this problem statement and so we have to kind of break it down in terms of what it is it's really asking the second step is to identify what a distribution of a single sample with first positives look like and you will see that once you understand what a single sample of 10 000 users with false positives look like this is going to help you understand the entire question itself which is asking you to find you know what a distribution of multiple samples with false positives look like okay so let's go through one step at a time the first thing to do is understand what is the meaning of false positives so you have to think about false positives in the context of statistical testing you know that you committed false positive if you rejected the null hypothesis when it is actually true so in the context of statistical testing there's something called a false positive rate or sometimes called type one error rate and this is equal to your alpha or significance level and so suppose that you set your alpha at five percent or point zero five then your false positive rate or your type 1 error rate is also going to be the exact same value so understanding meaning of false positives is going to be helpful as you will see throughout the video the next question is what is a population distribution so the question doesn't provide that it just said if you sample 10 000 users multiple times but where are you drawing this 10 000 users from what does this distribution look like is it uniform exponential normal or some other distributions so the next thing to do is really try to understand what this distribution looks like and you may want to ask this and to form a clarifying question to the interviewer another last thing is what is the distribution of a single sample of false positives look like so as i mentioned in the beginning this is a step two of solving this problem once you understand what a single sample of 10 000 users with false positives look like this is a stepping stone for you to understand what multiple samples look like now let's dive into the step two of the solution which is understanding what a single sample looks like okay so in order to solve this problem we're going to make some underlying assumption that the population distribution can be anything but for the sake of simplicity we're going to assume that it's just it's a standard normal distribution with a mean centered at zero with the variance being one now as i mentioned earlier the false positive rate is equal to alpha or the significance level and just for the sake of simplicity we're going to set it at a typical value of alpha which is going to be .05 and this point zero five is essentially the region under the curve to the right of this red line which is basically the location of the alpha threshold and so based on the standard normal distribution the location of this alpha threshold is going to be 1.64 now i just want to let you know that it really doesn't matter what the alpha value is it can be 0.10 it can be 0.20 but for the sake of this exercise we're just going to assume that it is going to be 0.05 now that you've established where that threshold lies what this basically means is that any observations to the right of this red line is going to be deemed as false positives now that we have this assumption about what the population distribution can look like now let's go back to the problem statement if you sample 10 000 users multiple times now we're going to put aside the meaning of multiple times for a second here and just try to understand the single sample of 10 000 users so if you were to draw one observation from this normal distribution then you're going to get the following value so you'll see that the first observation contains a value of 1.88 and this is some arbitrary value if you were to sample it and because the value is 1.88 and it is greater than alpha threshold that means that this value is going to be a false positive now what happens if you draw another data point so if you were to draw another data point this time the value could be 1.0 and because this value is less than the alpha threshold it is going to be labeled as zero for the case of false positive now let's draw another sample and once again you'll see that this value is less than the alpha threshold and so it is labeled as zero for the false positive now if you keep repeating this process until you have 10 000 observations or 10 000 users ultimately what you end up is a distribution and this single sample of 10 000 users is turned into this count base where you have the count of false positives and you have to count the true positives and so if you think about what this count distribution is basically it's a binomial distribution now i have a hunch that most of your initial assumption about what the answer is is that it is going to be binomial distribution but if you were to read the question carefully it asks you if you sample 10 000 users multiple times what would the distribution of false positives look like so let's go back to this binomial distribution so we know that this binomial distribution can be expressed in a different way and it can be proportion so the proportion is basically the number of instances where you have the false positives divided by your sample size which is going to be 10 000. and so what happens if you keep repeating this procedure multiple times meaning you're collecting samples multiple times as what the question states then what happens is is that for each instance of sample you're going to get one proportion after another and you keep if you keep repeating this procedure and ultimately what is going to happen is that this distribution of sample proportion is going to start to look like a normal distribution why because of the central limit theorem now if you recall from statistics 101 this is something that you probably would have stumbled upon in one of the first week of the course and the central limit theorem states that the distribution of sample means approximates a normal distribution as a sample size gets larger regardless of the population's distribution and so in this case we have started with the population distribution being normal but given that the clt is naive about whatever the initial population distribution is the distribution could have been exponential bimodal it could be in whatever distribution but the bottom line is that in terms of the overall process of taking samples of 10 000 users with false positives which is expressed in terms of proportions and if you keep repeating the samples with proportions ultimately the distribution of sample proportions it's going to approximate to a normal distribution now if you got to this part that is awesome but we don't just want to stop here we want to provide a perfect solution and so what other information could you potentially convey well we can provide information about what is the mean and variance of this sample proportion and so you can use the following formulas the sample mean being equal to p and sum of variance being equal to p times the complement of p which is going to be 1 minus p divided by n p here represents the probability of success and in this case it is going to be the probability of false positives and n here represents the sample signs and we already know what the probability of false positives and what the sample size are given that we have set the alpha which is equal to the probability of false positives we know that the value of p is going to be 0.05 and for n we know that the sample size is going to be 10 000 there are total 10 000 observations per sample and so if you plug in the values into this formula ultimately for the sample mean you're going to get a value of 0.05 and for the sample variance ultimately you'll get a value of four seven .00000475 now i'm a huge proponent that whenever you're practicing by yourself it is very important to self-assess the quality of your response so i have provided a rubric for you that you can use to assess yourself it's got four rating scores inadequate borderline good and excellent if you provided no response because you're not sure obviously it is going to be deemed as inadequate but if you mention that it is binomial so i would say it's a borderline but it doesn't just stop there you have to think about what the next step is and we know that the next step is to bring up the fact that there's the central limit theorem and along with the fact that ultimately the distribution is going to be normal and lastly the excellent response would contain the following which is that contain all of the elements that are required which is binomial clt normality sample mean and sample variance so self-assess yourself and drop a comment below in terms of what your rating was now i have a hunch that there might be some of you out there who are still not convinced that if you were to take any distribution of false positives then you might not get a normal distribution ultimately and so i decided that i'm going to create a collab for you where you can run through a simulation and ultimately see that the solution that i presented is correct and so and so this collab has the following setup you want to first of all set the parameters for the simulation so you'll see that there are four parameters the first thing is a population distribution do you want it in normal exponential and uniform and once you have a copy of this collab you're free to change the code itself and maybe you want to test out different different distribution bimodal or whatever distribution and by default i've already set the false positive rate which is going to be 0.05 the sample size being 10 000 and the number of samples that are collected as 1000. now the step one is to generate population data so when you run the cell you'll get a distribution of what this population distribution looks like along with the alpha threshold and the step two is if you were to take a single sample from this distribution with false positives what does that distribution looks like and it provides information about the sample size which is going to be 10 000. it provides information about the proportion of false positives which is 0.0495 along with information about the number of false positives that are discovered in this case and it's going to be 495. and the last thing demonstrate how if you were to sample 10 000 users with false positives multiple times what this distribution looks like and because of central limit theorem ultimately it becomes a normal distribution with the following statistical value which is the mean at approximately 0.05 and you will see that the sample variance is 4.78 times 10 to the power of negative six which is equivalent to the initial set of the sample variance that was calculated now just for the heck of it let's see what it looks like if we were to repeat this procedure but this time around instead of using normal distribution as a population let's use exponential and see ultimately what the alpha looks like so i'm going to swap the population distribution with exponential and i'll go ahead and run all of the cells and let's see what the output looks like so when i run the cells i get the following outputs so the population distribution is exponential and the alpha is still 0.05 because that's the initial parameter we have set but this time around the alpha threshold is located at 2.99 and the false positive count so basically if you were to take a single sample of 10 000 users you'll see that the proportion of false positives is 0.0491 so very similar to the alpha value that we have initially set along with the number of false positives being 491. now we have started with this initial distribution being exponential but if we were to repeat sampling that's multiple times what does the final distribution look like well not surprisingly because of central limit theorem what you're going to get is once again a normal distribution with the sample mean and the sample variance very similar to the ones that we have seen previously in the video so there you have it guys so this is how you answer this facebook interview question which is very tricky but with some clarification and then thinking about each of the crucial steps so understanding how a single sample looks like and then eventually translating that into what a multiple sample looks like you will be able to answer this question correctly so if you like this type of content definitely check out datainterview.com there are courses and coaching services that are going to be helpful for you in terms of your interview prep now i look forward to providing more insights like this in upcoming youtube videos

Info

Channel: DataInterview

Views: 2,804

Rating: undefined out of 5

Keywords:

Id: O-2l2Dy4XJM

Channel Id: undefined

Length: 17min 51sec (1071 seconds)

Published: Thu Oct 21 2021