Google SQL Interview Question - Calculate Correlation Coefficient

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey everyone this is dan from datantv.com x google and paypal data scientists in this video we're going to go through an sql table manipulation that takes a statistical concept um and actually applies it in this form of table manipulation this is the type of problem that you would often encounter if you are in a product data science interview um specific like the sql round where you might be asked a question like you know how do you calculate the variance or the mean or even correlation coefficient uh based on the tables that are provided to you um now i just want to sort of give you a heads up saying that you know as i go through this video itself um it is like a raw format and um at the same time i've been somewhat experiencing some cold the past few weeks or so definitely not covered just because i've already been tested twice and it came out to be negative but i'm gonna cough here and there as i'm sort of talking so sort of bear with me as we go through this exercise together really this is really designed to help you prepare for your upcoming interviews um and so hopefully you know if you like this type of content please give me a like and subscribe and comment below if you have any sort of questions or any sort of suggestions in terms of what other topics i should cover in the future all right so now let's get started with the actual problem itself um all right so given that the table below um uh compute the correlation coefficient of x1 and x2 columns so in this table you see that there are some values here x1 has some values and x2 has some values now we have to think about how we're going to actually solve this problem um so a really good response when it comes to sql is sort of like a sandwich method so um start by first of all just brainstorming out loud you know maybe take about a minute or two ask clarifying questions to the interviewer and kind of brainstorm you know what's your initial thought process about how you're going to solve this um and then spend a bulk of your time actually you know taking a crack at the problem and explain your thought process out loud um and that's sort of important just because you know you don't just want to be this robot that just sort of gets a problem and then you just solve it without actually engaging the interviewer um the more you engage the interviewer the more points you're sort of earning um especially if you know given that any of these technical exercises are not just about whether you can actually solve a problem or not but also the interviewers are sort of looking for you know if you were an actual colleague that you were working with together how would you actually um how what kind of collaboration style would you provide right uh so try to be engaging um you know to to an extent now the last part is after you have basically you know solve the problem uh from there just like kind of summarize you know how you would actually like how this um the script that you wrote would actually take the raw data and then manipulate it to get the output that you desire um and if you don't end up doing that sometimes the interviewer will actually ask you to do that you know can you take this raw table and kind of walk me through the procedure from the beginning to the end so those are the three step approach as a way to really provide a comprehensive sql solution all right so now let's get started itself and as i'm going through this once again this is somewhat of like a raw format i'm kind of thinking this out loud and um and this is really to demonstrate you know what it feels like to be in the shoes of a um candidate um you know as you're going through this problem right so so let's kind of think about you know um how um i would go on about solving this so first of all you have to kind of think about you know what is correlation coefficient right so you have two variables and you want to see to what extent they are correlated to each other um it's sort of like the analogy of beer and diaper sales of if you know when one is increasing when diaper cells are increasing um it's it's correlated with um you know beer cells as well um and vice versa so you have these two columns and um you know to what extent the x1 and x2 are correlated to each other and so we now have to think about you know what is the statistical formula for um calculating this this function and and the formula for this um as i recall is essentially um it's the covariance of x1 x2 divided by the the standard deviation of x1 x1 times the standard deviation of x2 all right so we have essentially established what is the um formula for the correlation coefficient now we have to kind of further break down you know how do we calculate the covariance and how do we calculate the standard deviation of a column so covariance is essentially um it is the um it is essentially the the average of the following which is going to be x1 minus x1 mu times zx2 minus x2 mu so this is the covariance formula whoops kobit um and then the standard deviation you know that's fairly straightforward so this is essentially the average of the um x1 minus and i'm going to put a parenthesis here um minus x1 mu and then you take the power of this by two and then this gives you the whoops so this gives you the variance formula and so the standard deviation formula is going to be the um the square root of this um this various formula so this is the um so these are all of the individual components of the correlate correlation coefficient that we have to first of all calculate um and then you know take that raw table manipulate it and then put them together and then eventually we end up with the final output which is the correlation coefficient um now it's really important to kind of lay this out from the get-go because if you don't really know what the formula is um then you're gonna have you're definitely gonna have a problem with this um you know with you know with this question so make sure you do kind of you know make sure you review the basic statistical concepts and also um you know if if it's not something that you readily know one thing you could potentially do is ask the interviewer to see whether you can google search it or perhaps get a hint in terms of what the formula is and sometimes the interviewer is going to allow you to do that um so make sure you you know just kind of pulse check with the interviewer and then see you know what kind of information you could get all right so now that we have the formula um so now let's think about how we're going to actually solve this problem step by step so uh so right off the bat i'm kind of thinking you know i should definitely use with clauses a series of weight clauses just because you know um i have all these various functions i need to be able to calculate separately and then put them together now the common denominator across all of these is that i need what i need a mean i need a mean in order to calculate the covariance i needed mean in order to calculate the variance and the standard deviation so what i'm going to do is i'm going to start off by calculating what the mean are for both the x1 and x2 all right so i'll start with with clause mean as so calculate the mean select x1 x2 average x1 over as mean x1 and um gonna do the same thing for the x2 column and this is from the table all right so why did i instead of just doing a group by average why did i calculate average over clause um so for those who are not too familiar with you know what the over clause does so over clause what this would do is essentially calculate the x1 and um it will take what is the average of the entire um column here um and let's just say arbitrarily you know let's just say that the entire the average of this entire column here is let's just say you know 5.0 okay so it's going to actually repeat the 5.0 across all of the rows for this given column and we want to preserve this instead of just collapsing into a single mean just because um because keep in mind that when we're calculating when we're calculating the variance we have to be able to take the individual individual value of the x1 and subtract it by the mean so it makes perfect sense to preserve this row by row so that in the next step we can go ahead and calculate the variance so now this is the next phase of the width clause and i'm going to go ahead and calculate the variance here so calculate the variance and i'll just select and i'll take the um average of this power x1 mean x1 2 as variance of x1 um just to kind of repeat you know what i'm getting right here so um so this x1 minus mean of x1 um you know that's somewhat self-explanatory right so basically this is the part that i'm trying to calculate here um and then this power of two this is basically you know this part where i take the difference of the two and then i square it um and then i need to be able to take the average of this um of you know of these uh these squared differences um as a way to calculate the variance of x1 and now i need to do the same thing for x2 as well so this is going to be x2 minus mean x2 and this is from the mean by the way yeah all right so now that we have the variance the next logical thing to do is standard deviation and we definitely need the standard deviation of x1 and x2 because that's what the correlation coefficient requires in the denominator so we're going to go ahead and calculate the standard deviation and i'm just going to go ahead and copy this and just call it std def so calculate the cd and um so this is fairly straightforward so we already know what the variance is and all we need to do is just apply this power function um and then in the second argument just replace this with 0.5 replace this variance x1 and that's just going to square root the current variance and i'm going to call this stdx1 and then just replace the x1 with x2 to get the standard deviation for the x2 value and this is going to be called this from uh variance yep all right so we're not too far away from actually completing this so now we have the denominator now we need to get the numerator we need to get the covariance all right so how do we get the covariance um well it's it's very straightforward i mean all we need to do is just covariance as so we're going to go ahead and calculate the co variance here so select the um in order to calculate the covariance what we need to do is just take the average of uh x one minus the um mean x1 times the x2 minus mean x 2 and then we'll call this the covariance of x1 x2 and this is going to come from this mean table mean and now we have the standard deviations and we had the covariance now we can put these two pieces together and calculate the correlation coefficient all right so final step is calculate the correlation coefficient and so this is going to be select um we already have the covariance here so covariance of x1 x2 divided by the um the multiplication of the standard deviation of x1 times the standard deviation of x2 um and apologize for the lighting issue um i get i think my um light box battery just you know ran out that's why it's sort of dark all of a sudden um but i'm just gonna go ahead and proceed with the video anyways all right so uh so you go ahead and calculate this and this is going to be the correlation of x1 x2 um and um and where do i and i need to be able to get the covariance from um from this covariance stereo table uh so i'll call it the covariance and i need to get the standard deviations from the standard deviation um temporary table so i'll call this stdev so i'm using i'm essentially using um cross join as a way to get the covariance from this covariance uh temporary table and standard deviations from this standard deviation table so all in all this is going to calculate the correlation coefficient of the x1 and x2 and so if i had to just basically repeat the process of this you know first of all i'll just explain that you know this these are the functions that you need to have in order to calculate the correlation coefficient then you're going to individually try to calculate this um using a series of with clauses so i'll basically explain you know how i would go on about you know uh get getting the mean by demonstrating what the values look like in this table in the problem set um i'll talk about how to calculate the variance and how the variance can be used to calculate the standard deviation and then eventually the covariance and so when you put together the covariance divided by the standard deviations that's going to give you the correlation coefficient so i i hope you um you know like this type of style of video um you know so basically this is the type of problem that you might often encounter um whenever you're you have these product data science interviews um specifically in the sql round they're not just simply you know take these two tables and do jump some joins your your sometimes you're going to be asked these type of problems where they ask you um a problem based on like a statistical concept you know for instance like can you take um a table and then try excuse me try to calculate the cross-validation of something or calculate the variance or correlation coefficient so these are additional type of exercises you could definitely do um and i just want to let you know that this is also like a preview of the future course that's going to be launched on data tv.com i intend on releasing product sql um problem set along with lessons that can be really helpful for you um as you're doing the preparation all right and truly sorry for the coughs uh it's just it's just been non-stop you know for me but but hope you like this video uh definitely give me a like subscribe and comment below if you have any other questions and i'll see you in the next video very soon all right thank you bye
Info
Channel: DataInterview
Views: 475
Rating: 5 out of 5
Keywords:
Id: yliRaLGzBfI
Channel Id: undefined
Length: 18min 34sec (1114 seconds)
Published: Thu Oct 07 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.