Three Tricky Analytics Interview Questions with Andrew

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hi everyone this is jay from interview query and today i'm here with andrew from data leap tech another youtube channel and also a coach on interview query and today we're going to go over a mock interview on sql and so before we get started i'd love for you andrew to talk about kind of your background and a quick summary about how you got into data science absolutely uh first things first thank you jay he is one of a kind i feel like there is not many ceos out there helping this direct niche of people try to find data science interviews and master them at this level uh you've built out a very unique product i feel like um i feel like i'm talking to someone i've looked up to for a long time so awesome that i'm here thank you for having me yes the flattery is uh much appreciated andrea it's all deserved okay so my background in data science is pretty diverse very non-linear background starting in natural language startups and then going into data engineering for a hedge fund uh moving around to silicon valley starting my own businesses and ultimately uh ending up in fintech and crypto where i currently reside and i feel like there is a lot of academic research that can be done in this field especially where natural language and fintech intersect so that's the kind of data science expertise that i try to provide a little context a little flavor to my mentees but the kind of mentor that i am is one that tries to paint a picture with your candidacy every single mentee has a trajectory that needs to be told right and nobody is exactly the same and that's what a lot of people don't realize about the interview process is that when you go to a recruiter or a hiring manager they're actually trying to sell you they're trying to sell the package that you are to the person further up the ladder so if you can have a good understanding of where you're coming from where your strengths are and where you can continue to challenge yourself that's the kind of mentorship that we would undergo if we were to do a mentorship or a coaching session awesome cool and i think people are really excited to try to jump on that as well as we've seen a lot more interest in just the different kind of learning styles within data science and getting people up to speed based on where they're at and just trying to figure out where they are and you know if they can pass the interview and get that job as well cool so now i'm going to move on to coderpad as we go over the sql question okay so today we're going over the sql question that's pretty standard and it requires two tables that are pretty common employees and projects right so the employees table holds all of the employees and then the projects table holds all the projects that the employees could work on right and so the first question that we have is take the five lowest paid employees who have done at least 10 projects so andrew how would you tackle this problem okay i would say so the first things that before i do any code is i would have to first ask about the nature of the tables and then maybe write a little bit of outlining in pseudocode before we jump in okay so the first thing i want to ask is would i be correct to assume that the employees table is a bit of a directory so like a rolodex that you would find on your contacts app or a collection of every single employee that exists at this company yes okay and then my the nature of the second table seems like that it is more of a ledger so something that uh only gets updated when a new project is in the pipeline um and any project that's never existed or isn't worthy of being in this table wouldn't even exist exactly yep and then i have to think about how these two tables are related so it seems like the employee id in table two is the foreign key into table one is that correct yes okay in that case i have a good understanding of the nature of these two tables and then i can move forward communication i think is a big part of these kind of interviews so definitely the the take the five lowest paid employees who have done at least 10 projects well the first thing i would imagine is i would need a filter for employees who have done at least 10 projects and then i always indicate to the interviewer that i have an understanding of how to filter this so i would put in parentheses this seems like an aggregate filter so we would need having and then i would need to take a ordered list of employees based on their salaries so i would then order by salaries this of course i have to do order by and then three i would consider limiting this to the only the five lowest paid employees which means that i need to then limit and then i would limit to the five lowest then i would go back to interviewer either i talked out loud during this entire process or i didn't and hopefully i did it really quick and i said interviewer does this look like the the three steps that i should take to solve this question yep okay so then i would start coding so let's get rid of this so we don't accidentally run it the the way that i would start i always start with the select statement and then i know a lot of interviewees get bundled into getting the wrong select statement for some reason we have to verify what we mean by lowest paid employees so we're selecting employee id and that's the only column we want to select is that right yes it looks like there isn't anything else that's identifying for the employees so yeah probably the employee id perfect so then i would expect that we're going to alias employees as e so then eid is all we're going to select where we're going to select this from well we're going to select it from the employees table whereas that's where e is so you can put as e or you can just put e this is shorthand for an alias and then we would want to join this now what kind of join am i thinking of well uh like we established the nature of these two tables is that this first table is a ledger so any employees that have done 0 projects wouldn't even exist in this table but we need people who have a lower bound of 10 projects so we don't need to do a left join let's just do an inner join here people just go ahead and go like their projects right without really understanding what the comma it stands there for make sure in most interviews you want to be able to indicate exactly what kind of join you're using okay so let's enter join projects alias that is p on e dot id equals p dot employee id okay that's pretty simple and then we're going to do our very first step filter for employees who have done at least 10 projects all right that means that we need having but before having we need group by so we group by what are we grouping by let's run group by the employee id since we're looking at how many projects each employee has done and then we are going to have to do having and this having is going to look at the definition of done and i would go to interviewer and i say does done mean that the end date needs to exist and is in the past uh let's say that the end date does have to exist does it have to be in the past so let's say that it could be in the future because we're projecting the end date of a project excellent in that case i don't have to do a more complicated case when i can just say having a count of end date greater than or equal to 10. that way we are counting each end date and if it's a null end date then we won't count that so we're counting only end dates that exist with for each employee that have ended greater than 10. and then we are going to order by let's order by their salary this is p dot end date e dot salary now descending and or ascending now ascending means that the lowest salaries go first so that's what we want ascending is also of course the uh default but you know adding it doesn't hurt and then we're going to limit five we want to make sure that this is correct we did the first step we filtered we did the second step we ordered by and we did the third step we limited and it seems like that this is going to be correct so then i would turn around and i'd say i think this is good do you have any questions about what i wrote yeah i think it does look good i would say for your having count statement right here does it matter if you put in the end date or can you use any row and if so then why good question i would say that by the definition that i just established with you if you put any other row well actually you can only put these three rows right because employee id is something that we're trying to a group by if you take any of these other rows then you are risking the idea that sometimes objects could exist it could be populated with the id get populated with the start but it doesn't have an end date and in that case we can't count it as the definition of done gotcha okay cool that sounds good do you want to do a part two today let's do it let's do a part two i have an idea for a interview question just from this okay let's look at the part two of this question essentially in part one we looked at the number of employees by specifically lowest paid and then 10 projects so the second part is what is the sum of the salaries of employees that have not finished a project okay so now we go back to the definition of not finished so in this case i can already imagine that there's three different at least three different definitions of not finished okay so let me define them here unfinished definitions the first way that a project could be not finished is that a project could have no end date so the end date is known second way that a project could be not finished is that the end date is in the future so that means that it is greater than current date and then the third way that the pro the project cannot be finished is that if it never happened at all but if it never happened at all we've established that the table two is a ledger so therefore if there is no project id in table two or a better way to say it is there is there exists an employee id in table one that has no uh project ids associated in table two okay so given these three conditions and it's good for you to lay this out let's say that we're going with number one the end date is null that probably seems like the most reliable way to collect data in this case than how we kind of set it up and then we can erase that idea of the end date in the future from question number one okay what about number three if an employee has never had a single project assigned to them then they haven't even done any projects or they can't finish a project if they've never started one that is true but then they could have been assigned one and then not finished it right and we'd still want to record that somewhere yes uh i think that would be good i think that would be covered by the first condition if they had a start date but no end date then their end date would be null yes yeah exactly yep so but if they have no project id associated with the employee id in table two then they would just have never been assigned a project to begin with correct so you um just to clarify that doesn't fit this definition yes so the definition is then what are the sum of the salaries of employees that have started have not finished a project but were assigned one okay let's get started so why would select this time and remember always be sure what you're selecting so i'm going to select sum of salary indicate this is in the employees table employees now incredibly important is that now we have to consider what kind of join we are using so last time we were able to use a an inner join but this time we know that all of the employees who have never had a project would also be here would not would be entirely excluded from the output a project if we wanted to include them then we would do a left join and they would exist in the final output and they wouldn't have an end date so therefore we would be including them anyway but as of right now we don't want to include those people because we want to include only employees that have started a project so this is again going to be an inner join okay i could just copy it from above but you know get those fingers typing okay so now that we have the join complete we're going to now think about this condition right if we filter for end date equals null we can do that right here the problem with that is that it's going to take every single person every single employee who currently has a project that they haven't finished so let's say you have had a thousand projects and your next project isn't finished you would still be collected by this where statement so your boss will say you're a slacker even though the data scientist was the one that actually made the mistake so we don't want to do that so what can we do well there's two ways we can we can overcome this problem we could one having count of project id equal one that means that it's they've only had one project ever and that one project ever has end date equals null okay but this doesn't collect the second problem which is what if they've had two projects and they're both unfinished if they're both in the future so we now need to essentially filter all of the projects that they have that are unfinished and see if that equals their total number of projects so this is going to get a little tricky okay so one is actually not okay we have to actually find out if having count of project id equals the count of null end date and what and this and this column currently doesn't exist but what we can do is we could do a case when so what we do here is we have either a cte or a sub query but let's do a cte so let's do with let's call this null ended actually that's too confusing because i'm going to call null and date for the column name so i'm going to call this cte as this is going to select the employee id uh let's think about this let's only collect project i let's do everything x and then do a case one so we're gonna select star and then do a case when date is null then one else zero so this is going to give us the products table but with one additional column which is this table let's call it as and i'll end date this is going to tell us exactly how many null end dates there are for each employee so now instead of joining to projects let's join to this cte still call it p though for a naming convention and the id equals employee p dot employee id and now we need to look at where having having count project id equals count null end date we can't have happen without group by group by employee id again we have the cte include everything from the projects table except in addition to one additional column that tells us whether or not the end date is known for each row and then we have a sum of salary from employees inner join the cte group by employee id and having count of project id equal to sum of null end date we need to that be sum because this is ones and zeros instead of nulls and not nulls gotcha this looks right so this is the same idea as saying end date equals null but this time it's saying um how many project ids do they have and how many null end dates do they have if they're equal then they're never finished the project i'm just saying i had another idea for this one um i was thinking that if you did let's say you know all this stuff with the join inner join on this one potentially you could actually make it so that you grabbed all of the employees where the uh where they did not where they finished did not where they had a project that was finished so it'd be like uh where e dot id is not in select you know employee id from projects where end date is not null right so i like that a lot now i think the only problem with this is that it falls into the same trap as before because let's imagine that you've done a thousand projects um well actually let me think about this any of them right yes okay so this is a condition for has done at least one project okay no i totally agree now i i have seen through uh the error of my ways this is totally a legit uh way to go about it as well um i think actually you don't even need this inner join on projects sorry so because then this would make it de-duped right so now you would just select some salary like that and that would get beautiful that's the beauty of sql you can do it so many ways yeah it is true uh cool i have one last interesting question for you but i don't know if you want to add it oh i do okay this one is like not as relevant but i'll i'll paste it up here let's say that uh it's time to give an annual raise to assert uh to a few employees how would you determine which employees should get the raise and so it's a little bit ambiguous but i'd love to understand how you might approach this problem i love ambiguity so let's consider let's let's jot it down this is starting to look like my desktop in terms of um in terms of messiness but let me let's just put some uh notes down here so um when we consider people who should get raised that is an incremental like a marginal right uh increase in compensation and why do we usually give raises well sometimes it's because inflation right sometimes it's because the value of the their salary is no longer matching the uh either the value of the dollar or the value in the market but also it could so since we can't look at this in these two tables what we can look at is uh marginal increases in performance yep we think about who should get raises we should really think about what they what did they do last year compared to what they do this year yeah so what we could do is we could group by year or depending on how often we give annuals so we can group by year and look at how how many projects they have done per year and then and then do a case win to determine how many people have done like a five percent increase in projects 10 15 20 all the way up to let's say 100 if there's a considerable number of people that have done more than a hundred percent increase in projects then we could continue that case when but i imagine that ultimately we're going to start thinning down the crowd until a couple of star performers are going to shine um and then what we can do is compare their salaries to start performance already right um if projects is the only metric that we're looking at then we would imagine that okay people who are getting paid more for doing the same amount of project increase why is that but then we should also remember that there is a probably like a diminishing return in number of projects completed so we might have to do some sort of penalty for people with a larger amount of projects which as easy as doing another case win on top of that which is um for people who have done more than uh 100 projects and then 200 projects and 300 projects at a penalty that's like their increase has less weight than someone who is just starting out yeah so i would say there's a few factors here you touched on one right it's projects uh and then there's like project length right so like if you do 10 projects that are you know a week versus one guy who does like five projects that are like a month long you know then i would assume that the second one is maybe more deserving of a raise and then to even add more complexity to it it could be like how many people were on the project than you also did right like and uh you know we don't have outcome of how good the projects are but we could say that they're all equal weight but you know let's say that um there are two people on this project versus like 10 people on this other project uh and essentially if you're assuming that all projects are equal then the people who have done the the two people on that one project it probably did better that's kind of an interesting concept that we can maybe touch on uh but yeah in general there's no correct answer i think but it's just about maybe for these types of questions of choosing one metric to calculate and then seeing if we could write the query for it right very cool uh i totally agree this is this is one of the questions that i love throwing out because um i think a couple of like a trend i see in mentees is that they're very straight and narrow um kind of like me back in the day i really wanted to be like getting to the heart of the question as soon as possible but then it's more of a conversation i feel is that you kind of want to get out your ideas and then open up to the coaching of the interviewer because they are essentially trying to lead you to the right answer in their own way yeah so we don't have to dive into it though because it seems like it'd be pretty long all right thanks for joining me today andrew and uh if you want to check out more of andrew's stuff go to his channel on data leap tech where he'll be juggling sql problems python problems and more uh data science content for you there i appreciate it jay jay is the man uh subscribe to this channel if you haven't yet like this video and if you want to make a six figure salary as a data scientist in six months hop on over to my channel i'll see you there all right bye
Info
Channel: Data Science Jay
Views: 27,585
Rating: 4.9303336 out of 5
Keywords: data science jay, interview query, data science, data scientist, sql mock interview, data science mock interview, data science sql, data science sql interview questions, data science sql interview, sql, sql interview questions and answers, sql interview questions, analytics interview questions, data science analytics, data analytics interview questions, analytics interview, sql tutorial, sql beginner, sql questions, data science interview, data analyst
Id: uLCFCzVLi4Q
Channel Id: undefined
Length: 25min 3sec (1503 seconds)
Published: Tue Nov 03 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.