Advanced Data Science SQL Interview Question [Amazon] (window functions & aliasing)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Here's my latest video on defining window aliases when you're working with window functions.

For background, I create weekly videos on coding tips usually related to data science type problems. I usually code in SQL so I'll end up sharing here if I think it's helpful information.

๐Ÿ‘๏ธŽ︎ 7 ๐Ÿ‘ค๏ธŽ︎ u/analytics_science ๐Ÿ“…๏ธŽ︎ Feb 17 2021 ๐Ÿ—ซ︎ replies

These vids are great man, keep it up!

๐Ÿ‘๏ธŽ︎ 2 ๐Ÿ‘ค๏ธŽ︎ u/sirdeionsandals ๐Ÿ“…๏ธŽ︎ Feb 18 2021 ๐Ÿ—ซ︎ replies

Fantastic video, really enjoy seeing the moderate to advanced skill level videos

๐Ÿ‘๏ธŽ︎ 1 ๐Ÿ‘ค๏ธŽ︎ u/deakaii ๐Ÿ“…๏ธŽ︎ Feb 18 2021 ๐Ÿ—ซ︎ replies
Captions
hey guys it's nate here today i have an advanced data science question that's going to test your knowledge on date manipulation and also advanced window functions so what i like about this question is that it's a common analysis that you'd be performing on the job but it's also a perfect interview question because it tests your data manipulation skills and it tests your communication skills as you guide the interviewer through your logic and through your implementation to the solution so what i'm trying to show you guys through my examples and through my videos are just relevant questions that you would find in industry whether on the job or on an interview date manipulations and window functions are considered advanced sql skills but the combination of these two technical concepts on this question make this question much more difficult so let's break down this question i'm going to guide you step by step on how to implement a solution and then show you exactly what's going on underneath the hood at each step alright so let's start coding but before i show you the question if you like content like this please subscribe to this channel thanks all right so this question is from amazon it's called monthly percentage difference the question reads given a table of purchases by date calculate the month over month percentage change in revenue the output should include the year month in this format right here and then also the percentage change rounded to the second decimal point so the percentage change column will be populated from the second month forward and can be calculated using this formula here which is essentially this month's revenue divided by last month's revenue so that gives you the difference divided by last month's revenue as well all right so that is the the formula for percentage change so now that we know what the question is let's go through a framework on how to approach this solution and i basically use the same approach for every single question whether on the job or on an interview so this framework has four steps to it the first step is to explore the underlying data to study the columns and also the behavior of that data so in order to do that what i can do is either run this code select star from sf transactions or just hit the preview button here i'm going to view the output in a separate browser so what i have is essentially four columns and what columns that i think are important are the created ad column which is a date as well as the value column here but when we take a look at the created add column we see that it's an object right it's actually not even a date data type so this object is really a varchar or a text data type in sql this is actually using a python executor here which changes the data type to python data types but for all intents and purposes this uh data type is actually a barchar or a text data type all right so what we're going to have to do is convert or cast this created add column to a date data type so that's the first thing the second is the value column here and it looks like what we want to do is sum these values up by your month right now what i'm seeing is that all of these rows are actually on a granularity of day not your month so the second step in my framework is to identify the required columns that are going to be needed to code up the solution so we sort of already did that in the first step we identified that the credit at column as well as the value column are the columns that are needed to answer this question so what i'll write down is that the created at and the value columns are our required columns to answer this question so that step is relatively easy all right so the third step in the framework is to visualize the output and to identify what columns i need in the output itself and then also what aggregations i need to implement in my solution so according to the question we need the date to be in a year month format and then we also need to implement a month over month percentage change formula using the values in the value column so again there are two columns that we need and we need a few aggregations to actually get to our answer so the fourth step is to start coding and to build your solution step by step and test out the query every time you add logic or a business rule all right so i'm going to guide you through the fourth step right now so let's start coding okay so the first step is to convert this created at column right here to a date data type because we see that it's an object right here it's really a varchar in sql terms or a text data type in sql terms but what we want to do is cast this column so that it becomes a date data type so what i can do is this so in order for me to cast this column to a date data type i can just implement these double colons here and then type out date to convert that into a date data type this is possible because this sql engine here is postgres and postgres allows us to use double colons in order to do quick casting and so when i run this code here i get exactly what i'm expecting i'm getting a date display i'm getting the date here the time stamp and then the time zone and so this is very unique to stratiscratch you might not see this in other ides uh but it tells me that created at this column is actually now a date data type if you're not on a postgres database maybe you're on hive or mysql what you can do is use a cast function and so it will just basically look like this i'm going to use the cast function call the column that i want to convert and say create it at as date and then this will also cast the created ad column and convert that to a date all right and so now that this column is a date what i want to do is change it to a year month format so i want the format to basically just be like this year and then month right now i have it on essentially a day granularity which i don't necessarily want so what i can do is use this to char function here in sql type in to char and basically say that what i want is this created at column i want it to be this format here and i'll name the column year month and then if i run this query i now get the date column to be converted into the format that i want year month so this operation here is extremely common in analytics it's very common to aggregate your data on a month level especially when you are spanning several years and it makes sense to format the date to a year month to basically separate out the years but now what we're going to do is calculate the month over month percentage change in revenue here and we're going to implement this formula that's basically this month's revenue minus last month's revenue so that difference divided by last month's revenue so this is going to require a window function so let's build out this calculation step by step so that you know exactly what's going on underneath the hood so the first step is to calculate the total revenue of the current month but in this case uh what we have is is actually the total revenue or the revenue of that day so what i did was essentially added a sum value and we're going to aggregate that and and sum it by the month and what we're going to do later on is aggregate that to your month so it's the sum value so that it's the total value for that year month but in this case i added a created add column here so that we can understand what's actually going on so if we add created that and some value then also add this group by what we get in the output is essentially your month the day itself and then the revenue the sum of that revenue for that day all right so that's the first step just to take a look at what's going on so the next thing we want to do is calculate the last month's revenue so for this we need to use a window function and we're going to use a function called lag so this lag function will take the previous record because we have one as a parameter so it's actually just taking the previous row and summing that value so this parameter here basically tells us which record to take one means just the row above it two means two rows above the current row so in this case because we're just we want last month's total revenue we're gonna use one because we want to take the the previous row and then to finish it off i have the window so over and then we're ordering the window by the credit at day and so when i run this code what i essentially have is the current total sum and then the previous month's total sum right so what you're not actually seeing in this output is the aggregation of your month just yet we're keeping everything as day so that we can start with the raw values and then start aggregating things later on so i'm just showing you what's underneath the hood in this window function and so just to explain what this window function is doing or what this entire query is doing you have the current day's revenue here and so january 5th 2019 corresponds to this revenue number and then the previous row or the previous day's revenue number is right here and that corresponds to the previous row like i mentioned and then if you take another example this revenue number corresponds to january 9th and then the previous revenue number corresponds with the row right above it all right so that is what the lag one is doing okay so now we have all of the components we need to implement this month over month percentage difference formula so now we can actually aggregate the dates to become your month so what we need to do is just change around and revise this code just a little bit we can first remove this created at field right here so that we don't get values by day the group by should actually change to your month because that's the level of granularity we want to aggregate things to and then let's actually change this format so that this window function is also on your month and actually just to do that we can just copy this chain that change that over and if i run this query here we finally get an output on uh your month aggregation so we have your month we have the current month's revenue it's the sum of all of that month's revenue and we have the sum of the previous month's revenue on the third column now the fourth step finally implementing the month over month difference formula so that formula is actually just right here we have all of the components right so all we really need to do is take the difference between the current month which is what i'm highlighting here and the previous month which is what i'm highlighting here so if i just do that i'll just name this the numerator and then the denominator to that formula is really going to be the last month's revenue right so i'm just going to copy and paste uh what i have here and i'm going to name this column the denominator so if i run this what i have is still components of this formula here right the numerator and the denominator but i know it works so what i can do next is actually finally create that ratio so instead of naming that column i'm just going to have a division sign right there and i'm going to call this revenue diff for revenue difference right so this is the difference right here the month over month difference divided by the previous month's total revenue so if i run this query i get the year month and i get the revenue difference and this is exactly what i want you know that we're almost to the final answer here but one thing i'll say is that this code is actually really hard to read and it also doesn't necessarily apply the best coding fundamentals or principles because we have some duplicate code here which i feel like we can actually implement the dry principle where we don't repeat yourself we don't repeat ourselves so in order to sort of make this code more readable and to apply some basic you know software fundamentals or basic coding fundamentals let's actually apply a window alias a window alias is for window functions and an analogy of what a window alias is is essentially a variable that stores an expression or store some code all right so this is how i'm going to implement the window alias what i'm going to do is define my window alias and call it w and then what i want to do is actually replace what's in this over window statement here with the window alias w so my window alias w i'm going to define as order by two char created at and another parentheses here and then this code right here can also just be the window alias all right let's run and see if it works so if i run this query i get the exact same output but things are just a little bit more readable and cleaner and our code does not repeat so the last thing we want to do is just clean up the formatting we just want to change this ratio to a percentage so to change this ratio to a percentage just multiply it by 100 i'll run the query so you can see that it's multiplied by 100 and then we're going to use this round function here and round it off to the second decimal spot run this query that's exactly what we get and just to be explicit i'm going to add this order by so that we're going to order the year month by earliest to latest so if i run this query i basically get the same thing because that was the default behavior to begin with but i just like being a little bit more explicit all right so if i check the solution the solution is correct so that wasn't too hard was it sometimes it's really nice to be able to take your time and break down the problem into its individual components so you're dealing with things one step at a time and then you're executing or running the query as you're building out the solution so you completely understand what's being processed and how the data is actually being manipulated so again the date manipulation that we did where we changed a date to a format of year month is extremely common in analytics you're almost always going to be asked to reformat dates so it's really important to be able to cast or convert your columns to a date data type and then also be able to reformat it to whatever format you want window functions are really important but as you saw in this example it can get pretty messy so window aliases are very helpful to improve readability but then to also reduce the amount of code that you're duplicating so both of these concepts that we covered today are advanced sql concepts but taking both of these concepts together in one question makes this question much more difficult so it's really important to be able to slow things down and to build the solution one step at a time and one logic or one business rule at a time this will help you troubleshoot and debug if you're on the job but if you're on an interview it will also help bring the interviewer along so they know exactly what you're doing and how you're thinking to actually solve this problem so i hope this was helpful if you have any questions or feedback for me please leave a comment in the comment section below i hope to see you guys in the next video thanks [Music] you
Info
Channel: Nate at StrataScratch
Views: 12,045
Rating: undefined out of 5
Keywords: data science, data science interviews, data science interview questions, data science interview questions and answers, coding questions, interview coding questions, interview technical questions, coding questions and answers, sql coding questions, sql coding questions and answers, amazon data science interview, amazon data science interview questions
Id: QenwDm5oWdU
Channel Id: undefined
Length: 16min 39sec (999 seconds)
Published: Wed Feb 17 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.