SQL for data Analysis 101 | Basics Data Extraction and Aggregation

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey there while you're trying to figure out what should you study for data analytics and data science python or r there is one programming language that's actually level zero on the scale of languages that you should know for data analytics and it's sql there is definitely no doubt a lot of value in knowing how to use sql and in some cases more than 50 of your programming or even more than that can come from sql and writing a lot of queries trying to get the data from the data sources and databases that are available in the company that you're working for so i'm gonna try to break down one-on-one the most basic and the most important things to know in sql for data analytics and data science if you google for sql interview questions you get something like this what is the difference between different types of databases what is the primary key what is a foreign key what are the subset of sql what kind of sql languages are there or dialects of a scale are there what are constraints what are the joints and some of the interview suggestions are not very practical of course it's really good to know what are the joints and what does it mean to have a primary key in your table and how do you use them but those are not particularly practical and they are more for people who will work with databases like data warehouse engineers but data scientists actually use the products of the data warehouse engineers in their day-to-day work so the questions there tend to be much more practical on the level zero on the level like the basic sql knowledge you need to be able to comfortably do the same aggregations the same calculations that you are comfortable to do in excel so if an excel data source you have some information about you know purchases for user activity or user type and if you can calculate their daily active users monthly active users spending patterns and things like that if you can put pull up the charts there then you can and should be able to do the same things in sql sql is immensely powerful and very helpful in data science in a lot of cases you will not only have to just extract the data from the databases using a scale but you will have to do a lot of data cleaning and manipulations there as well especially if you work with the bigger data sets with hundreds of thousands or hundreds of millions of records in that case it might be very inefficient to transfer all this data into python or r because they will just generally run much slower than many data platforms like amazon web services or google cloud platform i will go through an exercise now with a very basic things that pop up during interviews i plan to do more videos like that so here definitely this is not a comprehensive overview of all the sql questions that you can see but it's a start we will go through an exercise with bigquery that's the database that i use the most they also have a very efficient querying language that they built based on the one that they internally used in google for many years i think and it's really really really fast for ginormous data sources and very flexible it's very similar to general and common sql dialect but there are some things that are a little bit different but don't worry actually bigquery has one of the best resources libraries that you can find i really like to search for answers for questions that i don't know directly there rather than i don't want stuck overflow or generally in the internet the their reference guide is very easy to understand so without further ado let's go ahead usually when you come to an interview you will have a couple of questions about how to manipulate data and the interviewer will give you a couple of tables that you need to extract this information from let's start with um thinking that we'll have let's say an app where we have some users they can be either free subscribers or premium subscribers or paying subscribers and we'll have that information about them stored in a table when they are first logged in so first time when they create their profile they can either choose the free version of our app or the paid version of our app and we will get some information about that we'll call this table user type and we'll have here user id created add that will be a timestamp and is a subscriber then this is our metadata about our users and now we need to create a table where we store information about their activity usually in the companies those are like real-time logs that information gets added to when the user actually creates this activity so pretty much when you click a button in some app the record about how you did it and when you did it and some metadata about your device and location is sent to store to the company that produced this app so those are like real-time logs and we'll have this log as user activity and we'll have here some information as timestamp that's usually the one that's always there because it obviously helps to identify when the thing happened that we're interested in a user id because we need to know who did the thing that we're interested in and let's say we have activity id it could be anything that we might be interested in like an id of a button that the user clicked in or maybe a string just like in the name of the action like subscribe or login or whatever this is a very simplified version when bigger companies with developed data infrastructure those are of course much more granular but we're going to start with that so there are plenty questions that we can already answer with these with this amount of data even though it seems like we only have just two tables with just a few values we can start and look at calculating daily active users or monthly active users of the product that we have or the app that we have we can look at comparing the activity of the users based on their subscription type we can calculate how many daily new installs we have for monthly new installs based on the first table user type we can calculate how many daily new paying subscribers we have as a subset of the daily new installs new paying subscribers and then we can look at the activity so how many of those activities in the user activity table happen per user different different times during different times of day so because we have the timestamp we can extract this information activity by the time of the day and pretty much there is plenty plenty plenty already of other opportunities and how you can break down different kinds of users what kind of information you can extract you can just look at overall how many users we have how many users have been deleted if we have this information just plenty of the things that you can do with very simple amount with very small amount of data and actually encourage you to just like think about maybe look at these tables or when you see some interview questions with the table example of sql think about how many different questions you can answer with this data because as a data scientist or data analyst that's something you will have to do you will have a limited amount of resources and data to answer pretty much unlimited variety of questions to support this example i created a couple of tables those actual tables in bigquery if you have a google account you can create yourself this console for google cloud platform where you can actually write queries in bigquery but be careful there is a free tier you can run a certain amount of query acquiring a certain amount of data but at some point when you exceed maybe 200 a month or something you will have to pay for those queries and the data storage so check out in your i think it's somewhere under billing yes check out under your billing tab whether you're exceeding your query limits just so that you don't have to pay yeah i don't have to pay for now i don't think it's expensive and i think it's really nice to actually be able to run the queries when you're preparing for interviews but anyway so here we have the sql one on one data so data set that i have created for this exercise and there are two tables user type we can preview it and see that we have a few user ids and their created timestamp and boolean flag whether they're a subscriber or not and then we have user activity table which we can also preview and see that there is a lot of activity records for different user ids for different time steps and for two different activity ids first question was to calculate the daily active users or monthly active users of the app and we for that we need to extract information from user activity so we can click here and see if we can query that so first we're gonna look into yeah what we have in this table user ids and activity id so to calculate daily active users and monthly active users generally we want to look at how many unique users were active in any way in this day when you're going through an interview of course you need to discuss with the interviewer how do they define activities so if they just tell you calculate the daily active users of this product don't just go ahead and write select count you know distinct user id for date of the timestamp for date of the timestamp because maybe that's not how they define daily active users maybe that's something that you should that should discuss with your interviewer so let's say yes the daily and monthly activity will be whatever it is then we run this query by pressing command enter and yeah we can see that we have quite a few days it's not sorted so let's let's order by so our daily active users are pretty much two or one or three i didn't have enough time to add there too much information and also it's easier to double check yourself so that would be daily active users we might want to calculate average daily active users per each day of the month on top of this query we would want to write another query where we write select average deu [Music] per month um from this query now we don't really need an order by anymore and then we group by two and we run this query and we receive a a weird result yeah i should have named those columns but yeah we have on average 1.87 daily active users in the month of december however that's not how the monthly active users are calculated for the monthly active users we want to look at all users who were active at least once during the month of december and then it's pretty much just a little change of our inner query where we're gonna instead of the date here we will extract month again and we should call it so in the month of december we had five unique users active i don't know if you can hear but i have a little doggo eating her dinner in the corner of my room so i don't i hope you don't hear too much of the crunching well yeah let's go so we calculated daily active users and monthly active users now let's say we want to compare the activity of those users based on the subscription type so let's look again into the monthly active users but this time we need to join this table with the table with the user type we'll do it using user id and we're doing an inner join we're making it under the assumption that all the users in our user type table are to some extent present in our user activity table and vice versa so everyone who's ever logged in into our product our app has their record and user type and user activity if we wouldn't have that user in the user activity table then in this case we would just exclude them from our activity calculations and if we wouldn't have this user in the user type table we would have them in the user activity same we would exclude them from the calculation so this is under the assumption that we have records for all the users in both of the tables if we don't and we want to reflect that we want to we need to do a different kind of join for example left join would be beneficial if we have users in user activity that are not present in user type for some reason we can do a left join and assign them an unknown subscription type or is subscriber value because we don't have a record in user type this is all very important first of all to think when you're actually working but also to talk through during your interview process so in this case we not only want to know monthly active users and daily active users but we also need to look at the is subscriber and here we'll get our answer so we have nmr month of december we have three um non-paying subscribers active and to paying subscribers active we answered questions about daily and monthly activity by the type of the app we can do the same for daily new installs for subscribers we don't even need to use the activity table here we can just look at the user type and extract the date from the create date let's look at the activity by the time of the day this is something that's not so frequently i think asked or taught during sql courses but it's something that will come up so many times during your work because you will always need to parse timestamps for date time values to be able to cluster the activity per day per year per month per week for different kinds of week iso week or a week that starts on sunday week that's on monday all this syntax i don't necessarily remember all the time i need to sometimes google it and here i want to show you how easy it is to google this stuff for bigquery for example we want to format timestamp and bigquery and here we have reference informations for timestamp functions in sql bigquery we just look at the format timestamp so we just want to look at format timestamp for example yeah we have this timestamp in utc pretty much we just can well actually here i'd want to have um only hour and a minute so let's copy that um so now instead of date we will look at the hour and instead of extract timestamp we will look at the so we have ts here we don't want to add this information and then back going back to the reference part in the bottom part here you can see the reference guide for the support of elements for timestamp i constantly use it because most of the things i don't remember by heart here we only care about hours so the 24 hour clock and minute which is capital m so percent uh age percent um let's add this part hour minute then and here active users we can still do the join on subscriber type so let's see what this turns back to us this kind of looks better if we have like ability to chart this data do we want to do that yeah let's do that a little trick that i can show you is that here you can save it into google sheets which is one of the super fastest and easiest way to look at it as a as a chart to look at this data so we have this information in the google sheets so we're gonna we're gonna add here a pivot table where is it data to the table yes we created a new sheet and the rows we're gonna have our minute it's sorted perfect in the columns we're going to have value oh no in the columns we're going to have subscriber and the values we're going to have the active number no we don't need this here yes so now it looks kind of pretty good we don't really need grand total here and here and then we can add a chart just easily on top of this it doesn't look fantastic but it's something maybe if we use the no this kind of blind chart or maybe probably the bar chart would be better yeah so here we can see the activity for paid subscribers and non-paying subscribers um not a lot of [Laughter] not a lot of insight that we can derive from that because of the limited data set but let's say yeah just a lot of them are active at two at night and a bit of the subscribers are active at one o'clock yeah if we have more data of course that's more information that we can derive from this um data source so i hope this was useful i'm gonna try to continue doing those series and go more into detail of like some functions that you need to do how to do window functions how to use case when how to use different kinds of joins how to do running total calculations if you don't want to wait just google that yourself it's really useful and does come up a lot on the interviews but also something that you'll have to do almost like every day or every other day when you work with data sql is very valuable skill that's not that fancy for some reason not that well known compared to python r or machine learning even though sql is like the basic the basis that you need to know to be able to work with data hope that was useful thank you for listening let me know if you have some specific questions about sql queries maybe some examples that you want me to look through maybe i'll do that in the next videos and have a nice day
Info
Channel: Anastasia K
Views: 27,556
Rating: 4.9592528 out of 5
Keywords: data analysis, sql, sql 101, sql interview questions and answers, basic sql, basic sql for data analysis, sql for data science, sql queries example, google bigquery, GCP, BigQuery, data analysis interview, sql for data analysis tutorial, sql basics, anastasia kuznetsova, data analyst, data science sql, sql queries interview questions and answers, sql questions asked in interview, writing sql queries, basic sql tutorial for beginners, basic sql queries
Id: DmkswBf8SOk
Channel Id: undefined
Length: 22min 4sec (1324 seconds)
Published: Sat Apr 03 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.