Intro to Data Calculations in Spreadsheets and SQL | Google Data Analytics Certificate

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
this video is part of the google data analytics certificate providing you with job-ready skills to start or advance your career in data analytics get access to practice exercises quizzes discussion forums job search help and more on casera and you can earn your official certificate visit grow.google datacert to enroll in the full learning experience today [Music] as a data analyst you'll use key tools and processes over and over but you'll also learn new things as you grow in your job it could be anything from building a new kind of analysis to a time saving shortcut when i first got to google i relied on just a couple of programs and tools to access data and do my analysis but i soon realized that i wasn't working as efficiently as i wanted to once i got comfortable pulling data and analyzing it using sql it allowed me to be a lot more efficient than before and the better i got at sql and pulling the data from data tables the faster i completed my analysis i was hooked over the next few videos i'll show you some ways to be as efficient as possible while completing calculations during your analysis we'll start by revisiting spreadsheets where we'll look at formulas for basic calculations then we'll move into conditional formulas that use the if function to check whether a condition is met through a calculation after that we'll explore the multifunctional sum product function try saying that five times quickly some product adds and multiplies all in one step so it's very useful next we'll take another look at pivot tables if you've skipped around and it's your first time learning about them you'll get to know all about them pivot tables have tons of uses including organizing your calculations we'll then pivot to sql pun intended we'll show how queries and calculations go hand in hand in sql we'll also look at temporary tables in sql which are helpful for temporarily storing your data during analysis we'll be covering lots of new concepts in these videos so feel free to hit the pause button at any time to think through the problem or steps to try it on your own and you can always review the videos as much as you need to so to recap we'll have a little bit of a review and then cover some all new concepts all about calculations are you ready good me too you probably do a lot of calculations in your daily life maybe it's figuring out how to tip someone or balancing your budget you might do some of these calculations in your head or with paper and pencil or the calculator on your phone you might even have shortcuts you use to make the calculations easier you'll perform a lot of these calculations as a data analyst too but they'll involve more numbers and a wider range of calculations that's where you will put your data analyst tools to work we'll show you how you can use formulas in a spreadsheet to complete some of the more basic calculations formulas are one of the many shortcuts that data analysts use but rest assured even though they're shortcuts they still calculate with complete accuracy we've covered a lot of these calculations earlier in the program but if you skip that part and won a refresher we'll review them here these calculations will also be more advanced than the ones we've covered so far but they will also be closer to what you might use on the job we'll be using google sheets and this video but you can also use excel the steps might look a little different in excel but the outcomes will be the same okay let's try out some calculations with sales data from a discount store chain we'll look at data from one of the stores in the chain our objective use the existing sales data to find any trends this is a great way to see a lot of the ways formulas can be useful in your analysis we'll start by finding annual sales over the years 2011 to 2020. the data is already organized in columns by month and in rows by year but we don't have the total sales for each year yet we can use a sum function to help us figure that out we'll add the sales for 2011 first we'll add a heading for the annual sales column then we can type our sum function in a formula all formulas begin with an equal sign so we'll type that first followed by sum and then an open parenthesis after the open parenthesis we need to tell the formula which cells are being added in this case we need data from the whole row which begins in cell b2 b2 is the cell reference we'll use instead of typing each cell one by one we can put them in the formula quickly by selecting cell b2 and dragging the fill handle across the row to the last cell with sales data m2 the fill handle is a tiny box in the corner of each cell you can use it for lots of things like selecting multiple cells for a formula or continuing a pattern across several cells the fill handle definitely qualifies as a shortcut all right now we'll complete the formula by closing the parentheses and pressing enter and just like that we've calculated the total sales for 2011. here's another shortcut we worked on in an earlier video we can use the formula we created to calculate the total sales for the other years in the data set all we have to do is drag the fill handle down the other cells in the annual sales column and we'll have total sales for the rest of the years in the data set let's say we also need to find the growth in annual sales from year to year this would be a good time to think through the problem before we try to solve it do we have the data we need to solve this not yet thinking backwards like this helps us plan out the steps to move forward the first step we'll need to do is calculate the total sales by year then we'll measure the rate of change between years we'll start by labeling a new column in this case we won't need to use a function or parentheses since we're only using data from two cells we can just use the name of those cells we'll type an equal sign and then click in cell n3 which automatically populates that cell in the formula next we'll add a minus sign to the formula because we're subtracting to find the difference between two consecutive years clicking in cell into gives us the total from 2011 which we can then subtract from the total from 2012. then we hit enter and get our sales growth from 2011 to 2012. we're definitely getting some useful data here so let's keep going we can also use our sales growth to find the growth rate between the two years we'll show this as a percentage so we'll head our column with the percent sign and growth to do this we'll divide the total in cell o2 by the annual sales from 2011 and sell into a slash is a symbol that a formula recognizes as division so we'll place that between the two cell references presto there's the growth rate growth rates are usually shown as percentages which can be easier than a decimal to read and understand so let's change this number to a percentage time for another shortcut all we have to do is click the percent style button and our growth rate will become a percentage we can select the sales for both the total growth and the growth rate to populate the rest of the two columns we have some negative numbers but that just means that there was negative growth from one year to the next okay we've got just a few more things to calculate for our stakeholders next up is finding the average sales we want to compare sales between months to learn if there's a trend so we'll add this in a row instead of a column this will line up our averages under each month to find our averages we'll calculate the total and then divide that total by the number of values added to get it we can do this by using the average function between our parentheses we'll select the cells that contain the sales data for january b2 through b11 we'll duplicate that formula across the row through december to look for trends right away we know that summer months and december have the highest average sales since our stakeholders will want to understand our findings quickly and easily we'll add a little visualization to the data with conditional formatting conditional formatting is a spreadsheet tool that changes how cells appear when values meet specific conditions so let's apply conditional formatting to the cells with the average sales by month we'll use a color scale to show the range of averages with the lowest monthly average remaining as white and we'll apply shades of green to the rest of the values the brighter the green the higher the average now when we share our analysis with our stakeholders they'll be able to tell right away which months have the highest average sales okay just a couple more steps to complete our analysis now we need to find the minimum and maximum for average monthly sales with the data set this small it might be easy to find the minimum and maximum values without a formula but it's still good practice to use one not to mention using a formula helps prevent human error we'll again rely on formulas with functions to do these calculations we'll start with the lowest monthly average our function here is men followed by the cells with the average month b12 through m12 after we press enter the lowest monthly average is calculated we can repeat the same steps to find the highest monthly average in this formula we'll use the same data but we'll replace min with max for maximum so for this store location sales are strongest in december and weakest in january we could share these findings with stakeholders if they've met our objectives if they haven't we might need to continue with our analysis either way i hope you've learned how spreadsheet formulas can be valuable tools when doing calculations coming up we'll check out more formulas see you soon [Music] one of the first calculations most kids learn how to do is counting soon after they learn adding and that doesn't go away no matter what age we are we're always counting or adding something whether it's change at the grocery store or measurements in a recipe data analysts do a lot of counting and adding two and with the amount of data you'll come across as a data analyst you'll be grateful to have functions that can do the counting and adding for you so let's learn how these functions countif and sumif can help you do calculations for your analysis more easily and accurately we'll start with the countif function you might remember countif from some of the earlier videos about data cleaning countif returns the number of cells that match a specified value earlier we showed how countif can be used to find and count errors in a data set here we'll only be counting just a reminder though while we won't be actively searching for errors in this video you'll still want to watch out for any data that doesn't look right when doing your own analysis as a data analyst you'll look for and fix errors every step of the way for this example we'll look at a sample of data from an online kitchen supplies retailer our stakeholders have asked us to answer a few questions about the data to understand more about customer transactions including the revenue they're bringing in we've added the questions we need to answer to the spreadsheet we'll set up a simple summary table which is a table used to summarize statistical information about data we'll use the questions to create the attributes for our table columns count revenue total and average revenue per transaction each of our questions asks about transactions with one item or transactions with more than one item so those will be the observations for our rows we'll make quantity the heading for our observations we'll also add borders to make the summary table nice and clear the first question asks how many transactions include exactly one item to answer this we'll add a formula using the countif function in cell g11 we'll begin with an equal sign countif and an open parentheses column b has data about quantity so we'll select cells b3 through b50 followed by a comma next we need to tell the formula the value that we're looking for in the cells we've selected we want to tell the data to count the number of transactions if they equal one in this case between quotation marks we'll type an equal sign and the number one because that's the exact value we need to count when we add a close parenthesis and press enter we get the total count for transactions with only one item which is 25. we can follow the same steps to count values greater than one but this time because we only want values greater than 1 we'll type a greater than sign in our formula instead of an equal sign getting this information helps us compare the data about quantity okay now we need to find out how much total revenue each transaction type brought in since the data isn't organized by quantity we'll use the sum if function to help us add the revenue for transactions with one item and with one more item separately sum if is a function that adds numeric data based on one condition building a formula with some if is a bit different than one with countif they both start the same way with an equal sign and the function but a sum if formula contains the range of cells to be evaluated by your criteria and the criteria in other words some if has a list of cells to check based on the criteria you set in the formula then the range where we want to add the numbers is placed in the formula if that range is different from the range being evaluated there's commas between each of these parts adding a space after each comma is optional so let's try this in cell h11 we'll type our formula the range to be evaluated is in column b so we'll select those cells the condition we want the data to meet is for the values in the column to be equal to one so we'll type a comma and then inside quotes an equal sign and the number one then we'll select the range to be added based on whether the data from our first range is equal to one this range is in column c which lists the revenue for each transaction so every amount of revenue earned from a transaction with only one item will be added together and there's our total since this is revenue we'll change the format of the number to currency so it shows up as dollars and cents so the transactions with exactly one item earned one thousand five hundred and fifty five dollars in revenue let's see how much the transactions with more than one item earned foreign okay let's check out the results just like with our countif examples the second sum if formula will be the same as the first except for the condition which will make greater than one when we run the formula we discover that the revenue total is much higher four thousand seven hundred and thirty five dollars this makes sense since the revenue is coming from transactions with more than one item good news to complete our objective we'll do two more quick calculations first we'll find the average revenue per transaction by dividing each total by its count this will show our stakeholders how much of a difference there is in revenue per transaction between one item and multiple item transactions this information could be useful for lots of reasons for example figuring out whether to add a discount on purchases with more than one item to encourage customers to buy more we'll put these calculations in the last column of our summary table you might remember that we use a slash in a formula as the operator for division calculations the average revenue for transactions with one item is 62.20 and the average revenue for transactions with more than one item is 205.87 and that's it for our analysis our summary table now gives stakeholders and team members a snapshot of the analysis that's easy to understand our countif and some if functions played a big role here using these functions to complete calculations especially in large data sets can help speed up your analysis they can also make counting and adding a little more interesting nothing wrong with that and coming up we'll explore more functions to make your calculations run smoothly bye for now data analysts love discovering new ways to work on their analysis especially when those new ways simplify their work i know i'm a big fan of learning new tricks to complete tricky tasks instead of trying to find a new way to do something every time i do an analysis i try to learn from other people by asking questions and getting help when i need it the people i work with like to use the phrase stealing with pride all this means is that you should feel no shame for using a process in your analysis that you learn from someone or somewhere else fellow team members message board posts online searches i've used all of these resources for ideas with pride of course i always cite my sources when i do that's a super important step to remember the sum product function is also one of those tricks that analysts come across either on their own or from another source you can also think of it as a shortcut for doing more complex calculations we'll show you how some product works and when you might use it to make your work life simpler some product is a function that multiplies arrays and returns the sum of those products and here's what the sum product formula looks like equal sign the sum product followed by an open parenthesis and arrays being multiplied and then added together each array is separated by a comma an array is kind of like a range in a spreadsheet but keep in mind an array is a collection of values in cells not the cells themselves so when added to a formula the sum product function multiplies each of the values in two or more arrays together for example each value in the array of cells b3 through b7 can be multiplied by its corresponding value in the array of cells c3 through c7 so b3 times c3 b4 times c4 and so on it'll then return the sum of all of those multiplications let's check out an example using data from a kitchen supplies company you might remember this example from our countif and sumif video we've been given some data about a product order including the quantity of each product that was sold in the order and the unit price which tells how much one of each product cost our job is to use the data in these two columns to find out the total revenue for this order and that's where sum product comes in to find the total revenue we need to do both addition and multiplication calculations so first off we need to find the revenue that each item brought in separately if we did this without some product we'd have to multiply each quantity by its unit price 50 times a dollar 25 25 times five dollars and so on then we'd have to add all of those revenue amounts together to get the total revenue fortunately the sum product function does all of that for us let's add the label total revenue in cell g5 and then click g6 to input our formula we'll then start our formula with an equal sign and the function followed by an open parenthesis it's good to remind ourselves that the arrays we add to our formula should always be inside the parentheses next we'll select cells b3 through b7 for the first array followed by a comma the comma acts as a separator between the two arrays and the formula then we'll select cell c3 through c7 for the second array followed by a closed parenthesis to complete our formula we don't need to include the brackets in our actual formula we included them in the syntax example to clearly define each array for you then we press enter to get our total revenue since we're dealing with revenue we'll format the number as currency so we've learned the total revenue is 655 dollars but that's not the actual profit from the sales of these kitchen supplies because we haven't included the profit margin in our calculations the profit margin is a percentage that indicates how many cents a profit have been generated for each dollar of sale so in our data set product number 789 has a profit margin of 20 meaning each product sold earns a total profit of 20 cents for every dollar and just like the calculation for revenue we can save time finding profit margin by using the sum product function there's only one difference between the formula for profit margin and revenue in this spreadsheet but it's an important difference to start in cell g7 we type the same first part of the formula then we include the two arrays in the same way as well but instead of ending our formula we add another comma followed by another array this time we'll select the cells with the profit margin d3 through d7 we'll finish our formula and our calculation is complete the sum product function saved us from having to multiply each individual revenue amount by each profit margin percentage then add each profit margin amount together using some product for calculations is a time saver and helps you avoid making mistakes definitely a trick worth remembering and there's more worth remembering about calculations coming up next by now we've learned a lot about functions and formulas they are very helpful tools for your toolbox and great for finding shortcuts to complete calculations but there's another tool out there that does some of the same things in a spreadsheet the pivot table as a quick reminder pivot tables let you view data in multiple ways to find insights and trends we've talked before about how pivot tables help with cleaning and organizing your data including sorting and grouping data but pivot tables can also help with calculations for example they're great for quickly calculating sums and averages let's revisit our movie data set to show you how pivot tables and calculations work hand in hand earlier we summarized and organized this data in pivot tables we'll do that here too but in this case the organization is a bonus to using pivot tables for calculations you can do these steps in excel as well though some of the steps might look different in this example your manager asks you to find some trends to help them think through new movie ideas using revenue calculations this spreadsheet has data about movies from several years ago so it probably wouldn't be as useful right now still the steps we take to analyze the data absolutely apply then and now so let's get into it first we need to find out how much revenue was generated each year a pivot table is a good way to organize this so we'll build a pivot table to show this in our pivot table we can also find the average revenue per movie we can then check our findings for some possible trends we'll start by finding the revenue generated each year this gives us the release date for each movie in column b and the box office revenue and column in instead of organizing the table by year and building a formula to calculate the revenue per year we'll create a pivot table we'll add the pivot table in a new sheet keeping the data range from cell a1 to cell n509 adding a new sheet is especially helpful when working in a large data set like this it helps keep our calculations together in one place and separate from the rest of the data we'll rename this new sheet revenue to call out where our calculations are both for ourselves and for anyone else on our team who might need our analysis now we can build our pivot table starting with the rows we'll sort the rows by release date to find the revenue for each year you might notice this creates a row for every date on which one or more movies in this data set was released since we only need the years we'll right-click in one of the cells in the release date column to create a pivot date group and we'll group by year now we have one row for each year in which these movies were released next we want to work with values we'll add the box office revenue data here this populates the column next to the release dates with a total box office revenue in each year these calculations are automatic because the pivot table is already set to summarize the data using the sum function so no need to change this setting there's other functions in the summarize by menu though like min for the minimum amount of revenue for each year and count for the number of movies that generated revenue in each year okay let's check out what we've got here this data shows that 2014 had the highest box office revenue while 2016 had the lowest this might be useful information but finding the average revenue per movie would most likely be more useful since there was a number of different movies released each year so we'll add another column for the average revenue earned by each year's movies we can do this in the same pivot table we'll add another value and change the function that we use to summarize from sum to average the average function gives us the average revenue per year for the movies in the data set we can see that the average revenue in 2015 was much lower than in other years since this data stands out so much let's keep exploring to find out why taking your analysis to the next level like this is a sign of a great analyst when you're in your job you want to answer the questions that your managers and stakeholders ask but you also want to answer the ones that come up while you're doing your analysis so let's try to figure this out first we'll note how many movies from each year were included in the data set we'll add a new value and use the count function this time this shows us that there's more movies in the data set from 2015 than from any other year but 2015 still has the second lowest total box office revenue this could mean a few things it's likely that a lot of the movies from 2015 just didn't earn much revenue compared to the other years which would bring down the overall average revenue even if the total revenue remained on par with the other years we'll explore just this one possibility here but you can always go further when you analyze data in your own job it'll depend on your objectives and the questions you need to answer for now let's copy and paste our pivot table so we can test our hypothesis we'll rename the columns and our copy table to differentiate them from our original table we'll name them based on the data we'll be looking at which i'll explain in a moment another great pivot table feature that will help us here is the filter option earlier you learn how this lets us filter data so we can look at only the values we need so we'll select a cell in our copied pivot table and add a filter to the box office revenue the filter will then be applied to the entire table when we open the status menu we can choose to filter the data to show specific values but in our case we want to filter by condition so we can figure out how many movies in each year earned less than 10 million dollars of revenue this is why we renamed our columns with a less than sign and an abbreviation for 10 million dollars the condition we'll use in our filter is less than and our value will be 10 million dollars we'll type our number in a dollar and cents format so the condition matches the data in our pivot table this might not be necessary but it prevents potential errors from happening now we know that 20 movies released in 2015 made less than 10 million dollars this seems like a high number compared to the other years but keep in mind there were more movies from our data set released in 2015 than the other years it's still hard to tell how much of an impact these lower earning movies had on the average revenue so let's run a quick formula to find the percentage of movies for each year that earned less than 10 million dollars this will make it easier to compare them from year to year we'll create a customized column called a calculated field using our values menu a calculated field is a new field within a pivot table that carries out certain calculations based on the values of other fields you can do this in excel too using field settings and the create formula to get our average we'll divide the count from our movies under 10 million dollars by the count for all movies in our original pivot table after count and an open parenthesis between single quotes we'll type a less than sign and then the rest of our column name in a closed parenthesis then we'll type a slash as our division operator and we'll repeat the previous steps from our box office revenue count then we'll select custom from our summarize buy menu since we're not using the sum function hmm that didn't work we have an error message in the calculated field it might be because we're using values from two different pivot tables let's first check that a calculated field will work in our copied pivot table we'll change our formula to prove that calculated fields work and to verify that the values in our average column are accurate for our new formula we'll use the sum function and divide the sum of the box office revenue data from our original table by the count of the same data since we applied our filter for this pivot table earlier the formula will only return the average revenue of movies under 10 million okay that worked so we know that it's not a problem with the calculated field and as an added bonus we were able to check the accuracy of some data always a good thing but we still need to find the percentage of movies for each year that earned less than 10 million dollars so instead of a calculated field we'll add a formula in a new column then we can use the values from both of the pivot tables we'll need for this calculation we'll add a heading for our table and cell g10 we'll name it percent of total movies then we'll add our formula to the next cell in the column we'll divide the number of movies in the copy table by the number of movies in the original table next we'll use the fill handle in the cell with our formula and drag it to apply the formula to the rest of the years and to cap it off we'll format these numbers as percentages now our analysis shows that 16 percent of the movies released in 2015 earn less than 10 million dollars of revenue the other years are all close to 10 percent this is one possible explanation for why the average revenue is comparatively low in 2015. in real life we'd most likely need to take our analysis even further and explore the data even more depending on our objectives but for this video our objectives are complete you've learned how you can use pivot tables to perform data calculations it'll take practice but pivot tables are worth it because they do more than calculate they organize and filter data too pivot tables are also used to form the framework for lots of data visualizations like graphs and other charts you'll learn more about data visualization soon for now though take a second to breathe we've covered a lot of complex ideas in the last few videos functions formulas and pivot tables are all great tools to use in your analysis and with practice and experience it'll feel like you've used them forever just take your time getting to know how they work use these videos and other resources like the readings we've included to explore them further in the meantime we'll discover how the spreadsheet calculations you've been learning about connect to calculations using sql stay tuned by now you probably know that there's more than one way to do the daily task of a data analyst calculations are no exception as we've shown in earlier videos you can complete the same calculations in lots of different ways and spreadsheets you can also complete them using sql in this video we'll give you an overview of how sql calculations compare to spreadsheet calculations let's look at the arithmetic operators that are used in both spreadsheets and sql an operator is a symbol that names the type of operation or calculation to be performed in a formula as you learned earlier the four basic arithmetic operators in spreadsheet formulas are the plus sign for addition the minus or hyphen for subtraction the asterisk for multiplication and the forward slash for division these same operators calculate data in the same way when writing queries in sql the operators are embedded in the queries when pulling data from a database just like spreadsheet formulas there's a few different ways to perform calculations using queries let's look at the syntax for one possible query the syntax of a query is its structure it should include all the specific details of the data you want to pull into a new table where those details should be placed so if you want to add values from two columns of a table you start with the select command followed by the name of your first column then the name of your second column then you'd add the names of both columns with a plus sign between them after that you type as followed by the name you'd like to give the column with the added totals finally you then complete your query by typing from and then the name of the table that you're pulling the data from running this query would get you a table showing the two columns whose values are being added together plus a new column showing the sums of those values the operator in this query is a plus sign since values are being added if you needed to subtract multiply or divide you'd follow the same steps using the appropriate operators if you need to use more than one arithmetic operator in a calculation you'd use parentheses to control the order of the calculations so if we included column c in our query we could place parentheses around column a plus column b we then add an asterisk if we're multiplying followed by column c this query would return in a new column the sum of the values in column a and b multiplied by the values in column c all right now let's say you only wanted the remainder from a division calculation well you need a different operator for this the modulo operator the modulo operator is represented by the percent symbol this is an operator that returns the remainder when one number is divided by another in a spreadsheet you could complete the same calculation using the mod function this brings us to another similarity between calculations in spreadsheets and sql a lot of times you can use functions instead of operators to complete calculations for example the sum function can complete addition problems in spreadsheets and sql the average function in a spreadsheet is the same as the avg function in sql they both return the average value of a set of numbers in sql these functions are considered aggregate functions because they perform a calculation on one or more values and return a single value you'll learn more about how they're used with the group by command and a query soon so those are the basics of sql calculations knowing how to write a query for a calculation is a good first step stay with us and you'll learn more about calculations in sql bye for now earlier we showed you how to complete calculations in sql while there's a few different ways embedding them in queries is a very useful one when you include a calculation in a query with other commands you can do more work faster here's a basic query syntax that we talked about we start with select and then the names of the columns we want to use in our calculations then we add in calculation details including an operator like a forward slash for division next we type as followed by the new column's name to label the column with the calculated values finally we end our query with the from command and the name of the table that we're pulling data from now let's take it to the next level with some embedded calculations that use a syntax like this one even better we'll do this with some data about avocados sorry to those of you who aren't avocado lovers like me okay let's get started feel free to continue watching as we show you the steps using bigquery if you're joining us open up your tool of choice for using sql be sure to look through the instructions and the reading right before this video to help you get started okay date is already loaded so we can jump right in our goal is to find out the total number of bags of avocados sold on each date at each location using this data there's already a column that shows us the total but we want to make sure we understand how that total is calculated we want to make sure that the total column is just small large and extra large bags add it together so we'll add the values in those three columns together in our query and then compare them to the total bags column in the data set we'll start with the select command which we'll use to pull certain columns from the table we are selecting several columns so we'll press enter after select and after the comma after each column name next we'll type those column names date region small underscore bags large underscore bags extra large underscore bags in total underscore bags underscores are the lines used to underline words and connect text characters using spaces can confuse certain servers and applications so using underscores instead helps avoid potential issues while keeping the names readable now we'll add the calculation to the query using the names of the three columns with plus signs between them small bags plus large bags plus extra large bags since we want this calculation in a new column we'll use the as command to name the column total underscore bags underscore calc we've added the word calc so we can compare the columns to each other after we calculate our results now we'll finish our query with from and the name of the data set and subset we're pulling from avocado underscore data dot avocado underscore prices okay let's run the query in the total bags calc column the data shows the sum of each date for the number of small large and extra large bags of avocados that were sold at each location if we quickly compare the two columns showing the total number of bags we learn that the values are the same this lets us know that the data we want to use is the right data now that we have verified the total number of bags we can use those values in another query we need to find what percent of the total number of bags were small bags finding this out might help stakeholders make decisions on how to package avocados or which size bag to run a sale on our job is to get that information to the stakeholders so we'll set up a new query we'll select the date region total bags and small bag columns for this query next we'll set up a new column starting with our calculation to find the percentage of small bags we need to first divide the number of small bags by the number of total bags using a slash as the operator we'll put this part of the calculation in parentheses to let the server know that this calculation should be performed first then we'll multiply this total by a hundred using an asterisk as our operator multiplying by 100 gives us a value that's a percentage instead of a decimal percentages usually make it easier for people to understand quickly when you share results with them we'll use the as command to name this new column small bags percent then we'll add from and the name of the set we're pulling from and we'll run our query oops we got an error in our results it says that we can't divide by zero since we're finding percentages dividing by zero won't work this means that somewhere in the data set there's a total bags equal to zero we'll have to fix this in our query we can fix this using the where command where lets us add a condition to our calculation so after we type where we'll type total underscore bags followed by a less than sign and then a greater than sign these symbols tell the server that the values we're calculating should not be equal to the value we specify in this case that value's zero so we'll add a zero to our query now when we run the query you'll notice our new column shows the percent of small bags in the total bags count we'd get the same results if we use an exclamation mark followed by an equal sign in place of the less than and greater than signs note that this is one way for doing it but there are functions such as safe divide that also allow you to avoid this error those are just a couple of examples to get you started but with sql you can complete just about any calculation you want during your analysis and embedding the calculations in your queries will help you keep your analysis organized while getting your results the calculation methods we showed you here are just the beginning so look for more coming up see you soon good to see you as a data analyst you'll find that your calculations come in all shapes and sizes earlier we showed you how to do some of the more basic calculations in sql while basic calculations are great sometimes you'll need to group data before completing calculations the group by and order by commands help you do this these commands are usually paired with aggregate functions like sum or count we'll show you how you can use these commands and functions to calculate and summarize data from groups of rows in a table let's explore the group by command first group by is a command that groups rows that have the same values from a table into summary rows the group by command is used with select statements in a basic select from or select from where query group by comes at the end of the query all right let's try using group by we'll work with a database with data from a bike sharing system we want to find out how many rides people took on these bikes per year this data has several columns but for this task we only need the start time column since this data set isn't organized by date and the start time column isn't organized by year we'll need to include steps in our code to organize it we also want a total number of rides each year so we'll need to include a calculation in our query for this and depending on the questions we've been tasked with answering this might be the first step of many in our analysis we'll start our query with the select command we'll then add extract to our query the extract command lets us pull one part of a given date to use we'll extract the year from the start time column to do this we'll add an open parenthesis followed by year which lets the server know the part of the date we need then we'll add the from command and start time so we can get the year from all of the start times in that column we'll close the parentheses and then use as and the word year to name the column we're creating on the next line of the query we'll use the aggregate function count followed by an asterisk in parentheses this will count the bike rides in the start time column using the asterisk make sure that all the start times are counted in the data then we'll name our column number of rides with underscores between each word instead of spaces we'll add from and the database we're pulling on the next line in this case that's bigquery dash public dash data dot new underscore york dot city bike underscore trip and here's our group by command we'll use this to group the data by year so we'll type group by followed by year we can further organize our results by using the order by command adding this after group by orders the results we'll add year to order the data by year it's good to note that by default order by sorts data in ascending order now we can run our query to get the results the years are ordered starting with 2013 and ending with 2016. if we want to change this to descending order we can add the keyword d e s c to the end of the query and run it again but whichever order you use the group by and order by commands are great for helping us complete and organize a calculation for our analysis this is one way to include calculations when aggregating data and it's just one of many ways that sql helps keep your analysis running smoothly and moving forward there's still more to come on calculations in sql coming up we'll learn more about data validation see you soon [Music] earlier we covered data validation a spreadsheet function that adds drop-down lists to cells using data validation lets you control what can and can't be entered into your worksheet one of its uses is protecting structured data and formulas in your spreadsheets but as useful as it is the data validation function is just one part of a larger data validation process this process involves checking and rechecking the quality of your data so that it is complete accurate secure and consistent while the data validation process is a form of data cleaning you should use it throughout your analysis if this all sounds familiar to you that's good ensuring you have good data is super important and in my opinion it's kind of fun because you can pair your knowledge of the business with your technical skills this will help you understand your data check that it's clean and make sure you're aligning with your business objectives in other words it's what you do to make sure your data makes sense keep in mind you'll build your business knowledge with time and experience and here's a pro tip asking as many questions as possible whenever you need to will make this much easier okay let's say we're analyzing some data for a furniture retailer we want to check that the values in the purchase price column are always equal to the number of items sold times the product price so we'll add a formula and a new column to recalculate the purchase prices using a multiplication formula now comparing the totals there's at least one value that doesn't match the value in the purchase price column we need to find an answer to help us move forward with our analysis by doing some research and asking questions we find that there is a discount of 30 when customers buy five or more certain items if we hadn't run this check we could have missed this completely you've learned that as an analyst calculations are a big part of your job so it's important that whenever you do calculations you always check to make sure you've done them in the right way sometimes you'll run data validation checks that are common sense checks for example let's say you're working on an analysis to figure out the effectiveness of in-store promotions for a business that's only open on weekdays you check to make sure that there's no sales data for saturday and sundays if your data does show sales on weekends it might not be a problem with the data itself it might not even be a problem at all there might be a good reason like maybe your business hosts special events on saturdays and sundays then you would have sales for those weekends you still might want to leave out the weekend sales in your analysis if your objective is only to look at the weekdays but doing this data validation might save you from miscalculations and other errors in your analysis you should always do data validation no matter what analysis tool you're using in an earlier video we used sql to analyze some data about avocados one of the queries was a check to make sure the data showing the total number of bags was the sum of small large and extra large bags by running this query we were able to determine that the total number column was accurate we compared our two columns briefly in that video but to be absolutely sure that there's no issues with the data values in those columns we could have also run another query in this query we would select all using the asterisk and from the avocado prices data set in our where clause we'd also type out where our calculated total does not equal the total bags column if no values are returned we can be sure that the values in the total bags column are accurate and that led us to continue our analysis but when we tried to find what percent of the total number of bags was small we ran into a small problem we received an error message about dividing by zero we fixed that error by adjusting our query if we had linked that query to a presentation that went to our stakeholders they'd show us the divide by zero error instead of the figures we wanted by building in these types of checks as part of your data validation process you can avoid errors in your analysis and complete your business objectives to make everyone happy and trust me it's a great feeling when you do and another great feeling is knowing that you've made it through another video and learned something new and we have more where that came from coming soon see ya [Music] now if you're like me you always have sticky notes available nearby to write a reminder or figure out a quick math problem sticky notes are useful and important but they're also disposable since you usually only need them for a short time before you recycle them data analysts have their own version of sticky notes when they're working in sql they're called temporary tables and we're here to find out what they're all about a temporary table is a database table that is created and exists temporarily on a database server temp tables as we call them store subsets of data from standard data tables for a certain period of time then they're automatically deleted when you end your sql database session since temp tables aren't stored permanently they're useful when you only need a table for a short time to complete analysis tasks like calculations for example you might have a lot of tables you're performing calculations on at the same time if you have a query that needs to join seven or eight of them you could join the two or three tables having the fewest number of rows and store their output in a temp table you could then join this temp table to one of the other bigger tables another example is when you have lots of different databases you're running queries on you can run these initial queries in each separate database and then use a temp table to collect the results of all of these queries the final report query would then run on the temporary table you might not be able to make use of this reporting structure without temporary tables they're also useful if you've got a large number of records in a table and you need to work with a small subset of those records repeatedly to complete some calculations or other analysis so instead of filtering the data over and over to return the subset you can filter the data once and store it in a temporary table then you can run your queries using the temporary table you've created imagine that you've been asked to analyze data about the bike sharing system we looked at earlier you only need to analyze the data for bike trips that were over 60 minutes or longer but you have several questions to answer about the specific data using a temporary table will let you run several queries about this data without having to keep filtering it there's different ways to create temporary tables in sql depending on the relational database management system you're using we'll explore some of these options soon for this scenario we'll use bigquery we'll apply a with clause to our query the with clause is a type of temporary table that you can query from multiple times the with clause approximates a temporary table basically this means it creates something that does the same thing as a temporary table even if it doesn't add a table to the database you're working in for others to see you can still see your results and anyone who needs to review your work can see the code that led to your results let's get this query started we'll start this query with the with command we'll then name our temp table trips underscore over underscore 1 underscore hr then we'll type the as command in an open parenthesis on a new line we'll use the select from where structure for our subquery we'll type select followed by an asterisk you might remember the asterisk means you're selecting all the columns in the table now we'll type the from command and name the database set we're pulling from bigquery dash public dash data dot new underscore york dot city bike underscore trips next we'll add a where clause with the condition that the length of the bike trips we need in our temp table are greater than or equal to 60 minutes in the query it goes like this trip duration space greater than sign equal sign space 60. finally we'll add a closed parenthesis on a new line to end our sub query and that sets up our temporary table now we can run queries that'll only return results for trips that lasted 60 minutes or longer let's try one since we're working in our version of a temp table we don't need to open a new query instead we'll label our queries before we add our code to describe what we're doing for this query we'll type two hashtags this tells the server that this is a description and not part of the code next we'll add the query description count how many trips are 60 plus minutes long and then we'll add our query select then on a new line count with an asterisk in parentheses and as followed by cnt to name the column with our count next we'll add from and the name we're using for our version of a temporary table trips over one hour when we run our query the results show the total number of bike trips from the data set that lasted 60 minutes or longer we can keep running queries on this temp table over and over as long as we're looking to analyze bike trips that were 60 minutes and over and if you need to end your session and start a new runtime later most servers store the code used in temp tables you'll just need to recreate the table by running the code when you use temporary tables you make your own work more efficient naming and using temp tables can help you deal with a lot of data in a more streamlined way so you don't get lost repeating query after query with the same code that you could just include in a temp table and here's another bonus to using temp tables they can help your fellow team members too with temp tables your code is usually less complicated and easier to read and understand which your team will appreciate once you start to explore temporary tables on your own you might not be able to stop don't say i didn't warn you coming up we'll explore even more things you can do with temp tables see you soon [Music] earlier we introduced you to temporary tables they're a great resource to use during your analysis because they help you keep your sql code organized and efficient you learned how to use a with clause to create a type of temporary table and now we'll get into some other ways you can create temp tables along with the pros and cons they present and that's one of the great things about data analytics there's almost always more than one way to get your analysis done the select into statement is a good example of how to get a temp table done this statement copies data from one table into a new table but it doesn't add the new table to the database it's useful if you want to make a copy of a table with a specific condition like a query with a where clause so far we've been using bigquery to show you how sql works but bigquery doesn't currently recognize the select into command instead here's an example of how a select into statement might look in another rdbms in this statement a new table named africa sales is created using the data from the global sales database about the african region using select into is a good practice when you want to keep the database uncluttered and you don't need other people using the table now if lots of people will be using the same table then the create table statement might be the better option this statement does add the table into the database so if everyone needs access to the africa sales table your query would start with create table followed by the same select from where query as in the select into statement in most relational database management systems or rdbms you can add metadata to describe the data that's contained in the table you've created this can help make the table easier to understand for anyone using it the create table statement is also useful for tables that are more complex for example if the code is difficult to replicate the making attempt table in this way means it'll be safe for you to access later the way you create a temporary table using the with clause or a select into or create table statement is usually up to you and your needs the more you work in sql the more you might have preferences as well especially since there's more than one way to create temporary tables you may also find that you're working in an rdbms that uses a different syntax for example you might need to use a create temp table statement instead of create table here's some good news the syntax that you need for each unique rdbms is usually pretty easy to find with a quick online search but no matter how or where you create temporary tables there isn't much downside to them it's good to note though that sometimes building a temp table can interrupt your workflow again that'll depend on your objectives and your preferences you can repeat your code over and over instead of making a temp table but that usually leaves your queries less readable and more vulnerable to typos as you continue exploring the world of data analytics you'll find that temporary tables are just one of the many resources you'll be able to use and the more you use them the easier it'll be to navigate that world congratulations on finishing this video from the google data analytics certificate access the full experience including job search help and start to earn the official certificate by clicking the icon or the link in the description watch the next video in the course by clicking here and subscribe to our channel for more from upcoming google career certificates
Info
Channel: Google Career Certificates
Views: 2,197
Rating: 5 out of 5
Keywords: Grow with Google, Career Change, Tech jobs, Google Career Certificate, Google Career Certificates, Job skills, Coursera, Certification, Google, professional certificates, professional certificate program, Data analyst, Data analytics, Data analysis, Data analytics for beginners, What is data analytics, Sql, Data, R Programming, Spreadsheets, pivot table, conditional formatting, sumproduct, SQL tutorial for beginners, pivot table excel, excel pivot table tutorial
Id: WutHuhEWG1g
Channel Id: undefined
Length: 72min 59sec (4379 seconds)
Published: Fri Jun 18 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.