How to Clean Raw Data in SQL | Google Data Analytics Certificate

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
this video is part of the google data analytics certificate providing you with job-ready skills to start or advance your career in data analytics get access to practice exercises quizzes discussion forums job search help and more on casera and you can earn your official certificate visit grow.google datacert to enroll in the full learning experience today so before we go over all the ways data analysts use sql to clean data i want to formally introduce you to sql we've talked about sql a lot already you've seen some databases and some basic functions in sql and you've even seen how sql can be used to process data but now let's actually define sql sql is structured query language that analysts use to work with databases data analysts usually use sql to do with large data sets because it can handle huge amounts of data and i mean trillions of rows that's a lot of rows to wrap your head around so let me give you an idea about how much data that really is imagine a data set that contains the name of all 8 billion people in the world it would take the average person 101 years to read all 8 billion names sql can process this in seconds personally i think that's pretty cool other tools like spreadsheets might take a really long time to process that much data which is one of the main reasons data analysts choose to use sql when dealing with big data sets let me give you a short history on sql development on sql actually began in the early 70s in 1970 edgar f codd developed the theory about relational databases you might remember learning about relational databases a while back this is a database that contains a series of tables that can be connected to form relationships at the time ibm was using a relational database management system called system r well ibm computer scientists were trying to figure out a way to manipulate and retrieve data from ibm system r their first query language was hard to use so they quickly moved on to the next version sql in 1979 after extensive testing sql now just spelled sql was released publicly by 1986 sql had become the standard language for relational database communication and it still is this is another reason why data analysts choose sql it's a well-known standard within the community the first time i used sql to pull data from a real database was for my first job as a data analyst i didn't have any background knowledge about sql before that i only found out about it because it was a requirement for that job the recruiter for that position gave me a week to learn it so i went online and researched it and ended up teaching myself sql they actually gave me a written test as part of the job application process i had to write sql queries and functions on a whiteboard but i've been using sql ever since and i really like it and just like i learned sql on my own i wanted to remind you that you can figure things out yourself too there's tons of great online resources for learning so don't let one job requirement stand in your way without doing some research first now that we know a little more about why analysts choose to work with sql when they're handling a lot of data and a little bit about the history of sql we'll move on and learn some practical applications for it coming up next we'll check out some of the tools we learned in spreadsheets and figure out if any of those apply to working in sql spoil alert they do see you soon so far we've learned about both spreadsheets and sql while there's lots of differences between spreadsheets and sql you'll find some similarities too so let's check out what spreadsheets and sql have in common and how they're different spreadsheets and sql actually have a lot in common specifically there's tools you can use in both spreadsheets and sql to achieve similar results we've already learned about some tools for cleaning data in spreadsheets which means you already know some tools that you can use in sql for example you can still perform arithmetic use formulas and join data when you're using sql so we'll build on the skills we've learned in spreadsheets and use them to do even more complex work in sql here's an example of what i mean by more complex work if we were working with health data for a hospital we need to be able to access and process a lot of data we might need demographic data like patients names birthdays and addresses information about their insurance or past visits public health data or even user-generated data to add to their patient records all of this data is being stored in different places maybe even in different formats and each location might have millions of rows and hundreds of related tables this is way too much data to input manually even for just one hospital that's where sql comes in handy instead of having to look at each individual data source and record it in our spreadsheet we can use sql to pull all this information from different locations in our database now let's say we want to find something specific in all this data like how many patients with a certain diagnosis came in today in a spreadsheet we can use the countif function to find that out or we can combine the count and where queries in sql to find out how many rows match our search criteria this will give us similar results but works with a much larger and more complex set of data next let's talk about how spreadsheets and sql are different first it's important to understand that spreadsheets and sql are different things spreadsheets are generated with the program like excel or google sheets these programs are designed to execute certain built-in functions sql on the other hand is a language that can be used to interact with database programs like oracle mysql or microsoft sql server the differences between the two are mostly in how they're used if a data analyst was given data in the form of a spreadsheet they'll probably do their data cleaning and analysis within that spreadsheet but if they're working with a large data set with more than a million rows or multiple files within a database it's easier faster and more repeatable to use sql sql can access and use a lot more data because it can pull information from different sources in the database automatically unlike spreadsheets which only have access to the data you input this also means the data is stored in multiple places a data analyst might use spreadsheets stored locally on their hard drive or their personal cloud when they're working alone but if they're on a larger team with multiple analysts who need to access and use data stored across a database sql might be a more useful tool because of these differences spreadsheets and sql are used for different things as you already know spreadsheets are good for smaller data sets and when you're working independently plus spreadsheets have built-in functionalities like spell check that can be really handy sql is great for working with larger data sets even trillions of rows of data and because sql has been the standard language for communicating with databases for so long it can be adapted and used for multiple database programs sql also records changes in queries which makes it easy to track changes across your team if you're working collaboratively next we'll learn more queries and functions in sql that will give you some new tools to work with you might even learn how to use spreadsheet tools in brand new ways see you next time [Music] so far we've learned that sql has some of the same tools as spreadsheets but on a much larger scale in this video we'll learn some of the most widely used sql queries that you can start using for your own data cleaning and eventual analysis let's get started we've talked about queries as requests you put into the database to ask it to do things for you queries are a big part of using sql it's structured query language after all queries can help you do a lot of things but there's some common ones that data analysts use all the time so let's start there first i'll show you how to use the select query i've called this one out before but now i'll add some new things for us to try out right now the table viewer is blank because we haven't pulled anything from the database yet for this example the store we're working with is hosting a giveaway for customers in certain cities we have a database containing customer information that we can use to narrow down which customers are eligible for the giveaway so let's do that now we can use select to specify exactly what data we want to interact with in a table and if we combine select with from we can pull data from any table in this database as long as we know what the columns and rows are named we might want to pull the data about customer names and cities from one of the tables to do that we can input select name comma city from customer underscore data dot customer underscore address to get this information from the customer underscore address table which lives in the customer underscore data data set so select and from help specify what data we want to extract from the database and use we can also insert new data into a database or update existing data for example maybe we have a new customer that we want to insert into this table we can use the insert into query to put that information in so let's start with where we're trying to insert this data the customer underscore address table we also want to specify which columns we're adding this data to by typing their names in the parentheses that way sql can tell the database exactly where we're inputting new information then we'll tell it what values we're putting in run the query and just like that it added it to our table for us now let's say we just need to change the address of a customer well we can tell the database to update it for us to do that we need to tell it we're trying to update the customer underscore address table then we need to let it know what value we're trying to change but we also need to tell it where we're making that change specifically so that it doesn't change every address in the table there now this one customer's address has been updated if we want to create a new table for this database we can use the create table if not exists statement keep in mind just running a sql query doesn't actually create a table for the data we extract it just stores it in our local memory to save it we'll need to download it as a spreadsheet or save the results into a new table as a data analyst there is a few situations where you might need to do just that it really depends on what kind of data you're pulling and how often if you're only using a total number of customers you probably don't need a csv file or new table in your database if you're using the total number of customers per day to do something like track a weekend promotion in a store you might download that data as a csv file so you can visualize it in a spreadsheet but if you're being asked to pull this trend on a regular basis you can create a table that will automatically refresh with the query you've written that way you can directly download the results whenever you need them for a report another good thing to keep in mind if you're creating lots of tables within a database you'll want to use the drop table if exists statement to clean up after yourself it's good housekeeping you probably won't be deleting existing tables very often after all that's the company's data and you don't want to delete important data from their database but you can make sure you're cleaning up the tables you've personally made so that there aren't old or unused table with redundant information cluttering the database there now you've seen some of the most widely used sql queries in action there's definitely more query keywords for you to learn and unique combinations that will help you work within databases but this is a great place to start coming up we'll learn even more about queries in sql and how to use them to clean our data see you next time [Music] now that we know some basic sql queries and spent some time working in a database let's apply that knowledge to something else we've been talking about preparing and cleaning data you already know that cleaning and completing your data before you analyze it is an important step so in this video i'll show you some ways sql can help you do just that including how to remove duplicates as well as four functions to help you clean string variables earlier we covered how to remove duplicates in spreadsheets using the remove duplicates tool in sql we can do the same thing by including distinct in our select statement for example let's say the company we work for has a special promotion for customers in ohio we want to get the customer ids of customers who live in ohio but some customer information has been entered multiple times we can get these customer ids by writing select customer underscore id from customer underscore data dot customer underscore address this query will give us duplicates if they exist in the table if customer id 9080 shows up three times in our table our results will have three of that customer id but we don't want that we want a list of all unique customer ids to do that we add distinct to our select statement by writing select distinct customer underscore id from customer underscore data dot customer underscore address now the customer id 9080 will show up only once in our results you might remember we talked before about text strings as a group of characters within a cell commonly composed of letters numbers or both these text strings need to be cleaned sometimes maybe they've been entered differently in different places across your database and now they don't match in those cases you'll need to clean them before you can analyze them so here are some functions you can use in sql to handle string variables you might recognize some of these functions from when we talked about spreadsheets now it's time to see them work in a new way pull up the data set we shared right before this video and you can follow along step by step with me during the rest of this video the first function i want to show you is length which we've encountered before if we already know the length our string variables are supposed to be we can use length to double check that our string variables are consistent for some databases this query is written as len but it does the same thing let's say we're working with the customer underscore address table from our earlier example we can make sure that all country codes have the same length by using length on each of these strings so to write our sql query let's first start with select and from we know our data comes from the customer underscore address table within the customer underscore data data set so we add customer underscore data dot customer underscore address after the from clause then under select we'll write length and then the column we want to check country to remind ourselves what this is we can label this column in our results as letters underscore in underscore country so we add as letters underscore in underscore country after length parentheses country the result we get is a list of the number of letters in each country listed for each of our customers it seems that almost all of them are twos which means the country field contains only two letters but we notice one that has three that's not good we want our data to be consistent so let's check out which countries were incorrectly listed in our table we can do that by putting the length parentheses country parenthesis function that we created into the where clause because we're telling sql to filter the data to show only customers whose country contains more than two letters so now we'll write select country from customer underscore data dot customer underscore address where length parentheses country parentheses greater than two when we run this query we now get the two countries where the number of letters is greater than the two we expect to find the incorrectly listed countries show up as usa instead of us if we created this table then we could update our table so that this entry shows up as us instead of usa but in this case we didn't create this table so we shouldn't update it we still need to fix this problem so we can pull a list of all the customers in the u.s including the two that have usa instead of us the good news is that we can account for this error in our results by using the substring function in our sql query to write our sql query let's start by writing the basic structure select from where we know our data is coming from the customer underscore address table from the customer underscore data data set so we type in customer underscore data dot customer underscore address after from next we tell sql what data we want it to give us we want all the customers in the us by their ids so we type in customer underscore id after select finally we want sql to filter out only american customers so we use the substring function after the where clause we're going to use the substring function to pull the first two letters of each country so that all of them are consistent and only contain two letters to use the substring function we first need to tell sql the column where we found this error country then we specify which letter to start with we want sql to pull the first two letters so we're starting with the first letter so we type in one then we need to tell sql how many letters including this first letter to pull since we want the first two letters we need sql to pull two total letters so we type in two this will give us the first two letters of each country we want us only so we'll set this function to equals u.s when we run this query we get a list of all customer ids of customers whose country is the u.s including the customers that had usa instead of u.s going through our results it seems like we have a couple duplicates where the customer id is shown multiple times remember how we get rid of duplicates we add distinct before customer underscore id so now when we run this query we have our final list of customer ids of the customers who live in the u.s finally let's check out the trim function which you've come across before this is really useful if you find entries with extra spaces and need to eliminate those extra spaces for consistency for example let's check out the state column in our customer underscore address table just like we did for the country column we want to make sure the state column has the consistent number of letters so let's use the length function again to learn if we have any state that has more than two letters which is what we would expect to find in our data table we start writing our sql query by typing the basic sql structure of select from where we're working with the customer underscore address table in the customer underscore data data set so we type in customer underscore data dot customer underscore address after from next we tell sql what we want it to pull we want it to give us any state that has more than two letters so we type in state after select finally we want sql to filter for states that have more than two letters this condition is written in the where clause so we type in length parentheses state parentheses and that it must be greater than two because we want the states that have more than two letters we want to figure out what the incorrectly listed states look like if we have any when we run this query we get one result we have one state that has more than two letters but hold on how can this state that seems like it has two letters o and h for ohio have more than two letters we know that there are more than two characters because we use the length parenthesis state parentheses greater than two statement in the where clause when filtering our results so that means the extra characters that sql is counting must then be a space there must be a space after the h this is where we would use the trim function the trim function removes any spaces so let's write a sql query that accounts for this error let's say we want a list of all customer ids of the customers who live in oh for ohio we start with the basic sql structure select from where we know the data comes from the customer underscore address table and the customer underscore data data set so we type in customer underscore data dot customer underscore address after from next we tell sql what data we want we want sql to give us the customer ids of customers who live in ohio so we type in customer underscore id after select since we know we have some duplicate customer entries we'll go ahead and type in distinct before customer id to remove any duplicate customer ids from appearing in our results finally we want sql to give us the customer ids of the customers who live in ohio we're asking sql to filter the data so this belongs in the where clause here's where we'll use the trim function to use the trim function we tell sql the column we want to remove spaces from which is state in our case and we want only ohio customers so we type in equals oh that's it we have all customer ids of the customers who live in ohio including that customer with the extra space after the h making sure that your string variables are complete and consistent will save you a lot of time later by avoiding errors or miscalculations that's why we clean data in the first place hopefully functions like length substring and trim will give you the tools you need to start working with string variables in your own data sets next up we'll check out some other ways you can work with strings and more advanced cleaning functions then you'll be ready to start working in sql on your own see you soon [Music] so far we've gone over some basic sql queries and functions that can help you clean your data we've also checked out some ways you can deal with string variables in sql to make your job easier get ready to learn more functions for dealing with strings in sql trust me these functions will be really helpful in your work as a data analyst in this video we'll check out strings again and learn how to use the cast function to correctly format data when you import data that doesn't already exist in your sql tables the data types from the new data set might not have been imported correctly this is where the cast function comes in handy basically cast can be used to convert anything from one data type to another let's check out an example imagine we're working with lauren's furniture store the owner has been collecting transaction data for the past year but she just discovered that they can't actually organize their data because it hadn't been formatted correctly so we'll help her by converting her data to make it useful again for example let's say we want to sort all purchases by purchase underscore price in descending order that means we want the most expensive purchase to show up first in our results to write the sql query we start with the basic sql structure select from where we know the data is stored in the customer underscore purchase table in the customer underscore data data set so we write customer underscore data dot customer underscore purchase after from next we tell sql what data to give us in the select clause we want to see the purchase underscore price data so we type purchase underscore price after select next is the where clause we are not filtering out any data since we want all purchase prices shown so we can take out the where clause finally to sort the purchase underscore price in descending order we type order buy purchase underscore price d e s c at the end of our query let's run this query we see that 89.85 shows up at the top with 799.99 below it but we know that 799.99 is a bigger number than 89.85 the database doesn't recognize that these are numbers so it didn't sort them that way if we go back to the customer underscore purchase table and take a look at its schema we can see what data type the database thinks purchase underscore price is it says here the database thinks purchase underscore price is a string when in fact it is a float which is a number that contains a decimal that is why 89.85 shows up before 799.99 when we start letters we start from the first letter before moving on to the second letter so if we want to sort the words apple and orange in descending order we start with the first letters a and o since o comes after a orange will show up first then apple the database did the same with 89.8 and 799.99 it started with the first letter which in this case was a 8 and 7 respectively since 8 is bigger than 7 the database sorted 89.85 first and then 799.99 because the database treated these as text strings the database doesn't recognize these strings as floats because they haven't been typecast to match that data type yet typecasting means converting data from one type to another which is what we'll do with the cast function we use the cast function to replace purchase underscore price with a new purchase underscore price that the database recognizes as float instead of string we start by replacing purchase underscore price with cast then we tell sql the field we want to change which is the purchase underscore price field next is the data type we want to change purchase underscore price to which is the float data type we also need to sort by this new field so we change purchase underscore price after order by to cast parentheses purchase underscore price as float 64. this is how we use the cast function to allow sql to recognize the purchase underscore price column as floats instead of text strings now we can sort our purchases by purchase underscore price and just like that lauren's furniture store has data that can actually be used for analysis as a data analyst you'll be asked to locate and organize data a lot which is why you want to make sure you convert between data types early on businesses like our furniture store are interested in timely sales data and you need to be able to account for that in your analysis the cast function can be used to change strings into other data types too like date and time as a data analyst you may find yourself using data from various sources part of your job is to make sure the data from those sources is recognizable and usable in your database so that you won't run into any issues with your analysis and now you know how to do that the cast function is one great tool you can use when you're cleaning data and coming up we'll cover some other advanced functions that you can add to your toolbox see you soon great to see you again so far we've seen some sql functions in action in this video we'll go over more uses for cast and then learn about concat and coalesce let's get started so earlier we talked about the cast function which lets us type cast text strings into floats i called out that the cast function can be used to change into other data types too so let's check out another example of how you can use cast in your own data work we've got the transaction data we were working with from our lawrence furniture store example but now we'll check out the purchase date field the furniture store owner has asked us to look at purchases that occurred during their sales promotion period in december let's write a sql query that will pull date and purchase underscore price for all purchases that occurred between december 1st 2020 and december 31st 2020 we start by writing the basic sql structure select from where we know the data comes from the customer underscore purchase table in the customer underscore data data set so we write customer underscore data dot customer underscore purchase after from next we tell sql what data to pull since we want date and purchase underscore price we add them into the select statement finally we want sql to filter for purchases that occurred in december only so we type date between 2020-12-01 and 2020-12-31 in the where clause let's run the query four purchases occurred in december but the date field looks odd that's because the database recognizes the date field as date time which consists of the date and time our sql query still works correctly even if the date field is date time instead of date but we can tell sql to convert the date field into the date data type so we see just the date and not the time to do that we use the cast function again so we'll use a cast function to replace the date field in our select statement with the new date field that will show the date and not the time we can do that by typing cast and adding the date as the field we want to change then we tell sql the data type we want instead which is the date data type there now we can have cleaner results for purchases that occurred during the december sales period cast is a super useful function for cleaning and sorting data which is why i wanted you to see it in action one more time next up let's check out the concat function concat lets you add strings together to create new text strings that can be used as unique keys going back to our customer underscore purchase table we see that the furniture store sells different colors of the same product the owner wants to know if customers prefer certain colors so the owner can manage store inventory accordingly the problem is the product underscore code is the same regardless of the product color we need to find another way to separate products by color so we can tell if customers prefer one color over the others so we'll use concat to produce a unique key that'll help us tell the products apart by color and count them more easily let's write our sql query by starting with the basic structure select from where we know our data comes from the customer underscore purchase table in the customer underscore data data set so we type customer underscore data dot customer underscore purchase after from next we tell sql what data to pull we use the concat function here to get that unique key of product and color so we type concat the first column we want product underscore code and the other column we want product underscore color finally let's say we want to look at couches so we filter for couches by typing product equals couch in the where clause now we can count how many times each couch was purchased and figure out if customers preferred one color over the others with concat the furniture store can find out which color couches are the most popular and order more i've got one last advanced function to show you coalesce coalesce can be used to return non-null values in a list null values are missing values if you have a field that's optional in your table it'll have null in that field for rows that don't have appropriate values to put there let's open the customer underscore purchase table so i can show you what i mean in the customer underscore purchase table we can see a couple rows where product information is missing that is why we see nulls there but for the rows where product name is null we see that there is product underscore code data that we can use instead we'd prefer sql to show us the product name like bed or couch because it's easier for us to read but if the product name doesn't exist we can tell sql to give us the product underscore code instead that is where the coalesce function comes into play let's say we wanted a list of all products that were sold we want to use the product name column to understand what kind of product was sold so we write our sql query with the basic sql structure select from where we know our data comes from customer underscore purchase table in the customer underscore data data set so we type customer underscore data dot customer underscore purchase after from next we tell sql the data we want we want a list of product names but if names aren't available then give us the product code here is where we type coalesce then we tell sql which column to check first product and which column to check second if the first column is null product underscore code we'll name this new field as product underscore info finally we are not filtering out any data so we can take out the where clause this gives us product information for each purchase now we have a list of all products that were sold for the owner to review coalesce can save you time when you're making calculations too by skipping any null values and keeping your math correct those are just some of the advanced functions you can use to clean your data and get it ready for the next step in the analysis process you'll discover more as you continue working in sql but that's the end of this video and this module great work we've covered a lot of ground you learn the different data cleaning functions and spreadsheets and sql and the benefits of using sql to deal with large data sets we also added some sql formulas and functions to your toolkit and most importantly we got to experience some of the ways that sql can help you get data ready for your analysis after this you'll get to spend some time learning how to verify and report your cleaning results so that your data is squeaky clean and your stakeholders know it but before that you've got another weekly challenge to tackle you've got this some of these concepts might seem challenging at first but they'll become second nature to you as you progress in your career it just takes time and practice speaking of practice feel free to go back to any of these videos and re-watch or even try some of these commands on your own good luck and i'll see you again when you're ready congratulations on finishing this video from the google data analytics certificate access the full experience including job search help and start to earn the official certificate by clicking the icon or the link in the description watch the next video in the course by clicking here and subscribe to our channel for more from upcoming google career certificates
Info
Channel: Google Career Certificates
Views: 6,505
Rating: 4.9478259 out of 5
Keywords: Grow with Google, Career Change, Tech jobs, Google Career Certificate, Google Career Certificates, Job skills, Coursera, Certification, Google, professional certificates, professional certificate program, Data analyst, Data analytics, Data analysis, Data analytics for beginners, What is data analytics, Sql, Data, R Programming, Spreadsheets, SQL tutorial, Query, Queries, How to use SQL, SQL tutorial for beginners, Learn SQL, Spreadsheet, What is SQL
Id: r7GDQt3lZxs
Channel Id: undefined
Length: 42min 43sec (2563 seconds)
Published: Fri Jun 11 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.