Understanding Clean Data | Google Data Analytics Certificate

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
this video is part of the google data analytics certificate providing you with job-ready skills to start or advance your career in data analytics get access to practice exercises quizzes discussion forums job search help and more on casera and you can earn your official certificate visit grow.google datacert to enroll in the full learning experience today can you guess what inaccurate or bad data costs businesses every year thousands of dollars millions billions well according to ibm the yearly cost of poor quality data is 3.1 trillion dollars in the us alone that's a lot of zeros now can you guess the number one cause of poor quality data it's not a new system implementation or a computer technical glitch the most common factor is actually human error here's a spreadsheet from a law office it shows customers the legal services they bought the service order number how much they paid and the payment method dirty data can be the result of someone typing in a piece of data incorrectly inconsistent formatting blank fields or the same piece of data being entered more than once which causes duplicates dirty data is data that's incomplete incorrect or irrelevant to the problem you're trying to solve when you work with dirty data you can't be sure that your results are correct in fact you can pretty much bet they won't be earlier you learned that data integrity is critical to reliable data analytics results and clean data helps you achieve data integrity clean data is data that's complete correct and relevant to the problem you're trying to solve when you work with clean data you'll find that your projects go much more smoothly i remember the first time i witnessed first hand how important clean data really is i had just started using sql and i thought it worked like magic i could have the computer sum up millions of numbers saving me tons of time and effort but i quickly discovered that only works when the data is clean if there was even one accidental letter in a column that should only have numbers the computer wouldn't know what to do so it would throw an error and suddenly i was stuck and there's no way i can add up millions of numbers by myself so i had to clean up that data to make it work the good news is that there's plenty of effective processes and tools to help you do that coming up you'll gain the skills and knowledge you need to make sure the data you work with is always clean along the way we'll dig deeper into the difference between clean and dirty data and why clean data is so important we'll also talk about different ways to clean your data and common problems to look for during the process ready to start let's do it [Music] clean data is incredibly important for effective analysis if a piece of data is entered into a spreadsheet or database incorrectly or if it's repeated or if a field is left blank or if data formats are inconsistent the result is dirty data small mistakes can lead to big consequences in the long run now i'll be completely honest with you data cleaning is like brushing your teeth it's something you should do and do properly because otherwise it can cause serious problems for teeth that might be cavities or gum disease for data that might be costing your company money or an angry boss but here's the good news if you keep brushing twice a day every day it becomes a habit soon you don't even have to think about it and it's the same with data trust me it'll make you look great when you take the time to clean up that dirty data as a quick refresher dirty data is incomplete incorrect or irrelevant to the problem you're trying to solve it can't be used in a meaningful way which makes analysis very difficult if not impossible on the other hand clean data is complete correct and relevant to the problem you're trying to solve this allows you to understand and analyze information and identify important patterns connect related information and draw useful conclusions then you can apply what you learn to make effective decisions in some cases you won't have to do a lot of work to clean data for example when you use internal data that's been verified and cared for by your company's data engineers and data warehouse team it's more likely to be clean let's talk about some people you'll work with as a data analyst data engineers transform data into a useful format for analysis and give it a reliable infrastructure this means they develop maintain and test databases data processors and related systems data warehousing specialists develop processes and procedures to effectively store and organize data they make sure that data is available secure and backed up to prevent loss when you become a data analyst you can learn a lot by working with the person who maintains your databases to learn about their systems if data passes through the hands of a data engineer or data warehousing specialist first you know you're off to a good start on your project there's a lot of great career opportunities as a data engineer or data warehousing specialist if this kind of work sounds interesting to you maybe your career path will involve helping organizations save lots of time effort and money by making sure their data is sparkling clean but even if you go in a different direction with your data analytics career and have the advantage of working with data engineers and warehousing specialists you're still likely to have to clean your own data and it's important to remember no data set is perfect it's always a good idea to examine and clean data before beginning analysis here's an example let's say you're working on a project where you need to figure out how many people use your company's software program you have a spreadsheet that was created internally and verified by a data engineer and a data warehousing specialist check out the column labeled username now it might seem logical that you can just scroll down and count the rows to figure out how many users you have but that won't work because one person sometimes has more than one username maybe they registered from different email addresses or maybe they have a work and personal account in situations like this you would need to clean the data by eliminating any rows that are duplicates once you've done that there won't be any more duplicate entries then your spreadsheet's ready to be put to work okay so far we've discussed working with internal data but data cleaning becomes even more important when working with external data especially if it comes from multiple sources let's say the software company from our example surveyed its customers to learn how satisfied they are with its software product but when you review the survey data you find that you have several nulls a null is an indication that a value does not exist in a data set note that it's not the same as a zero in the case of our survey a null would mean the customer skipped that question a zero would mean they provided zero as their response to do your analysis you would first need to clean this data step one would be to decide what to do with those nulls you could either filter them out and communicate that you now have a smaller sample size or you can keep them in and learn from the fact that the customers did not provide responses there's lots of reasons why this could have happened maybe your survey questions weren't written as well as they could be maybe they were confusing or biased something we learned about earlier okay we've touched on the basics of cleaning internal and external data but there's lots more to come soon we'll learn about the common errors to be aware of to ensure your data is complete correct and relevant see you soon in this video we'll focus on common issues associated with dirty data these include spelling and other text errors inconsistent labels formats and field length missing data and duplicates this will help you recognize problems quicker and give you the information you need to fix them when you encounter something similar during your own analysis this is incredibly important in data analytics okay let's go back to our law office spreadsheet as a quick refresher we'll start by checking out the different types of dirty data it shows sometimes someone might key in a piece of data incorrectly other times they might not keep data formats consistent it's also common to leave a field blank that's also called null which we learned about earlier and if someone adds the same piece of data more than once that creates a duplicate so let's break that down then we'll learn about a few other types of dirty data and strategies for cleaning it misspellings spelling variations mixed up letters inconsistent punctuation and typos in general happen when someone types in a piece of data incorrectly as a data analyst you'll also deal with different currencies for example one data set could be in us dollars and another in euros and you don't want to get them mixed up we want to find these types of errors and fix them like this you'll learn more about this soon clean data depends largely on the data integrity rules that an organization follows such as spelling and punctuation guidelines for example a beverage company might ask everyone working in its database to enter data about volume in fluid ounces instead of cups it's great when an organization has rules like this in place it really helps minimize the amount of data cleaning required but it can't eliminate it completely like we discussed earlier there's always the possibility of human error the next type of dirty data our spreadsheet shows is inconsistent formatting in this example something that should be formatted as currency is shown as a percentage until this error is fixed like this the law office will have no idea how much money this customer paid for its services we'll learn about different ways to solve this and many other problems soon we discussed nulls previously but as a reminder nulls are empty fields this kind of dirty data requires a little more work than just fixing a spelling error or changing a format in this example the data analysts would need to research which customer had a consultation on july 4th 2020 then when they find the correct information they'd have to add it to the spreadsheet another common type of dirty data is duplicate maybe two different people added this appointment on august 13th not realizing that someone else had already done it or maybe the person entering the data hit copy and paste by accident whatever the reason it's the data analyst's job to identify this error and correct it by deleting one of the duplicates okay now let's continue on to some other types of dirty data the first has to do with labeling to understand labeling imagine trying to get a computer to correctly identify panda bears among images of all different kinds of animals you need to show the computer thousands of images of panda bears they're all labeled as panda bears any incorrectly labeled picture like the one here that's just bare will cause a problem the next type of dirty data is having an inconsistent field length you learned earlier that a field is a single piece of information from a row or column of a spreadsheet field length is a tool for determining how many characters can be keyed into a field assigning a certain length to the fields in your spreadsheet is a great way to avoid errors for instance if you have a column for someone's birth year you know the field length is 4 because all years are 4 digits long some spreadsheet applications have a simple way to specify field lengths and make sure users can only enter a certain number of characters into a field this is part of data validation data validation is a tool for checking the accuracy and quality of data before adding or importing it data validation is a form of data cleansing which you'll learn more about soon but first you'll get familiar with more techniques for cleaning data this is a very important part of the data analyst job and i look forward to sharing these data cleaning strategies with you now that you're familiar with some of the most common types of dirty data it's time to clean them up as you've learned clean data is essential to data integrity and reliable solutions and decisions the good news is that spreadsheets have all kinds of tools you can use to get your data ready for analysis the techniques for data cleaning will be different depending on the specific data set you're working with so we won't cover everything you might run into but this will give you a great starting point for fixing the types of dirty data analysts find most often think of everything that's coming up as a teaser trailer of data cleaning tools i'm going to give you a basic overview of some common tools and techniques and then we'll practice them again later on here we'll discuss how to remove unwanted data clean up text to remove extra spaces and blanks fix typos and make formatting consistent however before removing unwanted data it's always a good practice to make a copy of the data set that way if you remove something that you end up needing in the future you can easily access it and put it back in the data set once that's done then you can move on to getting rid of the duplicates or data that isn't relevant to the problem you're trying to solve typically duplicates appear when you're combining data sets from more than one source or using data from multiple departments within the same business you've already learned a bit about duplicates but let's practice removing them once more now using this spreadsheet which lists members of a professional logistics association duplicates can be a big problem for data analysts so it's really important that you can find and remove them before any analysis starts here's an example of what i'm talking about let's say this association has duplicates of one person's 500 membership in its database when the data is summarized the analyst would think there was one thousand dollars being paid by this member and would make decisions based on that incorrect data but in reality this member only paid 500 these problems can be fixed manually but most spreadsheet applications also offer lots of tools to help you find and remove duplicates now irrelevant data which is data that doesn't fit the specific problem that you're trying to solve also needs to be removed going back to our association membership list example let's say a data analyst was working on a project that focused only on current members they wouldn't want to include information on people who are no longer members or who never joined in the first place removing irrelevant data takes a little more time and effort because you have to figure out the difference between the data you need and the data you don't but believe me making those decisions will save you a ton of effort down the road the next step is removing extra spaces and blanks extra spaces can cause unexpected results when you sort filter or search through your data and because these characters are easy to miss they can lead to unexpected and confusing results for example if there's an extra space in a member id number when you sort the column from lowest to highest this row will be out of place to remove these unwanted spaces or blank cells you can delete them yourself or again you can rely on your spreadsheets which offer lots of great functions for removing spaces and blanks automatically the next data cleaning step involves fixing misspellings inconsistent capitalization incorrect punctuation and other typos these types of errors can lead to some big problems let's say you have a database of emails that you use to keep in touch with your customers if some emails have misspellings a period in the wrong place or any other kind of typo not only do you run the risk of sending an email to the wrong people you also run the risk of spamming random people think about our association membership example again misspelling might cause the data analyst to miscount the number of professional members if they sorted this membership type and then counted the number of rows like the other problems we've come across you can also fix these problems manually or you can use spreadsheet tools such as spell check autocorrect and conditional formatting to make your life easier there's also easy ways to convert text to lowercase uppercase or proper case which is one of the things we'll check out again later alright we're getting there the next step is removing formatting this is particularly important when you get data from lots of different sources every database has its own formatting which can cause the data to seem inconsistent creating a clean and consistent visual appearance for your spreadsheets will help make it a valuable tool for you and your team when making key decisions most spreadsheet applications also have a clear formats tool which is a great time saver cleaning data is an essential step in increasing the quality of your data now you know lots of different ways to do that in the next video you'll take that knowledge even further and learn how to clean up data that's come from more than one source [Music] so far you learned a lot about dirty data and how to clean up the most common errors in the data set now we're going to take that a step further and talk about cleaning up multiple data sets cleaning data that comes from two or more sources is very common for data analysts but it does come with some interesting challenges a good example is a merger which is an agreement that unites two organizations into a single new one in the logistics field there's been lots of big changes recently mostly because of the e-commerce boom with so many people shopping online it makes sense that the companies responsible for delivering these products to their homes are in the middle of a big shake up when big things happen in an industry it's common for two organizations to team up and become stronger through a merger let's talk about how that will affect our logistics association as a quick reminder this spreadsheet lists association member id numbers first and last names addresses how much each member pays and dues when the membership expires and the membership types now let's think about what would happen if the international logistics association decided to get together with the global logistics association in order to help their members handle the incredible demands of e-commerce first all the data from each organization would need to be combined using data merging data merging is the process of combining two or more data sets into a single data set this presents a unique challenge because when two totally different data sets are combined the information is almost guaranteed to be inconsistent and misaligned for example the global logistics association spreadsheet has a separate column for a person's suite apartment or unit number but the international logistics association combines that information with their street address this needs to be corrected to make the number of address columns consistent next check out how the global logistics association uses people's email addresses as their member id while the international logistics association uses numbers this is a big problem because people in a certain industry such as logistics typically join multiple professional associations so there's a very good chance that these data sets include membership information on the exact same person just in different ways it's super important to remove those duplicates also the global logistics association has many more member types than the other organization on top of that it uses the term young professional instead of student associate but both describe members who are still in school or just starting their careers if you are merging these two data sets you need to work with your team to fix the fact that the two associations describe memberships very differently okay now you understand why the merging of organizations also requires the merging of data and that can be tricky but there's lots of other reasons why data analysts merge data sets for example in one of my past jobs i merged a lot of data from multiple sources to get insights about our customers purchases the kinds of insights i gained helped me identify customer buying patterns when merging data sets i always begin by asking myself some key questions so help me avoid redundancy and to confirm that the data sets are compatible in data analytics compatibility describes how well two or more data sets are able to work together so the first question i would ask is do i have all the data i need to gather customer purchase insights i wanted to make sure i had data on customers their purchases and where they shopped next i would ask does the data i need exist within these data sets as you learned earlier in this program this involves considering the entire data set analytically looking through the data before i start using it lets me get a feel for what it's all about what the schema looks like if it's relevant to my customer purchase insights and if it's clean data that brings me to the next question do the data sets need to be cleaned or are they ready for me to use and because i'm working with more than one source i'll also ask myself are the data sets cleaned to the same standard for example what fields are regularly repeated how are missing values handled how recently was the data updated finding the answers to these questions and understanding if i need to fix any problems at the start of a project is a very important step in data merging in both the examples we explored here a data analyst could use either the spreadsheet tools or sql queries to clean up merge and prepare the data sets for analysis depending on the tool you decide to use the cleanup process can be simple or very complex soon you'll learn how to make the best choice for your situation and as a final note programming languages like r are also very useful for cleaning data you'll learn more about how to use r and the other concepts we covered soon [Music] as you learned earlier there's a lot of different ways to clean up data i've shown you some examples of how you can clean data manually such as searching for and fixing misspellings or removing empty spaces and duplicates we also learned that lots of spreadsheet applications have tools that help simplify and speed up the data cleaning process there's a lot of great efficiency tools that data analysts use all the time such as conditional formatting removing duplicates formatting dates fixing text strings and substrings and splitting text to columns we'll explore those in more detail now the first is something called conditional formatting conditional formatting is a spreadsheet tool that changes how cells appear when values meet specific conditions likewise it can let you know when a cell does not meet the conditions you've set visual cues like this are very useful for data analysts especially when we're working in a large spreadsheet with lots of data making certain data points stand out makes the information easier to understand and analyze and for cleaning data knowing when the data doesn't follow the condition is very helpful let's return to the logistics association spreadsheet to check out conditional formatting in action we'll use conditional formatting to highlight blank cells that way we know where there's missing information so we can add it to the spreadsheet to do this we'll start by selecting the range we want to search for this example we're not focused on address 3. and address 5. so the fields will include all the columns in our spreadsheets except for f and h next we'll go to format and choose conditional formatting great our range is automatically indicated in the field the format rule will be to format cells if the cell is empty finally we'll choose the formatting style i'm going to pick a shade of bright pink so my blanks really stand out then click done and the blank cells are instantly highlighted okay the next spreadsheet tool removes duplicates as you've learned before it's always smart to make a copy of the data set before removing anything so let's do that now great now we can continue you might remember that our example spreadsheet has one association member listed twice to fix that go to data and select remove duplicates remove duplicates is a tool that automatically searches for and eliminates duplicate entries from a spreadsheet choose data has header row because our spreadsheet has a row at the very top that describes the contents of each column next select all because we want to inspect our entire spreadsheet finally remove duplicates you'll notice the duplicate row was found and immediately removed another useful spreadsheet tool enables you to make formats consistent for example some of the dates in the spreadsheet aren't in a standard date format this could be confusing if you wanted to analyze when association members joined how often they renewed their memberships or how long they've been with the association to make all of our dates consistent first select column j then go to format select number then date now all of our dates have a consistent format before we go over the next tool i want to explain what a text string is in data analytics a text string is a group of characters within a cell most often composed of letters an important characteristic of a text string is its length which is the number of characters in it you'll learn more about that soon for now it's also useful to know that a sub string is a smaller subset of a text string okay now let's talk about split split is a tool that divides the text string around the specified character and puts each fragment into a new and separate cell split is helpful when you have more than one piece of data in a cell and you want to separate them out this might be a person's first and last name listed together or it could be a cell that contains someone's city state country and zip code but you actually want each of those in its own column so let's say this association wanted to analyze all of the different professional certifications its members have earned to do this you want each certification separated out into its own column right now the certifications are separated by a comma that's the specified text separating each item also called the delimiter let's get them separated highlight the column then select data and split text to columns this spreadsheet application automatically knew that the comma was the delimiter and separated each certification but sometimes you might need to specify what the delimiter should be you can do that here split text to columns is also helpful for fixing instances of numbers stored as text sometimes values in your spreadsheet will seem like numbers but they're formatted as text this can happen when copying and pasting from one place to another or if the formatting is wrong for this example let's check out a new spreadsheet from a cosmetics maker if a data analyst wanted to determine total profits they could add up everything in column f but there's a problem one of the cells has an error if you check into it you learn that the 707 in this cell is text and can't be changed into a number so when the spreadsheet tries to multiply the cost of the product by the number of units sold it's unable to make the calculation but if we select the orders column and choose split text to columns the error is resolved because now it can be treated as a number coming up you'll learn about a tool that does just the opposite concatenate is a function that joins multiple text strings into a single string spreadsheets are a very important part of data analytics they save data analysts time and effort and help us eliminate errors each and every day here you learn about some of the most common tools that we use but there's a lot more to come next we'll learn even more about data cleaning with spreadsheet tools bye for now [Music] you've learned about some very useful data cleaning tools that are built right into spreadsheet applications now we'll explore how functions can optimize your efforts to ensure data integrity as a reminder a function is a set of instructions that performs a specific calculation using the data in a spreadsheet the first function we'll discuss is called countif countif is a function that returns the number of cells that match a specified value basically it counts the number of times a value appears in a range of cells let's go back to our professional association spreadsheet in this example we want to make sure the association membership prices are listed accurately we'll use counts if to check for some common problems like negative numbers or a value that's much less or much greater than expected to start let's find the least expensive membership a hundred dollars for student associates so that'll be the lowest number that exists in this column if any cell has a value that's less than 100 countif will alert us we'll add a few more rows at the bottom of our spreadsheet then beneath column h type member dues less than a hundred dollars next type the function in the cell next to it every function has a certain syntax that needs to be followed for it to work syntax is a predetermined structure that includes all required information and is proper placement the syntax of accounts if function should be like this equals countif open parenthesis range comma the specified value in quotation marks and a closed parenthesis so it will show up like this where i2 through i72 is the range and the value is less than 100. this tells the function to go through column i and return a count of all cells that contain a number less than a hundred turns out there is one scrolling through our data we find that one piece of data was mistakenly keyed in as a negative number let's fix that now now we'll use countif to search for any values that are more than we would expect the most expensive membership type is 500 for corporate members type the function in the cell this time it'll appear like this i2 through i-72 is still the range but the value is greater than 500. there's one here too check it out this entry has an extra zero it should be a hundred dollars okay the next function we'll discuss is called len len is a function that tells you the length of a text string by counting the number of characters it contains this is useful when cleaning data if you have a certain piece of information in your spreadsheet that you know must contain a certain link for example this association uses six member identification codes so if we just import this data and wanted to be sure our codes are all the correct number of digits we'd use len the syntax of len is equals len open parentheses the range and the close parentheses so we'll insert a new column after member id then type an equal sign and len at an open parenthesis the range is the first member id number in a2 finish the function by closing the parentheses it tells us that there are six characters in cell a2 let's continue the function through the entire column and find out if any results are not six but instead of manually going through our spreadsheet to search for these instances we'll use conditional formatting we talked about conditional formatting earlier it's a spreadsheet tool that changes how cells appear when values meet specific conditions let's practice that now select all of column b except for the header then go to format and choose conditional formatting the format rule is to format cells if not equal to 6. click done and the cell with the 7 inside is highlighted all right now we're going to talk about left and right left is a function that gives you a set number of characters from the left side of a text string right is a function that gives you a set number of characters from the right side of a text string as a quick reminder a text string is a group of characters within a cell commonly composed of letters numbers or both to see these functions in action let's go back to the spreadsheet from the cosmetics maker from earlier this spreadsheet contains product codes each has a five digit numeric code and then a four character text identifier but let's say we only want to work with one side or the other you can use left or right to give you the specific set of characters or numbers you need we'll practice cleaning up our data using the left function first the syntax of left is equals left open parenthesis the range a comma and the number of characters from the left side of the text string we want then we finish it with the close parentheses here our project requires just the five digit numeric codes so in a separate column type equals left open parenthesis then add the range our range is a2 then add a comma and the number five for our five-digit product code finally finish the function with the close parenthesis our function should show up like this press enter and now we have a substring which is the number part of the product code only click and drag this function through the entire column to separate out the rest of the product codes by number only okay now let's say your project only needs the four character text identifier for that we'll use the write function in the next column we'll begin the function the syntax is equals right open parenthesis the range a comma and the number of characters we want then we finish with the close parentheses let's key that in now equals right open parenthesis and the range is still a2 add a comma and this time we'll tell it that we want the first four characters from the right close up the parentheses and press enter then drag the function throughout the entire column now we can analyze the products in our spreadsheet based on either substring the five digit numeric code or the four character text identifier so hopefully that makes it clear how you can use left and right to extract sub strings from the left and right sides of a string now let's learn how you can extract something in between here's where we'll use something called mid mid is a function that gives you a segment from the middle of a text string this cosmetics company lists all of its client using a client code it's composed of the first three letters of the city where the client's located its state abbreviation and then a three-digit identifier but let's say a data analyst needs to work with just the states in the middle the syntax for mid is equals mid open parenthesis the range then a comma when using mid you always need to supply a reference point in other words you need to set where the function should start after that place another comma and how many middle characters you want in this case our range is d2 let's start the function in new column type equals mid open parenthesis d2 then the first three characters represent a city name so that means the starting point is the fourth add a comma and four we also need to tell the function how many middle characters we want add one more comma and two because the state abbreviations are two characters long press enter and bam we just get the state abbreviation continue the mid function through the rest of the column so we've learned about a few functions that help separate out specific text strings but what if we want to combine them instead for that we'll use concatenate which is a function that joins together two or more text strings the syntax is equals concatenate then an open parenthesis inside indicate each text string you want to join separated by commas then finish the function with the closed parentheses so just for practice let's say we needed to rejoin the left and right text strings back into complete product codes in a new column let's begin our function type equals concatenate then an open parenthesis the first text string we want to join is in h2 then add a comma the second part is an i2 add a close parenthesis and press enter drag it down through the entire column and just like that all of our product codes are back together okay the last function we'll learn about here is trim trim is a function that removes leading trailing and repeated spaces and data sometimes when you import data your cells have extra spaces which can get in the way of your analysis for example if this cosmetics maker wanted to look up a specific client name it won't show up in the search if it has extra spaces you can use trim to fix that problem the syntax for trim is equals trim open parenthesis your range and close parenthesis so in a separate column type equals trim and an open parenthesis the range is c2 as you want to check out the client names close the parentheses and press enter finally continue the function down the column trim fix the extra spaces so now we know some very useful functions that can make your data cleaning even more successful this was a lot of information so as always feel free to go back and review the video and then practice on your own we'll continue building on these tools soon and you'll also have a chance to practice pretty soon these data clean steps will become second nature like brushing your teeth let's get into it motivational speaker wayne dyer once said if you change the way you look at things the things you look at change this is so true in data analytics no two analytics projects are ever exactly the same so it only makes sense that different projects require us to focus on different information differently in this video we'll explore different methods that data analysts use to look at data differently and how that leads to more efficient and effective data cleaning some of these methods include sorting and filtering pivot tables a function called vlookup and plotting to find outliers let's start with sorting and filtering as you learned earlier sorting and filtering data helps data analysts customize and organize the information the way they need for a particular project but these tools are also very useful for data cleaning you might remember that sorting involves arranging data into a meaningful order to make it easier to understand analyze and visualize for data cleaning you can use sorting to put things in alphabetical or numerical order so you can easily find a piece of data sorting can also bring duplicate entries closer together for faster identification filters on the other hand are very useful in data cleaning when you want to find a particular piece of information you learned earlier that filtering means showing only the data that meets a specific criteria while hiding the rest this lets you view only the information you need when cleaning data you might use a filter to only find values above a certain number or just even or odd values again this helps you find what you need quickly and separates out the information you want from the rest that way you can be more efficient when cleaning your data another way to change the way you view data is by using pivot tables you've learned that a pivot table is a data summarization tool that is used in data processing pivot tables sort reorganize group count total or average data stored in a database in data cleaning pivot tables are used to give you a quick clutter-free view of your data you can choose to look at the specific parts of a data set you need and get a visual in the form of a pivot table let's create one now using our cosmetic makers spreadsheet again to start select the data we want to use here we'll choose the entire spreadsheet select data then pivot table choose new sheet and create let's say we're working on a project that requires us to look at only the most profitable products items that earn the cosmetics maker at least ten thousand dollars in orders so the row will include is total for total profits we'll sort in descending order to put the most profitable items at the top and we'll show totals next we'll add another row for products so that we know what those numbers are about we can clearly determine that the most profitable products have the product codes 15143ex fo and 32729 masc we can ignore the rest for this particular project because they fall below 10 000 in orders now we might be able to use context clues that assume we're talking about exfoliants and mascaras but we don't know which ones or if that assumption is even correct so we need to confirm what the product codes correspond to and this brings us to the next tool it's called vlookup vlookup stands for vertical lookup it's a function that searches for a certain value in a column to return a corresponding piece of information when data analysts look up information for a project it's rare for all of the data they need to be in the same place usually you'll have to search across multiple sheets or even different databases the syntax of vlookup is equals vlookup open parentheses then the data you want to look up next is a comma and where you want to look for that data in our example this will be the name of a spreadsheet followed by exclamation point the exclamation point indicates that we're referencing a cell in a different sheet from the one we're currently working in again that's very common in data analytics okay next is the range in the place where you're looking for data indicated using the first and last cell separated by a colon after one more comma is the column in the range containing the value to return next another comma and the word false which means that an exact match is what we're looking for finally complete your function by closing the parentheses to put it simply vlookup searches for the value in the first argument in the left most column of the specified location the value of the third argument tells vlookup to return the value in the same row from the specified column the false tells vlookup that we want an exact match soon you'll learn the difference between exact and approximate matches but for now just know that vlookup takes a value in one cell and searches for a match in another place let's begin we'll type equals vlookup then add the data we want to look for which is the product code the dollar signs make sure that the corresponding part of the reference remains unchanged or locked you can lock just the column just the row or both at the same time next we'll tell it to look at sheet 2 in both columns we add a 2 to represent the second column the last term false says we want an exact match with this information we can now analyze the data for only the most profitable products going back to the two most profitable products we can search for 15143ex fo and 32729 masc go to edit and then find type in the product codes and search for them now we can learn which products we'll be using for this particular project the final tool we'll talk about is something called plotting when you plot data you put it in a graph chart table or other visual to help you quickly find what it looks like plotting is very useful when trying to identify any skewed data or outliers for example if we wanted to make sure the price of each product is correct we could create a chart this would give us a visual aid that helps us quickly figure out if anything looks like an error so let's select the column with our prices then we'll go to insert and choose chart pick a column chart as the type one of these prices looks extremely low if we look into it we discover that this item has a decimal point in the wrong place it should be dollars and 30 cents not 73 cents that would have a big impact on our total profits so it's a good thing we caught that during data cleaning looking at data in new and creative ways helps data analysts identify all kinds of dirty data coming up you'll continue practicing these new concepts so you can get more comfortable with them you'll also learn additional strategies for ensuring your data is clean and will provide you with effective insights great work so far [Music] so far you learned about a lot of different tools and functions that analysts use to clean up data for analysis now we'll take a step back and talk about some of the really big picture aspects of clean data knowing how to fix specific problems either manually with spreadsheet tools or with functions is extremely valuable but it's also important to think about how your data has moved between systems and how it's evolved along its journey to your data analysis project to do this data analysts use something called data mapping data mapping is the process of matching fields from one database to another this is very important to the success of data migration data integration and lots of other data management activities as you learned earlier different systems store data in different ways for example the state field in one spreadsheet might show maryland spelled out but another spreadsheet might store it as md data mapping helps us note these kinds of differences so we know when data is moved and combined it will be compatible as a quick reminder compatibility describes how well two or more data sets are able to work together so the first step to data mapping is identifying what data needs to be moved this includes the tables and the fields within them we also need to define the desired format for the data once it reaches its destination to figure out how this works let's go back to the merger between our two logistics associations starting with the first data field we'll identify that we need to move both sets of member ids and to define the desired format we'll choose whether to use numbers like this spreadsheet or email addresses like the other spreadsheet next comes mapping the data depending on the schema and number of primary and foreign keys in a data source data mapping can be simple or very complex as a reminder a schema is a way of describing how something is organized a primary key references a column in which each value is unique and a foreign key is a field within a table that is a primary key in another table for more challenging projects there's all kinds of data mapping software programs you can use these data mapping tools will analyze field by field how to move data from one place to another then they automatically clean match inspect and validate the data they also create consistent naming conventions ensuring compatibility when the data is transferred from one source to another selecting a software program to map your data you want to be sure that it supports the file types you're working with such as excel sql tableau and others later on you'll learn more about selecting the right tool for a particular task for now let's practice mapping data manually first we need to determine the content of each section to make sure the data ends up in the right place for example the data on when memberships expire would be consolidated into a single column this step makes sure that each piece of information ends up in the most appropriate place in the merged data source now you might remember that some of the data was inconsistent between the two organizations like the fact that one uses a separate column for suite apartment or unit number but the other doesn't this brings us to the next step transforming the data into a consistent format this is a great time to use concatenate as you learned before concatenate is a function that joins together two or more text strings which is what we did earlier with our cosmetics company example so we'll insert a new column and then type equals concatenate then the two text strings we want to make one drag that through the entire column and now we have the consistency in the new merge association lists of member addresses okay now that everything is compatible it's time to transfer the data to its destination there's a lot of different ways to move data from one place to another including querying import wizards and even simple drag and drop here you'll learn more about choosing the right method for different situations alright here's our merge spreadsheet it looks good but we still want to make sure everything was transferred properly so we'll go into the testing phase of data mapping for this you inspect a sample piece of data to confirm that it's clean and properly formatted it's also a smart practice to do spot checks on things such as the number of nulls for the test you can use a lot of the data cleaning tools we discussed previously such as data validation conditional formatting countif sorting and filtering finally once you've determined that the data is clean and compatible you can start using it for analysis data mapping is so important because even one mistake when merging data can ripple throughout an organization causing the same error to appear again and again this leads to poor results on the other hand data mapping can save the day by giving you a clear roadmap you can follow to make sure your data arrives safely at its destination and that's why you learn how to do it congratulations on finishing this video from the google data analytics certificate access the full experience including job search help and start to earn the official certificate by clicking the icon or the link in the description watch the next video in the course by clicking here and subscribe to our channel for more from upcoming google career certificates
Info
Channel: Google Career Certificates
Views: 6,624
Rating: 4.9633026 out of 5
Keywords: Grow with Google, Career Change, Tech jobs, Google Career Certificate, Google Career Certificates, Job skills, Coursera, Certification, Google, professional certificates, professional certificate program, Data analyst, Data analytics, Data analysis, Data analytics for beginners, What is data analytics, Sql, Data, R Programming, Spreadsheets, Human error, SQL database, Data validation, Conditional formatting, Pivot table, Data mapping, Schema, Foreign Key
Id: kCP-H8VRDCw
Channel Id: undefined
Length: 66min 32sec (3992 seconds)
Published: Fri Jun 11 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.