Ways to Ensure Data Integrity | Google Data Analytics Certificate

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
this video is part of the google data analytics certificate providing you with job-ready skills to start or advance your career in data analytics get access to practice exercises quizzes discussion forums job search help and more on casera and you can earn your official certificate visit grow.google datacert to enroll in the full learning experience today [Music] hi good to see you my name's sally and i'm here to teach you all about processing data i'm a measurement and analytical lead at google my job is to help advertising agencies and companies measure success and analyze their data so i get to meet with lots of different people to show them how data analysis helps with their advertising speaking of analysis you degrade earlier learning how to gather and organize data for analysis it's definitely an important step in the data analysis process so well done now let's talk about how to make sure that your organized data is complete and accurate clean data is the key to making sure your data has integrity before you analyze it we'll show you how to make sure your data is clean and tidy cleaning and processing data is one part of the overall data analysis process as a quick reminder that process is ask prepare process analyze share and act which means it's time for us to explore the process phase and i'm here to guide you the whole way i'm very familiar with where you are right now i never heard of data analytics until i went through a program similar to this one once i started making progress i realized how much i enjoyed data analytics and the doors it can open and now i'm excited to help you open those same doors one thing i've realized as i work for different companies is that clean data is important in every industry for example i learned early in my career to be on the lookout for duplicate data a common problem that analysts come across when cleaning i used to work for a company that had different types of subscriptions in our data set each user would have a new row for each subscription type they bought which meant users would show up more than once in my data so if i had counted the number of users in a table without accounting for duplicates like this i would have counted some users twice instead of once as a result my analysis would have been wrong which would have led to problems in my reports and for the stakeholders relying on my analysis imagine if i told the ceo that we had twice as many customers as we actually did that's why clean data is so important so the first step in processing data is learning about data integrity you'll find out what data integrity is and why it's important to maintain it throughout the data analysis process sometimes you might not even have the data that you need so you'll have to create it yourself this will help you learn how sample size and random sampling can save you time and effort testing data is another important step to take when processing data we'll share some guidance on how to test data before your analysis officially begins just like you'd clean your clothes and your dishes in everyday life analysts clean their data all the time too the importance of clean data will definitely be a focus here you'll learn data cleaning techniques for all scenarios along with some pitfalls to watch out for as you clean you'll explore data cleaning in both spreadsheets and databases building on what you've already learned about spreadsheets we'll talk more about sql and how you can use it to clean data and do other useful things too when analysts clean their data they do a lot more than a spot check to make sure it was done correctly you'll learn ways to verify and report your cleaning results this includes documenting your cleaning process which has lots of benefits that we'll explore it's important to remember that processing data is just one of the tasks you'll complete as a data analyst actually your skills with cleaning data might just end up being something you highlight in your resume when you start job hunting speaking of resumes you'll be able to start thinking about how to build your own from the perspective of a data analyst once you're done here you'll have a strong appreciation for clean data and how important it is in the data analysis process so let's get started [Music] in this video we're going to discuss data integrity and some risk you might run into as a data analyst a strong analysis depends on the integrity of the data if the data you're using is compromised in any way your analysis won't be as strong as it should be data integrity is the accuracy completeness consistency and trustworthiness of data throughout its life cycle that might sound like a lot of qualities for the data to live up to but trust me it's worth it to check for them all before proceeding with your analysis otherwise your analysis could be wrong not because you did something wrong but because the data you were working with was wrong to begin with when data integrity is low it can cause anything from the loss of a single pixel in an image to an incorrect medical decision in some cases one missing piece can make all of your data useless data integrity can be compromised in lots of different ways there's a chance data can be compromised every time it's replicated transferred or manipulated in any way data replication is the process of storing data in multiple locations if you're replicating data at different times in different places there's a chance your data will be out of sync this data lacks integrity because different people might not be using the same data for their findings which can cause inconsistencies there's also the issue of data transfer which is the process of copying data from a storage device to memory or from one computer to another if your data transfer is interrupted you might end up with an incomplete data set which might not be useful for your needs the data manipulation process involves changing the data to make it more organized and easier to read data manipulation is meant to make the data analysis process more efficient but an error during the process can compromise that efficiency finally data can also be compromised through human error viruses malware hacking and system failures which can all lead to even more headaches i'll stop there that's enough potentially bad news to digest let's move on to some potentially good news in a lot of companies the data warehouse or data engineering team takes care of ensuring data integrity coming up we'll learn about checking data integrity as a data analyst but rest assured someone else will usually have your back too after you found out what kind of data you're working with it's important to double check that your data is complete and valid before analysis this will help ensure that your analysis and eventual conclusions are accurate checking data integrity is a vital step in processing your data to get it ready for analysis whether you or someone else at your company is doing it coming up you'll learn even more about data integrity see you soon [Music] it's good to remember to check for data integrity it's also important to check that the data you use aligns with the business objective this adds another layer to the maintenance of data integrity because the data you're using might have limitations that you'll need to deal with the process of matching data to business objectives can actually be pretty straightforward here's a quick example let's say you're an analyst for a business that produces and sells auto parts if you need to address a question about the revenue generated by the sale of a certain part then you'd pull up the revenue table from the data set if the question is about customer reviews then you'd pull up the reviews table to analyze the average ratings but before digging into any analysis you need to consider a few limitations that might affect it if the data hasn't been cleaned properly then you won't be able to use it yet you would need to wait until a thorough cleaning has been done now let's say you're trying to find how much an average customer spends and you notice the same customers data showing up in more than one row this is called duplicate data to fix this you might need to change the format of the data or you might need to change the way you calculate the average otherwise it will seem like the data is for two different people and you'll be stuck with misleading calculations you might also realize there's not enough data to complete an accurate analysis maybe you only have a couple of months worth of sales data there's a slim chance you could wait for more data but it's more likely that you'll have to change your process or find alternate sources of data while still meeting your objective i like to think of a data set like a picture take this picture what are we looking at unless you're an expert traveler or know the area it may be hard to pick out from just these two images visually it's very clear when we aren't seeing the whole picture when you get the complete picture you realize you're in london with incomplete data it's hard to see the whole picture to get a real sense of what is going on we sometimes trust data because if it comes to us in rows and columns it seems like everything we need is there if we just query it but that's just not true i remember a time when i found out i didn't have enough data and had to find a solution i was working for an online retail company and was asked to figure out how to shorten customer purchase to delivery time faster delivery times usually lead to happier customers when i checked the data set i found very limited tracking information we were missing some pretty key details so the data engineers and i created new processes to track additional information like the number of stops in a journey using this data we reduced the time it took from purchase to delivery and saw an improvement in customer satisfaction that felt pretty great learning how to deal with data issues while staying focused on the objective will help set you up for success in your career as a data analyst and your path to success continues next up you'll learn more about aligning data to objectives keep it up every analyst has been in a situation where there is insufficient data to help with their business objective considering how much data is generated every day it may be hard to believe but it's true so let's discuss what you can do when you have insufficient data we'll cover how to set limits for the scope of your analysis and what data you should include at one point i was a data analyst at a support center every day we receive customer questions which were logged in as support tickets i was asked to forecast a number of support tickets coming in per month to figure out how many additional people we needed to hire it was very important that we had sufficient data spanning back at least a couple years because i had to account for year-to-year and seasonal changes if i just had the current year's data available i wouldn't have known that a spike in january is common and has to do with people asking for refunds after the holidays because i had sufficient data i was able to suggest we hire more people in january to prepare challenges are bound to come up but the good news is that once you know your business objective you'll be able to recognize whether you have enough data and if you don't you'll be able to deal with it before you start your analysis now let's check out some of those limitations you might come across and how you can handle different types of insufficient data say you're working in the tourism industry and you need to find out which travel plans are searched most often if you only use data from one booking site you're limiting yourself to data from just one source other booking sites might show different trends that you would want to consider for your analysis if a limitation like this impacts your analysis you can stop and go back to your stakeholders to figure out a plan if your data set keeps updating that means the data is still incoming and might not be complete so if there's a brand new tourist attraction that you're analyzing interest and attendance for there's probably not enough data for you to determine trends for example you might want to wait a month to gather data or you can check in with the stakeholders and ask about adjusting the objective for example you might analyze trends from week to week instead of month to month you could also base your analysis on trends over the past three months and say here's what attendance at the attraction for month 4 could look like you might not have enough data to know if this number is too low or too high but you would tell stakeholders that it's your best estimate based on the data that you currently have on the other hand your data could be older and no longer be relevant outdated data about customer satisfaction won't include the most recent responses so you'd be relying on ratings for hotels or vacation rentals that might no longer be accurate in this case your best bet might be to find a new data set to work with data that's geographically limited could also be unreliable if your company's global you wouldn't want to use data limited to travel in just one country you'd want a data set that includes all countries so that's just a few of the most common limitations you'll come across and some ways you can adjust them you can identify trends with the available data or wait for more data if time allows you can talk with stakeholders and adjust your objective or you can look for a new data set the need to take these steps will depend on your role in your company and possibly the needs of the wider industry but learning how to deal with insufficient data is always a great way to set yourself up for success your data analyst powers are growing stronger and just in time after you learn more about limitations and solutions you'll learn about statistical power another fantastic tool for you to use see you soon [Music] okay so earlier we talked about having the right kind of data to meet your business objective and the importance of having the right amount of data to make sure your analysis is as accurate as possible you might remember that for data analysts a population is all possible data values in a certain data set if you're able to use a hundred percent of a population in your analysis that's great but sometimes collecting information about an entire population just isn't possible it's too time consuming or expensive for example let's say a global organization wants to know more about pet owners who have cats you're tasked with finding out which kinds of toys cat owners in canada prefer but there's millions of cat owners in canada so getting data from all of them would be a huge challenge fear not allow me to introduce you to sample size when you use sample size or sample you use a part of a population that's representative of the population the goal is to get enough information from a small group within a population to make predictions or conclusions about the whole population the sample size helps ensure the degree to which you can be confident that your conclusions accurately represent the population so for the data on cat owners a sample size might contain data about hundreds or thousands of people rather than millions using a sample for analysis is more cost effective and takes less time if done carefully and thoughtfully you can get the same results using a sample size instead of trying to hunt down every single cat owner to find out their favorite cat toys there is a potential downside though when you only use a small sample of a population it can lead to uncertainty you can't really be a hundred percent sure that your statistics are a complete and accurate representation of the population this leads to sampling bias which we covered earlier in the program sampling bias is when a sample isn't representative of the population as a whole this means some members of the population are being over represented or underrepresented for example if the survey used to collect data from cat owners only included people with smartphones then cat owners who don't have a smartphone wouldn't be represented in the data using random sampling can help address some of those issues with sampling bias random sampling is a way of selecting a sample from a population so that every possible type of the sample has an equal chance of being chosen going back to our cat owners again using a random sample of cat owners means cat owners of every type have an equal chance of being chosen so cat owners who live in apartments in ontario would have the same chance of being represented as those who live in houses in alberta as a data analyst you'll find that creating sample sizes usually takes place before you even get to the data but it's still good for you to know that the data you're going to analyze is representative of the population and works with your objective it's also good to know what's coming up in your data journey in the next video you will have the option to become even more comfortable with sample sizes see you there [Music] we've all probably dreamed of having a superpower at least once in our lives i know i have i'd love to be able to fly but there's another superpower you might not have heard of statistical power statistical power is the probability of getting meaningful results from a test i'm guessing that's not a superpower any of you have dreamed about still it's a pretty great data superpower for data analysts your projects might begin with the test or study hypothesis testing is a way to see if a survey or experiment has meaningful results here's an example let's say you work for a restaurant chain that's planning a marketing campaign for their new milkshakes you need to test the ad on a group of customers before turning it into nationwide and campaign in the test you want to check whether customers like or dislike the campaign you also want to rule out any factors outside of the ad that might lead them to say they don't like it using all your customers would be too time consuming and expensive so you'll need to figure out how many customers you'll need to show that the ad is effective 50 probably wouldn't be enough even if you randomly chose 50 customers you might end up with customers who don't like milkshakes at all and if that happens you won't be able to measure the effectiveness of your ad in getting more milkshake orders since no one in the sample size would order them that's why you need a larger sample size so you can make sure you get a good number of all types of people for your test usually the larger the sample size the greater the chance you'll have statistically significant results with your test and that's statistical power in this case using as many customers as possible will show the actual differences between the groups who like or dislike the ad versus people whose decision wasn't based on the ad at all there are ways to accurately calculate statistical power but we won't go into them here you might need to calculate it on your own as a data analyst for now you should know that statistical power is usually shown as a value out of one so if your statistical power is 0.6 that's the same thing as saying 60 percent in the milkshake ad test if you found a statistical power of 60 that means there's a sixty percent chance of you getting a statistically significant result on the ads effectiveness statistically significant is a term that is used in statistics if you wanna learn more about the technical meeting you can search online but in basic terms if a test is statistically significant it means the results of the test are real and not an error caused by random chance so there's a sixty percent chance that the results of the milkshake add test are reliable and real and a 40 chance that the result of the test is wrong usually you need a statistical power of at least 0.8 or 80 percent to consider your results statistically significant let's check out one more scenario we'll stick with milkshakes because well because i like milkshakes imagine you work for a restaurant chain that wants to launch a brand new birthday cake flavored milkshake this milkshake will be more expensive to produce than your other milkshakes your company hopes that the buzz around the new flavor will bring in more customers and money to offset this cost they want to test this out in a few restaurant locations first so let's figure out how many locations you'd have to use to be confident in your results first you'd have to think about what might prevent you from getting statistically significant results are there restaurants running any other promotions that might bring in new customers do some restaurants have customers that always buy the newest item no matter what it is do some locations have construction that recently started that would prevent customers from even going to the restaurant to get a higher statistical power you'd have to consider all of these factors before you decide how many locations to include in your sample size for your study you want to make sure any effect is most likely due to the new milkshake flavor not another factor the measurable effects would be an increase in sales or the number of customers at the locations in your sample size that's it for now coming up we'll explore sample sizes in more detail so you can get a better idea of how they impact your tests and studies in the meantime you've gotten to know a little bit more about milkshakes and superpowers and of course statistical power sadly only statistical power can truly be useful for data analysts though putting on my cape and flying to grab a milkshake right now does sound pretty good if you've ever been to a store that hands out samples you know it's one of life's little pleasures for me anyway those small samples are also a very smart way for businesses to learn more about their products from customers without having to give everyone a free sample a lot of organizations use sample size in a similar way they take one part of something larger in this case a sample of a population sometimes they'll perform complex tests on their data to see if it meets their business objectives we won't go into all the calculations needed to do this effectively instead we'll focus on a big picture look at the process and what it involves as a quick reminder sample size is a part of a population that is representative of the population for businesses it's a very important tool it can be both expensive and time-consuming to analyze an entire population of data so using sample size usually makes the most sense and can still lead to valid and useful findings there are handy calculators online that can help you find sample size you need to input the confidence level population size and margin of error we've talked about population size before to build on that we'll learn about confidence level and margin of error knowing about these concepts will help you understand why you need them to calculate sample size the confidence level is the probability that your sample accurately reflects the greater population you can think of it the same way as confidence in anything else it's how strongly you feel that you can rely on something or someone having a 99 confidence level is ideal but most industries hope for at least a 90 or 95 percent confidence level industries like pharmaceuticals usually want a confidence level that's as high as possible when they are using a sample size this makes sense because they're testing medicines and need to be sure they work and are safe for everyone to use for other studies organizations might just need to know that the test or survey results have them heading in the right direction for example if a paint company is testing out new colors a lower confidence level is okay you also want to consider the margin of error for your study you'll learn more about this soon but it basically tells you how close your sample size results are to what your results would be if you use the entire population that your sample size represents think of it like this let's say that the principle of a middle school approaches you with the study about students candy preferences they need to know an appropriate sample size and they need it now the school has a student population of 500 and they're asking for a confidence level of 95 percent and a margin of error of 5 we've set up a calculator in a spreadsheet but you can also easily find this type of calculator by searching sample size calculator on the internet and just like those calculators our spreadsheet calculator doesn't show any of the more complex calculations for figuring out sample size so all we need to do is input the numbers for our population confidence level and margin of error and when we type 500 for our population size and 95 for our confidence level percentage and 5 for our margin of error percentage the result is about 218. that means for this study an appropriate sample size would be 218. so if we surveyed 218 students and found that 55 of them preferred chocolate then we could be pretty confident that would be true of all 500 students 218 is the minimum number of people we need to survey based on our criteria of a 95 confidence level and a 5 margin of error and in case you're wondering the confidence level and margin of error don't have to add up to 100 percent they're independent of each other so let's say we change our margin of error from five percent to three percent then we find that our sample size would need to be larger about 341 instead of 218 to make the results of the study more representative of the population feel free to practice with an online calculator knowing sample size and how to find it will help you when you work with data and we've got more useful knowledge coming your way including learning about margin of error see you soon earlier we touched on margin of error without explaining it completely well we're going to write that wrong in this video by explaining margin of error more we'll even include an example of how to calculate it as a data analyst it's important for you to figure out sample size and variables like confidence level and margin of error before running any kind of test or survey it's the best way to make sure your results are objective and it gives you a better chance of getting statistically significant results but if you already know the sample size like when you're given survey results to analyze you can calculate the margin of error yourself then you'll have a better idea of how much of a difference there is between your sample and your population we'll start at the beginning with a more complete definition margin of error is the maximum amount that the sample results are expected to differ from those of the actual population let's think about an example of margin of error it would be great to survey or test an entire population but it's usually impossible or impractical to do this so instead we take a sample of the larger population based on the sample size the resulting margin of error will tell us how different the results might be compared to the results if we had surveyed the entire population margin of error helps you understand how reliable the data from your hypothesis testing is the closer to zero the margin of error the closer your results from your sample would match results from the overall population for example let's say you completed a nationwide survey using a sample of the population you asked people who work five-day work weeks whether they like the idea of a four-day work week so your survey tells you that sixty percent prefer a four day work week the margin of error was ten percent which tells us that between fifty and seventy percent like the idea so if we were to survey all five-day workers nationwide between 50 and 70 would agree with our results keep in mind our range is between 50 and 70 that's because the margin of error is counted in both directions from the survey results of 60 percent if you set up a 95 confidence level for your survey there will be a 95 chance that the entire population's responses will fall between 50 and 70 percent saying yes they want a four day work week since your margin of error overlaps with that 50 percent mark you can't say for sure that the public likes the idea of a four-day workweek in that case you'd have to say your survey was inclusive now if you wanted a lower margin of error say 5 with a range between 55 and 65 you could increase the sample size but if you've already been given the sample size you can calculate the margin of error yourself then you can decide yourself how much of a chance your results have of being statistically significant based on your margin of error in general the more people you include in your survey the more likely your sample is representative of the entire population decreasing the confidence level would also have the same effect but that would also make it less likely that your survey is accurate so to calculate margin of error you need three things population size sample size and confidence level and just like with sample size you can find lots of calculators online by searching margin of error calculator but we'll show you in a spreadsheet just like we did when we calculated sample size let's say you're running a study on the effectiveness of a new drug you have a sample size of 500 participants whose condition affects one percent of the world's population that's about 80 million people which is the population for your study since it's a drug study you need to have a confidence level of 99 you also need a low margin of error let's calculate it we'll put the numbers for population and confidence level and sample size in the appropriate spreadsheet cells and our result is a margin of error of close to six percent plus or minus when the drug study is complete you'd apply the margin of error to your results to determine how reliable your results might be calculators like this one in the spreadsheet are just one of the many tools you can use to ensure data integrity and it's also good to remember that checking for data integrity and aligning the data with your objectives will put you in good shape to complete your analysis knowing about sample size statistical power margin of error and other topics we covered will help your analysis run smoothly that's a lot of new concepts to take in if you'd like to review them at any time you can find them all in the glossary or feel free to rewatch the video soon you'll explore the ins and outs of clean data the data adventure keeps moving i'm so glad you're moving along with it you got this congratulations on finishing this video from the google data analytics certificate access the full experience including job search help and start to earn the official certificate by clicking the icon or the link in the description watch the next video in the course by clicking here and subscribe to our channel for more from upcoming google career certificates
Info
Channel: Google Career Certificates
Views: 1,494
Rating: 5 out of 5
Keywords: Grow with Google, Career Change, Tech jobs, Google Career Certificate, Google Career Certificates, Job skills, Coursera, Certification, Google, professional certificates, professional certificate program, Data analyst, Data analytics, Data analysis, Data analytics for beginners, What is data analytics, Sql, Data, R Programming, Spreadsheets, Sampling, Statistical power, Statistic, Margin of error, SQL tutorial for beginners, Spreadsheet, Pivot table excel, Sql tutorial
Id: 9qCfJv-zoyE
Channel Id: undefined
Length: 33min 51sec (2031 seconds)
Published: Fri Jun 11 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.