Process Data from Dirty to Clean - Part 4 of the Google Data Analytics Certificate

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi good to see you my name's sally and i'm here to teach you all about processing data i'm a measurement and analytical lead at google my job is to help advertising agencies and companies measure success and analyze their data so i get to meet with lots of different people to show them how data analysis helps with their advertising speaking of analysis you degrade earlier learning how to gather and organize data for analysis it's definitely an important step in the data analysis process so well done now let's talk about how to make sure that your organized data is complete and accurate clean data is the key to making sure your data has integrity before you analyze it we'll show you how to make sure your data is clean and tidy cleaning and processing data is one part of the overall data analysis process as a quick reminder that process is ask prepare process analyze share and act which means it's time for us to explore the process phase and i'm here to guide you the whole way i'm very familiar with where you are right now i never heard of data analytics until i went through a program similar to this one once i started making progress i realized how much i enjoyed data analytics and the doors it can open and now i'm excited to help you open those same doors one thing i've realized as i work for different companies is that clean data is important in every industry for example i learned early in my career to be on the lookout for duplicate data a common problem that analysts come across when cleaning i used to work for a company that had different types of subscriptions in our data set each user would have a new row for each subscription type they bought which meant users would show up more than once in my data so if i had counted the number of users in a table without accounting for duplicates like this i would have counted some users twice instead of once as a result my analysis would have been wrong which would have led to problems in my reports and for the stakeholders relying on my analysis imagine if i told the ceo that we had twice as many customers as we actually did that's why clean data is so important so the first step in processing data is learning about data integrity you'll find out what data integrity is and why it's important to maintain it throughout the data analysis process sometimes you might not even have the data that you need so you have to create it yourself this will help you learn how sample size and random sampling can save you time and effort testing data is another important step to take when processing data we'll share some guidance on how to test data before your analysis officially begins just like you'd clean your clothes and your dishes in everyday life analysts clean their data all the time too the importance of clean data will definitely be a focus here you'll learn data cleaning techniques for all scenarios along with some pitfalls to watch out for as you clean you'll explore data cleaning in both spreadsheets and databases building on what you've already learned about spreadsheets we'll talk more about sql and how you can use it to clean data and do other useful things too when analysts clean their data they do a lot more than a spot check to make sure it was done correctly you'll learn ways to verify and report your cleaning results this includes documenting your cleaning process which has lots of benefits that we'll explore it's important to remember that processing data is just one of the tasks you'll complete as a data analyst actually your skills with cleaning data might just end up being something you highlight in your resume when you start job hunting speaking of resumes you'll be able to start thinking about how to build your own from the perspective of a data analyst once you're done here you'll have a strong appreciation for clean data and how important it is in the data analysis process so let's get started welcome back in this video we're going to discuss data integrity and some risks you might run into as a data analyst a strong analysis depends on the integrity of the data if the data you're using is compromised in any way your analysis won't be as strong as it should be data integrity is the accuracy completeness consistency and trustworthiness of data throughout its life cycle that might sound like a lot of qualities for the data to live up to but trust me it's worth it to check for them all before proceeding with your analysis otherwise your analysis could be wrong not because you did something wrong but because the data you were working with was wrong to begin with when data integrity is low it can cause anything from the loss of a single pixel in an image to an incorrect medical decision in some cases one missing piece can make all of your data useless data integrity can be compromised in lots of different ways there's a chance data can be compromised every time it's replicated transferred or manipulated in any way data replication is the process of storing data in multiple locations if you're replicating data at different times in different places there's a chance your data will be out of sync this data lacks integrity because different people might not be using the same data for their findings which can cause inconsistencies there's also the issue of data transfer which is the process of copying data from a storage device to memory or from one computer to another if your data transfer is interrupted you might end up with an incomplete data set which might not be useful for your needs the data manipulation process involves changing the data to make it more organized and easier to read data manipulation is meant to make the data analysis process more efficient but an error during the process can compromise that efficiency finally data can also be compromised through human error viruses malware hacking and system failures which can all lead to even more headaches i'll stop there that's enough potentially bad news to digest let's move on to some potentially good news in a lot of companies the data warehouse or data engineering team takes care of ensuring data integrity coming up we'll learn about checking data integrity as a data analyst but rest assured someone else will usually have your back too after you've found out what kind of data you're working with it's important to double check that your data is complete and valid before analysis this will help ensure that your analysis and eventual conclusions are accurate checking data integrity is a vital step in processing your data to get it ready for analysis whether you or someone else at your company is doing it coming up you'll learn even more about data integrity see you soon hey there it's good to remember to check for data integrity it's also important to check that the data you use aligns with the business objective this adds another layer to the maintenance of data integrity because the data you're using might have limitations that you'll need to deal with the process of matching data to business objectives can actually be pretty straightforward here's a quick example let's say you're an analyst for a business that produces and sells auto parts if you need to address a question about the revenue generated by the sale of a certain part then you'd pull up the revenue table from the data set if the question is about customer reviews then you'd pull up the reviews table to analyze the average ratings but before digging into any analysis you need to consider a few limitations that might affect it if the data hasn't been cleaned properly then you won't be able to use it yet you would need to wait until a thorough cleaning has been done now let's say you're trying to find how much an average customer spends and you notice the same customers data showing up in more than one row this is called duplicate data to fix this you might need to change the format of the data or you might need to change the way you calculate the average otherwise it will seem like the data is for two different people and you'll be stuck with misleading calculations you might also realize there's not enough data to complete an accurate analysis maybe you only have a couple of months worth of sales data there's a slim chance you could wait for more data but it's more likely that you'll have to change your process or find alternate sources of data while still meeting your objective i like to think of a data set like a picture take this picture what are we looking at unless you're an expert traveler or know the area it may be hard to pick out from just these two images visually it's very clear when we aren't seeing the whole picture when you get the complete picture you realize you're in london with incomplete data it's hard to see the whole picture to get a real sense of what is going on we sometimes trust data because if it comes to us in rows and columns it seems like everything we need is there if we just query it but that's just not true i remember a time when i found out i didn't have enough data and had to find a solution i was working for an online retail company and was asked to figure out how to shorten customer purchase to delivery time faster delivery times usually lead to happier customers when i checked the data set i found very limited tracking information we were missing some pretty key details so the data engineers and i created new processes to track additional information like the number of stops in a journey using this data we reduced the time it took from purchase to delivery and saw an improvement in customer satisfaction that felt pretty great learning how to deal with data issues while staying focused on the objective will help set you up for success in your career as a data analyst and your path to success continues next up you'll learn more about aligning data to objectives keep it up every analyst has been in a situation where there is insufficient data to help with their business objective considering how much data is generated every day it may be hard to believe but it's true so let's discuss what you can do when you have insufficient data we'll cover how to set limits for the scope of your analysis and what data you should include at one point i was a data analyst at a support center every day we received customer questions which were logged in as support tickets i was asked to forecast a number of support tickets coming in per month to figure out how many additional people we needed to hire it was very important that we had sufficient data spanning back at least a couple years because i had to account for year to year and seasonal changes if i just had the current year's data available i wouldn't have known that a spike in january is common and has to do with people asking for refunds after the holidays because i had sufficient data i was able to suggest we hire more people in january to prepare challenges are bound to come up but the good news is that once you know your business objective you'll be able to recognize whether you have enough data and if you don't you'll be able to deal with it before you start your analysis now let's check out some of those limitations you might come across and how you can handle different types of insufficient data say you're working in the tourism industry and you need to find out which travel plans are searched most often if you only use data from one booking site you're limiting yourself to data from just one source other booking sites might show different trends that you would want to consider for your analysis if a limitation like this impacts your analysis you can stop and go back to your stakeholders to figure out a plan if your data set keeps updating that means the data is still incoming and might not be complete so if there's a brand new tourist attraction that you're analyzing interest and attendance for there's probably not enough data for you to determine trends for example you might want to wait a month to gather data or you can check in with the stakeholders and ask about adjusting the objective for example you might analyze trends from week to week instead of month to month you could also base your analysis on trends over the past three months and say here's what attendance at the attraction for month 4 could look like you might not have enough data to know if this number is too low or too high but you would tell stakeholders that it's your best estimate based on the data that you currently have on the other hand your data could be older and no longer be relevant outdated data about customer satisfaction won't include the most recent responses so you'd be relying on ratings for hotels or vacation rentals that might no longer be accurate in this case your best bet might be to find a new data set to work with data that's geographically limited could also be unreliable if your company's global you wouldn't want to use data limited to travel in just one country you'd want a data set that includes all countries so that's just a few of the most common limitations you'll come across and some ways you can adjust them you can identify trends with the available data or wait for more data if time allows you can talk with stakeholders and adjust your objective or you can look for a new data set the need to take these steps will depend on your role in your company and possibly the needs of the wider industry but learning how to deal with insufficient data is always a great way to set yourself up for success your data analyst powers are growing stronger and just in time after you learn more about limitations and solutions you'll learn about statistical power another fantastic tool for you to use see you soon okay so earlier we talked about having the right kind of data to meet your business objective and the importance of having the right amount of data to make sure your analysis is as accurate as possible you might remember that for data analysts a population is all possible data values in a certain data set if you're able to use a hundred percent of a population in your analysis that's great but sometimes collecting information about an entire population just isn't possible it's too time consuming or expensive for example let's say a global organization wants to know more about pet owners who have cats you're tasked with finding out which kinds of toys cat owners in canada prefer but there's millions of cat owners in canada so getting data from all of them would be a huge challenge fear not allow me to introduce you to sample size when you use sample size or sample you use a part of a population that's representative of the population the goal is to get enough information from a small group within a population to make predictions or conclusions about the whole population the sample size helps ensure the degree to which you can be confident that your conclusions accurately represent the population so for the data on cat owners a sample size might contain data about hundreds or thousands of people rather than millions using a sample for analysis is more cost effective and takes less time if done carefully and thoughtfully you can get the same results using a sample size instead of trying to hunt down every single cat owner to find out their favorite cat toys there is a potential downside though when you only use a small sample of a population it can lead to uncertainty you can't really be 100 sure that your statistics are a complete and accurate representation of the population this leads to sampling bias which we covered earlier in the program sampling bias is when a sample isn't representative of the population as a whole this means some members of the population are being over-represented or under-represented for example if the survey used to collect data from cat owners only included people with smartphones then cat owners who don't have a smartphone wouldn't be represented in the data using random sampling can help address some of those issues with sampling bias random sampling is a way of selecting a sample from a population so that every possible type of the sample has an equal chance of being chosen going back to our cat owners again using a random sample of cat owners means cat owners of every type have an equal chance of being chosen so cat owners who live in apartments in ontario would have the same chance of being represented as those who live in houses in alberta as a data analyst you'll find that creating sample sizes usually takes place before you even get to the data but it's still good for you to know that the data you're going to analyze is representative of the population and works with your objective it's also good to know what's coming up in your data journey in the next video you'll have the option to become even more comfortable with sample sizes see you there hey there we've all probably dreamed of having a superpower at least once in our lives i know i have i'd love to be able to fly but there's another superpower you might not have heard of statistical power statistical power is the probability of getting meaningful results from a test i'm guessing that's not a superpower any of you have dreamed about still it's a pretty great data superpower for data analysts your projects might begin with the test or study hypothesis testing is a way to see if a survey or experiment has meaningful results here's an example let's say you work for a restaurant chain that's planning a marketing campaign for their new milkshakes you need to test the ad on a group of customers before turning it into nationwide and campaign in the test you want to check whether customers like or dislike the campaign you also want to rule out any factors outside of the ad that might lead them to say they don't like it using all your customers would be too time consuming and expensive so you'll need to figure out how many customers you'll need to show that the ad is effective 50 probably wouldn't be enough even if you randomly chose 50 customers you might end up with customers who don't like milkshakes at all and if that happens you won't be able to measure the effectiveness of your ad in getting more milkshake orders since no one in the sample size would order them that's why you need a larger sample size so you can make sure you get a good number of all types of people for your test usually the larger the sample size the greater the chance you'll have statistically significant results with your test and that's statistical power in this case using as many customers as possible will show the actual differences between the groups who like or dislike the ad versus people whose decision wasn't based on the ad at all there are ways to accurately calculate statistical power but we won't go into them here you might need to calculate it on your own as a data analyst for now you should know that statistical power is usually shown as a value out of one so if your statistical power is 0.6 that's the same thing as saying 60 in the milkshake ad test if you found a statistical power of 60 that means there's a 60 percent chance of you getting a statistically significant result on the ads effectiveness statistically significant is a term that is used in statistics if you want to learn more about the technical meeting you can search online but in basic terms if a test is statistically significant it means the results of the test are real and not an error caused by random chance so there's a 60 chance that the results of the milkshake ad test are reliable and real and a 40 chance that the result of the test is wrong usually you need a statistical power of at least 0.8 or 80 percent to consider your results statistically significant let's check out one more scenario we'll stick with milkshakes because well because i like milkshakes imagine you work for a restaurant chain that wants to launch a brand new birthday cake flavored milkshake this milkshake will be more expensive to produce than your other milkshakes your company hopes that the buzz around the new flavor will bring in more customers and money to offset this cost they want to test this out in a few restaurant locations first so let's figure out how many locations you'd have to use to be confident in your results first you'd have to think about what might prevent you from getting statistically significant results are there restaurants running any other promotions that might bring in new customers do some restaurants have customers that always buy the newest item no matter what it is do some locations have construction that recently started that would prevent customers from even going to the restaurant to get a higher statistical power you'd have to consider all of these factors before you decide how many locations to include in your sample size for your study you want to make sure any effect is most likely due to the new milkshake flavor not another factor the measurable effects would be an increase in sales or the number of customers at the locations in your sample size that's it for now coming up we'll explore sample sizes in more detail so you can get a better idea of how they impact your tests and studies in the meantime you've gotten to know a little bit more about milkshakes and superpowers and of course statistical power sadly only statistical power can truly be useful for data analysts though putting on my cape and flying to grab a milkshake right now does sound pretty good great to see you again in this video we'll go into more detail about sample sizes and data integrity if you've ever been to a store that hands out samples you know it's one of life's little pleasures for me anyway those small samples are also a very smart way for businesses to learn more about their products from customers without having to give everyone a free sample a lot of organizations use sample size in a similar way they take one part of something larger in this case a sample of a population sometimes they'll perform complex tests on their data to see if it meets their business objectives we won't go into all the calculations needed to do this effectively instead we'll focus on a big picture look at the process and what it involves as a quick reminder sample size is a part of a population that is representative of the population for businesses it's a very important tool it can be both expensive and time-consuming to analyze an entire population of data so using sample size usually makes the most sense and can still lead to valid and useful findings there are handy calculators online that can help you find sample size you need to input the confidence level population size and margin of error we've talked about population size before to build on that we'll learn about confidence level and margin of error knowing about these concepts will help you understand why you need them to calculate sample size the confidence level is the probability that your sample accurately reflects the greater population you can think of it the same way as confidence in anything else it's how strongly you feel that you can rely on something or someone having a 99 confidence level is ideal but most industries hope for at least a 90 or 95 confidence level industries like pharmaceuticals usually want a confidence level that's as high as possible when they are using a sample size this makes sense because they're testing medicines and need to be sure they work and are safe for everyone to use for other studies organizations might just need to know that the test or survey results have them heading in the right direction for example if a paint company is testing out new colors a lower confidence level is okay you also want to consider the margin of error for your study you'll learn more about this soon but it basically tells you how close your sample size results are to what your results would be if you use the entire population that your sample size represents think of it like this let's say that the principle of a middle school approaches you with the study about students candy preferences they need to know an appropriate sample size and they need it now the school has a student population of 500 and they're asking for a confidence level of 95 percent and a margin of error of 5 we've set up a calculator in a spreadsheet but you can also easily find this type of calculator by searching sample size calculator on the internet and just like those calculators our spreadsheet calculator doesn't show any of the more complex calculations for figuring out sample size so all we need to do is input the numbers for our population confidence level and margin of error and when we type 500 for our population size and 95 for our confidence level percentage and 5 for our margin of error percentage the result is about 218. that means for this study an appropriate sample size would be 218. so if we surveyed 218 students and found that 55 of them preferred chocolate then we could be pretty confident that would be true of all 500 students 218 is the minimum number of people we need to survey based on our criteria of a 95 percent confidence level and a 5 percent margin of error and in case you're wondering the confidence level and margin of error don't have to add up to 100 they're independent of each other so let's say we change our margin of error from 5 to 3 then we find that our sample size would need to be larger about 341 instead of 218 to make the results of the study more representative of the population feel free to practice with an online calculator knowing sample size and how to find it will help you when you work with data and we've got more useful knowledge coming your way including learning about margin of error see you soon hey there earlier we touched on margin of error without explaining it completely well we're going to write that wrong in this video by explaining margin of error more we'll even include an example of how to calculate it as a data analyst it's important for you to figure out sample size and variables like confidence level and margin of error before running any kind of test or survey it's the best way to make sure your results are objective and it gives you a better chance of getting statistically significant results but if you already know the sample size like when you're given survey results to analyze you can calculate the margin of error yourself then you'll have a better idea of how much of a difference there is between your sample and your population start at the beginning with a more complete definition margin of error is the maximum amount that the sample results are expected to differ from those of the actual population let's think about an example of margin of error it would be great to survey or test an entire population but it's usually impossible or impractical to do this so instead we take a sample of the larger population based on the sample size the resulting margin of error will tell us how different the results might be compared to the results if we had surveyed the entire population margin of error helps you understand how reliable the data from your hypothesis testing is the closer to zero the margin of error the closer your results from your sample would match results from the overall population for example let's say you completed a nationwide survey using a sample of the population you asked people who work five-day work weeks whether they like the idea of a four-day work week so your survey tells you that sixty percent prefer a four day work week the margin of error was ten percent which tells us that between fifty and seventy percent like the idea so if we were to survey all five day workers nationwide between 50 and 70 would agree with our results keep in mind our range is between 50 and 70 that's because the margin of error is counted in both directions from the survey results of sixty percent if you set up a ninety five percent confidence level for your survey there will be a ninety five percent chance that the entire population's responses will fall between 50 and 70 percent saying yes they want a four day work week since your margin of error overlaps with that 50 percent mark you can't see for sure that the public likes the idea of a four-day work week in that case you'd have to say your survey was inclusive now if you wanted a lower margin of error say 5 with a range between 55 and 65 you could increase the sample size but if you've already been given the sample size you can calculate the margin of error yourself then you can decide yourself how much of a chance your results have of being statistically significant based on your margin of error in general the more people you include in your survey the more likely your sample is representative of the entire population decreasing the confidence level would also have the same effect but that would also make it less likely that your survey is accurate so to calculate margin of error you need three things population size sample size and confidence level and just like with sample size you can find lots of calculators online by searching margin of error calculator but we'll show you in a spreadsheet just like we did when we calculated sample size let's say you're running a study on the effectiveness of a new drug you have a sample size of 500 participants whose condition affects one percent of the world's population that's about 80 million people which is the population for your study since it's a drug study you need to have a confidence level of 99 you also need a low margin of error let's calculate it we'll put the numbers for population and confidence level and sample size in the appropriate spreadsheet cells and our result is a margin of error of close to six percent plus or minus when the drug study is complete you'd apply the margin of error to your results to determine how reliable your results might be calculators like this one in the spreadsheet are just one of the many tools you can use to ensure data integrity and it's also good to remember that checking for data integrity and aligning the data with your objectives will put you in good shape to complete your analysis knowing about sample size statistical power margin of error and other topics we covered will help your analysis run smoothly that's a lot of new concepts to take in if you'd like to review them at any time you can find them all in the glossary or feel free to rewatch the video soon you'll explore the ins and outs of clean data the data adventure keeps moving i'm so glad you're moving along with it you got this [Music] can you guess what inaccurate or bad data costs businesses every year thousands of dollars millions billions well according to ibm the yearly cost of poor quality data is 3.1 trillion dollars in the u.s alone that's a lot of zeros now can you guess the number one cause of poor quality data it's not a new system implementation or a computer technical glitch the most common factor is actually human error here's a spreadsheet from a law office it shows customers the legal services they bought the service order number how much they paid and the payment method dirty data can be the result of someone typing in a piece of data incorrectly inconsistent formatting blank fields or the same piece of data being entered more than once which causes duplicates dirty data is data that's incomplete incorrect or irrelevant to the problem you're trying to solve when you work with dirty data you can't be sure that your results are correct in fact you can pretty much bet they won't be earlier you learn that data integrity is critical to reliable data analytics results and clean data helps you achieve data integrity clean data is data that's complete correct and relevant to the problem you're trying to solve when you work with clean data you'll find that your projects go much more smoothly i remember the first time i witnessed firsthand how important clean data really is i had just started using sql and i thought it worked like magic i could have the computer sum up millions of numbers saving me tons of time and effort but i quickly discovered that only works when the data is clean if there was even one accidental letter in a column that should only have numbers the computer wouldn't know what to do so it would throw an error and suddenly i was stuck and there's no way i can add up millions of numbers by myself so i had to clean up that data to make it work the good news is that there's plenty of effective processes and tools to help you do that coming up you'll gain the skills and knowledge you need to make sure the data you work with is always clean along the way we'll dig deeper into the difference between clean and dirty data and why clean data is so important we'll also talk about different ways to clean your data and common problems to look for during the process ready to start let's do it clean data is incredibly important for effective analysis if a piece of data is entered into a spreadsheet or database incorrectly or if it's repeated or if a field is left blank or if data formats are inconsistent the result is dirty data small mistakes can lead to big consequences in the long run now i'll be completely honest with you data cleaning is like brushing your teeth it's something you should do and do properly because otherwise it can cause serious problems for teeth that might be cavities or gum disease for data that might be costing your company money or an angry boss but here's the good news if you keep brushing twice a day every day it becomes a habit soon you don't even have to think about it and it's the same with data trust me it'll make you look great when you take the time to clean up that dirty data as a quick refresher dirty data is incomplete incorrect or irrelevant to the problem you're trying to solve it can't be used in a meaningful way which makes analysis very difficult if not impossible on the other hand clean data is complete correct and relevant to the problem you're trying to solve this allows you to understand and analyze information and identify important patterns connect related information and draw useful conclusions then you can apply what you learned to make effective decisions in some cases you won't have to do a lot of work to clean data for example when you use internal data that's been verified and cared for by your company's data engineers and data warehouse team it's more likely to be clean let's talk about some people you'll work with as a data analyst data engineers transform data into a useful format for analysis and give it a reliable infrastructure this means they develop maintain and test databases data processors and related systems data warehousing specialists develop processes and procedures to effectively store and organize data they make sure that data is available secure and backed up to prevent loss when you become a data analyst you can learn a lot by working with the person who maintains your databases to learn about their systems if data passes through the hands of a data engineer or data warehousing specialist first you know you're off to a good start on your project there's a lot of great career opportunities as a data engineer or data warehousing specialist if this kind of work sounds interesting to you maybe your career path will involve helping organizations save lots of time effort and money by making sure their data is sparkling clean but even if you go in a different direction with your data analytics career and have the advantage of working with data engineers and warehousing specialists you're still likely to have to clean your own data and it's important to remember no data set is perfect it's always a good idea to examine and clean data before beginning analysis here's an example let's say you're working on a project where you need to figure out how many people use your company's software program you have a spreadsheet that was created internally and verified by a data engineer and a data warehousing specialist check out the column labeled username now it might seem logical that you can just scroll down and count the rows to figure out how many users you have but that won't work because one person sometimes has more than one username maybe they registered from different email addresses or maybe they have a work and personal account in situations like this you would need to clean the data by eliminating any rows that are duplicates once you've done that there won't be any more duplicate entries then your spreadsheet's ready to be put to work okay so far we've discussed working with internal data but data cleaning becomes even more important when working with external data especially if it comes from multiple sources let's say the software company from our example surveyed its customers to learn how satisfied they are with its software product but when you review the survey data you find that you have several nulls a null is an indication that a value does not exist in a data set note that it's not the same as a zero in the case of our survey a null would mean the customer skipped that question a zero would mean they provided zero as their response to do your analysis you would first need to clean this data step one would be to decide what to do with those nulls you could either filter them out and communicate that you now have a smaller sample size or you can keep them in and learn from the fact that the customers did not provide responses there's lots of reasons why this could have happened maybe your survey questions weren't written as well as they could be maybe they were confusing or biased something we learned about earlier okay we've touched on the basics of cleaning internal and external data but there's lots more to come soon we'll learn about the common errors to be aware of to ensure your data is complete correct and relevant see you soon i'm angie i'm a program manager of engineering at google i truly believe that cleaning data is the heart and soul of data it's how you get to know your data it's quirks it's flaws it's mysteries i love a good mystery i remember one time i found somebody had purchased i think it was one million dollars worth of chicken sandwiches in one transaction and this mystery drove me nuts you know i had all these questions could this have really happened you know maybe it was a really big birthday party how did they make a million dollars worth of chicken sandwiches and i you know was cleaning my data and trying to figure out where did it go wrong and we ended up finding out that we had been squaring and multiplying all of our transactions for a very specific case uh it took us about three days to figure this out i will never forget the moment when it was like aha we got to the bottom of it you know the result is our data you know was cleaned and we had this great data set that we could use for analysis but what i loved was just the mystery of it you know and getting to know all these weird intricacies about my data set and it felt like a superpower almost like i was a detective and i'd gone in there and i'd really solved something so i love cleaning data hey there in this video we'll focus on common issues associated with dirty data these include spelling and other text errors inconsistent labels formats and field length missing data and duplicates this will help you recognize problems quicker and give you the information you need to fix them when you encounter something similar during your own analysis this is incredibly important in data analytics okay let's go back to our law office spreadsheet as a quick refresher we'll start by checking out the different types of dirty data it shows sometimes someone might key in a piece of data incorrectly other times they might not keep data formats consistent it's also common to leave a field blank that's also called null which we learned about earlier and if someone adds the same piece of data more than once that creates a duplicate so let's break that down then we'll learn about a few other types of dirty data and strategies for cleaning it misspellings spelling variations mixed up letters inconsistent punctuation and typos in general happen when someone types in a piece of data incorrectly as a data analyst you'll also deal with different currencies for example one data set could be in us dollars and another in euros and you don't want to get them mixed up we want to find these types of errors and fix them like this you'll learn more about this soon clean data depends largely on the data integrity rules that an organization follows such as spelling and punctuation guidelines for example a beverage company might ask everyone working in its database to enter data about volume in fluid ounces instead of cups it's great when an organization has rules like this in place it really helps minimize the amount of data cleaning required but it can't eliminate it completely like we discussed earlier there's always the possibility of human error the next type of dirty data our spreadsheet shows is inconsistent formatting in this example something that should be formatted as currency is shown as a percentage until this error is fixed like this the law office will have no idea how much money this customer paid for its services we'll learn about different ways to solve this and many other problems soon we discussed nulls previously but as a reminder nulls are empty fields this kind of dirty data requires a little more work than just fixing a spelling error or changing a format in this example the data analysts would need to research which customer had a consultation on july 4th 2020 then when they find the correct information they'd have to add it to the spreadsheet another common type of dirty data is duplicate maybe two different people added this appointment on august 13th not realizing that someone else had already done it or maybe the person entering the data hit copy and paste by accident whatever the reason it's the data analyst's job to identify this error and correct it by deleting one of the duplicates okay now let's continue on to some other types of dirty data the first has to do with labeling to understand labeling imagine trying to get a computer to correctly identify panda bears among images of all different kinds of animals you need to show the computer thousands of images of panda bears they're all labeled as panda bears any incorrectly labeled picture like the one here that's just bare will cause a problem the next type of dirty data is having an inconsistent field length you learned earlier that a field is a single piece of information from a row or column of a spreadsheet field length is a tool for determining how many characters can be keyed into a field assigning a certain length to the fields in your spreadsheet is a great way to avoid errors for instance if you have a column for someone's birth year you know the field length is four because all years are four digits long some spreadsheet applications have a simple way to specify field lengths and make sure users can only enter a certain number of characters into a field this is part of data validation data validation is a tool for checking the accuracy and quality of data before adding or importing it data validation is a form of data cleansing which you'll learn more about soon but first you'll get familiar with more techniques for cleaning data this is a very important part of the data analyst job and i look forward to sharing these data cleaning strategies with you hi now that you're familiar with some of the most common types of dirty data it's time to clean them up as you've learned clean data is essential to data integrity and reliable solutions and decisions the good news is that spreadsheets have all kinds of tools you can use to get your data ready for analysis the techniques for data cleaning will be different depending on the specific data set you're working with so we won't cover everything you might run into but this will give you a great starting point for fixing the types of dirty data analysts find most often think of everything that's coming up as a teaser trailer of data cleaning tools i'm going to give you a basic overview of some common tools and techniques and then we'll practice them again later on here we'll discuss how to remove unwanted data clean up text to remove extra spaces and blanks fix typos and make formatting consistent however before removing unwanted data it's always a good practice to make a copy of the data set that way if you remove something that you end up needing in the future you can easily access it and put it back in the data set once that's done then you can move on to getting rid of the duplicates or data that isn't relevant to the problem you're trying to solve typically duplicates appear when you're combining data sets from more than one source or using data from multiple departments within the same business you've already learned a bit about duplicates but let's practice removing them once more now using this spreadsheet which lists members of a professional logistics association duplicates can be a big problem for data analysts so it's really important that you can find and remove them before any analysis starts here's an example of what i'm talking about let's say this association has duplicates of one person's 500 membership in its database when the data is summarized the analyst would think there was one thousand dollars being paid by this member and would make decisions based on that incorrect data but in reality this member only paid five hundred dollars these problems can be fixed manually but most spreadsheet applications also offer lots of tools to help you find and remove duplicates now irrelevant data which is data that doesn't fit the specific problem that you're trying to solve also needs to be removed going back to our association membership list example let's say a data analyst was working on a project that focused only on current members they wouldn't want to include information on people who are no longer members or who never joined in the first place removing irrelevant data takes a little more time and effort because you have to figure out the difference between the data you need and the data you don't but believe me making those decisions will save you a ton of effort down the road the next step is removing extra spaces and blanks extra spaces can cause unexpected results when you sort filter or search through your data and because these characters are easy to miss they can lead to unexpected and confusing results for example if there's an extra space in a member id number when you sort the column from lowest to highest this row will be out of place to remove these unwanted spaces or blank cells you can delete them yourself or again you can rely on your spreadsheets which offer lots of great functions for removing spaces and blanks automatically the next data cleaning step involves fixing misspellings inconsistent capitalization incorrect punctuation and other typos these types of errors can lead to some big problems let's say you have a database of emails that you use to keep in touch with your customers if some emails have misspellings a period in the wrong place or any other kind of typo not only do you run the risk of sending an email to the wrong people you also run the risk of spamming random people think about our association membership example again misspelling might cause the data analyst to miscount the number of professional members if they sorted this membership type and then counted the number of rows like the other problems we've come across you can also fix these problems manually or you can use spreadsheet tools such as spell check auto correct and conditional formatting to make your life easier there's also easy ways to convert text to lowercase uppercase or proper case which is one of the things we'll check out again later alright we're getting there the next step is removing formatting this is particularly important when you get data from lots of different sources every database has its own formatting which can cause the data to seem inconsistent creating a clean and consistent visual appearance for your spreadsheets will help make it a valuable tool for you and your team when making key decisions most spreadsheet applications also have a clear formats tool which is a great time saver cleaning data is an essential step in increasing the quality of your data now you know lots of different ways to do that in the next video you'll take that knowledge even further and learn how to clean up data that's come from more than one source welcome back so far you learned a lot about dirty data and how to clean up the most common errors in the data set now we're going to take that a step further and talk about cleaning up multiple data sets cleaning data that comes from two or more sources is very common for data analysts but it does come with some interesting challenges a good example is a merger which is an agreement that unites two organizations into a single new one in the logistics field there's been lots of big changes recently mostly because of the e-commerce boom with so many people shopping online it makes sense that the companies responsible for delivering these products to their homes are in the middle of a big shake-up when big things happen in an industry it's common for two organizations to team up and become stronger through a merger let's talk about how that'll affect our logistics association as a quick reminder this spreadsheet lists association member id numbers first and last names addresses how much each member pays and dues when the membership expires and the membership types now let's think about what would happen if the international logistics association decided to get together with the global logistics association in order to help their members handle the incredible demands of e-commerce first all the data from each organization would need to be combined using data merging data merging is the process of combining two or more data sets into a single data set this presents a unique challenge because when two totally different data sets are combined the information is almost guaranteed to be inconsistent and misaligned for example the global logistics association spreadsheet has a separate column for a person's suite apartment or unit number but the international logistics association combines that information with their street address this needs to be corrected to make the number of address columns consistent next check out how the global logistics association uses people's email addresses as their member id while the international logistics association uses numbers this is a big problem because people in a certain industry such as logistics typically join multiple professional associations so there's a very good chance that these data sets include membership information on the exact same person just in different ways it's super important to remove those duplicates also the global logistics association has many more member types than the other organization on top of that it uses the term young professional instead of student associate but both describe members who are still in school or just starting their careers if you are merging these two data sets you need to work with your team to fix the fact that the two associations describe memberships very differently okay now you understand why the merging of organizations also requires the merging of data and that can be tricky but there's lots of other reasons why data analysts merge data sets for example in one of my past jobs i merged a lot of data from multiple sources to get insights about our customers purchases the kinds of insights i gained helped me identify customer buying patterns when merging data sets i always begin by asking myself some key questions to help me avoid redundancy and to confirm that the data sets are compatible in data analytics compatibility describes how well two or more data sets are able to work together so the first question i would ask is do i have all the data i need to gather customer purchase insights i wanted to make sure i had data on customers their purchases and where they shopped next i would ask does the data i need exist within these data sets as you learned earlier in this program this involves considering the entire data set analytically looking through the data before i start using it lets me get a feel for what it's all about what the schema looks like if it's relevant to my customer purchase insights and if it's clean data that brings me to the next question do the data sets need to be cleaned or are they ready for me to use and because i'm working with more than one source i'll also ask myself are the data sets cleaned to the same standard for example what fields are regularly repeated how are missing values handled how recently was the data updated finding the answers to these questions and understanding if i need to fix any problems at the start of a project is a very important step in data merging in both the examples we explored here a data analyst could use either the spreadsheet tools or sql queries to clean up merge and prepare the data sets for analysis depending on the tool you decide to use the cleanup process can be simple or very complex soon you'll learn how to make the best choice for your situation and as a final note programming languages like r are also very useful for cleaning data you'll learn more about how to use r and the other concepts we covered soon hi again as you learned earlier there's a lot of different ways to clean up data i've shown you some examples of how you can clean data manually such as searching for and fixing misspellings or removing empty spaces and duplicates we also learned that lots of spreadsheet applications have tools that help simplify and speed up the data cleaning process there's a lot of great efficiency tools that data analysts use all the time such as conditional formatting removing duplicates formatting dates fixing text strings and sub strings and splitting text to columns we'll explore those in more detail now the first is something called conditional formatting conditional formatting is a spreadsheet tool that changes how cells appear when values meet specific conditions likewise it can let you know when a cell does not meet the conditions you've set visual cues like this are very useful for data analysts especially when we're working in a large spreadsheet with lots of data making certain data points stand out makes the information easier to understand and analyze and for cleaning data knowing when the data doesn't follow the condition is very helpful let's return to the logistics association spreadsheet to check out conditional formatting in action we'll use conditional formatting to highlight blank cells that way we know where there's missing information so we can add it to the spreadsheet to do this we'll start by selecting the range we want to search for this example we're not focused on address 3 and address 5. so the fields will include all the columns in our spreadsheets except for f and h next we'll go to format and choose conditional formatting great our range is automatically indicated in the field the format rule will be to format cells if the cell is empty finally we'll choose the formatting style i'm going to pick a shade of bright pink so my blanks really stand out then click done and the blank cells are instantly highlighted okay the next spreadsheet tool removes duplicates as you've learned before it's always smart to make a copy of the data set before removing anything so let's do that now great now we can continue you might remember that our example spreadsheet has one association member listed twice to fix that go to data and select remove duplicates remove duplicates is a tool that automatically searches for and eliminates duplicate entries from a spreadsheet choose data has header row because our spreadsheet has a row at the very top that describes the contents of each column next select all because we want to inspect our entire spreadsheet finally remove duplicates you'll notice the duplicate row was found and immediately removed another useful spreadsheet tool enables you to make formats consistent for example some of the dates in the spreadsheet aren't in a standard date format this could be confusing if you wanted to analyze when association members joined how often they renewed their memberships or how long they've been with the association to make all of our dates consistent first select column j then go to format select number then date now all of our dates have a consistent format before we go over the next tool i want to explain what a text string is in data analytics a text string is a group of characters within a cell most often composed of letters an important characteristic of a text string is its length which is the number of characters in it you'll learn more about that soon for now it's also useful to know that a sub string is a smaller subset of a text string okay now let's talk about split split is a tool that divides the text string around the specified character and puts each fragment into a new and separate cell split is helpful when you have more than one piece of data in a cell and you want to separate them out this might be a person's first and last name listed together or it could be a cell that contains someone's city state country and zip code but you actually want each of those in its own column so let's say this association wanted to analyze all of the different professional certifications its members have earned to do this you want each certification separated out into its own column right now the certifications are separated by a comma that's the specified text separating each item also called the delimiter let's get them separated highlight the column then select data and split text to columns this spreadsheet application automatically knew that the comma was a delimiter and separated each certification but sometimes you might need to specify what the delimiter should be you can do that here split text to columns is also helpful for fixing instances of numbers stored as text sometimes values in your spreadsheet will seem like numbers but they're formatted as text this can happen when copying and pasting from one place to another or if the formatting is wrong for this example let's check out a new spreadsheet from a cosmetics maker if a data analyst wanted to determine total profits they could add up everything in column f but there's a problem one of the cells has an error if you check into it you learn that the 707 in this cell is text and can't be changed into a number so when the spreadsheet tries to multiply the cost of the product by the number of units sold it's unable to make the calculation but if we select the orders column and choose split text to columns the error is resolved because now it can be treated as a number coming up you'll learn about a tool that does just the opposite concatenate is a function that joins multiple text strings into a single string spreadsheets are a very important part of data analytics they save data analysts time and effort and help us eliminate errors each and every day here you learn about some of the most common tools that we use but there's a lot more to come next we'll learn even more about data cleaning with spreadsheet tools bye for now welcome back you've learned about some very useful data cleaning tools that are built right into spreadsheet applications now we'll explore how functions can optimize your efforts to ensure data integrity as a reminder a function is a set of instructions that performs a specific calculation using the data in a spreadsheet the first function we'll discuss is called countif countif is a function that returns the number of cells that match a specified value basically it counts the number of times a value appears in a range of cells let's go back to our professional association spreadsheet in this example we want to make sure the association membership prices are listed accurately we'll use counts if to check for some common problems like negative numbers or a value that's much less or much greater than expected to start let's find the least expensive membership a hundred dollars for student associates so that'll be the lowest number that exists in this column if any cell has a value that's less than 100 countif will alert us we'll add a few more rows at the bottom of our spreadsheet then beneath column h type member dues less than a hundred dollars next type the function in the cell next to it every function has a certain syntax that needs to be followed for it to work syntax is a predetermined structure that includes all required information and is proper placement the syntax of accounts if function should be like this equals countif open parentheses range comma the specified value in quotation marks and a close parenthesis so it will show up like this where i2 through i-72 is the range and the value is less than 100. this tells the function to go through column i and return a count of all cells that contain a number less than 100. turns out there is one scrolling through our data we find that one piece of data was mistakenly keyed in as a negative number let's fix that now now we'll use countif to search for any values that are more than we would expect the most expensive membership type is 500 for corporate members type the function in the cell this time it'll appear like this i2 through i-72 is still the range but the value is greater than 500. there's one here too check it out this entry has an extra zero it should be a hundred dollars okay the next function we'll discuss is called len len is a function that tells you the length of a text string by counting the number of characters it contains this is useful when cleaning data if you have a certain piece of information in your spreadsheet that you know must contain a certain length for example this association uses six digit member identification codes so if we just import this data and wanted to be sure our codes are all the correct number of digits we'd use len the syntax of len is equals len open parentheses the range and the close parentheses so we'll insert a new column after member id then type an equal sign and len at an open parenthesis the range is the first member id number in a2 finish the function by closing the parentheses it tells us that there are six characters in cell a2 let's continue the function through the entire column and find out if any results are not six but instead of manually going through our spreadsheet to search for these instances we'll use conditional formatting we talked about conditional formatting earlier it's a spreadsheet tool that changes how cells appear when values meet specific conditions let's practice that now select all of column b except for the header then go to format and choose conditional formatting the format rule is to format cells if not equal to 6. click done and the cell with the 7 inside is highlighted all right now we're going to talk about left and right left is a function that gives you a set number of characters from the left side of a text string right is a function that gives you a set number of characters from the right side of a text string as a quick reminder a text string is a group of characters within a cell commonly composed of letters numbers or both to see these functions in action let's go back to the spreadsheet from the cosmetics maker from earlier this spreadsheet contains product codes each has a five digit numeric code and then a four character text identifier but let's say we only want to work with one side or the other you can use left or right to give you the specific set of characters or numbers you need we'll practice cleaning up our data using the left function first the syntax of left is equals left open parenthesis the range a comma and the number of characters from the left side of the text string we want then we finish it with the close parenthesis here our project requires just the five digit numeric codes so in a separate column type equals left open parenthesis then add the range our range is a2 then add a comma and then number five for our five-digit product code finally finish the function with the closed parenthesis our function should show up like this enter and now we have a substring which is the number part of the product code only click and drag this function through the entire column to separate out the rest of the product codes by number only okay now let's say your project only needs the four character text identifier for that we'll use the write function in the next column we'll begin the function the syntax is equals right open parenthesis the range a comma and the number of characters we want then we finish with the closed parenthesis let's key that in now equals right open parenthesis and the range is still a2 add a comma and this time we'll tell it that we want the first four characters from the right close up the parentheses and press enter then drag the function throughout the entire column now we can analyze the products in our spreadsheet based on either substring the five digit numeric code or the four character text identifier so hopefully that makes it clear how you can use left and right to extract sub strings from the left and right sides of a string now let's learn how you can extract something in between here's where we'll use something called mid mid is a function that gives you a segment from the middle of a text string this cosmetics company lists all of its client using a client code it's composed of the first three letters of the city where the client's located its state abbreviation and then a three-digit identifier but let's say a data analyst needs to work with just the states in the middle the syntax for mid is equals mid open parenthesis the range then a comma when using mid you always need to supply a reference point in other words you need to set where the function should start after that place another comma and how many middle characters you want in this case our range is d2 let's start the function in new column type equals mid open parenthesis d2 then the first three characters represent a city name so that means the starting point is the fourth add a comma and four we also need to tell the function how many middle characters we want add one more comma and two because the state abbreviations are two characters long press enter and bam we just get the state abbreviation continue the mid function through the rest of the column so we've learned about a few functions that help separate out specific text strings but what if we want to combine them instead for that we'll use concatenate which is a function that joins together two or more text strings the syntax is equals concatenate then an open parenthesis inside indicate each text string you want to join separated by commas then finish the function with the closed parentheses so just for practice let's say we needed to rejoin the left and right text strings back into complete product codes in a new column let's begin our function type equals concatenate then in open parenthesis the first text string we want to join is in h2 then add a comma the second part is an i2 add a closed parenthesis and press enter drag it down through the entire column and just like that all of our product codes are back together okay the last function we'll learn about here is trim trim is a function that removes leading trailing and repeated spaces and data sometimes when you import data your cells have extra spaces which can get in the way of your analysis for example if this cosmetics maker wanted to look up a specific client name it won't show up in the search if it has extra spaces you can use trim to fix that problem the syntax for trim is equals trim open parenthesis your range and close parentheses so in a separate column type equals trim and an open parenthesis the range is c2 as you want to check out the client names close the parentheses and press enter finally continue the function down the column trim fix the extra spaces so now we know some very useful functions that can make your data cleaning even more successful this was a lot of information so as always feel free to go back and review the video and then practice on your own we'll continue building on these tools soon and you'll also have a chance to practice pretty soon these data clean steps will become second nature like brushing your teeth hi let's get into it motivational speaker wayne dyer once said if you change the way you look at things the things you look at change this is so true in data analytics no two analytics projects are ever exactly the same so it only makes sense that different projects require us to focus on different information differently in this video we'll explore different methods that data analysts use to look at data differently and how that leads to more efficient and effective data cleaning some of these methods include sorting and filtering pivot tables a function called vlookup and plotting to find outliers let's start with sorting and filtering as you learned earlier sorting and filtering data helps data analysts customize and organize the information the way they need for a particular project but these tools are also very useful for data cleaning you might remember that sorting involves arranging data into a meaningful order to make it easier to understand analyze and visualize for data cleaning you can use sorting to put things in alphabetical or numerical order so you can easily find a piece of data sorting can also bring duplicate entries closer together for faster identification filters on the other hand are very useful in data cleaning when you want to find a particular piece of information you learned earlier that filtering means showing only the data that meets a specific criteria while hiding the rest this lets you view only the information you need when cleaning data you might use a filter to only find values above a certain number or just even or odd values again this helps you find what you need quickly and separates out the information you want from the rest that way you can be more efficient when cleaning your data another way to change the way you view data is by using pivot tables you've learned that a pivot table is a data summarization tool that is used in data processing pivot tables sort reorganize group count total or average data stored in a database in data cleaning pivot tables are used to give you a quick clutter-free view of your data you can choose to look at the specific parts of a data set you need and get a visual in the form of a pivot table let's create one now using our cosmetic makers spreadsheet again to start select the data we want to use here we'll choose the entire spreadsheet select data then pivot table choose new sheet and create let's say we're working on a project that requires us to look at only the most profitable products items that earn the cosmetics maker at least ten thousand dollars in orders so the row we'll include is total for total profits we'll sort in descending order to put the most profitable items at the top and we'll show totals next we'll add another row for products so that we know what those numbers are about we can clearly determine that the most profitable products have the product codes 15143ex fo and 32729 masc we can ignore the rest for this particular project because they fall below 10 000 in orders now we might be able to use context clues that assume we're talking about exfoliants and mascaras but we don't know which ones or if that assumption is even correct so we need to confirm what the product codes correspond to and this brings us to the next tool it's called vlookup vlookup stands for vertical lookup it's a function that searches for a certain value in a column to return a corresponding piece of information when data analysts look up information for a project it's rare for all of the data they need to be in the same place usually you'll have to search across multiple sheets or even different databases the syntax of vlookup is equals vlookup open parentheses then the data you want to look up next is a comma and where you want to look for that data in our example this will be the name of a spreadsheet followed by exclamation point the exclamation point indicates that we're referencing a cell in a different sheet from the one we're currently working in again that's very common in data analytics okay next is the range in the place where you're looking for data indicated using the first and last cell separated by a colon after one more comma is the column in the range containing the value to return next another comma and the word false which means that an exact match is what we're looking for finally complete your function by closing the parentheses to put it simply vlookup searches for the value in the first argument in the left most column of the specified location then the value of the third argument tells vlookup to return the value in the same row from the specified column the false tells vlookup that we want an exact match soon you'll learn the difference between exact and approximate matches but for now just know that vlookup takes the value in one cell and searches for a match in another place let's begin we'll type equals vlookup then add the data we want to look for which is the product code the dollar signs make sure that the corresponding part of the reference remains unchanged or locked you can lock just the column just the row or both at the same time next we'll tell it to look at sheet 2 in both columns we add a 2 to represent the second column the last term false says we want an exact match with this information we can now analyze the data for only the most profitable products going back to the two most profitable products we can search for one five one four three e x f o and three two seven two nine m asc go to edit and then find type in the product codes and search for them now we can learn which products we'll be using for this particular project the final tool we'll talk about is something called plotting when you plot data you put it in a graph chart table or other visual to help you quickly find what it looks like plotting is very useful when trying to identify any skewed data or outliers for example if we wanted to make sure the price of each product is correct we could create a chart this would give us a visual aid that helps us quickly figure out if anything looks like an error so let's select the column with our prices then we'll go to insert and choose chart pick a column chart as the type one of these prices looks extremely low if we look into it we discover that this item has a decimal point in the wrong place it should be seven dollars and 30 cents not 73 cents that would have a big impact on our total profits so it's a good thing we caught that during data cleaning looking at data in new and creative ways helps data analysts identify all kinds of dirty data coming up you'll continue practicing these new concepts so you can get more comfortable with them you'll also learn additional strategies for ensuring your data is clean and will provide you with effective insights great work so far hello so far you learned about a lot of different tools and functions that analysts use to clean up data for analysis now we'll take a step back and talk about some of the really big picture aspects of clean data knowing how to fix specific problems either manually with spreadsheet tools or with functions is extremely valuable but it's also important to think about how your data has moved between systems and how it's evolved along its journey to your data analysis project to do this data analysts use something called data mapping data mapping is the process of matching fields from one database to another this is very important to the success of data migration data integration and lots of other data management activities as you learned earlier different systems store data in different ways for example the state field in one spreadsheet might show marilyn spelled out but another spreadsheet might store it as md data mapping helps us note these kinds of differences so we know when data is moved and combined it will be compatible as a quick reminder compatibility describes how well two or more data sets are able to work together so the first step to data mapping is identifying what data needs to be moved this includes the tables and the fields within them we also need to define the desired format for the data once it reaches its destination to figure out how this works let's go back to the merger between our two logistics associations starting with the first data field we'll identify that we need to move both sets of member ids and to define the desired format we'll choose whether to use numbers like this spreadsheet or email addresses like the other spreadsheet next comes mapping the data depending on the schema and number of primary and foreign keys in a data source data mapping can be simple or very complex as a reminder a schema is a way of describing how something is organized a primary key references a column in which each value is unique and a foreign key is a field within a table that is a primary key in another table for more challenging projects there's all kinds of data mapping software programs you can use these data mapping tools will analyze field by field how to move data from one place to another then they automatically clean match inspect and validate the data they also create consistent naming conventions ensuring compatibility when the data is transferred from one source to another when selecting a software program to map your data you want to be sure that it supports the file types you're working with such as excel sql tableau and others later on you'll learn more about selecting the right tool for a particular task for now let's practice mapping data manually first we need to determine the content of each section to make sure the data ends up in the right place for example the data on when memberships expire would be consolidated into a single column this step makes sure that each piece of information ends up in the most appropriate place in the merged data source now you might remember that some of the data was inconsistent between the two organizations like the fact that one uses a separate column for suite apartment or unit number but the other doesn't this brings us to the next step transforming the data into a consistent format this is a great time to use concatenate as you learned before concatenate is a function that joins together two or more text strings which is what we did earlier with our cosmetics company example so we'll insert a new column and then type equals concatenate then the two text strings we want to make one drag that through the entire column and now we have the consistency in the new merge association lists of member addresses okay now that everything is compatible it's time to transfer the data to its destination there's a lot of different ways to move data from one place to another including querying import wizards and even simple drag and drop alright here's our merge spreadsheet it looks good but we still want to make sure everything was transferred properly so we'll go into the testing phase of data mapping for this you inspect a sample piece of data to confirm that it's clean and properly formatted it's also a smart practice to do spot checks on things such as the number of nulls for the test you can use a lot of the data cleaning tools we discussed previously such as data validation conditional formatting countif sorting and filtering finally once you've determined that the data is clean and compatible you can start using it for analysis data mapping is so important because even one mistake when merging data can ripple throughout an organization causing the same error to appear again and again this leads to poor results on the other hand data mapping can save the day by giving you a clear roadmap you can follow to make sure your data arrives safely at its destination and that's why you learn how to do it welcome back and great job on that last weekly challenge now that we know the difference between clean and dirty data and some general data cleaning techniques let's focus on data cleaning using sql coming up we'll learn about the different data cleaning functions in spreadsheets and sql and how sql can be used to clean large data sets i'll also show you how to develop some basic search queries for databases and how to apply basic sql functions for transforming data and cleaning strings cleaning your data is the last step in the data analysis process before you can move on to the actual analysis and sql has a lot of great tools that can help you do that but before we start cleaning databases we'll take a closer look at sql and when to use it i'll see you there advertising agencies get money from their clients to advertise their brand right and so these agencies use our products use certain google platforms advertising platforms and i help them with how to best use those platforms different strategies they can use to be best in class a lot of the folks at the advertising agencies have reports that they have to send out to their clients and so these reports take a lot of time to create and visualize and so what i do is i help the practitioners and the analytics teams use a particular product that enables them to create those reports much faster and much easier if you're going to start off as a data analyst it opens tons of doors because everybody is tracking data is using data needs to use data regardless of industry so anywhere from healthcare to advertising to e-commerce to entertainment anything and everything everybody uses data so everybody needs you as a data analyst sql makes our lives easier when we're analyzing lots of different data it's only somewhat recently that the sql programs that we use now can give us instant results for analyzing millions or billions of data years ago maybe about like five years ago or so even though we could still analyze those millions of rows we would end up having to wait 15 minutes 30 minutes for the queries to run but now it's instantaneous and so that's really exciting and we can do so much more with that power so sql has helped a lot in my career because it's one of those fundamental things you have to know as a data analyst back in the day not everyone knew sql so knowing sql was definitely a competitive advantage nowadays i would say more and more people maybe most people know it it's a core skill and highly sought after by everybody so knowing sql becoming a data analyst makes you uh popular quite popular from recruiters so i think that's really fun i taught myself sql so my knowledge about sql is something i hold dear near and dear close to my heart since it's something that almost i've made for myself and i i feel so much satisfaction from it so that's why i really like sql one of the fun things about sql another reason why i really enjoy it using it is because when you type something in that query and you just hit ctrl enter or once you run the query you get the results almost instantly depending on the platform you use but it's fascinating to see if you think conceptually how much analysis the computer is doing for you based on that little bitty a little bit of command code or a little bit of code you wrote and it's just so powerful if you think about what's happening behind the scenes so i think that's kind of fun to look at we live in a world of big data and it keeps getting bigger um our the computing power is also increasing exponentially so with all the data that we can track the more and more we can track that data the more and more we need data analysts and so our career prospects are basically skyrocketing i'm sally i'm a measurement and analytical lead at google hello again so before we go over all the ways data analysts use sql to clean data i want to formally introduce you to sql we've talked about sql a lot already you've seen some databases and some basic functions in sql and you've even seen how sql can be used to process data but now let's actually define sql sql is structured query language that analysts use to work with databases data analysts usually use sql to do with large data sets because it can handle huge amounts of data and i mean trillions of rows that's a lot of rows to wrap your head around so let me give you an idea about how much data that really is imagine a data set that contains the name of all 8 billion people in the world it would take the average person 101 years to read all 8 billion names sql can process this in seconds personally i think that's pretty cool other tools like spreadsheets might take a really long time to process that much data which is one of the main reasons data analysts choose to use sql when dealing with big data sets let me give you a short history on sql development on sql actually began in the early 70s in 1970 edgar f codd developed the theory about relational databases you might remember learning about relational databases a while back this is a database that contains a series of tables that can be connected to form relationships at the time ibm was using a relational database management system called system r well ibm computer scientists were trying to figure out a way to manipulate and retrieve data from ibm system their first query language was hard to use so they quickly moved on to the next version sql in 1979 after extensive testing sql now just spelled sql was released publicly by 1986 sql had become the standard language for relational database communication and it still is this is another reason why data analysts choose sql it's a well-known standard within the community the first time i used sql to pull data from a real database was for my first job as a data analyst i didn't have any background knowledge about sql before that i only found out about it because it was a requirement for that job the recruiter for that position gave me a week to learn it so i went online and researched it and ended up teaching myself sql they actually gave me a written test as part of the job application process i had to write sql queries and functions on a whiteboard but i've been using sql ever since and i really like it and just like i learned sql on my own i wanted to remind you that you can figure things out yourself too there's tons of great online resources for learning so don't let one job requirement stand in your way without doing some research first now that we know a little more about why analysts choose to work with sql when they're handling a lot of data and a little bit about the history of sql we'll move on and learn some practical applications for it coming up next we'll check out some of the tools we learned in spreadsheets and figure out if any of those apply to working in sql spoiler alert they do see you soon hey there so far we've learned about both spreadsheets and sql while there's lots of differences between spreadsheets and sql you'll find some similarities too so let's check out what spreadsheets and sql have in common and how they're different spreadsheets and sql actually have a lot in common specifically there's tools you can use in both spreadsheets and sql to achieve similar results we've already learned about some tools for cleaning data in spreadsheets which means you already know some tools that you can use in sql for example you can still perform arithmetic use formulas and join data when you're using sql so we'll build on the skills we've learned in spreadsheets and use them to do even more complex work in sql here's an example of what i mean by more complex work if we were working with health data for a hospital we need to be able to access and process a lot of data we might need demographic data like patients names birthdays and addresses information about their insurance or past visits public health data or even user generated data to add to their patient records all of this data is being stored in different places maybe even in different formats and each location might have millions of rows and hundreds of related tables this is way too much data to input manually even for just one hospital that's where sql comes in handy instead of having to look at each individual data source and record it in our spreadsheet we can use sql to pull all this information from different locations in our database now let's say we want to find something specific in all this data like how many patients with a certain diagnosis came in today in a spreadsheet we can use the countif function to find that out or we can combine the count and where queries in sql to find out how many rows match our search criteria this will give us similar results but works with a much larger and more complex set of data next let's talk about how spreadsheets and sql are different first it's important to understand that spreadsheets and sql are different things spreadsheets are generated with the program like excel or google sheets these programs are designed to execute certain built-in functions sql on the other hand is a language that can be used to interact with database programs like oracle mysql or microsoft sql server the differences between the two are mostly in how they're used if a data analyst was given data in the form of a spreadsheet they'll probably do their data cleaning and analysis within that spreadsheet but if they're working with a large data set with more than a million rows or multiple files within a database it's easier faster and more repeatable to use sql sql can access and use a lot more data because it can pull information from different sources in the database automatically unlike spreadsheets which only have access to the data you input this also means the data is stored in multiple places a data analyst might use spreadsheets stored locally on their hard drive or their personal cloud when they're working alone but if they're on a larger team with multiple analysts who need to access and use data stored across a database sql might be a more useful tool because of these differences spreadsheets and sql are used for different things as you already know spreadsheets are good for smaller data sets and when you're working independently plus spreadsheets have built-in functionalities like spell check that can be really handy sql is great for working with larger data sets even trillions of rows of data and because sql has been the standard language for communicating with databases for so long it can be adapted and used for multiple database programs sql also records changes in queries which makes it easy to track changes across your team if you're working collaboratively next we'll learn more queries and functions in sql that will give you some new tools to work with you might even learn how to use spreadsheet tools in brand new ways see you next time hey welcome back so far we've learned that sql has some of the same tools as spreadsheets but on a much larger scale in this video we'll learn some of the most widely used sql queries that you can start using for your own data cleaning and eventual analysis let's get started we've talked about queries as requests you put into the database to ask it to do things for you queries are a big part of using sql it's structured query language after all queries can help you do a lot of things but there's some common ones that data analysts use all the time so let's start there first i'll show you how to use the select query i've called this one out before but now i'll add some new things for us to try out right now the table viewer is blank because we haven't pulled anything from the database yet for this example the store we're working with is hosting a giveaway for customers in certain cities we have a database containing customer information that we can use to narrow down which customers are eligible for the giveaway so let's do that now we can use select to specify exactly what data we want to interact with in a table and if we combine select with from we can pull data from any table in this database as long as we know what the columns and rows are named we might want to pull the data about customer names and cities from one of the tables to do that we can input select name comma city from customer underscore data dot customer underscore address to get this information from the customer underscore address table which lives in the customer underscore data data set so select and from help specify what data we want to extract from the database and use we can also insert new data into a database or update existing data for example maybe we have a new customer that we want to insert into this table we can use the insert into query to put that information in so let's start with where we're trying to insert this data the customer underscore address table we also want to specify which columns we're adding this data to by typing their names in the parentheses that way sql can tell the database exactly where we're inputting new information then we'll tell it what values we're putting in run the query and just like that it added it to our table for us now let's say we just need to change the address of a customer well we can tell the database to update it for us to do that we need to tell it we're trying to update the customer underscore address table then we need to let it know what value we're trying to change but we also need to tell it where we're making that change specifically so that it doesn't change every address in the table there now this one customer's address has been updated if we want to create a new table for this database we can use the create table if not exists statement keep in mind just running a sql query doesn't actually create a table for the data we extract it just stores it in our local memory to save it we'll need to download it as a spreadsheet or save the results into a new table as a data analyst there's a few situation where you might need to do just that it really depends on what kind of data you're pulling and how often if you're only using a total number of customers you probably don't need a csv file or new table in your database if you're using the total number of customers per day to do something like track a weekend promotion in a store you might download that data as a csv file so you can visualize it in a spreadsheet but if you're being asked to pull this trend on a regular basis you can create a table that will automatically refresh with the query you've written that way you can directly download the results whenever you need them for a report another good thing to keep in mind if you're creating lots of tables within a database you'll want to use the drop table if exists statement to clean up after yourself it's good housekeeping you probably won't be deleting existing tables very often after all that's the company's data and you don't want to delete important data from their database but you can make sure you're cleaning up the tables you've personally made so that there aren't old or unused table with redundant information cluttering the database there now you've seen some of the most widely used sql queries in action there's definitely more query keywords for you to learn and unique combinations that will help you work within databases but this is a great place to start coming up we'll learn even more about queries in sql and how to use them to clean our data see you next time hi i'm evan i'm a learning portfolio manager here at google i don't think i'm a computer science or a super engineering type but i really really like working with numbers so naturally i went into accounting and about after two years of accounting i said wow i really don't want to do all this by hand so i took my first information systems class where they taught me the language sql or sql and it completely opened up my mind between a working knowledge of spreadsheets where you change one cell and the whole spreadsheet changes because of those amazing calculated fields and sql where i can query billions of rows of data in a matter of seconds i was completely sold on my love for data and i've dedicated my life and my career to just communicating that passion and getting folks excited about the things that they could do with their data why is sql such an amazing first language to pick up well there's so many things that you can do with it i will first caveat and say i am not a computer science major i don't know deep down java in python and i was a little bit apprehensive of learning the computer language it's kind of like a pseudo programming language but in reality you can write your first sql statement as you're going to find out here in just five minutes or less sql honestly it's one of those languages that's easy to learn and even more fun to master i've been learning sql for 15 years i've been teaching it for 10. as you're going to see in some of these hands-on labs you'll be working through it's very easy to return data from within a database or a data set just select whatever columns from whichever database that you're pulling from and immediately you get the data back now the really fun part is actually teasing apart and saying hmm i wonder if i change my query add these more columns filter this data set a different way share with my colleagues it's meant to be an interactive querying language and query means asking a question so if i could challenge you one thing it's that the syntax for picking up sql much like the rules of a chess game are very very easy to pick up but the hard part is actually not the syntax writing much like with any programming language but the actual what question do you want to ask of your data so what i would encourage you to do is be super curious about whatever data set that you're given spend a lot of time even before you touch your keyboard and thinking about what data set or what insights you can get from your data and then start having fun there's many different ways to write the same correct sql statement so try one out share with your friends and then start returning that data back for insights good luck it's so great to have you back now that we know some basic sql queries and spent some time working in a database let's apply that knowledge to something else we've been talking about preparing and cleaning data you already know that cleaning and completing your data before you analyze it is an important step so in this video i'll show you some ways sql can help you do just that including how to remove duplicates as well as four functions to help you clean string variables earlier we covered how to remove duplicates and spreadsheets using the remove duplicates tool in sql we can do the same thing by including distinct in our select statement for example let's say the company we work for has a special promotion for customers in ohio we want to get the customer ids of customers who live in ohio but some customer information has been entered multiple times we can get these customer ids by writing select customer underscore id from customer underscore data dot customer underscore address this query will give us duplicates if they exist in the table if customer id 9080 shows up three times in our table our results will have three of that customer id but we don't want that we want a list of all unique customer ids to do that we add distinct to our select statement by writing select distinct customer underscore id from customer underscore data dot customer underscore address now the customer id 9080 will show up only once in our results you might remember we talked before about text strings as a group of characters within a cell commonly composed of letters numbers or both these text strings need to be cleaned sometimes maybe they've been entered differently in different places across your database and now they don't match in those cases you'll need to clean them before you can analyze them so here are some functions you can use in sql to handle string variables you might recognize some of these functions from when we talked about spreadsheets now it's time to see them work in a new way pull up the data set we shared right before this video and you can follow along step by step with me during the rest of this video the first function i want to show you is length which we've encountered before if we already know the length our string variables are supposed to be we can use length to double check that our string variables are consistent for some databases this query is written as len but it does the same thing let's say we're working with the customer underscore address table from our earlier example we can make sure that all country codes have the same length by using length on each of these strings so to write our sql query let's first start with select and from we know our data comes from the customer underscore address table within the customer underscore data data set so we add customer underscore data dot customer underscore address after the from clause then under select we'll write length and then the column we want to check country to remind ourselves what this is we can label this column in our results as letters underscore in underscore country so we add as letters underscore in underscore country after length parentheses country the result we get is a list of the number of letters in each country listed for each of our customers it seems that almost all of them are twos which means the country field contains only two letters but we notice one that has three that's not good we want our data to be consistent so let's check out which countries were incorrectly listed in our table we can do that by putting the length parentheses country parentheses function that we created into the where clause because we're telling sql to filter the data to show only customers whose country contains more than two letters so now we'll write select country from customer underscore data dot customer underscore address where length parentheses country parenthesis greater than 2. when we run this query we now get the two countries where the number of letters is greater than the two we expect to find the incorrectly listed countries show up as usa instead of u.s if we created this table then we could update our table so that this entry shows up as us instead of usa but in this case we didn't create this table so we shouldn't update it we still need to fix this problem so we can pull a list of all the customers in the us including the two that have usa instead of us the good news is that we can account for this error in our results by using the substring function in our sql query to write our sql query let's start by writing the basic structure select from where we know our data is coming from the customer underscore address table from the customer underscore data data set so we type in customer underscore data dot customer underscore address after from next we tell sql what data we want it to give us we want all the customers in the u.s by their ids so we type in customer underscore id after select finally we want sql to filter out only american customers so we use the substring function after the where clause we're going to use the substring function to pull the first two letters of each country so that all of them are consistent and only contain two letters to use the substring function we first need to tell sql the column where we found this error country then we specify which letter to start with we want sql to pull the first two letters so we're starting with the first letter so we type in one then we need to tell sql how many letters including this first letter to pull since we want the first two letters we need sql to pull two total letters so we type in two this will give us the first two letters of each country we want us only so we'll set this function to equals u.s when we run this query we get a list of all customer ids of customers whose country is the u.s including the customers that had usa instead of u.s going through our results it seems like we have a couple duplicates where the customer id is shown multiple times remember how we get rid of duplicates we add distinct before customer underscore id so now when we run this query we have our final list of customer ids of the customers who live in the u.s finally let's check out the trim function which you've come across before this is really useful if you find entries with extra spaces and need to eliminate those extra spaces for consistency for example let's check out the state column in our customer underscore address table just like we did for the country column we want to make sure the state column has the consistent number of letters so let's use the length function again to learn if we have any state that has more than two letters which is what we would expect to find in our data table we start writing our sql query by typing the basic sql structure of select from where we're working with the customer underscore address table in the customer underscore data data set so we type in customer underscore data dot customer underscore address after from next we tell sql what we want it to pull we want it to give us any state that has more than two letters so we type in state after select finally we want sql to filter for states that have more than two letters this condition is written in the where clause so we type in length parenthesis state parentheses and that it must be greater than two because we want the states that have more than two letters we want to figure out what the incorrectly listed states look like if we have any when we run this query we get one result we have one state that has more than two letters but hold on how can this state that seems like it has two letters o and h for ohio have more than two letters we know that there are more than two characters because we use the length parenthesis state parentheses greater than two statement in the where clause when filtering our results so that means the extra characters that sql is counting must then be a space there must be a space after the h this is where we would use the trim function the trim function removes any spaces so let's write a sql query that accounts for this error let's say we want a list of all customer ids of the customers who live in oh for ohio we start with the basic sql structure select from where we know the data comes from the customer underscore address table and the customer underscore data data set so we type in customer underscore data dot customer underscore address after from next we tell sql what data we want we want sql to give us the customer ids of customers who live in ohio so we type in customer underscore id after select since we know we have some duplicate customer entries we'll go ahead and type in distinct before customer id to remove any duplicate customer ids from appearing in our results finally we want sql to give us the customer ids of the customers who live in ohio we're asking sql to filter the data so this belongs in the where clause here's where we'll use the trim function to use the trim function we tell sql the column we want to remove spaces from which is state in our case and we want only ohio customers so we type in equals oh that's it we have all customer ids of the customers who live in ohio including that customer with the extra space after the h making sure that your string variables are complete and consistent will save you a lot of time later by avoiding errors or miscalculations that's why we clean data in the first place hopefully functions like length substring and trim will give you the tools you need to start working with string variables in your own data sets next up we'll check out some other ways you can work with strings and more advanced cleaning functions then you'll be ready to start working in sql on your own see you soon hi there and welcome back so far we've gone over some basic sql queries and functions that can help you clean your data we've also checked out some ways you can deal with string variables in sql to make your job easier get ready to learn more functions for dealing with strings in sql trust me these functions will be really helpful in your work as a data analyst in this video we'll check out strings again and learn how to use the cast function to correctly format data when you import data that doesn't already exist in your sql tables the data types from the new data set might not have been imported correctly this is where the cast function comes in handy basically cast can be used to convert anything from one data type to another let's check out an example imagine we're working with lauren's furniture store the owner has been collecting transaction data for the past year but she just discovered that they can't actually organize their data because it hadn't been formatted correctly so we'll help her by converting her data to make it useful again for example let's say we want to sort all purchases by purchase underscore price in descending order that means we want the most expensive purchase to show up first in our results to write the sql query we start with the basic sql structure select from where we know that data is stored in the customer underscore purchase table in the customer underscore data data set so we write customer underscore data dot customer underscore purchase after from next we tell sql what data to give us in the select clause we want to see the purchase underscore price data so we type purchase underscore price after select next is the where clause we are not filtering out any data since we want all purchase prices shown so we can take out the where clause finally to sort the purchase underscore price in descending order we type order by purchase underscore price d e s c at the end of our query let's run this query we see that 89.85 shows up at the top with 799.99 below it but we know that 799.99 is a bigger number than 89.85 the database doesn't recognize that these are numbers so it didn't sort them that way if we go back to the customer underscore purchase table and take a look at its schema we can see what data type the database thinks purchase underscore price is it says here the database thinks purchase underscore price is a string when in fact it is a float which is a number that contains a decimal that is why 89.85 shows up before 799.99 when we start letters we start from the first letter before moving on to the second letter so if we want to sort the words apple and orange and descending order we start with the first letters a and o since o comes after a orange will show up first then apple the database did the same with 89.85 and 799.99 it started with the first letter which in this case was a 8 and 7 respectively since 8 is bigger than 7 the database sorted 89.85 first and then 799.99 because the database treated these as text the database doesn't recognize these strings as floats because they haven't been typecast to match that data type yet typecasting means converting data from one type to another which is what we'll do with the cast function we use the cast function to replace purchase underscore price with a new purchase underscore price that the database recognizes as float instead of string we start by replacing purchase underscore price with cast then we tell sql the field we want to change which is the purchase underscore price field next is the data type we want to change purchase underscore price to which is the float data type we also need to sort by this new field so we change purchase underscore price after order by to cast parentheses purchase underscore price as float 64. this is how we use the cast function to allow sql to recognize the purchase underscore price column as floats instead of text strings now we can sort our purchases by purchase underscore price and just like that lauren's furniture store has data that can actually be used for analysis as a data analyst you'll be asked to locate and organize data a lot which is why you want to make sure you convert between data types early on businesses like our furniture store are interested in timely sales data and you need to be able to account for that in your analysis the cast function can be used to change strings into other data types too like date and time as a data analyst you may find yourself using data from various sources part of your job is to make sure the data from those sources is recognizable and usable in your database so that you won't run into any issues with your analysis and now you know how to do that the cast function is one great tool you can use when you're cleaning data and coming up we'll cover some other advanced functions that you can add to your toolbox see you soon hey there great to see you again so far we've seen some sql functions in action in this video we'll go over more uses for cast and then learn about concat and coalesce let's get started so earlier we talked about the cast function which lets us type cast text strings into floats i called out that the cast function can be used to change into other data types too so let's check out another example of how you can use cast in your own data work we've got the transaction data we were working with from our lawrence furniture store example but now we'll check out the purchase date field the furniture store owner has asked us to look at purchases that occurred during their sales promotion period in december let's write a sql query that will pull date and purchase underscore price for all purchases that occurred between december 1st 2020 and december 31st 2020 we start by writing the basic sql structure select from where we know the data comes from the customer underscore purchase table in the customer underscore data data set so we write customer underscore data dot customer underscore purchase after from next we tell sql what data to pull since we want date and purchase underscore price we add them into the select statement finally we want sql to filter for purchases that occurred in december only so we type date between 2020-12-01 and 2020-12-31 in the where clause let's run the query four purchases occurred in december but the date field looks odd that's because the database recognizes the date field as date time which consists of the date and time our sql query still works correctly even if the date field is date time instead of date but we can tell sql to convert the date field into the date data type so we see just the date and not the time to do that we use the cast function again so we'll use the cast function to replace the date field in our select statement with the new date field that will show the date and not the time we can do that by typing cast and adding the date as the field we want to change then we tell sql the data type we want instead which is the date data type there now we can have cleaner results for purchases that occur during the december sales period cast is a super useful function for cleaning and sorting data which is why i wanted you to see it in action one more time next up let's check out the concat function concat lets you add strings together to create new text strings that can be used as unique keys going back to our customer underscore purchase table we see that the furniture store sells different colors of the same product the owner wants to know if customers prefer certain colors so the owner can manage store inventory accordingly the problem is the product underscore code is the same regardless of the product color we need to find another way to separate products by color so we can tell if customers prefer one color over the others so we'll use concat to produce a unique key that'll help us tell the products apart by color and count them more easily let's write our sql query by starting with the basic structure select from where we know our data comes from the customer underscore purchase table in the customer underscore data data set so we type dot customerunderscoredata.com underscore purchase after from next we tell sql what data to pull we use the concat function here to get that unique key of product and color so we type concat the first column we want product underscore code and the other column we want product underscore color finally let's say we want to look at couches so we filter for couches by typing product equals couch in the where clause now we can count how many times each couch was purchased and figure out if customers preferred one color over the others with concat the furniture store can find out which color couches are the most popular and order more i've got one last advanced function to show you coalesce can be used to return non-null values in a list null values are missing values if you have a field that's optional in your table it'll have null in that field for rows that don't have appropriate values to put there let's open the customer underscore purchase table so i can show you what i mean in the customer underscore purchase table we can see a couple rows where product information is missing that is why we see nulls there but for the rows where product name is null we see that there is product underscore code data that we can use instead we'd prefer sql to show us the product name like bed or couch because it's easier for us to read but if the product name doesn't exist we can tell sql to give us the product underscore code instead that is where the coalesce function comes into play let's say we wanted a list of all products that were sold we want to use the product name column to understand what kind of product was sold so we write our sql query with the basic sql structure select from where we know our data comes from customer underscore purchase table in the customer underscore data data set so we type customer underscore data dot customer underscore purchase after from next we tell sql the data we want we want a list of product names but if names aren't available then give us the product code here is where we type coalesce then we tell sql which column to check first product and which column to check second if the first column is null product underscore code we'll name this new field as product underscore info finally we are not filtering out any data so we can take out the where clause this gives us product information for each purchase now we have a list of all products that were sold for the owner to review coalesce can save you time when you're making calculations too by skipping any null values and keeping your math correct those are just some of the advanced functions you can use to clean your data and get it ready for the next step in the analysis process you'll discover more as you continue working in sql but that's the end of this video and this module great work we've covered a lot of ground you learn the different data cleaning functions in spreadsheets and sql and the benefits of using sql to deal with large data sets we also added some sql formulas and functions to your toolkit and most importantly we got to experience some of the ways that sql can help you get data ready for your analysis after this you'll get to spend some time learning how to verify and report your cleaning results so that your data is squeaky clean and your stakeholders know it but before that you've got another weekly challenge to tackle you've got this some of these concepts might seem challenging at first but they'll become second nature to you as you progress in your career it just takes time and practice speaking of practice feel free to go back to any of these videos and re-watch or even try some of these commands on your own good luck and i'll see you again when you're ready hi there great to have you back you've been learning a lot about the importance of clean data and explored some tools and strategies to help you throughout the cleaning process in these videos we'll be covering the next step in the process verifying and reporting on the integrity of your clean data verification is a process to confirm that a data cleaning effort was well executed and the resulting data is accurate and reliable it involves rechecking your clean data set doing some manual cleanups if needed and taking a moment to sit back and really think about the original purpose of the project that way you can be confident that the data you collected is credible and appropriate for your purposes making sure your data is properly verified is so important because it allows you to double check that the work you did to clean up your data was thorough and accurate for example you might have referenced an incorrect cell phone number or accidentally keyed in a typo verification lets you catch mistakes before you begin analysis without it any insights you gain from analysis can't be trusted for decision making you might even risk misrepresenting populations or damaging the outcome of a product that you're actually trying to improve i remember working on a project where i thought the data i had was sparkling clean because i'd used all the right tools and processes but when i went through the steps to verify the data's integrity i discovered a semicolon that i had forgotten to remove sounds like a really tiny error i know but if i hadn't caught the semicolon during verification and removed it it would have led to some big changes in my results and that of course could have led to different business decisions so there is an example of why verification is so crucial but that's not all the other big part of the verification process is reporting on your efforts open communication is a lifeline for any data analytics project and reports are a super effective way to show your team that you're being 100 transparent about your data cleaning reporting is also a great opportunity to show stakeholders that you're accountable build trust with your team and make sure you're all on the same page about important project details coming up you'll learn different strategies for reporting like creating data cleaning reports documenting your cleaning process and using something called the change log a change log is a file containing a chronologically ordered list of modifications made to a project it's usually organized by version and includes the date followed by a list of added improved and removed features change logs are very useful for keeping track of how a data set evolved over the course of a project they're also another great way to communicate and report on data to others along the way you'll also see some examples of how verification and reporting can help you avoid repeating mistakes and save you and your team time ready to get started let's go in this video we'll discuss how to begin the process of verifying your data cleaning efforts verification is a critical part of any analysis project without it you have no way of knowing that your insights can be relied on for data-driven think of verification as a stamp of approval to refresh your memory verification is a process to confirm that a data cleaning effort was well executed and the resulting data is accurate and reliable it also involves manually cleaning data to compare your expectations with what's actually present the first step in the verification process is going back to your original unclean data set and comparing it to what you have now review the dirty data and try to identify any common problems for example maybe you had a lot of nulls in that case you check your clean data to ensure no nulls are present to do that you could search through the data manually or use tools like conditional formatting or filters or maybe there was a common misspelling like someone keying in the name of a product incorrectly over and over again in that case you'd run a find in your clean data to make sure no instances of the misspelled word occur another key part of verification involves taking a big picture view of your project this is an opportunity to confirm you're actually focusing on the business problem that you need to solve and the overall project goals and to make sure that your data is actually capable of solving that problem and achieving those goals it's important to take the time to reset and focus on the big picture because projects can sometimes evolve or transform over time without us even realizing it maybe an e-commerce company decides to survey a thousand customers to get information that would be used to improve a product but as responses begin coming in the analysts notice a lot of comments about how unhappy customers are with the e-commerce website platform altogether so the analysts start to focus on that while the customer buying experience is of course important for any e-commerce business it wasn't the original objective of the project the analysts in this case need to take a moment to pause refocus and get back to solving the original problem taking a big picture view of your project involves doing three things first consider the business problem you're trying to solve with the data if you've lost sight of the problem you have no way of knowing what data belongs in your analysis taking a problem-first approach to analytics is essential at all stages of any project you need to be certain that your data will actually make it possible to solve your business problem second you need to consider the goal in the project it's not enough just to know that your company wants to analyze customer feedback about a product what you really need to know is that the goal of getting this feedback is to make improvements to that product on top of that you also need to know whether the data you've collected and clean will actually help your company achieve that goal and third you need to consider whether your data is capable of solving the problem and meeting the project objectives that means thinking about where the data came from and testing your data collection and cleaning processes sometimes data analysts can be too familiar with their own data which makes it easier to miss something or make assumptions asking a teammate to review your data from a fresh perspective and getting feedback from others is very valuable in this stage this is also the time to notice if anything sticks out to you as suspicious or potentially problematic in your data again step back take a big picture view and ask yourself do the numbers make sense let's go back to our ecommerce company example imagine an analyst is reviewing the cleaned up data from the customer satisfaction survey the survey was originally sent to 1 000 customers but what if the analyst discovers there's more than a thousand responses in the data this could mean that the customer figured out a way to take the survey more than once or it could also mean that something went wrong in the data cleaning process and the field was duplicated either way this is a signal that is time to go back to the data cleaning process and correct the problem verifying your data ensures that the insights you gain from analysis can be trusted it's an essential part of data cleaning that helps companies avoid big mistakes this is another place where data analysts can save the day coming up we'll go through the next steps in the data cleaning process see you there hey there in this video we'll continue building on the verification process as a quick reminder the goal is to ensure that our data cleaning work was done properly and the results can be counted on you want your data to be verified so you know it's 100 ready to go it's sort of like car companies running tons of tests to make sure a car is safe before it hits the road you learn that the first step in verification is returning to your original unclean data set and comparing it to what you have now this is an opportunity to search for common problems after that you clean up the problems manually for example by eliminating extra spaces or removing an unwanted quotation mark but there's also some great tools for fixing common errors automatically such as trim and remove duplicates earlier you learned that trim is a function that removes leading trailing and repeated spaces in data and remove duplicates is a tool that automatically searches for and eliminates duplicate entries from a spreadsheet now sometimes you have an error that shows up repeatedly and it can't be resolved with a quick manual edit or a tool that fixes the problem automatically in these cases it's helpful to create a pivot table a pivot table is a data summarization tool that is used in data processing pivot tables sort reorganize group count total or average data stored in a database we'll practice that now using the spreadsheet from a party supply store let's say this company was interested in learning which of its four suppliers is most cost effective so an analyst pulled this data on the products the business sells how many were purchased which supplier provides them the cost of the products and the ultimate revenue the data's been cleaned but during verification we noticed that one of the suppliers names was keyed in incorrectly we could just correct the word as plus but this might not solve the problem because we don't know if this was a one-time occurrence or if the problem is repeated throughout the spreadsheet there are two ways to answer that question the first is using find and replace find and replace is a tool that looks for a specified search term in a spreadsheet and allows you to replace it with something else so we'll choose edit then find and replace we're trying to find p-l-o-s the misspelling of plus in the supplier's name in some cases you might not want to replace the data you just want to find something no problem just type the search term leave the rest of the options as default and click done but right now we do want to replace it with plus so we'll type that in here then click replace all and done there we go our misspelling has been corrected that was of course the goal but for now let's undo our find and replace so we can practice another way to determine if errors are repeated throughout a data set like with the pivot table we'll begin by selecting the data we want to use choose column c select data then pivot table choose new sheet and create we know this company has four suppliers so if we count the suppliers and the number doesn't equal four we know there's a problem so first add a row for suppliers next we'll add a value for our suppliers and summarize by count a count a counts the total number of values within a specified range here we're counting the number of times a supplier's name appears in column c note that there's also a function called count which only counts the numerical values within a specified range if we used it here the result would be zero not what we have in mind but in other specialty applications count would give us information we want for our current example as you continue learning more about formulas and functions you'll discover more interesting options if you want to keep learning search online for spreadsheet formulas and functions there's a lot of great information out there okay so our pivot table has counted the number of misspellings and it clearly shows that the error occurs just once otherwise our four suppliers are accurately accounted for in our data now we can correct the spelling and we've verified that the rest of the supplier data is clean this is also useful practice when querying a database if you're working in sql you can address misspellings using a case statement the case statement goes through one or more conditions and returns a value as soon as a condition is met let's discuss how this works in real life using our customer underscore name table check out how our customer tony magnolia shows up as tony and tinoy tony's name was misspell let's say we want a list of our customer ids and the customer's first names so we can write personalized notes thanking each customer for their purchase we don't want tony's note to be addressed incorrectly to tenoy so here's where we can use the case statement we'll start our query with the basic sql structure select from where we know the data comes from the customer underscore name table in the customer underscore data data set so we can add customer underscore data dot customer underscore name after from next we tell sql what data to pull in the select clause we want customer underscore id and first underscore name we can go ahead and add customer underscore id after select but for our customers first names we know that tony was misspelled so we'll correct that using case we'll add case and then when and type first underscore name equal t nuy next we'll use the then command and type tony followed by the else command here we'll type first on the scroll name followed by end as and then we'll type cleaned underscore name finally we are not filtering our data so we can eliminate the where clause as i mentioned a case statement can cover multiple cases if we wanted to search for a few more misspelled names our statement would look similar to the original with some additional names like this and there you go now that you've learned how you can use spreadsheets and sql to fix errors automatically we'll explore how to keep track of our changes next hi again now that you've learned how to make your data squeaky clean it's time to address all the dirt you've left behind when you clean your data all the incorrect or outdated information is gone leaving you with the highest quality content but all those changes you made to the data are valuable too in this video we'll discuss why keeping track of changes is important to every data project and how to document all your cleaning changes to make sure everyone stays informed this involves documentation which is the process of tracking changes additions deletions and errors involved in your data cleaning effort you can think of it like a crime tv show crime evidence is found at the scene and passed on to the forensics team they analyze every inch of the scene and document every step so they can tell a story with the evidence a lot of times the forensic scientist is called to court to testify about that evidence and they have a detailed report to refer to the same thing applies to data cleaning data errors are the crime data cleaning is gathering evidence and documentation is detailing exactly what happened for peer review or court having a record of how a data set evolved does three very important things first it lets us recover data cleaning errors instead of scratching our heads trying to remember what we might have done three months ago we have a cheat sheet to rely on if we come across the same errors again later it's also a good idea to create a clean table rather than overwriting your existing table this way you still have the original data in case you need to redo the cleaning second documentation gives you a way to inform other users of changes you've made and if you ever go on vacation or get promoted the analyst who takes over for you will have a reference sheet to check in with third documentation helps you determine the quality of the data to be used in analysis the first two benefits assume the errors aren't fixable but if they are a record gives the data engineer more information to refer to it's also a great warning for ourselves that the data set is full of errors and should be avoided in the future if the errors were time consuming to fix it might be better to check out alternative data sets that we can use instead data analysts usually use a change log to access this information as a reminder a change log is a file containing a chronologically ordered list of modifications made to a project you can use and view a change log in spreadsheets and sql to achieve similar results let's start with the spreadsheet we can use sheets version history which provides a real-time tracker of all the changes and who made them from individual cells to the entire worksheet to find this feature click the file tab and then select version history in the right panel choose an earlier version we can find who edited the file and the changes they made in the color next to their name to return to the current version go to the top left and click back if you want to check out changes in a specific cell we can right click and select show edit history that's it also if you want others to be able to browse the sheet's version history you'll need to assign permission okay now let's switch gears and talk about sql the way you create and view a changelog with sql depends on the software program you're using some companies even have their own separate software that keeps track of change logs and important sql queries this gets pretty advanced but essentially all you have to do is specify exactly what you did and why when you commit a query to the repository as a new and improved query this lets the company revert back to a previous version if something you've done crashes the system which has happened to me before another option is to just add comments as you go while you're cleaning data in sql this will help you construct your change log after the fact for now we'll check out bigquery's query history which tracks all the queries you've run you can click on any of them to revert back to a previous version of your query or to bring up an older version to find what you've changed here's what we've got i'm in the query history tab listed on the bottom right are all the queries are run by date and time you can click on this icon to the right of each individual query to bring it up to the query editor change logs like these are a great way to keep yourself on track it also lets your team get real-time updates when they want them but there's another way to keep the communication flowing and that's reporting stick around and you'll learn some easy ways to share your documentation and maybe impress your stakeholders in the process see you in the next video great you're back let's set the stage the crime is dirty data we've gathered the evidence it's been cleaned verified and cleaned again now it's time to present our evidence we'll retrace the steps and present our case to our peers as we discussed earlier data cleaning verifying and reporting is a lot like crime drama now it's our day in court just like a forensic scientist testifies on the stand about the evidence data analysts are counted on to present their findings after a data cleaning effort earlier we learned how to document and track every step of the data cleaning process which means we have solid information to pull from as a quick refresher documentation is the process of tracking changes additions deletions and errors involved in a data cleaning effort change logs are a good example of this since it's saved chronologically it provides a real-time account of every modification documenting will be a huge time saver for you as a future data analyst it's basically a cheat sheet you can refer to if you're working with a similar data set or need to address similar errors while your team can view change logs directly stakeholders can't and have to rely on your report to know what you did let's check out how we might document our data cleaning process using an example we worked with earlier in that example we found that this association had two instances of the same membership for 500 in its database we decided to fix this manually by deleting the duplicate info there are plenty of ways we could go about documenting what we did one common way is to just create a doc listing out the steps we took and the impact they had for example first on your list would be that you removed the duplicate instance which decreased the number of rows from 33 to 32 and lowered the membership total by 500 if we were working with sql we could include a comma in the statement describing the reason for a change without affecting the execution of the statement that's something a bit more advanced which we'll talk about later regardless of how we capture and share our change logs we're setting ourselves up for success by being 100 percent transparent about our data cleaning this keeps everyone on the same page and shows project stakeholders that we're accountable for effective processes in other words this helps build our credibility as witnesses who can be trusted to present all the evidence accurately during testimony for dirty data it's an open and shut case welcome back by now it's safe to say that verifying documenting and reporting are valuable steps in the data cleaning process you have proof to give stakeholders that your data is accurate and reliable and the effort to attain it was well executed and documented the next step is getting feedback about the evidence and using it for good which we'll cover in this video clean data is important to the task at hand but the data cleaning process itself can reveal insights that are helpful to a business the feedback we get when we report on our cleaning can transform data collection processes and ultimately business development for example one of the biggest challenges of working with data is dealing with errors some of the most common errors involve human mistakes like mistyping or misspelling flawed processes like poor design of a survey form and system issues where older systems integrate data incorrectly whatever the reason data cleaning can shine a light on the nature and severity of error generating processes with consistent documentation and reporting we can uncover error patterns and data collection and entry procedures and use the feedback we get to make sure common errors aren't repeated maybe we need to reprogram the way the data is collected or change specific questions on the survey form in more extreme cases the feedback we get can even send us back to the drawing board to rethink expectations and possibly update quality control procedures for example sometimes it's useful to schedule a meeting with the data engineer or data owner to make sure the data is brought in properly and doesn't require constant cleaning once errors have been identified and addressed stakeholders have data they can trust for decision making and by reducing errors and inefficiencies and data collection the company just might discover big increases to its bottom line congratulations you now have the foundation you need to successfully verify and report on your cleaning results stay tuned to keep building on your new skills hey there thanks for stopping by once again so earlier we checked out some potential career paths that might open up for you once you complete the program you might also have explored the advantages of networking and building an online presence and i want to tell you just by being here now you've shown you're committed you're taking a big step in your future career coming up we'll spend some time building your resume you might already have a resume that you've used or been saving and that's great there's a good chance you'll still be able to use it even if you're planning to switch careers together we'll find out what kinds of changes to your resume you might want to make but before that we'll figure out what the whole application process is like then we'll explore the best way to write or adjust your resume to make it as professional looking as possible and ready for your role as a data analyst we'll also take a peek at some examples of other resumes after that we'll have you do a little self-analysis as we review the different types of data analyst jobs out there so you can think about which ones might be best for you while i'm definitely not a career counselor we can still think of this as a kind of career counseling session you'll get a better idea of how to build your resume while thinking about your bigger career picture at the same time so let's get started hi again right now seems like the perfect time to take a step back from learning about data analytics so you can get excited about what comes after you're done here the road to finding a job can be challenging but you're building up your skill set and learning what it takes to be a data analyst in this video we'll cover what you can expect from your job search plus some tips to using your newfound skills and knowledge to make your search easier i remember when i first started out i reached out to as many people as i could to learn about their career paths their companies and their roles i wanted to get a good idea of what to expect and that's what we're doing now giving you an idea of what to expect during your own job search it's important to remember that everyone's search will be different it might depend on where you live what your interests are within the field and personal preferences like the type of work environment you feel comfortable in this is all part of making this journey your own as you hunt for a job that's perfect for you the most common way to start is by checking out available jobs there's a lot of job sites that are built specifically for people seeking employment you can also go to company websites where they usually post job listings too these sites might even have an option to send you an alert when a role matching your search becomes available once you find a few that you like do some research to learn more about the companies and the details about the specific positions you'll be applying for then you can update your resume or create a new one you'll want it to be specific and reflect what each company is looking for but you can definitely have a master resume that you tweak for each position it can also help to create a spreadsheet with all of your experiences and accomplishments to help you decide what to include in your resume for each position if you're using a professional networking site like linkedin you might already have connections who can help you with your job search maybe you know someone who can write a referral for you or knows of a job within their company that would suit you and even if you don't have any luck with your connections you can also reach out to employees of the companies you're interested in they might be able to give you some insight on the best ways to highlight your skills and experience when applying and it's okay if they don't write back keep trying this is probably a good time to tell you of the most challenging part of a job search hearing the word no you'll probably hear it a lot and that's a hundred percent okay it's part of everyone's experience especially when switching career paths people you reach out to might not be able to help you companies you would love to work for might not have any openings jobs you apply for might be filled by someone else and that's all part of the process the key is to stay focused don't get discouraged and above all else believe in yourself okay speeched over but don't forget it or i'll be forced to give more speeches so back to your search if the company you're applying to is interested your first point of contact might be a recruiter a recruiter might also reach out to you based on their own research they may find your professional profile online and think you're a good match for a position speaking of which that's another reason to keep building and refreshing your online profile recruiters are there to make sure you're a legitimate candidate for the job posted in the description so when you talk with the recruiter whether on the phone online or in person be professional and personable it's natural to feel nervous here so it can help to refer back to your resume to wow them with your knowledge of the data analytics industry and remember recruiters are also looking for someone and they're hoping it'll be you here's another tip using technical terms like sql and clean data will show recruiters that you know what you're doing recruiters probably won't go into too much detail about the ins and outs of the job but they want to see that you know what you're talking about they might also give you prep materials or other recommendations take advantage of these because recruiters want you to do well next up is usually the hiring manager this is the most important step the hiring manager's job is to evaluate whether you have the ability to do the work and whether you'd be a good fit for their team your job is to convince them that yes you do and yes you would be a good thing you can do here is use linkedin or other professional sites to research the hiring managers or even other analysts who have a similar role to the one you're applying for the more information you have about the job the better your chances of actually getting it you should also use this opportunity to ask lots of questions to help you figure out if the company's good fit for you you can do this when you talk to recruiters too now if the hiring manager sees you as a fit it's very possible you'll have at least one more interview the point of these interviews is to give your future stakeholders and teammates a chance to decide if you're the best candidate for the position the next step is the best step if all goes well you'll get an official offer usually by phone first and may be followed by an official letter at this point feel free to celebrate call everyone and celebrate some more but even if it's your dream job make sure it's a competitive offer before you sign remember if they reach out to you with an offer that means they want you as much as you want them if you're interviewing at other places you can leverage this to figure out if negotiating for a more competitive offer is possible you should also research salaries benefits vacation time and any other factors that are important to you for similar jobs if you can show specific research like company x gives y amount more for the same role there's usually some room to negotiate your salary vacation days or something else keep in mind you'll need to find a balance between what you want what they want to give you and what's fair so know your own worth but also understand that the company hiring you has already placed a certain value on your role okay let's say that everything works out and you're happy with the negotiated deal and excited to join your new team even then hit pause and give yourself at least two weeks before you officially start why well if you're already employed somewhere else during your job search it's customary and polite to give at least a two-week notice at your old job before restarting at the new one plus it's good to give yourself a break before starting your exciting new adventure you've earned it by now you should have a pretty good idea of what to expect when you start your data analyst job search coming up we'll talk more about building your resume see you in the next video great you're back when you take a picture you usually try to capture lots of different things in one image maybe you're taking a picture of the sunset and want to capture the clouds the tree line and the mountains basically you want a snapshot of that entire moment you can think of building your resume in the same way you want your resume to be a snapshot of all that you've done both in school and professionally in this video we'll go through the process of building a resume which you'll be able to add your own details to keep in mind this is a snapshot so when managers and recruiters look at what you've included in your resume they should be able to tell right away what you can offer their company the key here is to be brief try to keep everything in one page and each description to just a few bullet points two to four bullet points is enough but remember to keep your bullet points concise sticking to one page will help you stay focused on the details that best reflect who you are or who you want to be professionally one page might also be all that hiring managers and recruiters have time to look at they're busy people so you want to get their attention with your resume as quickly as possible now let's talk about actually building your resume this is where templates come in they're a great way to build a brand new resume or reformat one you already have programs like microsoft word or google docs and even some job search websites all have templates you can use a template has placeholders for the information you'll need to enter and its own design elements to make your resume look inviting you'll have a chance to explore this option a little later for now we'll go through the steps you can take to make your resume professional easy to read and error free if you already have a resume document you can use these steps to tweak it now there's more than one way to build a resume but most have contact information at the top of the document this includes your name address phone number and email address if you have multiple email addresses or phone numbers use the ones that are most reliable and sound professional it's also great if you can use your first and last name in your email address like janedos17 email.com you should also make sure that your contact information matches the details that you've included on professional websites and while most resumes have contact information in the same place it's up to you on how you organize that info a format that focuses more on skills and qualifications and less on work history is great for people who have gaps in their work history it's also good for those who are just starting out their career or making a career change and that might be you if you do want to highlight your work history feel free to include details of your work experience starting with your most recent job if you have lots of jobs that are related to a new position you're applying for this format makes sense if you're editing a resume you already have you can keep it in the same format and adjust the details if you're starting a new one or building a resume for the first time choose the format that makes the most sense for you there's lots of resume resources online you should browse through a bunch of different resumes to get an idea of the formats you think works best for you once you've decided on your format you can start adding your details some resumes begin with the summary but this is optional a summary can be helpful if you have experience that is not traditional for a data analyst or if you're making a career transition if you decide to include a summary keep it to one or two sentences that highlight your strengths and how you can help the company you're applying to you'll also want to make sure your summary includes positive words about yourself like dedicated and proactive you can support those words with data like the number of years you've worked or the tools you're experienced in like sql and spreadsheets a summary might start off with something like hard-working customer service representative with over five years of experience and once you've completed this program and have your certificate you'll be able to include that too which could sound like this entry-level data analytics professional recently completed the google data analytics professional certificate sounds pretty good doesn't it another option is leaving a placeholder for your summary while you build the rest of your resume and then writing it after you finish the other sections this way you can review the skills and experience you've mentioned and grab two or three of the highlights to use in your summary it's also good to note that summary might change a little as you apply for different jobs if you're including a work experience section there's lots of different types of experience you could add outside of jobs with other companies you could also include volunteer positions you've had and any freelance or side work you've done the key here is the way in which you describe these experiences try to describe the work you did in a way that relates to the position you're applying for most job descriptions have minimum qualifications or requirements listed these are the experiences skills and education you'll need to be considered for the job so it's important to clearly state them in your resume if you're a good match the next step is checking out preferred qualifications which lots of job descriptions also include these aren't required but every additional qualification you match makes you a more competitive candidate for the role including any part of your skills and experience that matches a job description will help your resume rise above the competition so if a job listing describes a job responsibility as effectively managing data resources you'll want to have your own description that reflects that responsibility for example if you volunteered or worked at a local school or community center you might say that you effectively manage resources for after-school activities later on you'll learn more ways to make your work history work for you it's helpful to describe your skills and qualifications in the same way for example if a listing talks about organization and partnering with others try to think about relevant experiences you've had maybe you've helped organize the food drive or partnered with someone to start an online business in your descriptions you want to highlight the impact you've had in your role as well as the impact the role had on you if you helped a business get started or reach new heights talk about that experience and how you played a part in it or if you worked at a store when it first opened you can say that you helped launch the successful business by ensuring quality customer service if you use data analytics in any of your jobs you'll definitely want to include that as well we'll cover how to add specific data analysis skills a little bit later one way to do this is to follow formula in your descriptions accomplish x as measured by y by doing z here's an example how this might read on a resume selected as one of 275 participants nationwide for this 12-month professional development program for high achieving talent based on leadership potential and academic success and if you've gained new skills in one of your experiences be sure to highlight them all and how they helped this is probably as good a spot as any to bring up data analytics even if this program is the first time you really thought about data analytics now that you're equipped with some knowledge you'll want to use that to your benefit so if you've ever managed money maybe that means you help the business analyze future earnings or maybe you created a budget based on your analysis of previous spending even if it was for your own or a friend's small business it's still data that you've analyzed now you can reflect on when and how and use it in your resume after you've added work experience and skills you should include a section for any education you've completed and yes this course absolutely counts you can add this course as part of your education and you can also refer to it in your summary and skills sections depending on the format of your resume you might want to add a section for technical skills you've acquired both in this course and elsewhere besides technical skills like sql you could also include language proficiencies in this section having some ability in a language other than english can only help your job search so now you have an idea of how to make your resume look professional and appealing as you move forward you'll learn even more about how to make your resume shine by the end you'll have a resume you can be proud of next up we'll talk about how to make your resume truly unique see you soon great to see you again building a strong resume is a great way to find success in your job hunt you've had the chance to start building your resume and now we'll take the next step by showing you how to refine your resume for data analytics jobs let's get started for data analysts one of the most important things your resume should do is show that you're a clear communicator companies looking for analysts want to know that the people they hire can do the analysis but also can explain it to any audience in a clear and direct way your first audience as a data analyst will most likely be hiring managers and recruiters so being direct and coherent in your resume will go a long way with them as well let's start with the summary section while you won't go into too much detail in this section about any of your work experiences it's a good spot to point out if you're transitioning into a new career role you might add something like transitioning from a career in the auto industry and seeking a full-time role in the field of data analytics one strategy you can use in your summary and throughout your resume is par or par statements par stands for problem action result this is a great way to help you write clearly and concisely so instead of saying something like was responsible for writing two blogs a month you'd say earned little-known website over 2000 new clicks through strategic blogging the website being little known is the problem the strategic action is the strategic blogging and the result is the 2000 new clicks adding par statements to your job descriptions or skills section can help with the organization and consistency in your resume they definitely helped me when i changed jobs speaking of the skills section make sure you include any skills and qualifications you've acquired through this course and on your own you don't need to be super technical but talking about your experience with spreadsheets sql tableau and r which is a programming language that we'll get to later will enhance your resume and your chances of getting a job so if you're listing qualifications or skills you might include a spot for programming languages and then list sql and r which are both a part of the google data analytics certificate you might even add in the top functions packages or formulas that you're comfortable with in each it also makes sense to include skills you've acquired in spreadsheets like pivot tables pivot tables sql r and lots of other terms we covered here might get you noticed by hiring managers and recruiters but you definitely want your resume to accurately represent your skills and abilities so only add these skills after you've completed this certificate once you start applying the ideas we talked about here to your resume you'll be well on your way to setting yourself apart from other candidates and after you've completed your final course you'll have the opportunity to complete a case study and link it on your resume this will be a great opportunity to show recruiters and hiring managers the skills you've learned while earning your certificate before you know it you'll have a pretty great resume that you can update quickly whenever you're searching for a data analyst job nothing wrong with that up next we'll talk more about adding experience to your resume bye for now hello my name is joseph i'm a people analyst at google as a people analyst my job is to work with executives and hr business partners to use data to make informed people decisions inclusion is very essential to the work that we do as you know sometimes you can tell a story with data and have your own bias in it so for us in this field that is very sensitive it requires that we have a diverse set of people who have different backgrounds to have this lens of data to it so being a black professional i can better tell a story about people of color that it's a lot more personal to me being an analyst requires me to uh take data and tell a story with it on a personal standpoint i'm very passionate about this space of increasing representation in the tech industry for example outside of work i run a non-profit called san carvertech and our whole goal is essentially to help develop the next generation of black engineers who can essentially be in this field and represent our experience using data as a foundation and also technology as the um the parent moving factor going forward it's critical that we have more black people in the technology sector as you all know the next 10 20 years ai machine learning would be like just speaking english in this country or even the entire world so the more we can have more black people in this field the more we can represent it and the products that are being built and the more that our experiences are being influenced in every single every single prior that these companies do build it's definitely critical that we have more black engineers we have more black data scientists to do the analysis and also just black data analysts help tell the story that's more inclusive of our experience as well so it's definitely essential that we do have people from different um backgrounds color creed to really understand data and have their lines to it and tell the story and make it very personal um to our audience welcome back everyone out there has their own personal work history we all started somewhere whether part-time or full-time what matters for your resume is how you present the work you've done in this video we'll hone in on work history and how you can translate yours effectively for your data analyst resume if you don't have a specific section for work history in your resume that's okay you can use the same basic ideas to adjust your skills and qualifications section the good news is that you already have a lot of the skills that recruiters and hiring agents look for when they hire data analysts you've probably used lots of them in previous jobs we call these transferable skills transferable skills are skills and qualities that can transfer from one job or industry to another so think about all the positions you've held associate owner team member manager and how they might be used as a data analyst let's start with a big one that we talked about before communication when job descriptions say they want strong communication skills for a data analyst it usually means they want someone who can speak about what they do to people who aren't as technical or analytical if someone who's not familiar with data analytics can't understand what you're talking about when you try to explain it to them your communication skills are usually good to go you've probably had to communicate in other jobs you had whether with employees customers or clients team members or managers you might have had to give presentations too whether formal or informal in your work extra section you can highlight how your effective communication skills have helped you you can also refer to specific presentations you've made and the outcomes of those presentations and you can even include the audience for your presentations especially if you presented to large groups or people in senior positions after listing job details like the place and length of employment you might add something like effectively implemented and communicated daily workflow to fellow team members resulting in an increase in productivity here you change the details based on the work you did since you'll be working in the world of data including any quantitative data would be ideal for example the increase in productivity might have been a 15 increase as long as you have a way to back up your data hopefully with more data then you can put it in your resume this example brings us to the next transferable skill data analysts are problem solvers when problems arise in a database or lines of code data analysts need to be able to find and troubleshoot the problem if you have no prior experience working with data you can still talk about your problem-solving skills that last example we shared does a great job of showing an ability to problem-solve it's actually written as a par or problem action result statement which we talked about earlier so the problem is that the daily workflow procedures were not in place the action is that you put the procedures into effect and communicated them to your team and the result is that productivity increased by 15 this makes it clear that there was a problem and you solved it we can also use this statement to point out teamwork as an important quality to bring to the data analyst world while you might have plenty of work to do on your own it will always be for the benefit of the team team means not only the data team you're a part of but the whole company as well so that's a few skills you can add to your work experience and skills and qualification sections all of these are known as soft skills soft skills are non-technical traits and behaviors that relate to how you work being detail-oriented and demonstrating perseverance are two more examples of soft skills that anyone hiring a data analyst will look for companies want to know that you will do your analysis carefully and to completion no matter what setbacks you might face along the way if you worked out a retail job you can talk about how your attention to detail helps you find discrepancy while handling a high volume of money and you could add how you continue to practice customer service at a high level despite a high turnover rate at the management level these are just some examples to think about and apply to your work details take a moment and think back to your last job or maybe it's your current job what soft skills do you use to find success are you starting to understand how those are transferable to the world of data analytics using par statements and focusing on your transferable soft skills can really add to the power of your resume so now you can keep powering on to the next step to continue learning about the data analytics field and your future job in it see you in the next video hi i'm kate i'm a senior product analyst at google i have always been um perhaps an annoyingly curious person um even as a child i remember i would take things apart just to see how they work i just love like seeing how things work together and i love asking new questions i love having more information i think that makes me a more well-rounded person and definitely a better analyst every step in my career including the first step in the army i kind of always like kind of picked out what i could in terms of trying to self-teach on things like databases and things like data one of my first forays into data was um you know i had been deployed and when i came back i worked with a personnel office and we had to do things like track where everybody was and what their pay was and their rank and if you know they were getting awards um and there wasn't a single system to work through in that and so i used an access database it took me forever to learn what a foreign key was and what a primary he was and um i'll be totally honest i did really poorly um i ended up going back to excel but it was a really good learning experience and after my time in the army i really have a sense of what i really wanted what i wanted to do i had been doing personnel but i really did still enjoy the technology piece and i somehow spun my army career into logistics and got a job doing um logistics for what they call the roundhouse is where they work on the locomotive engines i did a lot of database maintenance so when i left the railroad i went to a welding company where i started out as a logistics person working on trucks and truck parts but then i was able to transition into kind of a more database data focus role after my time at the welding company i was kind of ready to try something a lot more technical i actually ended up working for a small consulting firm that was very like boutiquey and did a lot of work with tableau where we started to work with companies and taught them how to do data visualization i did tableau training for a while um but really i was there for over six years and through my time there i did database engineering i did data engineering i got to run a team of analysts i got to teach people how to do consulting there was a lot of growth for me in that six year time period after that i decided to come to google i get to work with stakeholders on translations throughout the google world so if anybody wants to translate something from one language to another i get to work on the analytics of that so that means that if you take you know 500 different languages or 40 different languages what does it cost how many words do we translate what does that translation quality look like if i look back on my career i would have told myself five ten years ago to focus on something don't try to feel too overwhelmed the important thing to be able to do is to be able to ask the right question and know how to answer it i have confidence confidence is really important because people are coming to me for answers that's my job is to think really hard about the questions and give them answers that make them better and make the company better the fact that i know that i can do this now now that i've put that kind of time and effort into it it's really really rewarding hello so if you haven't done a search for a data analyst job yet give it a try one thing you might notice is how many variations of data analyst jobs there are you'll find some that just say data analyst in the job title and others that include more details like market research analyst and digital data analyst this variety is a good thing it means that as a data analyst you'll have a pretty wide range of job opportunities available so while you might not be the right fit for every position that's posted every position that's posted might not be the right fit for you as you continue moving forward it's important to keep your own interests in mind there might be certain topics that we've covered or will cover that you find yourself especially interested in when you're job hunting you might want to tailor your search to find jobs that are focused on or include your areas of interest for example if a job description lists data cleaning as a job responsibility and you think that you'd really enjoy that process you could make that job your top priority at the same time think about your other interests if you have a background in retail or medicine or finance and have had a good experience with it you might apply for jobs that match your background as an added bonus your experience will look great on your resume but it's also okay to search for jobs in an area of personal interest where you have no professional experience if you've always loved cars check out what positions the auto industry has if you're fascinated by how utility companies work hunt for jobs in the energy and utilities industry finding a job is great finding a job you love is even better always keep in mind that data analytics is constantly evolving within lots of different industries so job titles and hiring needs might also change but the opportunities no matter what they are when you're searching will be there so now let's preview some of the many kinds of data analyst jobs that are out there the certificate you earn here will be most applicable to junior or associate data analyst positions but that doesn't mean you have to limit your job search to only postings for junior or associate analysts job titles come in all shapes and sizes new analysts work in a wide range of industries healthcare analysts gather and interpret data from sources like electronic health records and patient surveys their work helps organizations improve the quality of their care healthcare analysts might also look for ways to lower the cost of care and improve patient experience data analysts and marketing complete quantitative and qualitative market analysis they identify important statistics and interpret and present their findings to help stakeholders understand the data behind their marketing strategies business intelligence analysts help companies use data they've collected to increase their efficiency and maximize their profits these analysts usually work with large amounts of data to identify trends and generate business insights financial analysts also work with lots of data really all analysts do but financial analysts use the data to identify and potentially recommend business and investment opportunities if you're a junior analyst in this field you'll probably start off doing a lot of data gathering and financial modeling as well as spreadsheet maintenance this is just a small taste of the types of data analyst jobs out there each type we've covered can branch out into other industries as well for example business intelligence analysts can work in healthcare government e-commerce and more it's exciting to think about the possibilities there's more work for you to do of course but there's nothing wrong with looking ahead and when you get to that place you're looking ahead to you'll be able to take charge and find the best job for you for the time being we'll keep exploring your resume see you soon great job on finishing this course you've covered a lot of ground already and learned so much there's only one thing left to do keep going but if you ever need to check back on what you've learned these videos will still be available to you right now i'm excited for you to meet your instructor for the next course ayanna she's ready to guide you through the next part of the program as you continue your journey to becoming a data analyst you've learned how to prepare and process data and in the next course ayanna will show you how to analyze it we'll explore how to make sense of all the data you've collected and cleaned and you'll learn how to ask the right questions and use data to find answers we'll also show you how to organize and format the data once again so it's completely ready for analysis we'll talk about aggregation and joins two processes that allow you to gather all the data you need and summarize it for your stakeholders both spreadsheets and sql will make an appearance too we'll give you some more practice with using them for calculations and we'll explain how temporary tables in sql work awesome job so far and best of luck on what's next
Info
Channel: Free Engineering Courses
Views: 3,219
Rating: 4.9529409 out of 5
Keywords:
Id: efoIxeJcI2I
Channel Id: undefined
Length: 212min 29sec (12749 seconds)
Published: Sat May 22 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.