Data Management - Data Quality

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] welcome to lesson two of introduction to data management course this lesson covers data quality management capability after this lesson you will be able to understand fundamental data quality concepts such as data quality profiling and assessment data quality dimensions and so on in addition to fundamental concepts data quality is covered from people process and Technology aspect at the beginning let's define the data quality data quality refers to the methodical approach policies and processes by which an organization manages the accuracy validity timeliness completeness uniqueness and consistency of its data in data systems and data flows speaking of data quality the subject of data quality is critical data element so let's say we have CDE name date of birth date of birth CDE is defined as the day on which a person is born now in the context of quality of data here are some questions we can ask about critical data element is it accurate is it valid is it provided on time is it complete is it unique is it consistent each of these questions address certain data quality aspect for date of birth CDE those aspects are also called data quality dimensions there are six fundamental data quality dimensions accuracy validity timeliness completeness uniqueness and consistency [Music] not all data quality dimensions are applicable on every CDE for instance for date of birth makes sense to look at quality of data against validity and completeness dimensions [Music] now let's learn more about data quality dimensions starting with the definition of course data quality dimension refers to the aspect or feature of information that can be assessed and used to determine quality of data as we mentioned there are six key data quality dimensions accuracy means that data accurately represent the real world typical example incorrect spellings of product or person names or addresses validity data conforms to the syntax of its definition such as format type or age typical example incorrect classification values for gender or customer type timeliness data represents reality from the required point of time typical example customer address change which is effective on July 1st is entered into the system in July 15th completeness data are complete in terms of required potential of data typical example customer address missing a zip code uniqueness data are properly identified and recorded only once typical example single customer is recorded twice in the database with different identifiers z' consistency data are represented consistently across the data set typical example customer account is closed but there is a new order associated to that account [Music] the next important concept in data quality is data quality rule data quality rules refers to business rules intended to ensure quality of data in terms of accuracy validity timeliness completeness uniqueness and consistency [Music] for instance for date of birth CDE we can define following data quality rules validity birth date must be validate in the range from year 1900 to current date completeness date of birth must be entered for each individual empty fields are not allowed please note that each data quality rule is associated to particular data quality dimension also multiple data quality rules can be associated to one data quality dimension [Music] the next step is to discuss data quality process data quality process consists of four activities one define data quality requirements to conduct data quality assessment three resolve data quality issues and four monitor and control data quality define data quality requirements this activity is described as follows first perform data profiling that will help us to discover value frequencies and formats of data data profiling can be performed by using specialized tool or query languages that are supported by data source although some data quality problems can be discovered during data profiling activity the purpose of data profiling is to give insight for data quality assessment [Music] conduct data quality assessment consists of steps as follows one define data quality rules for accuracy validity timeliness etc and also quality thresholds to perform data quality assessment by enforcing data quality rules on existing data set three identify data quality issues and update issue long resolve data quality issues consist of steps as follows one for data quality issues which are identified during data quality assessment conduct root cause analysis in order to determine root cause of the issue to conduct issue resolution by eliminating root cause of the issue 3 review data policies and procedures if necessary finally monitor and control data quality consists of following steps 1 define and populate data quality scorecards and to monitor data quality [Music] before we move forward a brief reminder about course objective to cover each capability from people process and Technology aspect this slide refers to process aspects of data quality capability [Music] same as metadata management data quality process should be a part of system development life cycle or sdlc SDLC or system development life cycle refers to the process of planning creating testing and deploying of an information system in other words during application design and development you should ensure the quality of data in all application parts in addition same as previous slide this slide refers to the process part of metadata management capability [Music] so far we introduced data quality concepts and data quality process now let's define the key role and data quality management data quality analyst represents key role and data management responsible to perform activities associated with data quality process although this is the only role specific to data quality data quality analyst will closely work with business owners data stewards technical owners and data custodians that includes but not limited to definition of data quality rules analyze results of data quality profiling and assessments investigating root causes for data quality issues etc once again brief reference to course objectives this slide covers people aspects of data quality management [Music] now we would be taking an example to see how data quality really works starting with use case definition the use case is about to conduct the data quality assessment for employee data set specifically to the data element date of birth the threshold for quality of data is 99% for each data quality dimension this threshold means that 99% of employee data set records must pass the data quality rules in order to consider data quality as good the CDE name date of birth is defined as the day of which the person is born the data quality rules are defined as follows one validity birth date must be valid date and range from year 1900 to current date and to completeness date of birth must be entered for each individual no blanks allowed so we have defined data quality use case based on one CDE and two data quality rules associated with that CDE also we have defined data quality threshold that will determine the level of data quality for defined CDE [Music] the very first step in data quality process is to do the data profiling data profiling refers to the technique of surveying data in the database in order to get information about specific data set on the screen you can see sample data set for employee data one of the columns is date of birth which is target for this data quality exercise now let's do the data profiling and see what we have inside the data set here are the results of the data profiling the data set contains eight records in total seven out of eight records are unique one out of eight records contain blonk field for date of birth column as you can see from the example data profiling gives a big picture about the data in the data set once we complete the data profiling the next step is to assess data against the data quality rules which are defined in the use case that step is known as data quality assessment [Music] data quality assessment is one of the key steps in data quality management in reference to this example the input for data quality assessment is employee data set and data quality rules for date of birth which are previously defined in the use case now let's assess each record from data set against the validity and completeness data quality rules here are the results for validity rule six out of eight records past Rule two records failed rule one record failed because contains invalid date and other contains blank value the score for validity rule is 75% that is below the threshold of 99% that we said in the use case definition therefore the actual result of 75% is red coloured in order to visually emphasize the problem with data quality for employee data set for completeness rule seven out of eight records passed rule one record failed rule the score for completeness rule is 88% that is still below the threshold of 99% same was in previous example the actual result of 88% is red coloured in order to visually emphasize the problem with data quality for employee data set the format we use to present data quality assessment results is known as data quality scorecard data quality scorecards are an effective tool to monitor data quality in the organization data quality assessment discovers data issues the next steps refer to identification of root cause of the issues and performing issue remediation in this example the root cause of issues are as follows validity data entered in this field are not validated against the valid date format completeness date of birth is not set mandatory field and database the next step is to perform issue remediation by making corrections in the data set after making corrections we should update data quality scorecard now the data issues are identified and remediated final step is to ensure that this kind of issues will not happen again in order to achieve that goal we need to perform actions as follows one implement validation control for date of birth field against the valley dates within defined range and two set date of birth field as mandatory in database by implementing remediation we will ensure that data shoes like invalid or empty date of birth will not happen again same like metadata management data quality management capability requires technology tools to support the data quality process data quality tool should have certain set of features in order to support data quality process on an effective way here is the list of the key features one ability to conduct data profiling including statistical analysis of data sets to ability to define and execute data quality rules for critical data elements which are subjects of data quality check three ability to store data quality profiling and assessment results for ability to conduct issue resolution process and discover issue patterns and five ability to create and visualize data quality scorecards before we move forward a brief reminder about the course objective to cover each capability from people process and technology perspective this slide refers to technology perspective another important milestone in data management journey the data quality management capability is almost completed we have covered fundamental concepts of data quality as well as people process and Technology aspects before we move to the next capability here are the key takeaways from this lesson the subjects of data quality is critical data elements CDEs data quality is about different aspects or features of CDEs these aspects or features are also called data quality dimensions six key data quality dimensions are accuracy validity timeliness completeness uniqueness and consistency data quality process consists of four major steps one define the key requirements to conduct DQ assessment 3 resolve DQ issues and 4 monitor and control all activities are performed by data quality analysts in cooperation with business owner data steward technical owner and data custodian data quality profiling is about getting statistical information about data data quality assessment is about to assess data quality and identify data quality issues it's based on data quality rules and thresholds data quality rules are associated with data quality dimensions data quality scorecards visualizes results of the assessment root cause analysis is about to find root cause of particular data quality issue data quality Jules provide technology support - data quality process to ensure the quality of data in accordance to business requirements [Music] that's all for lesson 2 and data quality management capability now take some fun and complete the quiz the next lesson is about data governance
Info
Channel: Global Data Store LLC
Views: 46,775
Rating: 4.7839851 out of 5
Keywords: Data Management, Data Quality, Data Governance, Free Course
Id: kDOelMaTOuM
Channel Id: undefined
Length: 19min 59sec (1199 seconds)
Published: Sat Feb 11 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.