ISO/TS 8000-81:2021 pdf – Data quality — Part 81: Data quality assessment: Profiling.
1 Scope This document specifies a procedure for data profiling to generate the foundation for performing data quality assessment. This profiling is applicable to data sets that are either originally in a structure of tables and columns or are the output from a transformation to create such a structure. NOTE 1 Data profiling is applicable to all types of database technology. The following are within the scope of this document: — performing structure analysis to determine data element concepts; — performing column analysis to identify relevant data elements, including statistics about a data set; — performing relationship analysis to identify dependencies in a data set. The following are outside the scope of this document: — methods for extracting and sampling data to be profiled from a data set; — deriving data rules; — measuring the extent of nonconformities in a data set. NOTE 2 ISO 8000-8 specifies approaches to measuring data and information quality. This document can be used in conjunction with, or independently of, quality management systems standards. 2 Normative references The following documents are referred to in the text in such a way that some or all of their content constitutes requirements of this document. For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies. ISO 8000-2, Data quality — Part 2: Vocabulary 3 Terms and definitions For the purposes of this document, the terms and definitions given in ISO 8000-2 apply. ISO and IEC maintain terminological databases for use in standardization at the following addresses: — ISO Online browsing platform: available at https://www.iso .org/obp — IEC Electropedia: available at http://www.electropedia .org/
4 Data profiling The purpose of data profiling is to characterize the structure, columns and relationships of a data set. This characterization is a data profile that serves as the basis on which an organization can improve data quality issues. The improvement can include creation of rules to enforce appropriate requirements on the data. Data profiling consists of the following processes (see Figure 1): — perform structure analysis (see Clause 5); — perform column analysis (see Clause 6); — perform relationship analysis (see Clause 7 ).
5.3 Outputs The output from structure analysis is a data element concept. 6 Column analysis 6.1 Inputs The inputs to column analysis are a data set and a corresponding data element concept from structure analysis (see Clause 5 ). 6.2 Scope of activities Column analysis consists of: — extracting data elements from the data element concept; — comparing the data elements with the values in the data set; — determining the value domain. NOTE The methods for extracting data elements include discovery, assertion testing and visual inspection. These methods can be supported by automated tools. 6.3 Outputs The output from column analysis is a list of constraints of value domain. These constraints include the following (see Annex B for more details): — cardinalities: count of rows, range of values, nulls, count of distinct values and uniqueness; — storage: data type, length of values and decimals; — valid values: discrete value list, permissible range, skip-over rules, pattern and domain. 7 Relationship analysis 7.1 Inputs The inputs for relationship analysis are a data set and the corresponding data elements from column analysis (see Clause 6 ). NOTE Relationship analysis extracts relationships between columns within not only a single table but also multiple tables. 7.2 Scope of activities Relationship analysis consists of: — comparing the extracted data elements with any supporting information in the data set; — determining dependency. NOTE When performing relationship analysis, a key requirement is to understand the correspondence between the data structure (tables and columns) and items in the real world. This understanding arises from data profiling practitioners collaborating with experts who work with the core processes of the organization. These experts are familiar with the details of the items represented by the data.