Professional Documents
Culture Documents
Toolkit
VSV Training
Chapter 04: Cleaning and Conforming
Prepared by: Vinh Tao
Date: 2/9/2008
4.0 Introduction
versus
How important is getting the data verifiably correct?
4.1.3 Balancing
Conflicting Priorities
(cont)
Corrective versus Transparent
Data Quality Can Learn From Manufacturing
Quality
Data quality can learn a great deal from manufacturing
quality. Most of the issues that come from ETL
4.1.4 Formulate a
Policy
most/least data-quality
Are there interesting patterns or trends
revealed in scrutinizing data-quality issues
over time?
Is there any correlation observable between
data-quality the performance of the
organization as a whole?
4.2.2 Cleaning
Deliverable #1: Error
Event Table
4.2.2 Cleaning
Deliverable #1: Error
The attributes of the screen dimension are as follows:
The ETL StageTable
Event
describes the (cont)
stage in the overall ETL process
in which the data-quality screen is applied.
The Processing Order Number is a primitive
4.2.3 Cleaning
Deliverable #2: Audit
The audit dimension
Dimension
is literally attached
to each fact record
in the data
warehouse and
captures important
ETL-processing
milestone
timestamps
and outcomes.
4.3.2 Types of
Enforcement
It is useful to divide the various kinds of
data-quality checks into four broad
categories:
Column property enforcement
Structure enforcement
Data enforcement
Value enforcement
ranges
Columns whose lengths are unexpectedly short or long
Columns that contain values outside of discrete valid value
sets
Adherence to a required pattern or member of a set of
patterns
Hits against a list of known wrong values where list of
acceptable values is too long
Spell-checker rejects
4.3.4 Structure
Enforcement
Whereas column property enforcement
focuses on individual fields, structure
enforcement focuses on the relationship of
columns to each other.
Structure enforcement also checks
hierarchical parent-child relationships to make
sure that every child has a parent or is the
supreme parent in a family.
4.3.6 Measurements
Driving Screen Design
4.3.9 Screens
Providing a history of record counts by day for tables to be
extracted
Providing a history of totals of key business metrics by day
Identifying required columns
Identifying column sets that should be unique
Identifying columns permitted (and not permitted) to be null
Determining acceptable ranges of numeric fields
Determining acceptable ranges of lengths for character
columns
Determining the set of explicitly valid values for all columns
where this can be defined
Identifying frequently appearing invalid values in columns
that do not have explicit valid value sets.
Standardizing
Matching and
deduplication.
Surviving
4.4.1 Conformed
Dimensions
4.4.4 Permissible
Variations of Conformed
Dimensions
4.4.12 Delivering
Delivering is the final essential ETL step.
In the smallest data warehouses consisting of
4.5 Summary
The objectives.
Data-quality techniques.
Data-quality metadata.
Data-quality measurements.