Data Quality for Analytics Using SAS
4/5
()
About this ebook
Gerhard Svolba
Dr. Gerhard Svolba is a senior solutions architect and analytic expert at SAS Institute Inc. in Austria, where he specializes in analytics in different business and research domains. His project experience ranges from business and technical conceptual considerations to data preparation and analytic modeling across industries. He is the author of Data Preparation for Analytics Using SAS and teaches a SAS training course called "Building Analytic Data Marts."
Related to Data Quality for Analytics Using SAS
Related ebooks
Text Mining and Analysis: Practical Methods, Examples, and Case Studies Using SAS Rating: 0 out of 5 stars0 ratingsApplied Data Mining for Forecasting Using SAS Rating: 0 out of 5 stars0 ratingsData Warehousing in the Age of Big Data Rating: 0 out of 5 stars0 ratingsSAS Visual Analytics for SAS Viya Rating: 0 out of 5 stars0 ratingsCategorical Data Analysis Using SAS, Third Edition Rating: 0 out of 5 stars0 ratingsFundamentals of Programming in SAS: A Case Studies Approach Rating: 0 out of 5 stars0 ratingsSAS Viya: The R Perspective Rating: 0 out of 5 stars0 ratingsDeep Learning for Numerical Applications with SAS Rating: 0 out of 5 stars0 ratingsSimulating Data with SAS Rating: 0 out of 5 stars0 ratingsMastering Data Analysis with R Rating: 5 out of 5 stars5/5PROC SQL: Beyond the Basics Using SAS, Third Edition Rating: 0 out of 5 stars0 ratingsExecuting Data Quality Projects: Ten Steps to Quality Data and Trusted Information<sup>TM</sup> Rating: 3 out of 5 stars3/5Practical and Efficient SAS Programming: The Insider's Guide Rating: 0 out of 5 stars0 ratingsImplementing CDISC Using SAS: An End-to-End Guide, Revised Second Edition Rating: 0 out of 5 stars0 ratingsExercises and Projects for The Little SAS Book, Sixth Edition Rating: 0 out of 5 stars0 ratingsSAS Programming for Enterprise Guide Users, Second Edition Rating: 0 out of 5 stars0 ratingsSmart Data Discovery Using SAS Viya: Powerful Techniques for Deeper Insights Rating: 0 out of 5 stars0 ratingsThe Data Detective's Toolkit: Cutting-Edge Techniques and SAS Macros to Clean, Prepare, and Manage Data Rating: 0 out of 5 stars0 ratingsMastering Predictive Analytics with R Rating: 4 out of 5 stars4/5Introduction to R for Business Intelligence Rating: 0 out of 5 stars0 ratingsApplied Econometrics with SAS: Modeling Demand, Supply, and Risk Rating: 5 out of 5 stars5/5SAS Certification Prep Guide: Statistical Business Analysis Using SAS9 Rating: 0 out of 5 stars0 ratingsReal World Health Care Data Analysis: Causal Methods and Implementation Using SAS Rating: 0 out of 5 stars0 ratingsSAS for Forecasting Time Series, Third Edition Rating: 0 out of 5 stars0 ratingsBuilding Big Data Applications Rating: 0 out of 5 stars0 ratingsMultiple Imputation of Missing Data Using SAS Rating: 0 out of 5 stars0 ratingsSegmentation Analytics with SAS Viya: An Approach to Clustering and Visualization Rating: 0 out of 5 stars0 ratingsBusiness Analytics Using SAS Enterprise Guide and SAS Enterprise Miner: A Beginner's Guide Rating: 0 out of 5 stars0 ratingsUnstructured Data Analysis: Entity Resolution and Regular Expressions in SAS Rating: 0 out of 5 stars0 ratings
Mathematics For You
Real Estate by the Numbers: A Complete Reference Guide to Deal Analysis Rating: 0 out of 5 stars0 ratingsAlan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition Rating: 4 out of 5 stars4/5The Little Book of Mathematical Principles, Theories & Things Rating: 3 out of 5 stars3/5Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics Rating: 4 out of 5 stars4/5Basic Math & Pre-Algebra For Dummies Rating: 4 out of 5 stars4/5My Best Mathematical and Logic Puzzles Rating: 5 out of 5 stars5/5Limitless Mind: Learn, Lead, and Live Without Barriers Rating: 4 out of 5 stars4/5Quantum Physics for Beginners Rating: 4 out of 5 stars4/5Flatland Rating: 4 out of 5 stars4/5Calculus Made Easy Rating: 4 out of 5 stars4/5The Math of Life and Death: 7 Mathematical Principles That Shape Our Lives Rating: 4 out of 5 stars4/5The Thirteen Books of the Elements, Vol. 1 Rating: 0 out of 5 stars0 ratingsA Mind for Numbers | Summary Rating: 4 out of 5 stars4/5Game Theory: A Simple Introduction Rating: 4 out of 5 stars4/5The Everything Guide to Algebra: A Step-by-Step Guide to the Basics of Algebra - in Plain English! Rating: 4 out of 5 stars4/5Algebra - The Very Basics Rating: 5 out of 5 stars5/5Algebra Rating: 4 out of 5 stars4/5The Golden Ratio: The Divine Beauty of Mathematics Rating: 5 out of 5 stars5/5The Math Book: From Pythagoras to the 57th Dimension, 250 Milestones in the History of Mathematics Rating: 3 out of 5 stars3/5Algebra I Workbook For Dummies Rating: 3 out of 5 stars3/5Algebra I For Dummies Rating: 4 out of 5 stars4/5The Everything Everyday Math Book: From Tipping to Taxes, All the Real-World, Everyday Math Skills You Need Rating: 5 out of 5 stars5/5Summary of The Black Swan: by Nassim Nicholas Taleb | Includes Analysis Rating: 5 out of 5 stars5/5Is God a Mathematician? Rating: 4 out of 5 stars4/5Mental Math Secrets - How To Be a Human Calculator Rating: 5 out of 5 stars5/5
Reviews for Data Quality for Analytics Using SAS
1 rating0 reviews
Book preview
Data Quality for Analytics Using SAS - Gerhard Svolba
The correct bibliographic citation for this manual is as follows: Svolba, Gerhard. 2012. Data Quality for Analytics Using SAS®. Cary, NC: SAS Institute Inc.
Data Quality for Analytics Using SAS®
Copyright © 2012, SAS Institute Inc., Cary, NC, USA
ISBN 978-1-60764-620-4 (Hardcopy)
ISBN 978-1-62959-802-4 (EPUB)
ISBN 978-1-62959-803-1 (MOBI)
ISBN 978-1-61290-227-2 (PDF)
All rights reserved. Produced in the United States of America.
For a hard-copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc.
For a Web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at the time you acquire this publication.
The scanning, uploading, and distribution of this book via the Internet or any other means without the permission of the publisher is illegal and punishable by law. Please purchase only authorized electronic editions and do not participate in or encourage electronic piracy of copyrighted materials. Your support of others’ rights is appreciated.
U.S. Government License Rights; Restricted Rights: The Software and its documentation is commercial computer software developed at private expense and is provided with RESTRICTED RIGHTS to the United States Government. Use, duplication or disclosure of the Software by the United States Government is subject to the license terms of this Agreement pursuant to, as applicable, FAR 12.212, DFAR 227.7202-1(a), DFAR 227.7202-3(a) and DFAR 227.7202-4 and, to the extent required under U.S. federal law, the minimum restricted rights as set out in FAR 52.227-19 (DEC 2007). If FAR 52.227-19 is applicable, this provision serves as notice under clause (c) thereof and no other notice is required to be affixed to the Software or documentation. The Government's rights in Software and documentation shall be only those set forth in this Agreement.
SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513-2414
April 2012
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
For my three teenage sons and their permanent effort
in letting me share the many wonderful moments of their life,
and without whose help
this book would have probably been completed a year earlier.
You are the quality of my life.
Acknowledgments
Martina, for supporting me and even the crazy idea to write this book in that period of our family life, which is probably the busiest one.
My parents, for providing me with so many possibilities.
The following persons, who contributed to the book by spending time to discuss data quality topics. It is a pleasure to work and to discuss with you: Albert Tösch, Andreas Müllner, Bertram Wassermann, Bernadette Fabits, Christine Hallwirth, Claus Reisinger, Franz Helmreich, Franz König, Helmut Zehetmayr, Josef Pichler, Manuela Lenk, Mihai Paunescu, Matthias Svolba, Nicole Schwarz, Peter Bauer, Phil Hermes, Stefan Baumann, Thomas Schierer, and Walter Herrmann.
The reviewers, who took time to review my manuscript and provided constructive feedback and suggestions. I highly appreciate your effort: Anne Milley, David Barkaway, Jim Seabolt, Mihai Paunescu, Mike Gilliland, Sascha Schubert, and Udo Sglavo.
The nice and charming SAS Press team for their support throughout the whole process of the creation of this book: Julie Platt, Stacey Hamilton, Shelley Sessoms, Kathy Restivo, Shelly Goodin, Aimee Rodriguez, Mary Beth Steinbach, and Lucie Haskins.
The management of SAS Austria for supporting the idea to write this book: Dietmar Kotras and Robert Stindl.
August Ernest Müller, my great-grandfather, designed one of the very early construction plans of a helicopter in 1916 and was able to file a patent in the Austrian-Hungarian monarchy. However, he found no sponsor to realize his project during World War I. I accidentally found his construction plans and documents at the time I started writing this book. His work impressed and motivated me a lot.
Contents
Introduction
Part I Data Quality Defined
Chapter 1 Introductory Case Studies
1.1 Introduction
1.2 Case Study 1: Performance of Race Boats in Sailing Regattas
Overview
Functional problem description
Practical questions of interest
Technical and data background
Data quality considerations
Case 1 summary
1.3 Case Study 2: Data Management and Analysis in a Clinical Trial
General
Functional problem description
Practical question of interest
Technical and data background
Data quality considerations
Case 2 summary
1.4 Case Study 3: Building a Data Mart for Demand Forecasting
Overview
Functional problem description
Functional business questions
Technical and data background
Data quality considerations
Case 3 summary
1.5 Summary
Data quality features
Data availability
Data completeness
Inferring missing data from existing data
Data correctness
Data cleaning
Data quantity
Chapter 2 Definition and Scope of Data Quality for Analytics
2.1 Introduction
Different expectations
Focus of this chapter
Chapter 1 case studies
2.2 Scoping the Topic Data Quality for Analytics
General
Differentiation of data objects
Operational or analytical data quality
General data warehouse or advanced analytical analyses
Focus on analytics
Data quality with analytics
2.3 Ten Percent Missing Values in Date of Birth Variable: An Example
General
Operational system
Systematic missing values
Data warehousing
Usability for analytics
Conclusion
2.4 Importance of Data Quality for Analytics
2.5 Definition of Data Quality for Analytics
General
Definition
2.6 Criteria for Good Data Quality: Examples
General
Data and measurement gathering
Plausibility check: Relevancy
Correctness
Missing values
Definitions and alignment
Timeliness
Adequacy for analytics
Legal considerations
2.7 Conclusion
General
Upcoming chapters
Chapter 3 Data Availability
3.1 Introduction
3.2 General Considerations
Reasons for availability
Definition of data availability
Availability and usability
Effort to make data available
Dependence on the operational process
Availability and alignment in the time dimension
3.3 Availability of Historic Data
Categorization and examples of historic data
The length of the history
Customer event histories
Operational systems and analytical systems
3.4 Historic Snapshot of the Data
More than historic data
Confusion in definitions
Example of a historic snapshot in predictive modeling
Comparing models from different time periods
Effort to retrieve historic snapshots
Example of historic snapshots in time series forecasting
3.5 Periodic Availability and Actuality
Periodic availability
Actuality
3.6 Granularity of Data
General
Definition of requirements
3.7 Format and Content of Variables
General
Main groups of variable formats for the analysis
Considerations for the usability of data
Typical data cleaning steps
3.8 Available Data Format and Data Structure
General
Non-electronic format of data
Levels of complexity for electronically available data
Availability in a complex logical structure
3.9 Available Data with Different Meanings
Problem definition
Example
Consequences
3.10 Conclusion
Chapter 4 Data Quantity
4.1 Introduction
Quantity versus quality
Overview
4.2 Too Little or Too Much Data
Having not enough data
Having too much data
Having too many observations
Having too many variables
4.3 Dimension of Analytical Data
General
Number of observations
Number of events
Distribution of categorical values (rare classes)
The number of variables
Length of the time history
Level of detail in forecast hierarchies
Panel data sets and repeated measurement data sets
4.4 Sample Size Planning
Application of sample size planning
Sample size calculation for data mining?
4.5 Effect of Missing Values on Data Quantity
General
Problem description
Calculation
Summary
4.6 Conclusion
Chapter 5 Data Completeness
5.1 Introduction
5.2 Difference between Availability and Completeness
Availability
Completeness
Categories of missing data
Effort to get complete data
Incomplete data are not necessarily missing data
Random or systematic missing values
5.3 Random Missing Values
Definition
Handling
Consequences
Imputing random missing values
5.4 Customer Age Is Systematically Missing for Long-Term Customers
Problem definition
Systematic missing values
Example
Consequences
5.5 Completeness across Tables
Problem description
Completeness in parent-child relationships
Completeness in time series data
5.6 Duplicate Records: Overcompleteness of Data
Definition of duplicate records
Consequences
Reasons for duplicate records
Detecting and treating duplicates
5.7 Conclusion
Chapter 6 Data Correctness
6.1 Introduction
6.2 Correctness of Data Transfer and Retrieval
General
Data entry
Data transfer
Minimize the number of data transfer steps
Comparing data between systems
6.3 Plausibility Checks
General
Categorical values
Interval values
Business Rules
Plausibility checks in relationships between tables
Process step where plausibility is checked
6.4 Multivariate Plausibility Checks
General
Multivariate definition of outliers
Outlier definition in the case of trends
6.5 Systematic and Random Errors
General
Random errors
Systematic errors
6.6 Selecting the Same Value in Data Entry
Problem definition
Using a default value in data entry
Example
Consequences
6.7 Psychological and Business Effects on Data Entry
Problem definition
Example
Consequences
6.8 Interviewer Effects
Problem description
Geographic or interviewer effect
A similar example
Consequences
6.9 Domain-Specific and Time-Dependent Correctness of Data
General
Correctness from a domain-specific point of view
Correctness from a time-dependent point of view
6.10 Conclusion
Chapter 7 Predictive Modeling
7.1 Introduction
A widely used method
Data quality considerations
7.2 Definition and Specifics of Predictive Models
The process of predictive modeling
7.3 Data Availability and Predictive Modeling
General
Historic snapshot of the data
Illustrative example for need to separate data over time
Multiple target windows
Data availability over time
7.4 Stable Data Definition for Future Periods
General
Requirements for regular scoring
Categories and missing values
Change in distributions
Checking distribution changes
Output data quality
7.5 Effective Number of Observations in Predictive Modeling
General
The vast reduction of observations
Pre-selected subsets of analysis subjects
7.6 Conclusion
Chapter 8 Analytics for Data Quality
8.1 Introduction
8.2 Correlation: Problem and Benefit
Problem description
Imputing missing values based on other variables
Substituting the effect of unavailable or unusable variables
Multicollinearity or the need for independent variables
Sign inversion
Derived variables from transactional data
Derived variables for customer behavior
8.3 Variability
General
Statistical variability and the significance of p-values
Introducing variability
Instability of the business background and definitions
Undescribed variability
8.4 Distribution and Sparseness
General
Missing values
Distribution of interval variables
Outliers
Categorical variables and rare events
Grouping sparse categories
Sparse values in time series forecasting
Clustering
8.5 Level of Detail
Detailed data and aggregated data
Data structures for analytics
Sampling
8.6 Linking Databases
Linking and combining data
Multivariate plausibility checks
Checking parent/child relationships
Project time estimates
Complex regulations
8.7 Conclusion
Chapter 9 Process Considerations for Data Quality
9.1 Introduction
9.2 Data Relevancy and the Picture of the Real World
Technical data quality and business data quality
Relevancy
Intent of the data retrieval system
Possible consequences
Reformulation of the business questions
Conversion of real-world facts into data
Conclusion
9.3 Consequences of Poor Data Quality
Introduction
Analysis projects are not started
Analysis results are not trusted
Analysis projects take longer than expected
Wrong decisions can be made
Loss of company or brand image
Regulatory fines or imprisonment
The desired results are not obtained
No statistical significance is reached
Conclusion
Different consequences for reporting and analytics
Required degree of accuracy
9.4 Data Quality Responsibilities
General
Responsible departments
Data quality responsibilities separated from business projects
Process features that trigger good data quality
9.5 Data Quality as an Ongoing Process
General
Maintaining the status
Short-term fixing or long-term improvement
9.6 Data Quality Monitoring
General
Example KPIs for data quality monitoring
Dimensions for analyzing the data quality
Analysis over time
Outlook
9.7 Conclusion
Part II Data Quality—Profiling and Improvement
Chapter 10 Profiling and Imputation of Missing Values
10.1 Introduction
More than simple missing value reporting
Profiling missing values
Imputing missing values
SAS programs
10.2 Simple Profiling of Missing Values
General
Counting missing values with PROC MEANS
Using a macro for general profiling of missing values
10.3 Profiling the Structure of Missing Values
Introduction
The %MV_PROFILE_CHAIN macro
Data to illustrate the %MV_PROFILE_CHAIN macro
Simple reports based on the missing value profile chain
Advanced reports based on the missing value profile chain
Usage examples
Usage information for the %MV_PROFILING macro
10.4 Univariate Imputation of Missing Values with PROC STANDARD
General
Replacement values for the entire table
Replacement values for subgroups
10.5 Replacing Missing Values with the Impute Node in SAS Enterprise Miner
Overview
Available methods for interval variables
Available methods for categorical variables
References
10.6 Performing Multiple Imputation with PROC MI
Single versus multiple imputation
Example data for multiple imputation and analysis
Performing multiple imputation with PROC MI
Analyze data with PROC LOGISTIC
Combine results with PROC MIANALYZE
10.7 Conclusion
The SAS offering
Business and domain expertise
Chapter 11 Profiling and Replacement of Missing Data in a Time Series
11.1 Introduction
General
SAS programs
11.2 Profiling the Structure of Missing Values for Time Series Data
Missing values in a time series
Example data
The TS_PROFILE_CHAIN
Reports for profiling a time series
Macro %PROFILE_TS_MV
11.3 Checking and Assuring the Contiguity of Time Series Data
Difference between transactional data and time series data
Example of a non-contiguous time series
Checking and assuring contiguity
PROC TIMESERIES
Macro implementation and usage
11.4 Replacing Missing Values in Time Series Data with PROC TIMESERIES
Overview
Functionality of PROC TIMESERIES
Examples
Changing zero values to missing values
11.5 Interpolating Missing Values in Time Series Data with PROC EXPAND
Introduction
Statistical methods in PROC EXPAND
Using PROC EXPAND
11.6 Conclusion
General
Available methods
Business knowledge
Chapter 12 Data Quality Control across Related Tables
12.1 Introduction
General
Relational model
Overview
12.2 Completeness and Plausibility Checks
General
Completeness check of records
Plausibility check of records
12.3 Implementation in SAS
Merging tables
Other methods in SAS
12.4 Using a SAS Hash for Data Quality Control
Example data
Completeness control in parent-child relationships
Plausibility checks in parent-child relationships
12.5 Conclusion
Chapter 13 Data Quality with Analytics
13.1 Introduction
13.2 Benefit of Analytics in General
Outlier detection
Missing value imputation
Data standardization and de-duplication
Handling data quantity
Analytic transformation of input variables
Variable selection
Assessment of model quality and what-if analyses
13.3 Classical Outlier Detection
Ways to define validation limits
Purpose of outlier detection
Statistical methods
Implementation
Outlier detection with analytic methods
13.4 Outlier Detection with Predictive Modeling
General
Methods in SAS
Example of clinical trial data
Extension of this method
13.5 Outlier Detection in Time Series Analysis
General
Time series models
Outlier detection with ARIMA(X) models
Decomposition and smoothing of time series
13.6 Outlier Detection with Cluster Analysis
General
Conclusion
13.7 Recognition of Duplicates
General
Contribution of analytics
13.8 Other Examples of Data Profiling
General
Benford’s law for checking data
13.9 Conclusion
Chapter 14 Data Quality Profiling and Improvement with SAS Analytic Tools
14.1 Introduction
14.2 SAS Enterprise Miner
Short description of SAS Enterprise Miner
Data quality correction
Assessing the importance of variables
Gaining insight into data relationships
Modeling features for data quality
Quick assessment of model quality and what-if analyses
Handling small data quantities
Text mining
Features for modeling and scoring in SAS Enterprise Miner
14.3 SAS Model Manager
14.4 SAS/STAT Software
14.5 SAS Forecast Server and SAS Forecast Studio
Short description of SAS Forecast Server
Data preprocessing
Outlier detection
Model output data quality
Data quantity
14.6 SAS/ETS Software
14.7 Base SAS
14.8 JMP
General
Detecting complex relationships and data quality problems with JMP
Missing data pattern
Sample size and power calculation
14.9 DataFlux Data Management Platform
Short description of DataFlux Data Management Platform
Data profiling
Data standardization and record matching
Defining a data quality process flow
14.10 Conclusion
Part III Consequences of Poor Data Quality—Simulation Studies
Chapter 15 Introdution to Simulation Studies
15.1 Rationale for Simulation Studies for Data Quality
Closing the loop
Investment to improve data quality
Further rationale for the simulation studies
15.2 Results Based on Simulation Studies
Analytical domains in the focus of the simulation studies
Data quality criteria that are simulated
15.3 Interpretability and Generalizability
Simulations studies versus hard fact formulas
Illustrate the potential effect
15.4 Random Numbers: A Core Ingredient for Simulation Studies
The simulation environment
Random number generators
Creation of random numbers in SAS
Random numbers with changing start values
Code example
15.5 Downloads
Chapter 16 Simulating the Consequences of Poor Data Quality for Predictive Modeling
16.1 Introduction
Importance of predictive modeling
Scope and generalizability of simulations for predictive modeling
Overview of the functional questions of the simulations
16.2 Base for the Business Case Calculation
Introduction
The reference company Quality DataCom
16.3 Definition of the Reference Models for the Simulations
Available data
Data preparation
Building the reference model
Process of building the reference model
Optimistic bias in the models in the simulation scenarios
Detailed definition of the data and the reference model results
16.4 Description of the Simulation Environment
General
Input data source node CHURN
START GROUPS and END GROUPS node
DATA PARTITION node
INSERT MISSING VALUES node
IMPUTE MISSING VALUES node
REGRESSION node
MODEL COMPARISON node
STORE ASSESSM. STATISTICS node
16.5 Details of the Simulation Procedure
Validation method
Process of building the scenario models
Validation statistic
Data quality treatment in training data and scoring data
Box-and-whisker plots
16.6 Downloads
16.7 Conclusion
Chapter 17 Influence of Data Quality and Data Availability on Model Quality in Predictive Modeling
17.1 Introduction
General
Dataquantity
Data availability
17.2 Influence of the Number of Observations
Detailed functional question
Data preparation
Simulation settings
Results
Business case
17.3 Influence of the Number of Events
Detailed functional question
Data preparation
Simulation settings
Results
Business case
17.4 Comparison of the Reduction of Events and the Reduction of Observations
17.5 Effect of the Availability of Variables
Alternate predictors
Availability scenarios
Results
Interpretation
Business case
17.6 Conclusion
Chapter 18 Influence of Data Completeness on Model Quality in Predictive Modeling
18.1 Introduction
General
Random and systematic missing values
Missing values in the scoring data partition
18.2 Simulation Methodology and Data Preparation
Inserting random missing values
Inserting systematic missing values
Replacing missing values
Process flow
Simulation scenarios
18.3 Results for Random Missing Values
Random missing values only in the training data
Random missing values in the training and scoring data
Business case
18.4 Results for Systematic Missing Values
Systematic missing values only in the training data
Systematic missing values in the training and scoring data
18.5 Comparison of Results between Different Scenarios
Introduction
Graphical comparison
Differentiating between types of missing values
Multivariate quantification
18.6 Conclusion
Chapter 19 Influence of Data Correctness on Model Quality in Predictive Modeling
19.1 Introduction
General
Non-visible data quality problem
Random and systematic bias
Biased values in the scoring data partition
19.2 Simulation Methodology and Data Preparation
Standardization of numeric values
Inserting random biases in the input variables
Inserting systematic biases in the input variables
Inserting a random bias in the target variable
Inserting a systematic bias in the target variable
Simulation scenarios
19.3 Results for Random and Systematic Bias in the Input Variables
Scenario settings
Bias in the input variables in the training data only
Bias in the input variables in the training and scoring data
Comparison of results
19.4 Results for Random and Systematic Bias in the Target Variables
General
Examples of biased target variables
Detecting biased target variables
Results for randomly biased target variables
Results for systematically biased target variables
19.5 Conclusion
General
Treatment of biased or incorrect data
19.6 General Conclusion of the Simulations for Predictive Modeling
Increasing the number of events and non-events matters
Age variable is important and there are compensation effects between the variables
It makes a difference whether data disturbances occur in the training data only or in both the training and scoring data
Random disturbances affect model quality much less than systematic disturbances
Chapter 20 Simulating the Consequences of Poor Data Quality in Time Series Forecasting
20.1 Introduction
General
Purpose and application of time series forecasting
Methods to forecast time series
Scope and generalizability of simulations for time series forecasting
20.2 Overview of the Functional Questions of the Simulations
20.3 Base for the Business Case Calculation
Introduction
The reference company Quality DataCom
20.4 Simulation Environment
General
Available data for the simulation environment
Time series methods
20.5 Simulation Procedure
Basic simulation procedure
Insertion of disturbances for the data in the simulation procedure
Loop over TIME HISTORIES
Loop over shifts
Qualification of time series for the simulations
Assessment of forecast accuracy
20.6 Downloads
20.7 Conclusion
Chapter 21 Consequences of Data Quantity and Data Completeness in Time Series Forecasting
21.1 Introduction
21.2 Effect of the Length of the Available Time History
General
Simulation procedure
Graph results
Interpretation
Results in numbers
Business case calculation
21.3 Optimal Length of the Available Time History
General
Results
Interpretation
21.4 Conclusion
General
Data relevancy
Self-assessment of time series data
Chapter 22 Consequences of Random Disturbances in Time Series Data
22.1 Introduction
General
Simulation procedure
Types of random disturbances
22.2 Consequences of Random Missing Values
General
Insertion and replacement of missing values
Results for missing value imputation with PROC EXPAND
Results for missing value imputation with PROC TIMESERIES
22.3 Consequences of Random Zero Values
General
Results
22.4 Consequences of Random Biases
General
Standard deviation as basis for random bias
Code to insert a random bias
Results
22.5 Conclusion
Chapter 23 Consequences of Systematic Disturbances in Time Series Data
23.1 Introduction
General
Simulation procedure
Systematically selecting observations from the time series
Types of systematic disturbances
23.2 Coding Systematic Disturbances in Time Series Data
Precalculation of values
Systematic selection based on the decile group
Systematic selection based on the calendar month
23.3 Results for the Effect of Systematic Disturbances
General
Systematic disturbances inserted for the top 10% of time series values
Systematic disturbances inserted for three consecutive calendar months
23.4 Interpretation
23.5 General Conclusions of the Simulations for Time Series Forecasting Shown in Chapters 21–23
Increasing length of data history decreases forecast error
The marginal effect of additional forecast months decreases
For many time series, a short time history causes better forecast accuracy
Long time histories can solve data quality problems to some extent
Appendix A: Macro Code
A.1 Introduction
A.2 Code for Macro %COUNT_MV
General
Macro code
A.3 Code for Macro %MV_PROFILING
General
Parameters
Macro code
Comments
A.4 Code for Macro %PROFILE_TS_MV
General
Parameters
Macro code
A.5 Code for Macro %CHECK_TIMEID
General
Parameters
Macro code
Comments
Appendix B: General SAS Content and Programs
B.1 Calculating the Number of Records with at Least One Missing Value
B.2 The SAS Power and Sample Size Application
Appendix C: Using SAS Enterprise Miner for Simulation Studies
C.1 Introduction
C.2 Preparation of SAS Enterprise Miner for SEED=0 Random Numbers
General
Changing the settings
C.3 Simulation Environment
General
Features
Deriving the parameter setting from the node name
Programming details
C.4 Discussion of the Suitability of SAS Enterprise Miner for a Simulation Environment
General
Advantages
C.5 Selected Macros and Macro Variables Available in a SAS Enterprise Miner Code Node
Appendix D: Macro to Determine the Optimal Length of the Available Data History
D.1: Introduction
D.2: Example Call and Results
Preparation of the data
Example call
Results
D.3: Macro Parameters
Parameters for macro %TS_HISTORY_CHECK
Parameters for macro %TS_HISTORY_CHECK_ESM
D.4: Macro Code
Macro code for %TS_HISTORY_CHECK
Macro code for %TS_HISTORY_CHECK_ESM
Comments
Appendix E: A Short Overview on Data Structures and Analytic Data Preparation
E.2 Wording: Analysis Table and Analytic Data Mart
E.3 Normalization and De-normalization
General
Normalization
De-normalization
Example
E.4 Analysis Subjects
Definition
Representation in the data set
E.5 Multiple Observations
General
Examples
Repeated measurements over time
Multiple observations because of hierarchical relationships
E.6 One-Row-per-Subject Data Mart
E.7 The Multiple-Rows-per-Subject Data Mart
E.8 The Technical Point of View
Transposing
Aggregating
References
Index
Introduction
Rationale and Trigger to Write This Book
The first impulse
In November 2005, shortly before I finished the full draft of my first book, Data Preparation for Analytics Using SAS, I was asked whether I wanted to contribute content and knowledge to the topic of data quality for analytics. At that time it was too late to include data quality into my first book. It also would not have been advisable to do so, as this important topic would have gone beyond the scope of the book on data preparation.
When Data Preparation for Analytics Using SAS was published in late 2006 I had already begun thinking about starting a new book on the topic of data quality. However, it wasn’t until 2008 that I started collecting ideas and opinions on the book you are reading now. After I received the green light to start writing the book from SAS Publishing, I started work at the end of 2009.
Focus on analytics
My intention was not to write another book on data quality in general, but to write the first book that deals with data quality from the viewpoint of a statistician, data miner, engineer, operations researcher, or other analytically minded problem-solver.
Data quality is getting a lot of attention in the market. However, most of the initiatives, publications, and papers on data quality focus on classical data quality topics, such as elimination of duplicates, standardization of data, lists of values, value ranges, and plausibility checks. It will not be said here that these topics are not important for analytics; on the contrary, they build the foundation of data for analysis. However, there are many aspects of data that are specific to analytics. And these aspects are important to differentiate whether data are suitable for analysis or not.
For classical data quality, books, best practices, and knowledge material are widely available. For the implementation of data quality, SAS offers the DataFlux Data Management Platform, a market-leading solution for typical data quality problems, and many methods in the established SAS modules.
Symbiosis of analytic requirements and analytic capabilities
In many cases, analytics puts higher requirements on data quality but also offers many more capabilities and options to measure and to improve data quality, like the calculation of representative imputation values for missing values. Thus there is a symbiosis between the analytical requirements and the analytical capabilities in the data quality context.
Analytics is also uniquely able to close the loop on data quality since it reveals anomalies in the data that other applications often miss. SAS is also perfectly suited to analyze and improve data quality.
• In part II, this book shows software capabilities that are important to measure and improve data quality and, thus, close the loop in the data quality process and show how analytics can improve data quality.
• In part III, this book shows how SAS can be used as a simulation environment for the evaluation of data quality status and the consequences of inferior data quality. This part also shows new and unique simulations results on the consequences of data quality.
FOR analytics and WITH analytics
The book deals with data quality topics that are relevant FOR analytics. Data quality is discussed in conjunction with the requirements of analytical methods on the data. Analytics is, however, not only posing regulations on minimum data quality requirements. Analytical methods are also used to improve data quality. This book illustrates the demand of analytical methods but in return also shows what can be done WITH analytics in the data quality area.
Data quality for non-analytics
Much literature, publications, and discussions on general data quality exist from a broad perspective where the focus is not primarily on analytics. The chapters in this book, especially in the first part, include these typical data quality topics and methods as long as they are important for data quality for analytics.
The idea of this book is not to start from scratch with the data quality topic and to introduce all methods that exist for the simple profiling of data, like using validations lists and validation limits. These methods are introduced in the respective sections, but the focus of the book stays on analytic implications and capabilities.
Cody’s Data Cleaning Techniques Using SAS, by Ron Cody [9], shows how to profile the quality of data in general. The book in your hand references some of these techniques in some chapters; however, it does not repeat all the data quality basics.
Data and Measurement
The term data not only appears in the title of this book but is also used throughout the text to discuss features and characteristics of data and data quality.
Measurement is very close in meaning to the word data in this book and could possibly be used as an alternative expression. Different from data, measurement also implies a process or an activity and, thus, better illustrates the process around data.
• Some data are passively measured
like transaction data (calls to a hotline, sales in a retail shop) or web data (like social media sites). This also compares to an observational study of using measurements that are opportunistically collected.
• Other data are actively measured
like vital signs in a medical study or survey data. This is usually the case in a designed experiment where measurements are prospectively collected.
In research analyses the manufactured asset
for the analysis is usually called measurement
instead of data.
The topics, methods, and findings that are discussed in this book thus apply not only to those who receive their data from databases, data warehouses, or externally acquired data but also to those who perform measurements in experiments.
Things are being measured
actively and passively with many spillover benefits for uses not originally envisioned. Finally, the researcher or analyst has to decide whether his data fit for intended use.
Importance of Data Quality for Analytics
Consequences of bad data quality
Data quality for analytics is an important topic. Bad data quality or just the mere perception that data has bad quality causes the following:
• Increases project duration and efforts
• Reduces the available project time for analysis and intelligence
• Damages trust in the results
• Slows down innovation and research
• Decreases customer satisfaction
• Leads to wrong, biased, outdated, or delayed decisions
• Costs money and time
• Demotivates the analyst, increasing the risk of losing skilled people to other projects
Frequently used expression
Data quality is a frequently used expression. As a 21 September 2011 Google search reveals, data quality ranges with 10.8 Mio potential hits, behind terms like data management (30.9 Mio), data mining (28.1 Mio), or data warehouse (14.4 Mio). But it is still more prominent than terms like relational database (8.5 Mio), regression analysis (8.1 Mio), data integration (6.9 Mio), ETL or extraction transformation loading (6.2 Mio), time series analysis (3.6 Mio), cluster analysis (2.8 Mio), and predictive analytics (1.3 Mio).
The frequency of use of the term data quality reinforces the requirement for a clear definition of data quality for analytics. Chapter 2 of this book goes into more detail on this.
Trend in the market
Data quality is currently also an important trend in the market. David Barkaway [5] shows in his 2010 SAS Global Forum paper the 2009 results of Forrester Research. To the question, Have you purchased any complimentary data management solution through your ETL vendor?
38 percent replied that they had bought data quality management software.
Figure I.1: Complimentary data management solutions
image shown hereSource: Forrester survey November 2009, Global ETL Online Survey, Trends in Enterprise and Adoption
The Layout of the Book
Data quality process steps
There are different ways a process about data quality can be defined. Thus, different vendors of data quality software and different data quality methodologies present processes that differ to some extent.
The DataFlux Data Management Platform, for example, is built around five steps, which are grouped into three main buckets. The steps follow a logical flow that makes sense for the data quality and data cleaning process for which the tool is usually applied. These steps are shown in Figure I.2.
Figure I.2: Data quality process in the DataFlux Data Management Platform
image shown here• Profiling
o Get a picture about the quality status of data before beginning a project
o Discover and check relations in the data
• Quality
o Separate information into smaller units
o Standardize, correct, and normalize data
• Integration
o Discover related data
o Remove duplicates
• Enrichment
o Add data from other sources like address data, product data, or geocoding
• Monitoring
o Detect trends in data quality
o Track consequences of bad data quality
Main parts of this book
This book is divided into three main parts. The naming and ordering of these three parts and the respective chapters follow a process as well, but also consider a segmentation of the content of this book into well-defined parts and a good readable sequence of topics and chapters.
The three parts of this book are:
• Data Quality Defined
• Data Quality—Profiling and Improvement
• Consequences of Poor Data Quality—Simulation Studies
These three main parts can be represented as a data quality process as shown in Figure I.3 and that is described in the paragraphs that follow.
Figure I.3: Data quality process in this book
image shown hereThe logical order here is to first define the requirements and criteria for data quality for analytics. The first part of the book is therefore the conceptual part and contains text, definitions, explanations, and examples. This part is called Data Quality Defined.
Based on these definitions the second part of the book focuses on how the data quality status can be profiled and how a picture of important criteria for advanced analytic methods and the data quality status of the data can be achieved. The second part also shows ways that data quality can be improved with analytical methods. The name of this part is Data Quality—Profiling and Improvement.
As not all data quality problems can be corrected or solved (or the effort is not justifiable), the last part of the book deals with consequences of poor data quality. Based on simulation studies, general answers about the usability of certain analytical methods and the effect on the accuracy of models are given if data quality criteria are not fulfilled. The last part is named Consequences of Poor Data Quality—Simulation Studies.
A cyclic approach
The process in this book, thus, also follows a cyclic approach, after the definition of criteria, the assessment and possible correction of the data quality status, and the consequences of the actual data quality status are analyzed. Based on the outcome the analysis is performed or measures are taken to fulfill the criteria or to relax the criteria by reformulating the business questions.
Selection of data quality criteria
The selection of the set of data quality criteria for this book has been made based on the practical experience of the author. Actually, there is no single definition that can be considered to be the golden standard for all applications. It can also be seen that many definitions highly overlap.
Gloskin [3], for example, defined the criteria Accuracy, Reliability, Timeliness, and Completeness. Orli [7] gives a longer list of criteria, which is Accuracy, Completeness, Consistency, Timeliness, Uniqueness, and Validity.
The data quality criteria that are defined in this book in chapters 3–9 are the following.
• Chapter 3, Data Availability,
starts with the question as to whether data are available in general.
• Chapter 4, Data Quantity,
examines whether the amount of data are sufficient for the analysis.
• Chapter 5, Data Completeness,
deals with the fact that available data fields may contain missing values.
• Chapter 6, Data Correctness,
checks whether the available data are correct with respect to its definition.
• Chapter 7, Predictive Modeling,
discusses special requirements of predictive modeling methods.
• Chapter 8, Analytics for Data Quality,
shows additional requirements of interdependences of analytical methods and the data.
• Chapter 9, Process Considerations for Data Quality,
finally shows the process aspect of data quality and also discusses aspects like data relevancy and possible alternatives.
These criteria are considered to be the most important ones in the context of this book and are shown in part I.
The Scope of This Book
Widespread expectations of this book
As already mentioned in a section above, data quality and data quality for analytics are very important topics that are discussed in many circumstances, projects, analysis domains, analytical disciplines, data warehouse communities, and across industries.
As a consequence, the expectations on the content of this book from people from these different areas are very diverse. Depending on the way people perform these analyses and acquire, prepare, and use data, the expectations may vary. A book titled Data Quality for Analytics thus bears the risk of not meeting the expectations of all people.
Consider the following roles, which have different perspectives on data quality and data quality for analytics and will likely have different expectations:
• An analyst who builds analytical models for customer behavior analysis for a retail bank
• An IT person who is in charge of maintaining the data warehouse and de-duplicating customer records in both the operational and the decision support system
• A researcher who conducts and analyzes clinical trials
• A statistician who works for the statistical office and creates reports based on register data
This section attempts to correctly set expectations. Chapter 2 also goes into more details on the scope of data quality for analytics.
Data cleaning techniques in general
In the book Cody’s Data Cleaning Techniques Using SAS, Ron Cody shows a bunch of methods for profiling data quality status and correcting data quality errors. These methods include checking categorical, interval variables and date values as well as checking for missing values. Other topics include the check for duplicates in n observations per subject and work with multiple files.
The intention of this book is not to compete with other books but to complement other titles by SAS Publishing by offering a book that has a different point of view. The data quality checks presented in Cody’s Data Cleaning Techniques Using SAS form an important basis for data quality control and improvement of analysis data in general. The emphasis of this book goes beyond typical basic methods and puts data quality into a more business-focused context and covers more closely the requirements of analytical methods.
The detailed presentation of typical methods to profile the data quality status is not a focus of this book. The methods shown in part II are more advanced to profile specific analytical data requirements.
The DataFlux Data Management Platform
SAS offers the DataFlux Data Management Platform for data quality management. This market-leading solution is well suited to profile data with respect to data quality requirements, to improve data quality by de-duplicating data and enriching data, and to monitor data quality over time.
The solution provides important features and strongly focuses on:
• profiling the distribution of variables of different types, the matching of predefined patterns, and the presentation of summary statistics on the data.
• methods to standardize data, for example, address data and product code and product name data, the controlled de-duplication of data.
The features are important in providing a quality data basis for analysis (see also David Barkaway [5]) and definitely focus on data quality. The aim of this book, however, is also to discuss data quality from a business- and analytical-methods-specific point of view in terms of necessary data histories and historic snapshots of the data and the reliability and relevancy of data.
Administrative records and data in statistical offices
In statistical analysis in statistical institutions and in the social sciences the use of administrative data sources has become an important topic over the last several years. On some parts of the population, administrative data provides more information than any survey data. Consequently, the