Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Data Quality for Analytics Using SAS
Data Quality for Analytics Using SAS
Data Quality for Analytics Using SAS
Ebook1,169 pages15 hours

Data Quality for Analytics Using SAS

Rating: 4 out of 5 stars

4/5

()

Read preview

About this ebook

Analytics offers many capabilities and options to measure and improve data quality, and SAS is perfectly suited to these tasks. Gerhard Svolba's Data Quality for Analytics Using SAS focuses on selecting the right data sources and ensuring data quantity, relevancy, and completeness. The book is made up of three parts. The first part, which is conceptual, defines data quality and contains text, definitions, explanations, and examples. The second part shows how the data quality status can be profiled and the ways that data quality can be improved with analytical methods. The final part details the consequences of poor data quality for predictive modeling and time series forecasting.
LanguageEnglish
PublisherSAS Institute
Release dateMay 5, 2015
ISBN9781629598024
Data Quality for Analytics Using SAS
Author

Gerhard Svolba

Dr. Gerhard Svolba is a senior solutions architect and analytic expert at SAS Institute Inc. in Austria, where he specializes in analytics in different business and research domains. His project experience ranges from business and technical conceptual considerations to data preparation and analytic modeling across industries. He is the author of Data Preparation for Analytics Using SAS and teaches a SAS training course called "Building Analytic Data Marts."

Related to Data Quality for Analytics Using SAS

Related ebooks

Mathematics For You

View More

Related articles

Reviews for Data Quality for Analytics Using SAS

Rating: 4 out of 5 stars
4/5

1 rating0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Data Quality for Analytics Using SAS - Gerhard Svolba

    The correct bibliographic citation for this manual is as follows: Svolba, Gerhard. 2012. Data Quality for Analytics Using SAS®. Cary, NC: SAS Institute Inc.

    Data Quality for Analytics Using SAS®

    Copyright © 2012, SAS Institute Inc., Cary, NC, USA

    ISBN 978-1-60764-620-4 (Hardcopy)

    ISBN 978-1-62959-802-4 (EPUB)

    ISBN 978-1-62959-803-1 (MOBI)

    ISBN 978-1-61290-227-2 (PDF)

    All rights reserved. Produced in the United States of America.

    For a hard-copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc.

    For a Web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at the time you acquire this publication.

    The scanning, uploading, and distribution of this book via the Internet or any other means without the permission of the publisher is illegal and punishable by law. Please purchase only authorized electronic editions and do not participate in or encourage electronic piracy of copyrighted materials. Your support of others’ rights is appreciated.

    U.S. Government License Rights; Restricted Rights: The Software and its documentation is commercial computer software developed at private expense and is provided with RESTRICTED RIGHTS to the United States Government. Use, duplication or disclosure of the Software by the United States Government is subject to the license terms of this Agreement pursuant to, as applicable, FAR 12.212, DFAR 227.7202-1(a), DFAR 227.7202-3(a) and DFAR 227.7202-4 and, to the extent required under U.S. federal law, the minimum restricted rights as set out in FAR 52.227-19 (DEC 2007). If FAR 52.227-19 is applicable, this provision serves as notice under clause (c) thereof and no other notice is required to be affixed to the Software or documentation. The Government's rights in Software and documentation shall be only those set forth in this Agreement.

    SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513-2414

    April 2012

    SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

    Other brand and product names are trademarks of their respective companies.

    For my three teenage sons and their permanent effort

    in letting me share the many wonderful moments of their life,

    and without whose help

    this book would have probably been completed a year earlier.

    You are the quality of my life.

    Acknowledgments

    Martina, for supporting me and even the crazy idea to write this book in that period of our family life, which is probably the busiest one.

    My parents, for providing me with so many possibilities.

    The following persons, who contributed to the book by spending time to discuss data quality topics. It is a pleasure to work and to discuss with you: Albert Tösch, Andreas Müllner, Bertram Wassermann, Bernadette Fabits, Christine Hallwirth, Claus Reisinger, Franz Helmreich, Franz König, Helmut Zehetmayr, Josef Pichler, Manuela Lenk, Mihai Paunescu, Matthias Svolba, Nicole Schwarz, Peter Bauer, Phil Hermes, Stefan Baumann, Thomas Schierer, and Walter Herrmann.

    The reviewers, who took time to review my manuscript and provided constructive feedback and suggestions. I highly appreciate your effort: Anne Milley, David Barkaway, Jim Seabolt, Mihai Paunescu, Mike Gilliland, Sascha Schubert, and Udo Sglavo.

    The nice and charming SAS Press team for their support throughout the whole process of the creation of this book: Julie Platt, Stacey Hamilton, Shelley Sessoms, Kathy Restivo, Shelly Goodin, Aimee Rodriguez, Mary Beth Steinbach, and Lucie Haskins.

    The management of SAS Austria for supporting the idea to write this book: Dietmar Kotras and Robert Stindl.

    August Ernest Müller, my great-grandfather, designed one of the very early construction plans of a helicopter in 1916 and was able to file a patent in the Austrian-Hungarian monarchy. However, he found no sponsor to realize his project during World War I. I accidentally found his construction plans and documents at the time I started writing this book. His work impressed and motivated me a lot.

    Contents

    Introduction

    Part I Data Quality Defined

    Chapter 1 Introductory Case Studies

    1.1 Introduction

    1.2 Case Study 1: Performance of Race Boats in Sailing Regattas

    Overview

    Functional problem description

    Practical questions of interest

    Technical and data background

    Data quality considerations

    Case 1 summary

    1.3 Case Study 2: Data Management and Analysis in a Clinical Trial

    General

    Functional problem description

    Practical question of interest

    Technical and data background

    Data quality considerations

    Case 2 summary

    1.4 Case Study 3: Building a Data Mart for Demand Forecasting

    Overview

    Functional problem description

    Functional business questions

    Technical and data background

    Data quality considerations

    Case 3 summary

    1.5 Summary

    Data quality features

    Data availability

    Data completeness

    Inferring missing data from existing data

    Data correctness

    Data cleaning

    Data quantity

    Chapter 2 Definition and Scope of Data Quality for Analytics

    2.1 Introduction

    Different expectations

    Focus of this chapter

    Chapter 1 case studies

    2.2 Scoping the Topic Data Quality for Analytics

    General

    Differentiation of data objects

    Operational or analytical data quality

    General data warehouse or advanced analytical analyses

    Focus on analytics

    Data quality with analytics

    2.3 Ten Percent Missing Values in Date of Birth Variable: An Example

    General

    Operational system

    Systematic missing values

    Data warehousing

    Usability for analytics

    Conclusion

    2.4 Importance of Data Quality for Analytics

    2.5 Definition of Data Quality for Analytics

    General

    Definition

    2.6 Criteria for Good Data Quality: Examples

    General

    Data and measurement gathering

    Plausibility check: Relevancy

    Correctness

    Missing values

    Definitions and alignment

    Timeliness

    Adequacy for analytics

    Legal considerations

    2.7 Conclusion

    General

    Upcoming chapters

    Chapter 3 Data Availability

    3.1 Introduction

    3.2 General Considerations

    Reasons for availability

    Definition of data availability

    Availability and usability

    Effort to make data available

    Dependence on the operational process

    Availability and alignment in the time dimension

    3.3 Availability of Historic Data

    Categorization and examples of historic data

    The length of the history

    Customer event histories

    Operational systems and analytical systems

    3.4 Historic Snapshot of the Data

    More than historic data

    Confusion in definitions

    Example of a historic snapshot in predictive modeling

    Comparing models from different time periods

    Effort to retrieve historic snapshots

    Example of historic snapshots in time series forecasting

    3.5 Periodic Availability and Actuality

    Periodic availability

    Actuality

    3.6 Granularity of Data

    General

    Definition of requirements

    3.7 Format and Content of Variables

    General

    Main groups of variable formats for the analysis

    Considerations for the usability of data

    Typical data cleaning steps

    3.8 Available Data Format and Data Structure

    General

    Non-electronic format of data

    Levels of complexity for electronically available data

    Availability in a complex logical structure

    3.9 Available Data with Different Meanings

    Problem definition

    Example

    Consequences

    3.10 Conclusion

    Chapter 4 Data Quantity

    4.1 Introduction

    Quantity versus quality

    Overview

    4.2 Too Little or Too Much Data

    Having not enough data

    Having too much data

    Having too many observations

    Having too many variables

    4.3 Dimension of Analytical Data

    General

    Number of observations

    Number of events

    Distribution of categorical values (rare classes)

    The number of variables

    Length of the time history

    Level of detail in forecast hierarchies

    Panel data sets and repeated measurement data sets

    4.4 Sample Size Planning

    Application of sample size planning

    Sample size calculation for data mining?

    4.5 Effect of Missing Values on Data Quantity

    General

    Problem description

    Calculation

    Summary

    4.6 Conclusion

    Chapter 5 Data Completeness

    5.1 Introduction

    5.2 Difference between Availability and Completeness

    Availability

    Completeness

    Categories of missing data

    Effort to get complete data

    Incomplete data are not necessarily missing data

    Random or systematic missing values

    5.3 Random Missing Values

    Definition

    Handling

    Consequences

    Imputing random missing values

    5.4 Customer Age Is Systematically Missing for Long-Term Customers

    Problem definition

    Systematic missing values

    Example

    Consequences

    5.5 Completeness across Tables

    Problem description

    Completeness in parent-child relationships

    Completeness in time series data

    5.6 Duplicate Records: Overcompleteness of Data

    Definition of duplicate records

    Consequences

    Reasons for duplicate records

    Detecting and treating duplicates

    5.7 Conclusion

    Chapter 6 Data Correctness

    6.1 Introduction

    6.2 Correctness of Data Transfer and Retrieval

    General

    Data entry

    Data transfer

    Minimize the number of data transfer steps

    Comparing data between systems

    6.3 Plausibility Checks

    General

    Categorical values

    Interval values

    Business Rules

    Plausibility checks in relationships between tables

    Process step where plausibility is checked

    6.4 Multivariate Plausibility Checks

    General

    Multivariate definition of outliers

    Outlier definition in the case of trends

    6.5 Systematic and Random Errors

    General

    Random errors

    Systematic errors

    6.6 Selecting the Same Value in Data Entry

    Problem definition

    Using a default value in data entry

    Example

    Consequences

    6.7 Psychological and Business Effects on Data Entry

    Problem definition

    Example

    Consequences

    6.8 Interviewer Effects

    Problem description

    Geographic or interviewer effect

    A similar example

    Consequences

    6.9 Domain-Specific and Time-Dependent Correctness of Data

    General

    Correctness from a domain-specific point of view

    Correctness from a time-dependent point of view

    6.10 Conclusion

    Chapter 7 Predictive Modeling

    7.1 Introduction

    A widely used method

    Data quality considerations

    7.2 Definition and Specifics of Predictive Models

    The process of predictive modeling

    7.3 Data Availability and Predictive Modeling

    General

    Historic snapshot of the data

    Illustrative example for need to separate data over time

    Multiple target windows

    Data availability over time

    7.4 Stable Data Definition for Future Periods

    General

    Requirements for regular scoring

    Categories and missing values

    Change in distributions

    Checking distribution changes

    Output data quality

    7.5 Effective Number of Observations in Predictive Modeling

    General

    The vast reduction of observations

    Pre-selected subsets of analysis subjects

    7.6 Conclusion

    Chapter 8 Analytics for Data Quality

    8.1 Introduction

    8.2 Correlation: Problem and Benefit

    Problem description

    Imputing missing values based on other variables

    Substituting the effect of unavailable or unusable variables

    Multicollinearity or the need for independent variables

    Sign inversion

    Derived variables from transactional data

    Derived variables for customer behavior

    8.3 Variability

    General

    Statistical variability and the significance of p-values

    Introducing variability

    Instability of the business background and definitions

    Undescribed variability

    8.4 Distribution and Sparseness

    General

    Missing values

    Distribution of interval variables

    Outliers

    Categorical variables and rare events

    Grouping sparse categories

    Sparse values in time series forecasting

    Clustering

    8.5 Level of Detail

    Detailed data and aggregated data

    Data structures for analytics

    Sampling

    8.6 Linking Databases

    Linking and combining data

    Multivariate plausibility checks

    Checking parent/child relationships

    Project time estimates

    Complex regulations

    8.7 Conclusion

    Chapter 9 Process Considerations for Data Quality

    9.1 Introduction

    9.2 Data Relevancy and the Picture of the Real World

    Technical data quality and business data quality

    Relevancy

    Intent of the data retrieval system

    Possible consequences

    Reformulation of the business questions

    Conversion of real-world facts into data

    Conclusion

    9.3 Consequences of Poor Data Quality

    Introduction

    Analysis projects are not started

    Analysis results are not trusted

    Analysis projects take longer than expected

    Wrong decisions can be made

    Loss of company or brand image

    Regulatory fines or imprisonment

    The desired results are not obtained

    No statistical significance is reached

    Conclusion

    Different consequences for reporting and analytics

    Required degree of accuracy

    9.4 Data Quality Responsibilities

    General

    Responsible departments

    Data quality responsibilities separated from business projects

    Process features that trigger good data quality

    9.5 Data Quality as an Ongoing Process

    General

    Maintaining the status

    Short-term fixing or long-term improvement

    9.6 Data Quality Monitoring

    General

    Example KPIs for data quality monitoring

    Dimensions for analyzing the data quality

    Analysis over time

    Outlook

    9.7 Conclusion

    Part II Data Quality—Profiling and Improvement

    Chapter 10 Profiling and Imputation of Missing Values

    10.1 Introduction

    More than simple missing value reporting

    Profiling missing values

    Imputing missing values

    SAS programs

    10.2 Simple Profiling of Missing Values

    General

    Counting missing values with PROC MEANS

    Using a macro for general profiling of missing values

    10.3 Profiling the Structure of Missing Values

    Introduction

    The %MV_PROFILE_CHAIN macro

    Data to illustrate the %MV_PROFILE_CHAIN macro

    Simple reports based on the missing value profile chain

    Advanced reports based on the missing value profile chain

    Usage examples

    Usage information for the %MV_PROFILING macro

    10.4 Univariate Imputation of Missing Values with PROC STANDARD

    General

    Replacement values for the entire table

    Replacement values for subgroups

    10.5 Replacing Missing Values with the Impute Node in SAS Enterprise Miner

    Overview

    Available methods for interval variables

    Available methods for categorical variables

    References

    10.6 Performing Multiple Imputation with PROC MI

    Single versus multiple imputation

    Example data for multiple imputation and analysis

    Performing multiple imputation with PROC MI

    Analyze data with PROC LOGISTIC

    Combine results with PROC MIANALYZE

    10.7 Conclusion

    The SAS offering

    Business and domain expertise

    Chapter 11 Profiling and Replacement of Missing Data in a Time Series

    11.1 Introduction

    General

    SAS programs

    11.2 Profiling the Structure of Missing Values for Time Series Data

    Missing values in a time series

    Example data

    The TS_PROFILE_CHAIN

    Reports for profiling a time series

    Macro %PROFILE_TS_MV

    11.3 Checking and Assuring the Contiguity of Time Series Data

    Difference between transactional data and time series data

    Example of a non-contiguous time series

    Checking and assuring contiguity

    PROC TIMESERIES

    Macro implementation and usage

    11.4 Replacing Missing Values in Time Series Data with PROC TIMESERIES

    Overview

    Functionality of PROC TIMESERIES

    Examples

    Changing zero values to missing values

    11.5 Interpolating Missing Values in Time Series Data with PROC EXPAND

    Introduction

    Statistical methods in PROC EXPAND

    Using PROC EXPAND

    11.6 Conclusion

    General

    Available methods

    Business knowledge

    Chapter 12 Data Quality Control across Related Tables

    12.1 Introduction

    General

    Relational model

    Overview

    12.2 Completeness and Plausibility Checks

    General

    Completeness check of records

    Plausibility check of records

    12.3 Implementation in SAS

    Merging tables

    Other methods in SAS

    12.4 Using a SAS Hash for Data Quality Control

    Example data

    Completeness control in parent-child relationships

    Plausibility checks in parent-child relationships

    12.5 Conclusion

    Chapter 13 Data Quality with Analytics

    13.1 Introduction

    13.2 Benefit of Analytics in General

    Outlier detection

    Missing value imputation

    Data standardization and de-duplication

    Handling data quantity

    Analytic transformation of input variables

    Variable selection

    Assessment of model quality and what-if analyses

    13.3 Classical Outlier Detection

    Ways to define validation limits

    Purpose of outlier detection

    Statistical methods

    Implementation

    Outlier detection with analytic methods

    13.4 Outlier Detection with Predictive Modeling

    General

    Methods in SAS

    Example of clinical trial data

    Extension of this method

    13.5 Outlier Detection in Time Series Analysis

    General

    Time series models

    Outlier detection with ARIMA(X) models

    Decomposition and smoothing of time series

    13.6 Outlier Detection with Cluster Analysis

    General

    Conclusion

    13.7 Recognition of Duplicates

    General

    Contribution of analytics

    13.8 Other Examples of Data Profiling

    General

    Benford’s law for checking data

    13.9 Conclusion

    Chapter 14 Data Quality Profiling and Improvement with SAS Analytic Tools

    14.1 Introduction

    14.2 SAS Enterprise Miner

    Short description of SAS Enterprise Miner

    Data quality correction

    Assessing the importance of variables

    Gaining insight into data relationships

    Modeling features for data quality

    Quick assessment of model quality and what-if analyses

    Handling small data quantities

    Text mining

    Features for modeling and scoring in SAS Enterprise Miner

    14.3 SAS Model Manager

    14.4 SAS/STAT Software

    14.5 SAS Forecast Server and SAS Forecast Studio

    Short description of SAS Forecast Server

    Data preprocessing

    Outlier detection

    Model output data quality

    Data quantity

    14.6 SAS/ETS Software

    14.7 Base SAS

    14.8 JMP

    General

    Detecting complex relationships and data quality problems with JMP

    Missing data pattern

    Sample size and power calculation

    14.9 DataFlux Data Management Platform

    Short description of DataFlux Data Management Platform

    Data profiling

    Data standardization and record matching

    Defining a data quality process flow

    14.10 Conclusion

    Part III Consequences of Poor Data Quality—Simulation Studies

    Chapter 15 Introdution to Simulation Studies

    15.1 Rationale for Simulation Studies for Data Quality

    Closing the loop

    Investment to improve data quality

    Further rationale for the simulation studies

    15.2 Results Based on Simulation Studies

    Analytical domains in the focus of the simulation studies

    Data quality criteria that are simulated

    15.3 Interpretability and Generalizability

    Simulations studies versus hard fact formulas

    Illustrate the potential effect

    15.4 Random Numbers: A Core Ingredient for Simulation Studies

    The simulation environment

    Random number generators

    Creation of random numbers in SAS

    Random numbers with changing start values

    Code example

    15.5 Downloads

    Chapter 16 Simulating the Consequences of Poor Data Quality for Predictive Modeling

    16.1 Introduction

    Importance of predictive modeling

    Scope and generalizability of simulations for predictive modeling

    Overview of the functional questions of the simulations

    16.2 Base for the Business Case Calculation

    Introduction

    The reference company Quality DataCom

    16.3 Definition of the Reference Models for the Simulations

    Available data

    Data preparation

    Building the reference model

    Process of building the reference model

    Optimistic bias in the models in the simulation scenarios

    Detailed definition of the data and the reference model results

    16.4 Description of the Simulation Environment

    General

    Input data source node CHURN

    START GROUPS and END GROUPS node

    DATA PARTITION node

    INSERT MISSING VALUES node

    IMPUTE MISSING VALUES node

    REGRESSION node

    MODEL COMPARISON node

    STORE ASSESSM. STATISTICS node

    16.5 Details of the Simulation Procedure

    Validation method

    Process of building the scenario models

    Validation statistic

    Data quality treatment in training data and scoring data

    Box-and-whisker plots

    16.6 Downloads

    16.7 Conclusion

    Chapter 17 Influence of Data Quality and Data Availability on Model Quality in Predictive Modeling

    17.1 Introduction

    General

    Dataquantity

    Data availability

    17.2 Influence of the Number of Observations

    Detailed functional question

    Data preparation

    Simulation settings

    Results

    Business case

    17.3 Influence of the Number of Events

    Detailed functional question

    Data preparation

    Simulation settings

    Results

    Business case

    17.4 Comparison of the Reduction of Events and the Reduction of Observations

    17.5 Effect of the Availability of Variables

    Alternate predictors

    Availability scenarios

    Results

    Interpretation

    Business case

    17.6 Conclusion

    Chapter 18 Influence of Data Completeness on Model Quality in Predictive Modeling

    18.1 Introduction

    General

    Random and systematic missing values

    Missing values in the scoring data partition

    18.2 Simulation Methodology and Data Preparation

    Inserting random missing values

    Inserting systematic missing values

    Replacing missing values

    Process flow

    Simulation scenarios

    18.3 Results for Random Missing Values

    Random missing values only in the training data

    Random missing values in the training and scoring data

    Business case

    18.4 Results for Systematic Missing Values

    Systematic missing values only in the training data

    Systematic missing values in the training and scoring data

    18.5 Comparison of Results between Different Scenarios

    Introduction

    Graphical comparison

    Differentiating between types of missing values

    Multivariate quantification

    18.6 Conclusion

    Chapter 19 Influence of Data Correctness on Model Quality in Predictive Modeling

    19.1 Introduction

    General

    Non-visible data quality problem

    Random and systematic bias

    Biased values in the scoring data partition

    19.2 Simulation Methodology and Data Preparation

    Standardization of numeric values

    Inserting random biases in the input variables

    Inserting systematic biases in the input variables

    Inserting a random bias in the target variable

    Inserting a systematic bias in the target variable

    Simulation scenarios

    19.3 Results for Random and Systematic Bias in the Input Variables

    Scenario settings

    Bias in the input variables in the training data only

    Bias in the input variables in the training and scoring data

    Comparison of results

    19.4 Results for Random and Systematic Bias in the Target Variables

    General

    Examples of biased target variables

    Detecting biased target variables

    Results for randomly biased target variables

    Results for systematically biased target variables

    19.5 Conclusion

    General

    Treatment of biased or incorrect data

    19.6 General Conclusion of the Simulations for Predictive Modeling

    Increasing the number of events and non-events matters

    Age variable is important and there are compensation effects between the variables

    It makes a difference whether data disturbances occur in the training data only or in both the training and scoring data

    Random disturbances affect model quality much less than systematic disturbances

    Chapter 20 Simulating the Consequences of Poor Data Quality in Time Series Forecasting

    20.1 Introduction

    General

    Purpose and application of time series forecasting

    Methods to forecast time series

    Scope and generalizability of simulations for time series forecasting

    20.2 Overview of the Functional Questions of the Simulations

    20.3 Base for the Business Case Calculation

    Introduction

    The reference company Quality DataCom

    20.4 Simulation Environment

    General

    Available data for the simulation environment

    Time series methods

    20.5 Simulation Procedure

    Basic simulation procedure

    Insertion of disturbances for the data in the simulation procedure

    Loop over TIME HISTORIES

    Loop over shifts

    Qualification of time series for the simulations

    Assessment of forecast accuracy

    20.6 Downloads

    20.7 Conclusion

    Chapter 21 Consequences of Data Quantity and Data Completeness in Time Series Forecasting

    21.1 Introduction

    21.2 Effect of the Length of the Available Time History

    General

    Simulation procedure

    Graph results

    Interpretation

    Results in numbers

    Business case calculation

    21.3 Optimal Length of the Available Time History

    General

    Results

    Interpretation

    21.4 Conclusion

    General

    Data relevancy

    Self-assessment of time series data

    Chapter 22 Consequences of Random Disturbances in Time Series Data

    22.1 Introduction

    General

    Simulation procedure

    Types of random disturbances

    22.2 Consequences of Random Missing Values

    General

    Insertion and replacement of missing values

    Results for missing value imputation with PROC EXPAND

    Results for missing value imputation with PROC TIMESERIES

    22.3 Consequences of Random Zero Values

    General

    Results

    22.4 Consequences of Random Biases

    General

    Standard deviation as basis for random bias

    Code to insert a random bias

    Results

    22.5 Conclusion

    Chapter 23 Consequences of Systematic Disturbances in Time Series Data

    23.1 Introduction

    General

    Simulation procedure

    Systematically selecting observations from the time series

    Types of systematic disturbances

    23.2 Coding Systematic Disturbances in Time Series Data

    Precalculation of values

    Systematic selection based on the decile group

    Systematic selection based on the calendar month

    23.3 Results for the Effect of Systematic Disturbances

    General

    Systematic disturbances inserted for the top 10% of time series values

    Systematic disturbances inserted for three consecutive calendar months

    23.4 Interpretation

    23.5 General Conclusions of the Simulations for Time Series Forecasting Shown in Chapters 21–23

    Increasing length of data history decreases forecast error

    The marginal effect of additional forecast months decreases

    For many time series, a short time history causes better forecast accuracy

    Long time histories can solve data quality problems to some extent

    Appendix A: Macro Code

    A.1 Introduction

    A.2 Code for Macro %COUNT_MV

    General

    Macro code

    A.3 Code for Macro %MV_PROFILING

    General

    Parameters

    Macro code

    Comments

    A.4 Code for Macro %PROFILE_TS_MV

    General

    Parameters

    Macro code

    A.5 Code for Macro %CHECK_TIMEID

    General

    Parameters

    Macro code

    Comments

    Appendix B: General SAS Content and Programs

    B.1 Calculating the Number of Records with at Least One Missing Value

    B.2 The SAS Power and Sample Size Application

    Appendix C: Using SAS Enterprise Miner for Simulation Studies

    C.1 Introduction

    C.2 Preparation of SAS Enterprise Miner for SEED=0 Random Numbers

    General

    Changing the settings

    C.3 Simulation Environment

    General

    Features

    Deriving the parameter setting from the node name

    Programming details

    C.4 Discussion of the Suitability of SAS Enterprise Miner for a Simulation Environment

    General

    Advantages

    C.5 Selected Macros and Macro Variables Available in a SAS Enterprise Miner Code Node

    Appendix D: Macro to Determine the Optimal Length of the Available Data History

    D.1: Introduction

    D.2: Example Call and Results

    Preparation of the data

    Example call

    Results

    D.3: Macro Parameters

    Parameters for macro %TS_HISTORY_CHECK

    Parameters for macro %TS_HISTORY_CHECK_ESM

    D.4: Macro Code

    Macro code for %TS_HISTORY_CHECK

    Macro code for %TS_HISTORY_CHECK_ESM

    Comments

    Appendix E: A Short Overview on Data Structures and Analytic Data Preparation

    E.2 Wording: Analysis Table and Analytic Data Mart

    E.3 Normalization and De-normalization

    General

    Normalization

    De-normalization

    Example

    E.4 Analysis Subjects

    Definition

    Representation in the data set

    E.5 Multiple Observations

    General

    Examples

    Repeated measurements over time

    Multiple observations because of hierarchical relationships

    E.6 One-Row-per-Subject Data Mart

    E.7 The Multiple-Rows-per-Subject Data Mart

    E.8 The Technical Point of View

    Transposing

    Aggregating

    References

    Index

    Introduction

    Rationale and Trigger to Write This Book

    The first impulse

    In November 2005, shortly before I finished the full draft of my first book, Data Preparation for Analytics Using SAS, I was asked whether I wanted to contribute content and knowledge to the topic of data quality for analytics. At that time it was too late to include data quality into my first book. It also would not have been advisable to do so, as this important topic would have gone beyond the scope of the book on data preparation.

    When Data Preparation for Analytics Using SAS was published in late 2006 I had already begun thinking about starting a new book on the topic of data quality. However, it wasn’t until 2008 that I started collecting ideas and opinions on the book you are reading now. After I received the green light to start writing the book from SAS Publishing, I started work at the end of 2009.

    Focus on analytics

    My intention was not to write another book on data quality in general, but to write the first book that deals with data quality from the viewpoint of a statistician, data miner, engineer, operations researcher, or other analytically minded problem-solver.

    Data quality is getting a lot of attention in the market. However, most of the initiatives, publications, and papers on data quality focus on classical data quality topics, such as elimination of duplicates, standardization of data, lists of values, value ranges, and plausibility checks. It will not be said here that these topics are not important for analytics; on the contrary, they build the foundation of data for analysis. However, there are many aspects of data that are specific to analytics. And these aspects are important to differentiate whether data are suitable for analysis or not.

    For classical data quality, books, best practices, and knowledge material are widely available. For the implementation of data quality, SAS offers the DataFlux Data Management Platform, a market-leading solution for typical data quality problems, and many methods in the established SAS modules.

    Symbiosis of analytic requirements and analytic capabilities

    In many cases, analytics puts higher requirements on data quality but also offers many more capabilities and options to measure and to improve data quality, like the calculation of representative imputation values for missing values. Thus there is a symbiosis between the analytical requirements and the analytical capabilities in the data quality context.

    Analytics is also uniquely able to close the loop on data quality since it reveals anomalies in the data that other applications often miss. SAS is also perfectly suited to analyze and improve data quality.

    • In part II, this book shows software capabilities that are important to measure and improve data quality and, thus, close the loop in the data quality process and show how analytics can improve data quality.

    • In part III, this book shows how SAS can be used as a simulation environment for the evaluation of data quality status and the consequences of inferior data quality. This part also shows new and unique simulations results on the consequences of data quality.

    FOR analytics and WITH analytics

    The book deals with data quality topics that are relevant FOR analytics. Data quality is discussed in conjunction with the requirements of analytical methods on the data. Analytics is, however, not only posing regulations on minimum data quality requirements. Analytical methods are also used to improve data quality. This book illustrates the demand of analytical methods but in return also shows what can be done WITH analytics in the data quality area.

    Data quality for non-analytics

    Much literature, publications, and discussions on general data quality exist from a broad perspective where the focus is not primarily on analytics. The chapters in this book, especially in the first part, include these typical data quality topics and methods as long as they are important for data quality for analytics.

    The idea of this book is not to start from scratch with the data quality topic and to introduce all methods that exist for the simple profiling of data, like using validations lists and validation limits. These methods are introduced in the respective sections, but the focus of the book stays on analytic implications and capabilities.

    Cody’s Data Cleaning Techniques Using SAS, by Ron Cody [9], shows how to profile the quality of data in general. The book in your hand references some of these techniques in some chapters; however, it does not repeat all the data quality basics.

    Data and Measurement

    The term data not only appears in the title of this book but is also used throughout the text to discuss features and characteristics of data and data quality.

    Measurement is very close in meaning to the word data in this book and could possibly be used as an alternative expression. Different from data, measurement also implies a process or an activity and, thus, better illustrates the process around data.

    • Some data are passively measured like transaction data (calls to a hotline, sales in a retail shop) or web data (like social media sites). This also compares to an observational study of using measurements that are opportunistically collected.

    • Other data are actively measured like vital signs in a medical study or survey data. This is usually the case in a designed experiment where measurements are prospectively collected.

    In research analyses the manufactured asset for the analysis is usually called measurement instead of data. The topics, methods, and findings that are discussed in this book thus apply not only to those who receive their data from databases, data warehouses, or externally acquired data but also to those who perform measurements in experiments.

    Things are being measured actively and passively with many spillover benefits for uses not originally envisioned. Finally, the researcher or analyst has to decide whether his data fit for intended use.

    Importance of Data Quality for Analytics

    Consequences of bad data quality

    Data quality for analytics is an important topic. Bad data quality or just the mere perception that data has bad quality causes the following:

    • Increases project duration and efforts

    • Reduces the available project time for analysis and intelligence

    • Damages trust in the results

    • Slows down innovation and research

    • Decreases customer satisfaction

    • Leads to wrong, biased, outdated, or delayed decisions

    • Costs money and time

    • Demotivates the analyst, increasing the risk of losing skilled people to other projects

    Frequently used expression

    Data quality is a frequently used expression. As a 21 September 2011 Google search reveals, data quality ranges with 10.8 Mio potential hits, behind terms like data management (30.9 Mio), data mining (28.1 Mio), or data warehouse (14.4 Mio). But it is still more prominent than terms like relational database (8.5 Mio), regression analysis (8.1 Mio), data integration (6.9 Mio), ETL or extraction transformation loading (6.2 Mio), time series analysis (3.6 Mio), cluster analysis (2.8 Mio), and predictive analytics (1.3 Mio).

    The frequency of use of the term data quality reinforces the requirement for a clear definition of data quality for analytics. Chapter 2 of this book goes into more detail on this.

    Trend in the market

    Data quality is currently also an important trend in the market. David Barkaway [5] shows in his 2010 SAS Global Forum paper the 2009 results of Forrester Research. To the question, Have you purchased any complimentary data management solution through your ETL vendor? 38 percent replied that they had bought data quality management software.

    Figure I.1: Complimentary data management solutions

    image shown here

    Source: Forrester survey November 2009, Global ETL Online Survey, Trends in Enterprise and Adoption

    The Layout of the Book

    Data quality process steps

    There are different ways a process about data quality can be defined. Thus, different vendors of data quality software and different data quality methodologies present processes that differ to some extent.

    The DataFlux Data Management Platform, for example, is built around five steps, which are grouped into three main buckets. The steps follow a logical flow that makes sense for the data quality and data cleaning process for which the tool is usually applied. These steps are shown in Figure I.2.

    Figure I.2: Data quality process in the DataFlux Data Management Platform

    image shown here

    • Profiling

    o Get a picture about the quality status of data before beginning a project

    o Discover and check relations in the data

    • Quality

    o Separate information into smaller units

    o Standardize, correct, and normalize data

    • Integration

    o Discover related data

    o Remove duplicates

    • Enrichment

    o Add data from other sources like address data, product data, or geocoding

    • Monitoring

    o Detect trends in data quality

    o Track consequences of bad data quality

    Main parts of this book

    This book is divided into three main parts. The naming and ordering of these three parts and the respective chapters follow a process as well, but also consider a segmentation of the content of this book into well-defined parts and a good readable sequence of topics and chapters.

    The three parts of this book are:

    • Data Quality Defined

    • Data Quality—Profiling and Improvement

    • Consequences of Poor Data Quality—Simulation Studies

    These three main parts can be represented as a data quality process as shown in Figure I.3 and that is described in the paragraphs that follow.

    Figure I.3: Data quality process in this book

    image shown here

    The logical order here is to first define the requirements and criteria for data quality for analytics. The first part of the book is therefore the conceptual part and contains text, definitions, explanations, and examples. This part is called Data Quality Defined.

    Based on these definitions the second part of the book focuses on how the data quality status can be profiled and how a picture of important criteria for advanced analytic methods and the data quality status of the data can be achieved. The second part also shows ways that data quality can be improved with analytical methods. The name of this part is Data Quality—Profiling and Improvement.

    As not all data quality problems can be corrected or solved (or the effort is not justifiable), the last part of the book deals with consequences of poor data quality. Based on simulation studies, general answers about the usability of certain analytical methods and the effect on the accuracy of models are given if data quality criteria are not fulfilled. The last part is named Consequences of Poor Data Quality—Simulation Studies.

    A cyclic approach

    The process in this book, thus, also follows a cyclic approach, after the definition of criteria, the assessment and possible correction of the data quality status, and the consequences of the actual data quality status are analyzed. Based on the outcome the analysis is performed or measures are taken to fulfill the criteria or to relax the criteria by reformulating the business questions.

    Selection of data quality criteria

    The selection of the set of data quality criteria for this book has been made based on the practical experience of the author. Actually, there is no single definition that can be considered to be the golden standard for all applications. It can also be seen that many definitions highly overlap.

    Gloskin [3], for example, defined the criteria Accuracy, Reliability, Timeliness, and Completeness. Orli [7] gives a longer list of criteria, which is Accuracy, Completeness, Consistency, Timeliness, Uniqueness, and Validity.

    The data quality criteria that are defined in this book in chapters 3–9 are the following.

    Chapter 3, Data Availability, starts with the question as to whether data are available in general.

    Chapter 4, Data Quantity, examines whether the amount of data are sufficient for the analysis.

    Chapter 5, Data Completeness, deals with the fact that available data fields may contain missing values.

    Chapter 6, Data Correctness, checks whether the available data are correct with respect to its definition.

    Chapter 7, Predictive Modeling, discusses special requirements of predictive modeling methods.

    Chapter 8, Analytics for Data Quality, shows additional requirements of interdependences of analytical methods and the data.

    Chapter 9, Process Considerations for Data Quality, finally shows the process aspect of data quality and also discusses aspects like data relevancy and possible alternatives.

    These criteria are considered to be the most important ones in the context of this book and are shown in part I.

    The Scope of This Book

    Widespread expectations of this book

    As already mentioned in a section above, data quality and data quality for analytics are very important topics that are discussed in many circumstances, projects, analysis domains, analytical disciplines, data warehouse communities, and across industries.

    As a consequence, the expectations on the content of this book from people from these different areas are very diverse. Depending on the way people perform these analyses and acquire, prepare, and use data, the expectations may vary. A book titled Data Quality for Analytics thus bears the risk of not meeting the expectations of all people.

    Consider the following roles, which have different perspectives on data quality and data quality for analytics and will likely have different expectations:

    • An analyst who builds analytical models for customer behavior analysis for a retail bank

    • An IT person who is in charge of maintaining the data warehouse and de-duplicating customer records in both the operational and the decision support system

    • A researcher who conducts and analyzes clinical trials

    • A statistician who works for the statistical office and creates reports based on register data

    This section attempts to correctly set expectations. Chapter 2 also goes into more details on the scope of data quality for analytics.

    Data cleaning techniques in general

    In the book Cody’s Data Cleaning Techniques Using SAS, Ron Cody shows a bunch of methods for profiling data quality status and correcting data quality errors. These methods include checking categorical, interval variables and date values as well as checking for missing values. Other topics include the check for duplicates in n observations per subject and work with multiple files.

    The intention of this book is not to compete with other books but to complement other titles by SAS Publishing by offering a book that has a different point of view. The data quality checks presented in Cody’s Data Cleaning Techniques Using SAS form an important basis for data quality control and improvement of analysis data in general. The emphasis of this book goes beyond typical basic methods and puts data quality into a more business-focused context and covers more closely the requirements of analytical methods.

    The detailed presentation of typical methods to profile the data quality status is not a focus of this book. The methods shown in part II are more advanced to profile specific analytical data requirements.

    The DataFlux Data Management Platform

    SAS offers the DataFlux Data Management Platform for data quality management. This market-leading solution is well suited to profile data with respect to data quality requirements, to improve data quality by de-duplicating data and enriching data, and to monitor data quality over time.

    The solution provides important features and strongly focuses on:

    • profiling the distribution of variables of different types, the matching of predefined patterns, and the presentation of summary statistics on the data.

    • methods to standardize data, for example, address data and product code and product name data, the controlled de-duplication of data.

    The features are important in providing a quality data basis for analysis (see also David Barkaway [5]) and definitely focus on data quality. The aim of this book, however, is also to discuss data quality from a business- and analytical-methods-specific point of view in terms of necessary data histories and historic snapshots of the data and the reliability and relevancy of data.

    Administrative records and data in statistical offices

    In statistical analysis in statistical institutions and in the social sciences the use of administrative data sources has become an important topic over the last several years. On some parts of the population, administrative data provides more information than any survey data. Consequently, the

    Enjoying the preview?
    Page 1 of 1