You are on page 1of 10

Garbage in, garbage

out: The importance of


data quality
As business intelligence and data warehousing continue to mature and
with increasing attention being given to big data, I thought it important
to address an often overlooked or under-addressed issue; Data Quality.
Over the course of three blog posts in the coming weeks, Ill touch on
Data Quality, Master Data Management and Data Governance. What do
they mean and how are they useful to your organisation?
Lets start with Data Quality. Several years ago I was working with the
Vice-President of US Sales at a large multi-national organisation. We
designed a BI solution to provide the VP and all sales management
access to sales information relevant to their area of responsibility. A
data mart was created based on the needs, the customers database
team designed and built the ETL process to populate the data mart and
the BI solution was built on the mart.
The BI solution allowed everyone, including individual sales people, to
see their sales performance. The project started with standard sales
metrics such as actual vs. targets, year-over-year comparison, biggest
customers, most profitable customers and so forth and provided the
ability to slice, or view, the information in a number of different ways. As
the implementation was reviewed by sales reps and district managers, it
was praised for providing them valuable information.
When I met with the VP of US Sales (the project sponsor) to review the
solution all went well until we got to the biggest customer list. He
examined the list for a few moments and then asked Where is
Company ABC? (name changed to protect the innocent). He continued,
I know that if they arent the largest, they are nearly the largest and yet
they do not appear on the list. I had not heard this concern when

working with individual sales reps and district managers. What could be
wrong? I told him I would look into the issue and get back to him.
Upon examining the customer information contained within the data
mart, the following was an example of the type of records found for ABC
company.

When all the data from the various records were added together, it
showed that ABC company was, in fact, the largest customer. However,
due to data issues, that analysis could not be seen on any metric,
dashboard or report.
The BI solution worked properly but the underlying data prevented it
from delivering accurate information for some of the metrics. If the
customer learned of this oversight, it could lead to a very awkward
conversation between the vendor and the customer or worse result
in the customer changing vendors.
What kind of a problem was this and what are options for solving it?
The problem was related to Data Quality; the data was all there, but the
problem was that the data was inconsistent between regions, sales
people, and so forth. The data, being inconsistent, could not be added
together to get a true picture of customer activity. Further analysis
showed that the issue of inconsistent data was pervasive and, due to
that fact, it rendered the largest and most profitable customer metrics
meaningless.
Fixing the data in the mart would address the data quality issue for the
sales team. They discussed how, as the data travelled through the ETL
process, they could put transformation logic in place to fix the issue of
having multiple variations of a company name. However, what if the
company name was entered in a format that was outside the rules
implemented to catch and change names to ensure consistent naming?
They would constantly have to review the data for new inconsistencies
and revise the transformation logic.

What about other groups within the organisation? Marketing are they
sending the same customer multiple copies of promotions or new
product information? Accounting are they sending bills to the wrong
address? Customer Service are they unable to find the proper
customer information because they didnt use the correct company
name for the person who is calling? Fixing the data issue for the sales
group is good for the sales team but leaves the root problem untouched.
In this example, the data quality issue was generated from a single
source system. Imagine how data problems can escalate when you
have the same data in multiple systems. For example, what if the
company name in a CRM system differs from the name for the same
company in an accounting system. Add in a third, a fourth, or even more
additional applications with the same data, when there is a difference
in data between the systems, which is correct?
As you can see, data quality can be a significant problem for
companies. Data quality problems can be addressed in one of three
ways (at a high level) and each has times where it is a great choice and
there are also cost and time factors that must be evaluated in deciding
the best resolution for your organisation. The three high level methods
of addressing data quality issues are:

As an exception. For example, well fix it next week (or next


month, etc.) and then all is good. This is the best approach for
modifying data from a legacy system and bringing it into a warehouse.

As an after-the-fact solution. This includes addressing the data


issues via ETL when the data is moved into a data warehouse. This is
something I sometimes refer to as Report Data Quality Management as
it addresses the issue for any reports coming from the warehouse, but
not for anything done with data in the source system.

Addressing the problem at the root. In this instance, the source


systems are altered to ensure the data always enters the system in the
correct way, ensuring there is no inconsistency either within the
application or between different applications.
It is important to recognise that Data Quality is generally not a tool
problem, but rather it is usually a business process issue. Tools can be
used to help but addressing business process issues is often at the root
of the problem. There can be times where poor Data Quality is enabled

through an application but, most generally, it is a business process


issue.
But is inconsistent company naming the only type of Data Quality
problem? In a word, no: Not all quality problems are the same and not
all have the same impact on your business.

The Five Categories of Data Quality


Issues
In fact, there are five major categories of Data Quality issues. They are:
Data Completeness, Data Consistency, Data Acuracy, Internal Integrity
and External Integrity.

Data Completeness
Data completeness indicates that all requisite data items have been
populated. Not every data item needs to be populated in every instance.
For example, in a purchasing system, if a customer is paying by credit
card, the credit card number is required but if they are paying with cash,
that data item is not necessary. What we need to ensure is that all the
requisite data items are populated for each record.
Data Completeness has a second aspect as well: suppose that a
particular data item is deemed optional and so is entered into the
system sporatically. As an example, perhaps email was an optional field
for a customer but then there is a desire to do an ongoing email
marketing program and for that, they need to have customer email
addresses. Or perhaps there is a desire to analyse information in a new
way such as by gender, another data item that was not required and
therefore is sporadically populated.
This is a quality issue that cannot be found though a few simple rules,
as the necessity of a data item is sometimes not realised until after an
application has been in place for years; it is not until analysis of the data
is attempted that the lack of data in a certain area is discovered.
Additionally, even when the data item is determined to be necessary for

analysis, there may still be cases where null values are to be expected
as there are often exceptions to general rules.

Data Consistency
The problem described in the story above is a data consistency
problem. This problem is most frequently found in cases where an
application allows for free-form entry of text in regard to important
informational items or can also arise when merging data from disparate
systems (more on that in the External Integrity section). The absence of
data consistency makes accurate analysis difficult. While correcting a
data consistency issue can largely be resolved in an ETL process, it
means that the information in the data warehouse does not accurately
reflect the source data and so reports on the same data, one from the
source and one from the warehouse, may not match due to the
differences.

Data Accuracy
This can be a difficult item to validate and correct. What do you do if the
address provided by the customer is invalid or if the person entering the
address makes a mistake? Such information inaccuracies will exist in
the data and can be difficult to detect and correct. While data accuracy
items do not impact analysis as much as consistency or completeness,
they can still impact analysis. If an analysis is being done by city and
the address has the wrong city, the analysis will be incorrect. Depending
on the amount of data, the inaccuracy is often only found in a small
number of records and so the impact is minimised.

Internal Integrity
This problem refers to referential problems within the database; in other
words, how well does the data fit together? Perhaps a company is
deleted from the system but the contacts for that company are not
deleted; with the company gone, those contacts are now orphans as
they have no parent. Depending on the type of analysis being done, this
can lead to analysis problems. This type of issue is often caused by

either a problem within the source application or through meddling with


the data via SQL calls.

External Integrity
External Integrity is linked with data consistency. When identical data is
merged between systems, it is hoped that the data between the
systems will match. If merging data between sales and accounting, the
sales database may have customer ABC while the accounting
database may have customer ABC, Inc.. Maintaining external integrity
means ensuring data consistency across source applications.
As you can see, there are various ways in which data quality issues can
impact reporting and analysis. In the next post in this series, Ill discuss
how Master Data Management can be used to help maintain the quality
of organisational data.

QUESTIONS
1. Identify the name of the information system that being used in the
article.
2. Discuss what the system can do?
3. What could be the problem with regards data integrity that
mentioned in the article?
4. What do you understand with below categories of data quality
issues? (Explain in your own words)
a. Data Completeness
b. Data Consistency
c. Data Accuracy
d. Internal Integrity
e. External Integrity

ETL
BI
BI

ABC

ABC


ABC

BI

- -

ETL

-
-
-

CRM

- ETL

''

Acuracy

sporatically

ETL

'' SQL

'ABC'

ABC

1.
2.
3.
4.

You might also like