You are on page 1of 10

Chapter 5: Data Warehouse Design Methodology

The Data Warehouses Infrastructure


The data warehouse

gathers some raw inputs (data from TPS and other places),
operates on those inputs (to cleanse, integrate, and store the data), and then
distributes the results to users.

Therefore a warehouse must do

Extract data from a variety of sources. Most warehouses gather data from a
number of source systems.
Integrate data into a common repository. Once the data is extracted from its
source, it cant just be thrown into database tables. To be useful, the data has to be
cleansed, and the relationships between data elements must be validated and
enforced.
Put data into a format that users can use. The data warehouse must deliver its
product in a standard, user-friendly format.
Provide users with query tools to access the warehouse. To support query need,
the information utility must supply tools that allow users to plug into the
warehouse.

Warehouse Design Architecture


A reporting system and data warehousing is built around

Four data stores and


Three data flows

Data Stores
i.
ii.
iii.
iv.

The source system the transaction processing systems that will provide
data to the warehouse.
A warehouse or integration layer.
The data mart or high performance query structure (HPQS), and
The data on the report or analysis in the end users hand.

Data Flows
i.
ii.
iii.

From warehouse sources into integration layer, where data is cleansed and
integrated.
From integration layer data flows to the HPQS.
From HPQS to the end user via reporting application.

Data Store 1 The Source Systems


Source systems are those systems like accounting, and distribution that will provide
data to your warehouse. They may be enterprise resource planning packages (ERP) such
as SAP, PeopleSoft, or Oracle applications. They may be homegrown applications like
OASIS system. They may even be sources outside of your organization. For example,
data you purchased from outside vendor.
Each of these systems has a database of information that end users need access to.
Frequently, they need access to data that has been integrated from many of these systems.
Flow 1 From Data Source to the Integration Layer
Mechanisms for getting data out of its sources, called data extraction, are to be
developed.
A wide variety of data storage technologies are used by both source and transaction
processing systems. The data might be stored in flat files, Oracle, DB2, Informix, IMS, or
any number of other formats.
Regardless of the storage technology, you must get the data out. Thus programmers must
be able to write code in whatever language the source database understands.
Following things must be considered when extracting from source systems.
Is This Extract Supporting the Initial Load of the warehouse or a Periodic Refresh Load?
You frequently must build not one, but two architectures to load your warehouse.

The first performs initial load and


The second periodically refreshes the data.

These two architectures can be quite similar or remarkably different. For example, while
refresh loads usually look at online data, history loads frequently have to dredge up to lot
of history from offline storage.
In addition, while refresh loads must determine which source records have changed to
extract just those, history loads frequently just bring it all over.
Finally, with regard to table design, we are firm believers that all records put into
warehouse should contain a number of time stamps.
How Will I Determine What Records to Extract?
The art of determining what records to extract from the source system is frequently called
change data capture (CDC).

The point of change data is to recognize what source records have changed and how, so
that just the changed records are moved to the warehouse.
Techniques used to recognize changes to source database tables are:

Timestamps
Triggers
Application Integration Software (AIS)
File Compares

Timestamps
Timestamp records whenever the data is inserted or deleted in the source systems. In
these situations, change data capture is reduced to an exercise of a search through tables
to determine what records have changed.
In addition, the source system doesnt actually allow deletes, but, instead, marks record a
deleted and timestamps the delete time without actually removing them from the
database.
Triggers
A great technique for capturing changes in source records is to put triggers on the source
tables.
Every time a record is inserted into, updated in, or deleted from a source table, the
triggers write a corresponding message in a log file. The warehouse uses the information
in these log files to determine how to update itself.
In practice, though, it is unusual to see this method implemented, because it requires you
to put triggers on, or modify, your source systems a step that many organizations simply
will not allow. Their (some times valid) concern is that the addition of triggers will
jeopardize the performance of those source systems. This performance drop is in
comparison to other warehouse loading techniques that touch the sources in a batch mode
during off-peak times.
Also flat files do not support triggers.
Application Integration Software (AIS)
AIS tools are used to pass information between applications. Tools in this field have
names like MQ Series, Mercator and Tibco.
For example, say your company utilizes Oracle software. AIS can be used to link these
applications, such that when transaction occurs in one, the AIS would transmit it to all the
others. AIS provides on additional benefit as a data feed.

File Compares
Probably the least desirable technique for identifying changes in your source data is to
compare the file as it appears today to a copy of how it appeared when you last loaded the
warehouse.
Not only is this technique difficult, to implement, but its also less accurate than some
other methods. How so? Well, this technique compares periodic snapshots. Thus if you
load your warehouse weekly, you will only be able to see the new state of the database
every week, but not every change that occurred during the week.
How Will I Format The Extracted Records?
Once youve extracted the records, you should store them in a way that will allow you to
recognize what each means. Carry enough reference data to tie each record back to its
source.
For example, make sure you keep information in the source system that indicates what
source system generated the record, when the record was obtained, and the key of the
record. This information is invaluable when testing your load routines as well as when
you need to investigate detail.
What Will I Do With The Extracted Records?
Generally, extracted records are stored in flat files that are then read by data loading
programs. Data loading programs will periodically read these files and load the data into
the warehouse as appropriate.
In general we believe in building loose coupled warehousing architectures. By loose
coupling, we mean maintaining a separation between data extraction programs and data
loading programs. It makes your warehouse more flexible and maintainable.
One nice benefit to loose coupling is that it eases the addition of new data sources. So
long as the new source systems submit their data in the standard file format, your
standard load routines should be able to read that new data.
Dirty Data
Data received from source systems can be dirty in a number of ways.

Format violations
Referential integrity violations
Cross-system matching violations and
Internal consistency violations.

Format Violations
Data types are potentially wrong. For example, you might find letters in supposedly
numeric fields, incorrectly formatted phone numbers, and other similar examples.
Referential Integrity Violations
Data is not referentially sound. The sales system, for instance, might record sales to
customers who arent listed in the customer file.
Cross-system Matching Violations
The same data elements appear in multiple systems but cannot be easily matched to each
other. This happens when a customer appears as J Smith in the sales system and John
Smith in the accounts receivable database.
Internal consistency Violations
The same records are repeated in a single table, typically with minor differences such as
with different spelling of names and other fields. For example, the customer IBM might
appear multiple times in your customer database: once as IBM, another as International
Business machines and so on.
The problem with dirty data is that it makes your warehouse unreliable. This sort of data
should be cleansed whenever the data is loaded into the warehouse.
Data Store 2 The Integration Layer
The integration layer, or warehouse, is a normalized database that unites the feeds from
all your sources in a single place.
As in all databases, it is strongly recommended that referential integrity constraints be
enabled in your warehouse/integration layer.
Why Build an Integration Layer?
Building an integration layer

Avoids extraction repetition.


Its likely that multiple data marts will require data from the same source
systems. If you dont bring your data from those sources through a
common warehouse, then each data mart will have to access each source
itself. Multiple marts will be writing similar programs to get data from the
same systems.
Each mart will then also have to go through the work of cleansing and
integrating the data.

If, on the other hand, you build an integration layer, every mart has to read
only one source: the integration layer that already contains integrated,
clean data.

Ensures standard interpretation of enterprise data.


It is quite likely that multiple groups interpret the same data differently.
The builder of an integration layer is forced to consider all the possible
definitions of data and to develop the common definitions that will be
shared across the organization.
Provides a repository that is far flexible than the denormalized structures in the
high performance query, or data mart, layer.
Data mart is comprised of denormalized data structures that are optimized
for querying. Denormalized structures, while providing great query
performance, are quite inflexible and awkward to work with. It is,
therefore, difficult to perform the data integration steps needed in these
structures.

An Introduction to Database Normalization


Consider the following denormalized table that contains information about sales of
electrical service to customers over time. In this example each customer receives a bill
each month for residential electric service.
cust_num
cust_name
cust_addr
cust_phone
Substation_id
Substation_name
Month1_id
Month1_name
Month1_bill_amount
Month2_id
Month2_name
Month2_bill_amount
First Normalized Form
To be in the first normalized form, a database table must contain no repeating
groups.

In the above table, the monthly bill information is a repeating group. To put our data into
first normal form (sometimes referred to as 1NF), we must create a new table as
following.
CUSTOMER
cust_num
cust_name
cust_addr
cust phone
Substation_id
Substation_name

CUSTOMER_MONTH
cust_num
Month_id
Month_name
Second
Normal Form
Month_bill_amount
To be in second normal form, all non-key attributes of a table
must rely on the entire key of the table.

Each record in the CUSTOMER_MONTH table is uniquely identified by combination of


customer and month (because we send one bill to each customer every month).
Notice the Month_name attribute. Month_name is not dependent on the entire primary
key. It is dependent only on Month_id. In other words, you could determine the month
name simply by knowing the Month_id; the customer name is irrelevant. Contrast this
with the Month_bill_amount. To correctly find the amount billed to any customer in a
particular month you must know both the cust_num AND the Month_id. Thus, the tables
are changed as following:
cust_num
cust_name
cust_addr
cust phone
Substation_id
Substation_name

CUSTOMER

CUSTOMER_MONTH

cust_num
Month_id
Month_bill_amount

MONTH

Month_id
Month_name

Third Normal Form


To be in the third normal form, none of the non-key fields can depend on the other
non-key fields. In other words, all non-key fields must depend solely on the tables
primary key.
Consider substation fields in the CUSTOMER table. While each customer is served by
only one substation, is the substation name actually dependent on the primary key of the
customer table? No, in fact, the Substation_name is dependent only on the Substation_id.
Thus the tables are changed as following:
CUSTOMER

CUSTOMER_MONTH

cust_num
cust_name
cust_addr
cust phone
Substation_id

cust_num
Month_id
Month_bill_amount

MONTH
Month_id
Month_name

SUBSTATION
Substation_id
Substation_name

What Extra Data Must the Integration Layer Hold?


Surrogate Keys
A surrogate key has no business meaning. It is simply a sequential number generated by
data warehouse load program.
A surrogate key is added with each record that is loaded into the warehouse.
For example consider the following records from a source application.
Custnum Lname

Fname

Street

City

State

Georgia USA

480BC

Themistocles Bob

46 Maple Lane

Athens

550BC

Cyrus

74 Tech Way

Anshan

Hubert

Country Planet

Persia

Earth
Earth

The same records after they have been placed into a warehouse table on adding the
surrogate key Custkey.
Custkey Custnum Lname

Fname

Street

City

State

Georgia USA

480BC

Themistocles Bob

46 Maple Lane

Athens

550BC

Cyrus

74 Tech Way

Anshan

Hubert

Country Planet

Persia

Earth
Earth

The following is the warehouse table after Themistocles moved from Athens to Argos.
Custkey Custnum Lname

Fname

Street

City

State

Country Planet

Georgia USA

480BC

Themistocles Bob

46 Maple Lane

Athens

550BC

Cyrus

74 Tech Way

Anshan

Persia

480BC

Themistocles Bob

Ostracon Place

Argos

Greece Earth

Hubert

Earth
Earth

Dates, Statuses, and Other Fields


For number of reasons (auditing support, the ability to easily identify data for extract to
data marts, and others), we like to see additional information on records in the
warehouse.

Additional fields we found to be useful on warehouse records include: insert date, last
update date, status flag (for example, is this the current record or has it been superceded
by some other?), and the source system responsible for this record.
Another Note About Dirty Data
Techniques for handling bad records include:

Ignore them
Rejecting bad records, but saving them in a separate file for manual review.
Loading as much of the bad record as possible and pointing out the errors for later
review.

Data Flow 2 From the Integrity Layer to the High Performance Query Structure
Data warehousing is an end user concept. End users should query data only out of their
high performance query structures (HPQS), or data marts.
In this flow, data is extracted from integration layer, or warehouse, and inserted into the
data marts. Thus, you must build another set of extract, transformation, and load (ETL)
jobs to populate marts. Once again you generally should try to incrementally refresh your
data mart rather than completely refreshing them with every load.
One way in which your data marts will differ from data warehouse is in their use of
summary tables. As they can greatly improve end user query performance, data marts
almost always contain some summaries of their atomic-level details. Thus, your load
programs may be called upon to load not only the atomic-level data, but also these
summary tables.
Summary management is greatly aided by Oracles 8is materialized view functionality.
Oracle 8i allows you to create summary tables called materialized views.
Data Store 3 The High Performance Query Structure (HPQS)
HPQSs, or data marts, are databases and data structures setup specifically to support enduser queries.
These databases are most frequently managed by, either
Relational database engines (for example, Oracle 8i) or
Multi dimensional database engines (for example, Express or Essbase.)
Its important to note that data marts and HPQSs are logical, not physical, concepts.
Frequently, an organizations data warehouse and its data mart will share the same
computer. Some times, they even share the same Oracle instance and schema. Still they
have different purposes and physically have very different table designs.

In the end, your data mart is a set of data structures that contains data in formats that
make it easier and speedier for end users to access than it would be in traditional,
normalized database formats.
Data Flow 3 From HPQSs to the End User Reporting Applications
The query tools get data out of the data marts and into the hands of end users. These tools
generally issue SQL calls to relational databases and other appropriate calls to databases
using other technologies. Data is returned to the tools, which then formats the result.
Data Store 4 Data in the End Users Hand
End users reports really do constitute a data store. If your sponsors cant be made
comfortable with that concept, you may have problem on your hand.
Alternate Warehousing Architectures
No Warehouse Users could query directly off the OLTP sources. This is possible
when the transaction processing systems are sufficiently strong and end-user
queries are sufficiently limited, such that there are no concerns about response
time. This approach, obviously, wont relieve situations where there is a need to
integrate data across multiple systems.
Normalized Design The architecture weve laid out here is to build just
warehouse/integration layer. All data is integrated there, but rather than querying
from denormalized or multidimensional data marts, users query directly out of the
integration layer. This architecture will provide you with the integration benefits
of our preferred architecture. What it wont give you are the usability and query
performance benefits associated with HPQSs or data marts.
Just Data Marts This approach is best used for limited scope, point solutions that
dont need data integrated from multiple systems. For example, if a department
has a need for data out of a single OLTP system and this type of data has very
little application in the rest of the company perhaps it doesnt make sense to bring
this data through the warehouse.

You might also like