You are on page 1of 6

Incremental Data Warehouse Updates

With the increasing popularity of data warehouses and data marts, the ability to refresh
data in a timely fashion is more important than ever. Strategies of the past included
completely rebuilding the data warehouse periodically to ensure that information used for
reporting was current. As the data warehouse/data mart increases in complexity and the
demand for more up-to-the-minute data increases, the possibility of maintaining the data
warehouse in this fashion becomes intractable. One solution to this problem is to monitor
the operational data sources, detect changes of interest and populate the data warehouse
with the deltas. This article focuses on the several approaches for capturing changed data
that are available today.

The majority of the data in operational systems is transient. This is a manifestation of the
needs of the business. At any one point in time, the operational system contains only the
latest version of all records in the database. This is the view that is of interest to the
operational side of the business. In some cases, an operational system may implement a
form of semi-periodic history (e.g., the last five changes are preserved for each record).
The problem with this technique is that it is impossible to guarantee that all changes will
be preserved. For example, we may have seven changes for a record and two will be lost.

For operational systems where history is critical to the business operation (e.g., the
insurance industry or for database marketing programs), all changes to records are
preserved for a set period of time. Since there is no guarantee that changes will be
captured for us on the operational system(s), we must utilize a technique to capture the
transient and semi-periodic data. Once changes of interest are identified, they must be
stored in a persistent area to ensure that these records are available for updating the data
warehouse. The level of information stored should be rich enough to allow the
transformation application to make update decisions accurately. Typically, this level of
information would be an indication of the event or the full record before or after the

change.

Our main challenge is to identify changes within the operational system(s) that are of
interest to the data warehouse at the appropriate granularity. Once these changes have
been identified, the problem then becomes one of determining the actions to take based
on these changes. In a data warehouse environment, different actions may be taken based
on the type of change identified (insert, update or delete). For example, if we have
created a retail data warehouse and over the course of time one of our products is retired,
we would capture that change (a delete) and update the information in our warehouse
correspondingly. In this particular case, we would most likely have the product
information in a dimension table and the subsequent action would be to update the
product dimension record with a termination date.
Summary of Techniques

There are a number of techniques that are available for incrementally updating a data
warehouse, a data mart or another operational system. The following will discuss the
most popular techniques, realizing that there may be variations of these approaches in
practice. Changed data capture techniques fall into two broad categories: static and
incremental. In all of these approaches, it is assumed that a detected change will yield a
complete changed record to the requesting target. Based on the requirements of the
solution and the implementation of the incremental capture technique, the before and
after image of the changed record may be provided.

Static data capture is usually associated with taking a snapshot of the data at a particular
point in time. In some cases, the complete data set may be retrieved, but more likely only
a subset will be used. Static capture, timestamp capture and file comparison capture are
the static capture methods discussed in this article. Except in the case of file comparison
capture, the static capture approaches are relatively simple.

Incremental data capture is a time-dependent model for capturing changes to operational


systems. This technique is best applied in circumstances where changes in the data are
significantly smaller than the size of the data set for a specific period of time (i.e., the
time between captures). These techniques are more complex than static capture, because
they are closely tied to the DBMS or the operational software which updates the DBMS.
Three different techniques in this category are application-assisted capture, trigger-based
capture and transaction log capture. In circumstances where DBMSs are used to store
operational data, transaction log capture is the most powerful and provides the most
efficient approach to incremental capture

Static Capture Methods

Static Capture: This technique is the simplest of the data capture techniques. The
fundamental principle here is to periodically take a snapshot of the operational system
and then load the information into the data warehouse. The load operation can occur in
two modes, either complete reload or append. Complete reload assumes that the tables
being populated in the data warehouse are cleared by dropping and recreating each table
or by deleting all the data in the tables prior to reload. This technique does not provide a
mechanism for capturing historical data unless the operational system maintains periodic
data. Append mode assumes that data exists in the tables to be loaded and places
information into those tables based on predefined rules. For example, if a record exists,
overwrite the entire record.

Timestamp (Database Scan) Capture: This method of incremental data capture is


similar, in concept, to the static data capture method. The important distinction in the
case of the timestamp approach is that all records contain information pertaining to the
time at which they were last updated. These temporal indicators (tags), known as
timestamps, provide the selection criteria for the capture of changed records. For
example, the application that is responsible for the retrieval of changed records will know
that, for the tables of interest, all records with a timestamp after the last time the program
ran and up to the current time must be captured. This application would be required to
maintain a persistent status of the point at which it "left off" so that there would be no
overlap in the incremental capture process.

One advantage to this approach is that it is independent of database type. The second
distinct advantage of this approach over the static capture approach is that the total
volume of data is smaller. If the business does not require the capture of all changes in
state for records of interest, this technique would work well.

The timestamp method suffers from the same problems as the static capture technique
because it is difficult to capture intermediate states of data unless the operational system
has been architected to do so. For example, if a record changed state five times since its
last capture and the operational system did not maintain periodic history for the record,
then the next time the information capture application ran, it would only retrieve the
current state of the record in question.

An additional problem with this technique centers around the handling of deletes.
Typically in an operational system, records that are no longer valid or needed are deleted.
In order for the timestamp capture application to detect deletions, the record must be
marked inactive until it has been captured. Then, it can be removed from the operational
system.

File Comparison Capture: This method is also known as the snapshot differential
method. This method works by keeping before and after images of files that are of
concern to the data warehouse. Records are compared to find changes, and record keys
are compared to find inserts and deletes. This technique is most appropriate in the case of
legacy systems due to the fact that triggers typically do not exist and transaction logs are
either non-existent or in a proprietary format. Since most legacy databases have some
mechanism for dumping data into files, this technique creates periodic snapshots and then
compares the results to produce change records. Certainly, all the problems of static
capture are present here. Added complexity is introduced by the challenge of comparing
entire rows of information and by key identification and matching. This technique is
complex in nature and typically not desirable but, in some cases, may be the only
solution.

Incremental Data Capture Methods

Application-Assisted Capture: This mechanism for data capture does not require a
separate application/process to perform the change identification and collection. Instead,
this technique requires the change detection logic to be implemented as part of the
operational application. The underlying principle here is that when changes are written to
the operational database, changes are also written to a persistent area for later retrieval.
The latency in this technique is obviously minimal, meaning that the changed records are
available for immediate use. Although this technique can be powerful when designed and
developed correctly, it suffers from a few problems. Since the application is responsible
for making changes to the database, it more than likely maintains only key information
for records that require updates. Therefore, in order to write complete records to the
change queue, the application would: 1) have to go to the database and retrieve the entire
record; 2) apply the change; and 3) write it out to the change queue.

This can hurt performance and increase overall complexity. In addition, this technique
requires that the application perform all of the computation necessary to accurately
update specific records. For example, if the database is expected to add the current date to
a record, then the application either has to perform that same function or go back and
read the record after it has been written. One of the biggest drawbacks to the application-
assisted capture approach is that adding this level of functionality to legacy systems can
be difficult, especially since the originators of the code are often no longer within the
organization.

Trigger-Based Capture: Triggers are stored procedures that are invoked when certain
conditions or events occur. The granularity and flexibility of the conditions and events
that cause a trigger to fire are database specific. This technique assumes that the DBMS
supports triggers. Therefore, file-based storage cannot use trigger-based capture.

This approach is very similar to the application-assisted technique in that triggers, once
activated, can perform specialized tasks. In this case, the invocation of the trigger can be
used to save changed records of interest to a persistent storage area for retrieval. One
drawback to this approach is that it should only be used in cases where the number of
events captured is within the system's performance limits. For example, if all updates to a
table are captured, this method creates twice the workload for the database. An additional
problem with this approach is that it requires that the granularity of the information in the
target database match that of the source database. For example, if we are interested in
capturing bank transactions in the target database, and the source database only stores
account summary information, then we may not be able to derive the transaction (event)
information. One benefit to this approach is that data capture occurs at the source of the
data and, therefore, both the before and after image are available.

Transaction Log Capture: Transaction log capture takes a slightly different approach to
capturing the changed information. This technique leverages the logging and recovery
capabilities of a DBMS and, therefore, sequential and indexed files cannot take advantage
of this method. Since the transaction logs are utilized by the DBMS as a place to store
transaction information for logging and recovery, it is the ideal location to capture
changed information without directly impeding the performance of the database. When
reading the transaction logs, particular care must be taken to obtain only information that
has been committed by the database.

This technique limits the amount of work that the DBMS must perform, but an
application must be written to monitor the log files and capture the data of interest. This
technique is popular in some of the database replication techniques that are available
today.
One of the drawbacks to this approach is that it depends on the fact that the transaction
logs must remain available until the changes of interest are captured. For example, if a
DBA decides to trim the transaction logs prior to the capture of the changed records,
information may be lost. The log capture method is probably the most efficient approach
to incremental capture due to the fact that log writing is already well optimized and
available on most DBMS platforms today. In addition, log capture can be run on a
separate processor to reduce the impact on the performance of the operational system.

Identify Needs and Goals

While there are a number of techniques available for the incremental update of a data
warehouse, there is no one silver bullet. Each of the techniques discussed has pros and
cons, depending upon both the defined business objectives and the technical operating
environment (see Figure 1). A business must carefully identify its needs and goals with
respect to an incremental update strategy and select the one that is most appropriate.

FIGURE 1: Incremental Update Techniques

Impact in Inherent
Complexity Operational
Impact in Oper- Ability to
of System
Source ational Capture
Implement- Performance
Databases Appli- Historical
ation Impact
cations Information
Static None None Low None (1) Low(2)
Time-
stamp Low Low(3) Low None(4) Low
File
Compar- None None High None(4) Low(2)
ison
Appli-
cation None High High High High
Assited
Trigger High None Med(5) High High
Log
Capture None(6) None Med High Low

For example, transaction log capture allows the detection and processing of changes to
occur without directly impacting the operational system. This is very attractive in
situations where the operational system is already over-utilized. Transaction log capture
may be the desired approach. But, if you are using a source database that does not
produce a usable transaction log, then an alternative solution must be selected.

As we proceed into the realm of terabyte data warehouses, the ability to rebuild the data
warehouse from scratch on a nightly basis will go the way of the dinosaur. The logical
and efficient approach to updating the data warehouse involves some form of incremental
update strategy. The key to success is the ability to identify the changes of interest and to
capture them in a timely manner that does not overburden the source operational systems,
while still meeting the business needs of the organization.

You might also like