You are on page 1of 12

Real-Time Data Warehousing:

Challenges and Solutions


by Justin Langseth,
Founder and CTO of Claraview

Today's Real-time Requirement


Traditionally data warehouses do not contain today's data. They are usually loaded with data from
operational systems at most weekly or in some cases nightly, but are in any case a window on
the past. The fast pace of business today is quickly making these historical systems less valuable
to the issues facing managers and government officials in the real world. Morning sales on the
east coast will affect how stores are stocked on the west coast. Airlines and government agencies
need to be able to analyze the most current information when trying to detect suspicious groups
of passengers or potentially illegal activity. Fast-paced changes in the financial markets may
make the personalized suggestions on a stockbroker's website obsolete by the time they are
viewed.

As today's decisions in the business world become more real-time, the systems that support
those decisions need to keep up. It is only natural that Data Warehouse, Business Intelligence,
Decision Support, and OLAP systems quickly begin to incorporate real-time data.

Data warehouses and business intelligence applications are designed to answer exactly the types
of questions that users would like to pose against real-time data. They are able to analyze vast
quantities of data over time, to determine what is the best offer to make to a customer, or to
identify potentially fraudulent, illegal, or suspicious activity. Ad-hoc reporting is made easy using
today's advanced OLAP tools. All that needs to be done is to make these existing systems and
applications work off real-time data.

This article examines the challenges of adding real-time data to these system, and presents
several approaches to making real-time warehousing a reality today.

Challenge #1: Enabling Real-time ETL


One of the most difficult parts of building any data warehouse is the process of extracting,
transforming, cleansing, and loading the data from the source system. Performing ETL of data in
real-time introduces additional challenges. Almost all ETL tools and systems, whether based on
off-the-shelf products or custom-coded, operate in a batch mode. They assume that the data
becomes available as some sort of extract file on a certain schedule, usually nightly, weekly, or
monthly. Then the system transforms and cleanses the data and loads it into the data warehouse.

This process typically involves downtime of the data warehouse, so no users are able to access it
while the load takes place. Since these loads are usually performed late at night, this scheduled
downtime typically does not inconvenience many users.

When loading data continuously in real-time, there can't be any system downtime. The heaviest
periods in terms of data warehouse usage may very well coincide with the peak periods of
incoming data. The requirements for continuous updates with no warehouse downtime are
generally inconsistent with traditional ETL tools and systems. Fortunately, there are new tools on
the market that specialize in real-time ETL and data loading. There are also ways of modifying
existing ETL systems to perform real-time or near real-time warehouse loading. Some of these
tools and techniques are described below.

Solution 1a: "Near Real-time" ETL

The cheapest and easiest way to solve the real-time ETL problem is to not even attempt it in the
first place. Not every problem actually requires, or can justify the costs of true real-time data
warehousing. For these applications, simply increasing the frequency of the existing data load
may be sufficient.

A data load that currently occurs weekly can perhaps be performed instead daily, or twice a day.
A daily data load could be converted to an hourly data load. To get around the issue of system
downtime, see Solution 1c below. This approach allows the warehouse users to access data that
is more fresh than they are used to having, without having to make major modifications to existing
load processes, data models, or reporting applications. While not real-time, near-real time may be
a good inexpensive first step.

Solution 1b: Direct trickle feed

Assuming that an application requires a true-real time data warehouse, the simplest approach is
to continuously feed the data warehouse with new data from the source system. This can be
done by either directly inserting or updating data in the warehouse fact tables, or by inserting data
into separate fact tables in a real-time partition (see Solution 2b).

How to actually transport the data between the operational systems and the data warehouse is
beyond the scope of this article, but several approaches are worth mentioning. There are a
number of new real-time data loading packages that are specifically designed for this challenge,
including those from DataMirror and MetaMatrix.

EAI vendors such as Tibco provide solutions for real-time data transport. For systems based on
the latest Java technologies, Java Messaging Service (JMS) can be used to transmit each new
data element from the source system to a lightweight listener application that in turn inserts the
new data into the warehouse tables. For data that is received over the Internet, the data can be
transmitted in XML via HTTP using the SOAP standard, and then loaded into the warehouse.

Once the data is near the warehouse, simply inserting the new data in real-time is not particularly
challenging. The problem with this approach, which will probably not be readily apparent during
development and initial deployment, is that it does not scale well. The same logic as to why data
warehouses exist in the first place applies-complex analytical queries do not mix well with
continuous inserts and updates. Constantly updating the same tables that are being queried by a
reporting or OLAP tool can cause the data warehouse's query performance to degrade.

Under moderate to heavy usage, either from concurrent warehouse user queries or from the
incoming data stream, most relational database management systems will begin to temporarily
block the incoming data transactions altogether, and the data will become stale. This may also
cause the data loading application to fail altogether if the warehouse becomes unresponsive. Also
as queries begin to slow and if the data becomes stale, the warehouse users may become
frustrated and stop using the system altogether.

Two approaches to real-time data loading that help avert this scalability problem are described in
the next section. Also the issue of real-time data warehouse scalability, and four additional
approaches to building a scalable real-time warehouse system, are discussed in Section 4 of this
paper.

Solution 1c: Trickle & Flip

The "Trickle & Flip" approach helps avert the scalability issues associated with querying tables
that are being simultaneously updated. Instead of loading the data in real-time into the actual
warehouse tables, the data is continuously fed into staging tables that are in the exact same
format as the target tables. Depending on the data modeling approach being used (see section 2
below), the staging tables either contain a copy of just the data for the current day, or for smaller
fact tables can contain a complete copy of all the historical data.

Then on a periodic basis the staging table is duplicated and the copy is swapped with the fact
table, bring the data warehouse instantly up-to-date. If the "integrated real-time partition through
views" approach is being used, this operation may simply consist of changing the view definition
to include the updated table instead of the old table. Depending on the characteristics of how this
swap is handled by the particular RDBMS, it might be advisable to temporally pause the OLAP
server while this flip takes place so that no new queries are initiated while the swap occurs.

This approach can be used with cycle times ranging from hourly to every minute. Generally best
performance is obtained with 5-10 minute cycles, but 1-2 minute cycles (or even faster) are also
possible for smaller data sets or with sufficient database hardware. It is important to test this
approach under full load before it is brought into production to find the cycle time that works best
for the application.

Solution 1d: External Real-time Data Cache

All of the solutions discussed so far involve the data warehouse's underlying database taking on
a lot of additional load to deal with the incoming real-time data, and making it available to
warehouse users. The best option in many cases is to store the real-time data in an external real-
time data cache (RTDC) outside of the traditional data warehouse, completely avoiding any
potential performance problems and leaving the existing warehouse largely as-is.

The RTDC can simply be another dedicated database server (or a separate instance of a large
database system) dedicated to loading, storing, and processing the real-time data. Applications
that either deal with large volumes of real-time data (hundreds or thousands of changes per
second), or those that require extremely fast query performance, might benefit from using an in-
memory database (IMDB) for the real-time data cache. Such IMDBs are provided by companies
such as Angara, Cacheflow, Kx, TimesTen, and InfoCruiser.

Regardless of the database that is used to hold the RTDC, its function remains the same. All the
real-time data is loaded into the cache as it arrives from the source system. Depending on the
approach taken and the analytical tools being used, either all queries that involve the real-time
data are directed to the RTDC, or the real-time data required to answer any particular query is
seamlessly imaged to the regular data warehouse on a temporary basis to process the query.
These two options are described in detail in Solutions 4c and 4d below.

Using an RTDC, there's no risk of introducing scalability or performance problems on the existing
data warehouse, which is particularly important when adding real-time data to an existing
production warehouse. Further, queries that access the real-time data will be extremely fast, as
they execute in their own environment separate from the existing warehouse.
This is important as users who need up-to-the-second data typically don't want to wait too long for
their queries to return. Also by using just-in-time data merging from the RTDC into the
warehouse, or reverse just-in-time merging from the warehouse into the RTDC (described in
Solutions 4d and 4e), queries can access both real-time and historical data seamlessly.

The downside of using a RTDC solution, with or without just-in-time data merging, is that it
involves an additional database that needs to be installed and maintained. Also there is additional
work required to configure the applications that need to access the real-time data so that they
point to the RTDC. The upside of the RTDC is increased performance, access to up-to-the-
second data, and no scalability or performance risk on the existing warehouse. Also the cost of
an RTDC solution is typically low compared to the cost to add sufficient hardware and memory to
the existing database server to overcome the scalability and performance issues associated with
the trickle feed approaches.

Challenge #2: Modeling Real-time Fact Tables


The introduction of real-time data into an existing data warehouse, or the modeling of real-time
data for a new data warehouse brings up some interesting data modeling issues. For instance, a
warehouse that has all of its data aggregated at various levels based on a time dimension needs
to consider the possibility that the aggregated information may be out of synch with the real-time
data. Also some metrics such as month-to-date and week-to-date may behave strangely with a
partial day of data that continuously changes. The main issue regarding modeling however
revolves around where the real-time data is stored, and how best to link it into the rest of the data
model.

Solution 2a: Modeling as Usual with Direct Fact Table Feed

When using the Direct Trickle Feed or Trickle & Flip approaches to real-time ETL discussed
above, and when the real-time data is stored along with the historical data in the same fact tables,
no special data modeling approaches are required. From the query tool's perspective, there is
nothing different about a real-time data warehouse modeled in this manner than from a non-real-
time warehouse.

The main thing to consider when using this approach is caching. Many query, reporting, and
OLAP tools will cache reports and result sets, and assume that these caches need to be
refreshed only upon the nightly or weekly warehouse load. With a real-time data feed, this type of
caching needs to be either disabled, or set to the same cycle as the table flip when using the
Trickle & Flip approach. Since most query tools do not have a good mechanism for coordinating
cache expiration with frequent external events, disabling caching for the reports that run on real-
time data is usually the best approach. In the future it is likely the query tools will become more
real-time aware, and understand that some tables are frequently changing and vary their caching
approaches accordingly, but for now these settings need to be set carefully and by hand.

Solution 2b: Separate Real-time Partition

One approach to modeling real-time data is to store the real-time data in separate warehouse fact
tables. Depending on the type of fact table, many query tools that support table partitioning will be
smart enough to automatically retrieve the real-time data from the real-time tables when required.
For tables where this will not work, alternate facts and attributes can be set up to point to the real-
time data tables, and the query tool can be set up to drill across from reports containing the
historical data to real-time data templates.
From the query-tool configuration and administration perspective, this data modeling approach is
the most complex to engineer. This approach requires either the query tool or the end user to
understand where the various types of data are located and how to access them. The success of
this approach depends significantly on how well the query tool insulates the end user from the
extra complexities required to access real-time information.

This data modeling approach is covered in detail by Ralph Kimball in his February 2002 "Real-
time Partitions" article in Intelligent Enterprise (http://www.intelligententerprise.com/020201).

Solution 2c: Integrated Real-time through Views

Another real-time data modeling approach is to store the real-time data in different tables from
historical data, but in the same table structure. Then by using database views, the historical and
real-time data tables are combined together so that they look like one logical table from the query
tool's perspective. This helps alleviate many of the problems associated with the separate
partition approach, as the query tool or end users do not need to join two tables.

This approach is similar to the direct fact table feed, except the feed is into smaller tables that can
be more easily modified or replaced when necessary by the load process. Also these tables will
generally be small enough to sit in the database's memory cache, to alleviate some of the query
contention issues involved with performing OLAP queries on changing data. The caching
concerns discussed in Solution 2a still apply, and care needs to be taken to ensure that the query
tool doesn't return old cache results to users who are requesting real-time data.

Solution 2d: Modeling with an External Real-time Data Cache

When using an external real-time data cache, no special data modeling is required in the data
warehouse. The external data cache database is generally modeled identically to the data
warehouse, but typically contains only the tables that are real-time.

If the external data cache is accessed separately from the data warehouse (using a separate
OLAP project, for example), some additional tables may be required in the cache, such as lookup
tables. But external data caching is most useful when the data is seamlessly integrated with
historical data for query and analysis purposes.

The advantage of this approach from a data modeling perspective is similar to using a direct fact
table feed or performing integrated real-time through views (discussed in Solutions 2a and 2c).
This approach has the additional advantage of eliminating performance problems associated with
the other approaches to integrating real-time data into a data warehouse. These approaches are
discussed in more detail in Solutions 4c, 4d, and 4e.

Challenge #3: OLAP Queries vs. Changing Data


OLAP and Query tools were designed to operate on top of unchanging, static historical data.
Since they assume that the underlying data is not changing, they don't take any precautions to
ensure that the results they produce are not negatively influenced by data changes concurrent to
query execution. In some cases, this can lead to inconsistent and confusing query results.

Relational OLAP tools are particularly sensitive to this problem because they perform all but the
simplest data analysis operations by issuing multi-pass SQL. A multi-pass SQL statement is
made up of many smaller SQL statements that sequentially operate on a set of temporary tables.
For example, the simple report below shows sales for a retailer at the category level along with a
percent-to-total calculation:
While this report is simple, the SQL that is used to generate it by a leading OLAP tool is
surprisingly complex. Note that two temporary tables are created, loaded with intermediate
results, then a final pass is run, and the two temp tables are then dropped. The first temporary
table is loaded with the totals for sales at the category level. The second temp table holds only
one row, the total sales for all categories. The final SELECT statement then joins the two
temporary tables together with the fact table to generate the final result set.

This works fine when the data is static, but what if the underlying data changes while the first
temp table is being created? Most database systems will return the data that was current at the
point that the query started to run. Notice the times in bold before each statement. The INSERT
statement into TEMP1 ran for four seconds, and the query to load data into TEMP2 started at
second 6. This means that TEMP1 will contain data current as of 0:01, but TEMP2 will contain
data current as of 0:06. If a few large sales were registered during those 5 seconds, they will be
included in the total dollar amount contained in TEMP2, but won't be represented in the category-
level data that is in TEMP1. When the data is brought together in the final SELECT statement, the
total in TEMP2 will be larger than the sum of the categories in TEMP1. This will cause the
resulting report to show an incorrect total, and the total percentage number will be less than
100%! Also, in this example, the result that is displayed, besides being internally inconsistent, is
already 10 seconds old.

This example actually represents the best case for most real-world relational OLAP systems.
Many users experience query response times that run from tens of seconds to multiple hours. For
a well-tuned application, acceptable response times typically run from 15 seconds to 2-5 minutes.
Also typical OLAP queries can contain many pages of SQL code. While our example contained
only 6 passes of SQL, it is not uncommon for reports to consist of 10-50 passes.
This presents two problems. The first problem is that the results of a query that takes even one
minute are arguably not exactly real-time anymore. While this data latency may be acceptable to
a retail division manager, it might not be ok for an application that is looking for atmospheric
trends that indicate the presence of a tornado, or for an application detecting real-time credit card
or telecommunications fraud.

The second problem is that given the multiple passes of SQL required to perform almost any
relational OLAP reporting or analytical operation, any real-time warehouse is likely to suffer from
the result set internal inconsistency issue discussed above. There's nothing like the numbers not
adding up properly to make a user skeptical of a report. For more complex product affinity or
trend detection analytics, the results may be so confusing as to be completely useless.

The next three solution sections discuss ways to alleviate this problem while still retaining real-
time or near-real-time reporting and analytics.

Solution 3a: Use a Near Real-time Approach

As in the first challenge, an easy way around this challenge is to avoid the whole issue in the first
place. This report consistency issue is only a problem when the data is changing quickly enough
so that the data will be different at the end of a multi-pass query execution cycle than it was at the
beginning. Using either the Near-real-time ETL approach (Solution 1a) or the Trickle & Flip
approach with a relatively long cycle time (Solution 1c) will alleviate this problem if the OLAP
server is instructed not to send new jobs to the data warehouse during the load or flip.

The downside is that the data is not really real-time, but is "delayed" like some free stock market
sites on the Internet. Also if the ETL or flipping process takes too long, it is possible that users will
become frustrated by delayed or timed-out queries. Care must be taken to ensure that the OLAP
server can be properly paused during the load or flip.

Solution 3b: Risk Mitigation for True Real-time

Assuming a near-real-time approach is not sufficient for an application, there are some ways to
mitigate the affects of report data inconsistency. The simplest is to not allow users to perform the
most complex queries on the real-time data. An alternative is to have a less-frequently updated
snapshot of the real-time data in a separate partition that can be used for complex analytical
queries. This however adds a lot of setup, maintenance, and user education complexity.

If an application is going to need to live with some internal report inconsistency, it is important to
educate the users of the real-time information that this is a possibility. Uneducated users who
view data that doesn't properly add up are likely to assume that the system is malfunctioning and
can't be trusted.

Solution 3c: Use an External Real-time Data Cache

The only way to completely solve this problem without compromising report internal consistency,
data latency, or the user experience is to use an external real-time data cache. By keeping the
real-time data separate from the historical data, the reports will never be internally inconsistent. It
is still possible though to do seamless reporting on historical and real-time data using a just-in-
time data merge process, which is described in Solution 4d below. Also the issue of reporting
latency can be addressed using a reverse just-in-time data merge into a memory-based real time
data cache. This approach is discussed in Solution 4e.

Challenge #4: Scalability & Query Contention


The issue of query contention and scalability is the most difficult issue facing organizations
deploying real-time data warehouse solutions. Data warehouses were separated from
transactional systems in the first place because the type of complex analytical queries run against
warehouses don't "play well" with lots of simultaneous inserts, updates, or deletes.

Usually the scalability of data warehouse and OLAP solutions is a direct function of the amount of
data being queried and the number of users simultaneously running queries. Given a fixed
amount of data, the number of users on the system is proportional to query response time. Lots of
concurrent usages causes reports to take longer to execute.

While this is still true in a real-time system, the additional burden of continuously loading and
updating data further strains system resources. Unfortunately the additional burden of a
continuous data load is not just equivalent to one or two additional simultaneously querying users
due to the contention between data inserts and typical OLAP select statements. While it depends
on the database, the contention between complex selects and continuous inserts tends to
severely limit scalability. Surprisingly quickly the continuous data loading process may become
blocked, or what used to be fast queries may begin to take intolerably long to return.

There are ways to get around this problem, including the near-real-time approaches described in
previous sections, but where true real-time is a hard and fast requirement, the approaches
described below help to address this problem in various ways.

Solution 4a: Simplify and Limit Real-time Reporting

Many real-time warehousing applications are relatively simple. Users that want to see up-to-the-
second data may have relatively simple reporting requirements. If reports based on real-time data
can be limited to simple and quick single-pass queries, many relational database systems will be
able to handle the contention that is introduced. Frequently the most complex queries in a data
warehouse will be accessing data across a large amount of time. If these queries can be based
only on the non-changing historical data, contention with the real-time load is eliminated.

Another important consideration is to examine who really needs to be able to access the real-time
information. While real-time data may be interesting to a large group of users within an
organization, the needs of many users may be adequately met with non-real-time data, or with
near-real-time solutions.

Also many users who may be interested in real-time data may be better served by an alert
notification application that sends them an email or wireless message alerting them to real-time
data conditions that meet their pre-defined thresholds. Designed properly, these types of systems
can be scaled to 100 or 1000 times more users than could possibly run their own concurrent real-
time warehouse queries (see Section 5 for more details).

Solution 4b: Apply More Database Horsepower

There is always the option of adding more hardware to deal with scalability problems. More nodes
can be added to a high-end SMP database system, or a stand-alone warehouse box can be
upgraded with faster processors and more memory. While this approach may overcome short-
term scalability problems, it is likely to only represent a band-aid approach. Real-time query
contention often has more to do with the fundamental design of a RDBMS than with the system
resources available.

Solution 4c: Separate & Isolate in a Real-time Data Cache


Using an external real-time data cache (described above in Solution 1d) solves the scalability and
query contention problem by routing all real-time data loading and query activity to an
independent database that can be tuned for real-time access. With all the real-time activity on the
separate cache database, the data warehouse does not bear any additional load.

However this approach, on its own, is not an entirely satisfactory solution. With the real-time data
external to the warehouse, it is not possible for a single report or analysis to join or co-display
real-time and historical information. Further, if complex analytical reports are run on the real-time
cache, it is possible for the cache to begin to exhibit the same internal report inconsistency,
database contention, and scalability problems that a warehouse would exhibit.

This approach on its own, without the additional functionality described in the next two Solutions,
does work well for some stand-alone real-time reporting, analysis, and alerting applications,
particularly those with few concurrent users and those with limited or no need for historical data.

Solution 4d: Just-in-time Information Merge from External Data Cache

There is a class of applications whose needs are not met by the approaches described so far in
this paper. They tend to share two or more of the following characteristics:

• Require true real-time data (not near-real-time)


• Involve rapidly-changing data (10-1000 transactions per second)
• Need to be accessed by 10s, 100s, or 1000s of concurrent users
• Analyze real-time data in conjunction with historical information
• Involve complex, multi-pass, analytical OLAP queries

These applications require the best aspects of a traditional data warehouse such as access to
large amounts of data, analytical depth, and massive scalability. They also require the access to
real-time data and processing speed provided by a real-time data cache. For these applications it
is necessary to use a hybrid approach. The real-time information sits in an external data cache,
and the historical information sits in a warehouse, and the two are efficiently linked together as
needed. This can be accomplished by an approach known as just-in-time information merging
(JIM).

In a JIM system, queries that require real-time data are pre-processed by the a component known
as the JIM Request Analyzer (JIM-RA). The JIM-RA analyzes the query to determine exactly
what real-time data components are required. Then the JIM Data Imager (JIM-DI) component
takes a snapshot image of the required parts of the real-time data cache. The real-time data from
the snapshot is then loaded into temporary tables in the data warehouse by the JIM-DI
component. Once the real-time data is present, the JIM-RA modifies the original query to include
the temporary tables containing the snapshot data.

This approach allows existing BI and OLAP tools to seamlessly access real-time information
within the framework of existing data warehouse applications. There is no possibility of
introducing scalability problems, as the required real-time data is only brought into the warehouse
when needed, and on a temporary independent one-off basis for each query. Data modeling is
relatively simple, and can be approached as described above in Solutions 2a and 2b. Query
contention is not a problem as the data does not change while it is being queried. Also there is no
risk of report inconsistency, as the real-time information is held constant in the snapshot between
the multiple passes of SQL.

lution 4e: Reverse Just-in-time Data Merge


A variant of JIM is Reverse Just-in-time Data Merging (RJIM). Reverse JIM is useful for queries
that are mainly based on real-time data, but that contain limited historical information as well. In
Reverse JIM a similar process takes place, except the needed historical information is loaded
from the data warehouse into the external data cache on a temporary basis, and then the query is
run in the data cache. This only works when the data cache is located in a RDBMS system with
full SQL support, and will not work with some IMDB systems that do not support many SQL
functions.

An intelligent RJIM system needs to process as much of the query as possible on the warehouse
before transferring only the data required at the highest level of summarization required for each
query. Otherwise an RJIM system could easily overflow an external data cache with large
amounts of historical information. The best systems are able to use both JIM and RJIM, and
decide which process to use for each query based on the amount of data that is likely to go in
either direction, and choose the path of least resistance and likely best performance.

While this article has given an overview of JIM and RJIM approaches, the details regarding how
to implement such systems are beyond its scope, and the technologies and tools required are
quickly developing.

Challenge #5: Real-time Alerting


Most alerting applications associated with data warehouses to date have been mainly used to
distribute email versions of reports after the nightly data warehouse load. The availability of real-
time data in a data warehouse makes alerting applications much more appealing, as users can
be alerted to real-time conditions as they occur in the warehouse, not just on a nightly basis.

The availability of real-time data makes products such as MicroStrategy's NarrowCaster and
similar products from Cognos and Business Objects very valuable. But real-time alerting using
these products brings its own set of challenges, as surprisingly these products, like many query
tools from the same vendors, were not designed to operate on or tested against real-time data
feeds.

These products operate on a schedule or event basis, so they can either trigger an alert every
few minutes or hours, or need to be triggered by an external system. These challenges will be
addressed in Solutions 5a and 5b below. There is also the issue of threshold management. When
alerts are triggered frequently (as opposed to once a night upon warehouse load), there needs to
be a mechanism in place to make sure that once an alert is sent due to a condition in the
warehouse that the alert is not continuously sent over and over again during each alerting cycle.
A way to solve this problem is presented in Solution 5c below.

Solution 5a: n-Minute Cycle Schedule

Currently all data warehouse alerting technology works on a schedule basis or on an event basis.
This means to perform true real-time alerting, a process needs to continuously monitor the
incoming data and trigger the events when appropriate. This approach is discussed in Solution 5b
below.

One way to approximate real-time alerting, without the added complexity of a real-time data
stream monitoring solution, is to utilize a data warehouse alerting package on a scheduled basis,
with the schedule typically set to every 1, 5, 15, or 30 minutes. This approach works reasonably
well and provides near-real-time alerting. For a warehouse that is loaded on a near-real-time
basis, all that needs to be done is to set the alerting schedule to trigger right after the data is
refreshed.
For a warehouse that is being updated on a truly real-time basis, then using a n-minute cycle
schedule will introduce a certain amount of latency, as an alert can't be triggered until the next
cycle window comes around after the threshold condition is met. If the cycle time is low enough,
and the alert is to be sent via email anyway, a 1 or 5 minute latency may be acceptable for many
applications.

It is appealing to set the cycle threshold as low as possible, but this can introduce certain
complications. Frequent alert cycles will introduce more load on the data warehouse, which can
impact performance for other users. Also, if a cycle begins before the previous cycle has
completed, it is possible for users to receive duplicate alerts (see Solution 5c below). Some tools
will overwrite temporary work tables if another batch is triggered before the previous one
completes, which can cause system errors or missing alerts.

Until more sophisticated alert cycle pipelining technology is added to current alerting tools, it is
important to choose a cycle window that will provide enough time for the previous batch of alerts
to be completed before the next begins.

Solution 5b: True real-time data monitoring & triggering

For true real-time data alerting, a triggering system needs to be in place, as existing data
warehouse alerting systems are not capable of monitoring real-time data streams looking for
exception conditions. General-purpose versions of such technologies are currently under
development by companies such as Apama, but in the short-term custom solutions optimized for
the task and data stream at hand are the best solutions.

These systems are fairly complex, and depending on the number of users and alert conditions,
and on the amount of incoming data, can require large amounts of hardware and particularly
memory. Imagine a system providing real-time stock market alerts for 100,000 users. 100k
portfolios and alert thresholds need to be stored in memory and compared against every
incoming tick from the various stock markets, which can be hundreds or thousands per second
during market hours. This is a very difficult task, and currently there are no general-purpose
technologies that can meet these needs.

Solution 5c: Real-time Alert Threshold Management

Regardless of whether alerts are generated using near-real-time batch cycles or by a real-time
triggering system, it is critically important that the users' alert thresholds are properly managed.
Imagine a user who asks to be alerted by email when the inventory level of any item in his store
drops to 5% of the normal level during the course of a day. The user assumes that he or she will
be alerted once when the level drops to 5%, but with a system running on a 5 minute cycle, a new
alert will be send every 5 minutes!

The problem occurs when static threshold definitions, which are fine for systems that load on a
nightly or weekly basis, are applied to systems that update more frequently. It is unlikely that the
user desires to be reminded every 5 minutes. It might be better to alert the user once levels reach
5%, and then maybe again at 2%, and then one final time when stock runs out completely. Or a
brokerage customer may want to know when a certain stock exceeds $20, and then be alerted
again every increment of $2 it goes higher.

This type of threshold management is necessary for real-time alert systems to be accepted by
users. Unfortunately it is not natively supported by current warehouse alerting tools. Until this
support is added, the best approach is to use the tools' post-service plug-in abilities to run custom
SQL or procedures to directly update the user's thresholds based on the current data conditions.
It is this post-execution job that make it critical that cycles don't overlap, as if the next batch
begins before the thresholds are updated, it is likely that duplicate alerts will be sent.

Summary: Real-time with Today's Technology


As we have seen in this article, real-time data warehousing and OLAP are possible using today's
technology, but challenges lurk seemingly every step of the way. For the determined team armed
with the right knowledge and experience, it is possible to make real-time reporting, analysis, and
alerting systems work. The challenge is to make the right tradeoffs along the way, to make sure
the systems meet the needs of the user base while ensuring that they don't collapse under their
own weight, or cause existing production warehouses to malfunction

It is likely that a lot of the challenges discussed in this paper will become less challenging over
time, as database, ETL, OLAP, reporting, and alerting tool vendors begin to add features to their
systems to make them work better with real-time data streams. In the meantime, it is important to
make sure real-time warehousing systems are well planned and designed, and thoroughly tested
under realistic data and user load conditions before they are deployed.

The benefits of data warehousing in real-time are becoming clearer every day. With the right
tools, designs, advice, approaches, and in some cases tricks, real-time data warehousing is
possible using today's technologies, and will only become easier in the future. In any case, the
time to begin planning and prototyping is now.

You might also like