You are on page 1of 19

External/Unstructured Data and

the Data Warehouse


External data and Data
Warehouse
Enterprises are increasingly information centric,
and recent trends reveal that most competitive
businesses require external data in their
enterprise data warehouse to strategically
position themselves in the market. This article
touches upon a few critical challenges specific to
integrating data from external systems as well as
best practices and considerations to do so
successfully.
To build a powerful data warehouse you must include as
much relevant data from internal and external sources as
possible to optimize the decision processes that managers
As an example, retailers have current/historic sales data
along with pricing information, but this will only provide
partial insight into the determinants that are driving sales.
Information such as weather, income tax distribution
periods, regional or local population growth, household
demography, may also play a key factor in driving sales
and must be taken into consideration.
The government and many other organizations capture and deliver this
data and distribute it free or for a nominal fee. Here are some examples:
Weather: Yahoo offers an RSS fed that can be called using an http
request as follows:
http://weather.yahooapis.com/forecastrss?p=48161
The parameters to the request are the following:
Parameter, Description, Examples
p, US zip code or Location ID, p=95089 or p=USCA1116
u, Units for temperature (case sensitive), f: Fahrenheit or c: Celsius
The RSS response from this request includes the following information:
Geographic latitude/longitude
Weather Conditions (48 distinct codes)
Temperature (F,C)
Forecast (Condition, High Temperature, Low Temperature)
Using Demographic Data along with internally generated data
can go a long way to enhance the data warehouse. The following
are examples of where this data can be obtained:
http://www.geolytics.com/?
gclid=CMeliJqrqJ0CFU1M5QodekqHkA
With limited data (address or lat/long information) you can get
60 demographic attributes for that address that include factors
such as income, average number of people per home, average
age, education
Likewise another good site for demographic data and data
validation is:
http://www.melissadata.com/dqt/index.htm
This site offers validation against address, phone numbers,
email and perform name parsing via Web Services calls which
can help accelerate the ETL development process, provided you
do not have to develop the code and maintain large demographic
databases onsite. Additionally, this site offers demographic data
on income, media locations, reverse phone and mailing lists.
Finally, the Federal government maintains thousands of
databases with data gathered from various agencies that contain
information that can be coupled with internal data to make your
data warehouse far more powerful. For example:
http://research.stlouisfed.org/fred2/
http://www.data.gov/catalog
http://www.census.gov/
These sites contain historic economic and
demographic data the government has collected
regarding income, population, interest rates,
commodity prices, housing sales and the
downloads are free.
The goal of data warehouse development should be
to provide the tools and data for optimal decision
making. To assure this goal is achieved, make sure
external source are also included in the initial and
ongoing data warehouse implementation.
External/Unstructured Data in the
Data Warehouse
Several issues relate to the use and storage of
external and unstructured data in the data
warehouse.
1. The frequency of availability
2. It is totally undiscipline
3. Its unpredictability
Many methods to capture and store unstructure
information such as:
1. Near-line Storage
2. Create two stores of unstructured data
Meta Data and External Data
Meta data is vital because through it external data is registered, accessed,
and controlled in the data warehouse environment. The importance of meta
data is best understood by noting what it typically encompasses:
Document ID
Date of entry into the warehouse
Description of the document
Source of the document
Date of source of the document
Classification of the document
Index words
Purge date
Physical location reference
Length of the document
Related references
Storing External/
Unstructured Data
External data and unstructured data can actually be
stored in the data warehouse if it is convenient and
cost-effective to do so.
To store external data and unstructured data
requires considerable resources
By associating external data and the unstructured
data with a data warehouse, the external data and
the unstructured data become available for all parts
of the organization, such as finance, marketing,
accounting, sales, engineering and so forth
Modeling and
External/Unstructured data
What is the role of the data model and
external data. See below (figure 8.6)
Archiving External data
Every piece of information external or
otherwise has a useful lifetime.
Once past that lifetime, it is not economical
to keep the information. An essential part of
managing external data is deciding what the
useful lifetime of the data is.
Comparing Internal data to
external data
One of the most useful things to do with
external data is to compare it to internal
data over a period of time. The comparison
allows management a unique perspective.
For instance, being able to contrast
immediate and personal activities against
global activities and trends allow an
executive to have insights that simply not
possible elsewhere.

You might also like