You are on page 1of 34

Chapter 11:

Data Warehousing
Modern Database
Management
8th Edition
Jeffrey A. Hoffer, Mary B. Prescott,
Fred R. McFadden
2007 by Prentice Hall

Objectives

Definition of terms
Reasons for information gap between
information needs and availability
Reasons for need of data warehousing
Describe three levels of data warehouse
architectures
List four steps of data reconciliation
Describe two components of star schema
Estimate fact table size
Design a data mart

Chapter 11

2007 by Prentice Hall

Definition

Data Warehouse:

A subject-oriented, integrated, time-variant, nonupdatable collection of data used in support of


management decision-making processes
Subject-oriented: e.g. customers, patients,
students, products
Integrated: Consistent naming conventions, formats,
encoding structures; from multiple data sources
Time-variant: Can study trends and changes
Nonupdatable: Read-only, periodically refreshed

Data Mart:

A data warehouse that is limited in scope

Chapter 11

2007 by Prentice Hall

Need for Data Warehousing

Integrated, company-wide view of high-quality


information (from disparate databases)
Separation of operational and informational
systems and data (for improved performance)

Chapter 11

2007 by Prentice Hall

Source: adapted from Strange (1997).

Chapter 11

2007 by Prentice Hall

Data Warehouse
Architectures

Generic Two-Level Architecture


Independent Data Mart
Dependent Data Mart and Operational
Data Store
Logical Data Mart and Real-Time Data
Warehouse
Three-Layer architecture

All involve some form of extraction, transformation and loading (ETL)


ETL
Chapter 11

2007 by Prentice Hall

Figure 11-2: Generic two-level data warehousing architecture

L
T

One,
companywide
warehouse

E
Periodic extraction data is not completely current in warehouse
Chapter 11

2007 by Prentice Hall

Figure 11-3 Independent data mart


data warehousing architecture

Data marts:
Mini-warehouses, limited in scope

T
E
Separate ETL for each
independent data mart
Chapter 11

Data access complexity


due to multiple data marts

2007 by Prentice Hall

Figure 11-4 Dependent data mart with


ODS provides option for
operational data store: a three-level architecture obtaining current data

T
E

Simpler data access

Single ETL for


enterprise data warehouse
(EDW)
Chapter 11

Dependent data marts


loaded from EDW

2007 by Prentice Hall

Figure 11-5 Logical data mart and real


time warehouse architecture

ODS and data warehouse


are one and the same

T
E
Near real-time ETL for
Data Warehouse
Chapter 11

Data marts are NOT separate databases,


but logical views of the data warehouse
Easier to create new data marts

2007 by Prentice Hall

10

Figure 11-6 Three-layer data architecture for a data warehouse

Chapter 11

2007 by Prentice Hall

11

Figure 11-7
Example of DBMS
log entry

Data Characteristics
Status vs. Event Data
Statu
s

Event =

a database

action
(create/update/delete)
that results from a
transaction

Statu
s
Chapter 11

2007 by Prentice Hall

12

Figure 11-8
Transient
operational data

Chapter 11

Data Characteristics
Transient vs. Periodic Data

2007 by Prentice Hall

With
transient
data,
changes
to
existing
records
are
written
over
previous
records,
thus
destroyin
g the
previous
data
13
content

Figure 11-9:
Periodic
warehouse data

Chapter 11

Data Characteristics
Transient vs. Periodic Data

2007 by Prentice Hall

Periodic
data are
never
physicall
y
altered
or
deleted
once
they
have
been
added
to the
store
14

Other Data Warehouse


Changes

New descriptive attributes


New business activity attributes
New classes of descriptive attributes
Descriptive attributes become more
refined
Descriptive data are related to one
another
New source of data

Chapter 11

2007 by Prentice Hall

15

The Reconciled Data Layer

Typical operational data is:

Transientnot historical
Not normalized (perhaps due to denormalization for
performance)
Restricted in scopenot comprehensive
Sometimes poor qualityinconsistencies and errors

After ETL, data should be:

Detailednot summarized yet


Historicalperiodic
Normalized3rd normal form or higher
Comprehensiveenterprise-wide perspective
Timelydata should be current enough to assist decision-making
Quality controlledaccurate with full integrity

Chapter 11

2007 by Prentice Hall

16

The ETL Process


Capture/Extract
Scrub

or data cleansing
Transform
Load and Index
ETL = Extract, transform, and load
Chapter 11

2007 by Prentice Hall

17

Figure 11-10:
Steps in data
reconciliation

Capture/Extractobtaining a snapshot of a chosen


subset of the source data for loading into the data
warehouse

Incremental extract =
capturing changes that
have occurred since the
last static extract
18
2007 by Prentice Hall

Static extract =
capturing a snapshot of
the source data at a
point in time

Chapter 11

Scrub/Cleanseuses pattern recognition and AI


techniques to upgrade data quality

Figure 11-10:
Steps in data
reconciliation
(cont.)

Fixing errors:

Also: decoding,

misspellings, erroneous
reformatting, time stamping,
dates, incorrect field usage,
conversion, key generation,
mismatched addresses,
merging, error
missing data, duplicate data,
detection/logging, locating
inconsistencies
Chapter
11
missing
19
2007 by Prentice
Hall data

Transform = convert data from format of


operational system to format of data warehouse

Figure 11-10:
Steps in data
reconciliation
(cont.)

Record-level:

Field-level:

Selectiondata partitioning
single-fieldfrom one field to one
Joiningdata combining
field
Aggregationdata
multi-fieldfrom many fields to
summarization
one, or one field to many
Chapter 11
20
2007 by Prentice Hall

Figure 11-10:
Steps in data
reconciliation
(cont.)

Load/Index= place transformed


data into the warehouse and
create indexes

Refresh mode: bulk

rewriting of target data at


periodic intervals
Chapter 11

Update mode: only

changes in source data are


written to data warehouse

2007 by Prentice Hall

21

Figure 11-11: Single-field transformation


In generalsome
transformation function
translates data from old form
to new form

Algorithmic transformation
uses a formula or logical
expression

Table lookupanother
approach, uses a
separate table keyed by
source record code

Chapter 11

2007 by Prentice Hall

22

Figure 11-12: Multifield transformation

M:1from many source


fields to one target
field

1:Mfrom one
source field to
many target
fields

Chapter 11

2007 by Prentice Hall

23

Derived Data

Objectives

Ease of use for decision support applications


Fast response to predefined user queries
Customized data for particular target audiences
Ad-hoc query support
Data mining capabilities

Characteristics

Detailed (mostly periodic) data


Aggregate (for summary)
Distributed (to departmental servers)

Most common data model = star schema


(also called dimensional model)
Chapter 11

2007 by Prentice Hall

24

Figure 11-13 Components of a star schema


Fact tables contain
factual or quantitative
data

1:N relationship between


dimension tables and fact
tables

Dimension tables are denormalized


to maximize performance

Dimension tables contain


descriptions about the subjects of
the business

Excellent for ad-hoc queries, but bad for online transaction


processing
Chapter
11
2007 by Prentice Hall

25

Figure 11-14 Star schema example


Fact table provides statistics for
sales broken down by product,
period and store dimensions

Chapter 11

2007 by Prentice Hall

26

Figure 11-15 Star schema with sample data

Chapter 11

2007 by Prentice Hall

27

Issues Regarding Star


Schema

Dimension table keys must be surrogate (nonintelligent and non-business related), because:

Granularity of Fact Tablewhat level of detail do


you want?

Keys may change over time


Length/format consistency

Transactional grainfinest level


Aggregated grainmore summarized
Finer grains better market basket analysis capability
Finer grain more dimension tables, more rows in fact table

Duration of the databasehow much history


should be kept?

Natural duration13 months or 5 quarters


Financial institutions may need longer duration
Older data is more difficult to source and cleanse

Chapter 11

2007 by Prentice Hall

28

Figure 11-16: Modeling dates

Fact tables contain time-period data


Date dimensions are important
Chapter 11

2007 by Prentice Hall

29

The User Interface


Metadata (data catalog)

Identify subjects of the data mart


Identify dimensions and facts
Indicate how data is derived from enterprise data
warehouses, including derivation rules
Indicate how data is derived from operational
data store, including derivation rules
Identify available reports and predefined queries
Identify data analysis techniques (e.g. drill-down
Identify responsible people

Chapter 11

2007 by Prentice Hall

30

On-Line Analytical Processing (OLAP)


Tools

The use of a set of graphical tools that provides


users with multidimensional views of their data
and allows them to analyze the data using simple
windowing techniques
Relational OLAP (ROLAP)

Multidimensional OLAP (MOLAP)

Traditional relational representation


Cube structure

OLAP Operations

Cube slicingcome up with 2-D view of data


Drill-downgoing from summary to more detailed
views

Chapter 11

2007 by Prentice Hall

31

Figure 11-23 Slicing a data cube

Chapter 11

2007 by Prentice Hall

32

Figure 11-24
Example of drill-down

Starting with
summary data,
users can obtain
details for particular
cells

Chapter 11

Summary
report

Drill-down with
color added

2007 by Prentice Hall

33

Data Mining and


Visualization

Knowledge discovery using a blend of statistical, AI, and computer


graphics techniques
Goals:

Techniques

Explain observed events or conditions


Confirm hypotheses
Explore data for new or unexpected relationships
Statistical regression
Decision tree induction
Clustering and signal processing
Affinity
Sequence association
Case-based reasoning
Rule discovery
Neural nets
Fractals

Data visualizationrepresenting data in graphical/multimedia formats for


analysis

Chapter 11

2007 by Prentice Hall

34

You might also like