You are on page 1of 12

From Enterprise Models to Dimensional Models:

A Methodology for Data Warehouse and Data Mart Design


Daniel L. Moody
Department of Information Systems,
University of Melbourne, Parkville, Australia 3052
email: d.moody@ dis.unimelb.edu.au

Mark A.R. Kortink


Simsion Bowles & Associates
1 Collins St, Melbourne, Australia 3000.

Abstract reported in the literature for such applications (Graham


et al, 1996).
This paper describes a method for developing dimensional One of the most important issues in data warehous-
models from traditional Entity Relationship models. This can
ing is how to design appropriate database structures to
be used to design data warehouses and data marts based on
support end user queries (Pokorny, 1999). Existing
enterprise data models. The first step of the method involves
approaches to data warehouse design advocate a “first
classifying entities in the data model into a number of catego-
ries. The second step involves identifying hierarchies that principles” approach, where the structure of the data
exist in the model. The final step involves collapsing these warehouse is derived directly from user query require-
hierarchies and aggregating transaction data to form dimen- ments. In this paper, we describe a method for de-
sional models. A number of design alternatives are pre- signing data warehouses and data marts based on an
sented, including a flat schema, a terraced schema, a star enterprise data model represented in Entity Relation-
schema and a snowflake schema. We also define a new type ship form. This provides a more structured approach to
of schema called a star cluster schema. This is a restricted data warehouse design, and ensures that structure of the
form of snowflake schema, which minimises the number of data warehouse reflects the underlying semantic struc-
tables while avoiding overlap between different dimensional ture of the data. It also leads to a more flexible ware-
hierarchies. Individual schemas can be collected together to house design, which is resilient to changes in analysis
form constellations or galaxies. We illustrate the method requirements over time.
using a simple example.
Data Warehouse Architecture
1. INTRODUCTION
In the early 90’s, data warehousing was proposed as
The Problem Addressed a general purpose solution to the problem of satisfying
Data warehousing is currently one of the most im- organisational management information needs. A data
portant applications of database technology in practice. warehouse is a database which provides a single con-
The total size of the data warehousing market, includ- sistent source of management information for reporting
ing software and hardware, was estimated to be US$8 and analysis across the organisation (Inmon, 1993;
billion in 1998 (Sen and Jacob, 1998). A significant Love, 1994). Data warehousing requires a major shift
proportion of IT budgets in most organisations is de- in the relationship between IT departments and users,
voted to data warehousing applications. High levels of in that it advocates a “self-service” model rather than
user satisfaction and return on investment have been the traditional report-driven model. In a data ware-
housing environment, end users access data directly
using user friendly query tools rather than relying on
The copyright of this paper belongs to the paper’s authors. Per- reports generated by IT specialists. This helps to re-
mission to copy without fee all or part of this material is granted
provided that the copies are not made or distributed for direct com- duce user reliance on IT staff to satisfy information
mercial advantage. needs.
Proceedings of the International Workshop on Design and Data warehousing is based on a supply chain meta-
Management of Data Warehouses (DMDW'2000) phor. The data “product” is obtained from data “sup-
Stockholm, Sweden, June 5-6, 2000 pliers” (operational systems or external sources) and is
(M. Jeusfeld, H. Shu, M. Staudt, G. Vossen, eds.) temporarily stored in a central data “warehouse”. The
http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-28/

D. Moody, M. Kortink 5- 1
data is then delivered via data “marts” to data “con- Dimensional Modelling
sumers” (end users). Figure 1 shows a generic archi-
tecture for a data warehouse (rectangles indicate data According to Kimball (1996, 1997), the data ware-
stores, while circles indicate processes). housing (OLAP1) environment is profoundly different
from the operational (OLTP2) environment and tech-
Back End Front End
niques used to design operational databases are inap-
"Data Supplier" "Procurement" "Wholesale
Level"
"Distribution" "Retail Level" "Data Consumer"
propriate for designing data warehouses. For this rea-
Operational Extract Data Load
Data Mart
son, Kimball proposed a new technique for data mod-
Databases Processes Warehouse Processes

End User/
elling specifically for designing data warehouses,
Decision
which he called dimensional modelling. The method
External was developed based on observations of practice, and
Data Sources
in particular, of data vendors who are in the business of
providing data in “user-friendly” form to their custom-
Figure 1. Data Warehouse Architecture ers. It is not based on any theory, and has never been
empirically tested, but has clearly been very successful
The architecture consists of the following compo- in practice. Dimensional modelling has been adopted as
nents: the predominant approach to designing data ware-
• Operational systems: these are systems which record houses and data marts in practice, and represents an
details of business transactions. This is where most important contribution to the discipline of data model-
of the data required for decision support is pro- ling and database design.
duced.
• External sources: data warehouses often incorporate Objectives of this Paper
data from external sources (e.g. census data, eco-
nomic data) to support analysis Kimball argues that modelling in a data warehousing
• Extract processes: these processes “stock” the data environment is radically different from modelling in an
warehouse with data on a regular basis (daily, operational (transaction processing) environment, and
weekly, monthly). Data is extracted from different that you must forget all you know about Entity Rela-
sources, consolidated and reconciled together and tionship modelling.
stored in a consistent format. This corresponds to a “Entity relation models are a disaster for querying because
procurement function. they cannot be understood by users and cannot be navigated
• Central data warehouse: this acts as the central usefully by DBMS software. Entity relation models cannot
source of decision support data across the enter- be used as the basis for enterprise data warehouses”
prise. This forms the “wholesale level” of the data In this paper, we argue that Entity Relationship
warehouse environment and is used to supply data modelling is equally applicable in data warehousing
marts. The central data warehouse is usually im- context as in an operational context and provides a use-
plemented using a traditional relational DBMS. ful basis for designing both data warehouses and data
• Load processes: these processes distribute data from marts.
the central data warehouse to the data marts. This
corresponds to a distribution function.
2. DIMENSIONAL MODELLING
• Data marts: These represent the “retail outlets” of
the data warehouse which provide data in usable CONCEPTS
form for analysis by end users. Data marts are usu- Objectives of Dimensional Modelling
ally tailored to the needs of a specific group of users
or decision making task. They may be “real” (stored There are two major differences between operational
as actual tables populated from the central data databases and data warehouses:
warehouse) or virtual (defined as views on the cen- • End user access: In a data warehousing environ-
tral data warehouse). Data marts may be imple- ment, users write queries directly against the data-
mented using traditional relational DBMS or OLAP base structure, whereas in an operational environ-
tools. ment, users generally only access the database
• End users: write queries and analyses against data through an application system “front end”. In a tra-
stored in data marts using “user friendly” query ditional application system, the structure of the da-
tools. tabase is invisible to the user.
• Read-only: data warehouses are effectively read
only databasesusers can retrieve and analyse data,

1
OLAP = On-Line Analytical Processing
2
OLTP = On-Line Transaction Processing

D. Moody, M. Kortink 5- 2
but cannot update it. Data stored in the data ware- • The fact table contains measurements (e.g. price of
house is updated via batch extract processes. products sold, quantity of products sold) which may
The problem with using traditional database design be aggregated in various ways.
techniques in a data warehousing environment is that it • The dimension tables provide the basis for aggre-
results in database structures which are too complex for gating the measurements in the fact table.
end users to understand and use. A typical operational • The fact table is linked to all the dimension tables
database consists of hundreds of tables linked by a by one-to-many relationships
complex web of relationships. Even quite simple que- • The primary key of the fact table is the concatena-
ries will require multi-table joins, which are error- tion of the primary keys of all the dimension tables.
prone and beyond the capabilities of non-technical us- A more concrete example of a star schema is shown
ers. This is not a problem in transaction processing in Figure 3. In this example, sales data may be ana-
systems because the complexity of the database struc- lysed by product, customer, retail outlet and date.
ture is hidden from the user by a layer of software.
A major reason for the complexity of operational
Product
databases is the use of normalisation. Normalisation
tends to multiply the number of tables required, as it
requires splitting out functionally dependent attributes
into separate tables. The objective of normalisation is Customer Sales Summary
Retail
Outlet
to minimise data redundancy (Codd, 1970). This
maximises update efficiency because each change can
be made in a single place, but tends to penalise re-
trieval (Kent, 1978). Redundancy is less of an issue in a Date

data warehousing environment because data is not up-


dated on-line.
The objective of dimensional modelling is to pro- Figure 3. Star Schema Example
duce database structures that are easy for end users to
understand and write queries against. A secondary Dimension tables are often highly denormalised ta-
objective is to maximise the efficiency of queries. It bles, and generally consist of embedded hierarchies.
achieves these objectives primarily by minimising the For example, Customer, which represents a single di-
number of tables and relationships between them. This mension in the star schema above, consists of three
reduces the complexity of the database and minimises independent hierarchies, when they are normalised out
the number of joins required in user queries. (Figure 4).

Star Schemas Market


Segmentation
Market Market Customer ID
Customer Name
Sector Segment Market Segment Code
Hierarchy Market Segment Name

The basic building block used in dimensional mod- Market Sector Code
Market Sector Name
Industry Code

elling is the star schema. A star schema consists of one Industry Industry Industry
Industry Name
Industry Sector Code
Industry Customer Industry Sector Name
Group Sector
large central table called the fact table, and a number of Hierarchy Industry Group Code
Industry Group Name
City code

smaller tables called dimension tables which radiate City name


State code
State name

out from the central table (Figure 2). Location


Hierarchy Country State City Country code
Country name

Dimension
Table 1
Figure 4. Embedded Hierarchies in Customer
Dimension

Dimension
Table 2
Fact Table
Dimension
Table 3
The advantage of using star schemas to represent
data is that it reduces the number of tables in the data-
base and the number of relationships between them and
therefore the number of joins required in user queries.
Dimension
Table 4 Kimball (1996) claims that use of star schemas to de-
sign data warehouses results in 80% of queries being
single table browses. Star schemas may either be im-
Figure 2. Star Schema (Generic Structure) plemented in specialist OLAP tools, or using traditional
relational DBMS.
The fact table forms the “centre” of the star, while
the dimension tables form the “points” of the star. A
star schema may have any number of dimensions.

D. Moody, M. Kortink 5- 3
Star Schema Design Approach Data Mart Design
Kimball’s design method is a “first principles” ap- Data marts represent the “retail” level of the data
proach, which is based on analysis of user query re- warehouse, where data is accessed directly by end us-
quirements. It begins by identifying the relevant ers. Data is extracted from the central data warehouse
“facts” that need to be aggregated, the dimensional into data marts to support particular analysis require-
attributes to aggregate by, and forming star schemas ments. The most important requirement at this level is
based on these. It results in a data warehouse design that data is structured in a way that is easy for users to
which is a set of discrete star schemas. However there understand and use. For this reason, dimensional mod-
are a number of practical problems with this approach: elling techniques are most appropriate at this level.
• User analysis requirements are highly unpredictable This ensures that data structures are as simple as possi-
and subject to change over time, which provides an ble in order to simplify user queries. The next section
unstable basis for design describes an approach for developing dimensional
• It can lead to incorrect designs if the designer does models from an enterprise data model.
not understand the underlying relationships in the
data 4. FROM ENTITY RELATIONSHIP
• It results in loss of information through premature MODELS TO DIMENSIONAL
aggregation, which limits the ways in which data can
MODELS
be analysed
• The approach is presented by examples rather than This section describes a method for developing di-
via an explicit design procedure mensional models from Entity Relationship models.
The approach described in this paper overcomes This can be used as a basis for designing data marts
these problems by using an enterprise data model as the based on an enterprise data model.
basis for data warehouse design. This makes use of the
relationships in the data which have been already been Example Data Model
documented, and provides a much more structured ap-
A simple example is used to illustrate the design ap-
proach to developing a data warehouse design.
proach. Figure 5 shows an operational data model for a
sales application. The highlighted attributes indicate
3. DATA WAREHOUSE DESIGN the primary keys of each entity.
APPROACH
We argue that different design principles should be Period
Sale
Posted
Sale Location
Location
Type
used for designing the central data warehouse and data Date
Mth
Sale_Id
Sale_Date
Loc_Id
Loc_Name
Loc_Type_Id
Loc_Type_Name

marts: Qtr
Yr
Posted_Date
Cust_Id
Loc_Regn_Id
Loc_Type_Id Region State
Fiscal_Yr Loc_Id
Regn_Id State_Id
Discount_Amt
Regn_Name State_Name

Central Data Warehouse Design State_Id

Product Sale Customer


Product Customer
Type Item Type
This represents the “wholesale” level of the data Prod_Type_Id
Prod_Type_Name
Prod_Id
Prod_Name
Sale_Id
Prod_Id
Cust_Id
Cust_Name
Cust_Type_Id
Cust_Type_Name

warehouse, which is used to supply data marts with Prod_Type_Id Qty


Unit_Price
Cust_Type_Id
Cust_Regn_Id

data. The most important requirement of the central Fee Sale


data warehouse is that it provides a consistent, inte- Type
Fee_Type_Id
Fee
Sale_Id

grated and flexible source of data. We argue that tra- Fee_Type_Name Fee_Type_Id
Fee

ditional data modelling techniques (Entity Relationship


models and normalisation) are most appropriate at this Figure 5. Example Data Model
level. A normalised database design ensures maximum
consistency and integrity of the data. It also provides Such a model is typical of data models that are used
the most flexible data structurenew data can be easily by operational (OLTP) systems. Such a model is well
added to the warehouse in a modular way, and the da- suited to a transaction processing environment. It con-
tabase structure will support any analysis requirements. tains no redundancy, thus maximising efficiency of
Aggregation or denormalisation at this stage will lose updates, and explicitly shows all the data and the rela-
information and restrict the kind of analyses which can tionships between them. Unfortunately most decision
be carried out. An enterprise data model, if one exists, makers would find this schema incomprehensible. Even
should be used as the basis for structuring the central quite simple queries require multi-table joins and com-
data warehouse. plex subqueries. As a result, end users will be depend-
ent on technical specialists to write queries for them.

D. Moody, M. Kortink 5- 4
Step 1. Classify Entities Period
Sale

Sale Location
Location
Type
Posted

The first step in producing a dimensional model Region State


from an Entity Relationship model is to classify the
entities into three categories: Product
Type
Product
Sale
Item
Customer
Customer
Type

Transaction Entities Fee


Type
Sale
Fee

Transaction entities record details about particular


events that occur in the businessfor example, orders, Figure 6. Entity Classifications
insurance claims, salary payments and hotel bookings.
Invariably, it is these events that decision makers want Resolving Ambiguities
to understand and analyse. The key characteristics of a
In some cases, entities may fit into multiple catego-
transaction entity are:
ries. We therefore define a precedence hierarchy for
• It describes an event that happens at a point in time
resolving such ambiguities:
• It contains measurements or quantities that may be
1. Transaction entity (highest precedence)
summarised e.g. dollar amounts, weights, volumes.
2. Classification entity
For example, an insurance claim records a particular
3. Component entity (lowest precedence)
business event and (among other things) the amount
For example, if an entity can be classified as either a
claimed. Transaction entities are the most important
classification entity or a component entity, it should be
entities in a data warehouse, and form the basis for
classified as a classification entity. In practice, some
constructing fact tables in star schemas. Not all trans-
entities will not fit into any of these categories. Such
action entities will be of interest for decision support,
entities do not fit the hierarchical structure of a dimen-
so user input will be required in identifying which
sional model, and cannot be included in star schemas.
transactions are important.
This is where real world data sometimes does not fit the
star schema “mould”.
Component Entities
A component entity is one which is directly related Step 2. Identify Hierarchies
to a transaction entity via a one-to-many relationship.
Hierarchies are an extremely important concept in
Component entities define the details or “components”
dimensional modelling, and form the primary basis for
of each business transaction. Component entities an-
deriving dimensional models from Entity Relationship
swer the “who”, “what”, “when”, “where”, “how” and
models. As discussed previously, most dimension ta-
“why” of a business event. For example, a sales trans-
bles in star schemas contain embedded hierarchies. A
action may be defined by a number of components:
hierarchy in an Entity Relationship model is any se-
• Customer: who made the purchase
quence of entities joined together by one-to-many rela-
• Product: what was sold
tionships, all aligned in the same direction. Figure 7
• Location: where it was sold shows a hierarchy extracted from the example data
• Period: when it was sold model, with State at the top and Sale Item at the bot-
An important component of any transaction is tom.
timehistorical analysis is an important part of any
data warehouse. Component entities form the basis for Sale
Sale Location Region State
constructing dimension tables in star schemas. Sale_Id
Item
Sale_Id Loc_Id Regn_Id State_Id

Prod_Id Sale_Date Loc_Name Regn_Name State_Name

Qty Posted_Date Regn_Id State_Id

Classification Entities Unit Price Cust_Id Loc_Type_Id

Loc_Id

Discount_Amt
Classification entities are entities which are related
to component entities by a chain of one-to-many rela-
tionshipsthat is, they are functionally dependent on a Figure 7 Example of Hierarchy
component entity (directly or transitively). Classifica-
tion entities represent hierarchies embedded in the data In hierarchical terminology:
model, which may be collapsed into component entities • State is the “parent” of Region
to form dimension tables in a star schema. • Region is the “child” of State
Figure 6 shows the classification of the entities in • Sale Item, Sale, Location and Region are all “de-
the example data model. In the diagram, scendants” of State
• Black entities represent Transaction entities • Sale, Location, Region and State are all “ancestors”
• Grey entities indicate Component entities of Sale Item
• White entities indicate Classification entities

D. Moody, M. Kortink 5- 5
Maximal Hierarchy Figure 9 shows Region being collapsed into Loca-
tion. We can continue doing this until we reach the
A hierarchy is called maximal if it cannot be ex-
bottom of the hierarchy, and end up with a single table
tended upwards or downwards by including another
(Sale Item).
entity. In all, there are 14 maximal hierarchies in the
example data model:
• Customer Type-Customer-Sale-Sale Fee
Sale
Sale Location Region
Item

• Customer Type-Customer-Sale-Sale Item


Sale_Id Sale_Id Loc_Id Regn_Id

Prod_Id Sale_Date Loc_Name Regn_Name

• Fee Type-Sale Fee Qty

Value
Posted_Date

Cust_Id
Regn_Id

Loc_Type_Id
State_Id

State_Name

• Location Type-Location-Sale-Sale Fee Loc_Id Regn_Name

• Location Type-Location-Sale-Sale Item


Discount State_Id

State_Name

• Period (posted)-Sale-Sale Fee collapse


• Period (posted)-Sale-Sale Item
• Period (sale)-Sale-Sale Fee
Figure 9. Region Entity “ collapsed” into Location
• Period (sale)-Sale-Sale Item
• Product Type-Product-Sale Item Operator 2: Aggregation
• State-Region-Customer-Sale-Sale Fee
• State-Region-Customer-Sale-Sale Item The aggregation operator can be applied to a trans-
• State-Region-Location-Sale-Sale Fee action entity to create a new entity containing summa-
• State-Region-Location-Sale-Sale Item rised data. A subset of attributes is chosen from the
An entity is called minimal if it is at the bottom of a source entity to aggregate (the aggregation attributes)
maximal hierarchy and maximal if it is at the top of and another subset of attributes chosen to aggregate by
one. Minimal entities can be easily identified as they (the grouping attributes). Aggregation attributes must
are entities with no one-to-many relationships (or be numerical quantities.
“leaf” entities in hierarchical terminology), while For example, we could apply the aggregation op-
maximal entities are entities with no many to one rela- erator to the Sale Item entity to create a new entity
tionships (or “root” entities). In the example data called Product Summary as in Figure 10. This aggre-
model there are gated entity shows for each product the total sales
• Two minimal entities: Sale Item and Sale Fee amount (quantity*price), the average quantity per order
and average price per item on a daily basis. The aggre-
• Six maximal entities: Period, Customer_Type, State,
gation attributes are quantity and price, while the
Location Type, Product Type and Fee Type.
grouping attributes are Product ID and Date. The key
of this entity is the combination of the attributes used to
Step 3. Produce Dimensional Models aggregate by (grouping attributes). Note that aggrega-
tion loses information: we cannot reconstruct the de-
Operators For Producing Dimensional Models tails of individual sale items from the product summary
We use two operators to produce dimensional mod- table.
els from Entity Relationship models.
Sale Product
Item Summary
Operator 1: Collapse Hierarchy Sale_Id Prod_Id

Prod_Id Date
Higher level entities can be “collapsed” into lower Date Total Sales ($)

level entities within hierarchies. Figure 8 shows the Qty Average Quantity

Unit Price Average Price


State entity being collapsed into the Region entity. The
Region entity contains its original attributes plus the
attributes of the collapsed table. This introduces redun- Figure 10. Aggregation Operator
dancy in the form of a transitive dependency, which is
a violation to third normal form (Codd, 1970). Col- Dimensional Design Options
lapsing a hierarchy is therefore a form of denormalisa-
There is a wide range of options for producing di-
tion (Tonkin, 1991).
mensional models from an Entity Relationship model.
Sale
Item
Sale Location Region State These include:
Sale_Id

Prod_Id
Sale_Id

Sale_Date
Loc_Id

Loc_Name
Regn_Id

Regn_Name
State_Id

State_Name
• Flat schema
Qty

Value
Posted_Date

Cust_Id
Regn_Id

Loc_Type_Id
State_Id

State_Name
• Terraced schema
Loc_Id
• Star schema
• Snowflake schema
Discount
collapse

Figure 8. State Entity “ collapsed” into Region • Star cluster schema

D. Moody, M. Kortink 5- 6
Each of these options represent different trade-offs counting). Another problem with flat schemas is that
between complexity and redundancy. Here we discuss they tend to result in tables with large numbers of at-
how the operators previously defined may be used to tributes, which may be unwieldy. While the number of
produce different dimensional models. tables (system complexity) is minimised, the complexity
of each table (element complexity) is greatly increased.
Option 1: Flat Schema
Option 2: Terraced Schema
A flat schema is the simplest schema possible with-
out losing information. This is formed by collapsing all A terraced schema is formed by collapsing entities
entities in the data model down into the minimal enti- down maximal hierarchies, but stopping when they
ties. This minimises the number of tables in the data- reach a transaction entity. This results in a single table
base and therefore the possibility that joins will be for each transaction entity in the data model. Figure 12
needed in user queries. In a flat schema we end up shows the terraced schema that results from the exam-
with one table for each minimal entity in the original ple data model. This schema is less likely to cause
data model. Figure 11 shows the flat schema which problems for an inexperienced user, because the sepa-
results from the example data model. ration between levels of transaction entities is explicitly
shown.
Sale Sale
Item Fee
Sale_Id Sale_Id Sale Sale
Sale
Prod_Id Fee_Type_Id Item Fee
Qty Fee Sale_Id Sale_Id Sale_Id

Value Fee_Type_Name Prod_Id Sale_Date Fee_Type_Id

Prod_Name Sale_Date Qty Sale_Mth Fee

Prod_Type_Id Sale_Mth Value Sale_Qtr Fee_Type_Name

Prod_Type_Name Sale_Qtr Prod_Name Sale_Yr

Sale_Date Sale_Yr Prod_Type_Id Sale_Fiscal_Yr

Sale_Mth Sale_Fiscal_Yr Prod_Type_Name Posted_Date

Sale_Qtr Posted_Date Posted_Mth

Sale_Yr Posted_Mth Posted_Qtr

Sale_Fiscal_Yr Posted_Qtr Posted_Yr

Posted_Date Posted_Yr Posted_Fiscal_Yr

Posted_Mth Posted_Fiscal_Yr Discount


Posted_Qtr Discount Cust_Id
Posted_Yr Cust_Id Cust_Name
Posted_Fiscal_Yr Cust_Name Cust_Type_Id
Discount Cust_Type_Id Cust_Type_Name
Cust_Id Cust_Type_Name Cust_Regn_Id
Cust_Name Cust_Regn_Id Cust_Regn_Name
Cust_Type_Id Cust_Regn_Name Cust_State_Id
Cust_Type_Name Cust_State_Id Cust_State_Name
Cust_Regn_Id Cust_State_Name Loc_Id
Cust_Regn_Name Loc_Id Loc_Name
Cust_State_Id Loc_Name Loc_Type_Id
Cust_State_Name Loc_Type_Id Loc_Type_Name
Loc_Id Loc_Type_Name Loc_Regn_Id
Loc_Name Loc_Regn_Id Loc_Regn_Name
Loc_Type_Id Loc_Regn_Name Loc_State_Id
Loc_Type_Name Loc_State_Id Loc_State_Name
Loc_Regn_Id Loc_State_Name
Loc_Regn_Name
Loc_State_Id
Loc_State_Name Figure 12. Terraced Schema

Figure 11. Flat Schema Option 3: Star Schema


A star schema can be easily derived from an Entity
Such a schema is similar to the “flat files” used by Relationship model. Each star schema is formed in the
analysts using statistical packages such as SAS and following way:
SPSS. Note that this structure does not lose any infor- • A fact table is formed for each transaction entity.
mation from the original data model. It contains redun- The key of the table is the combination of the keys
dancy, in the form of transitive and partial dependen- of its associated component entities.
cies, but does not involve any aggregation. One prob- • A dimension table is formed for each component
lem with a flat schema is that it may lead to aggregation entity, by collapsing hierarchically related classifi-
errors when there are hierarchical relationships be- cation entities into it.
tween transaction entities. When we collapse numerical • Where hierarchical relationships exist between
amounts from higher level transaction entities into an- transaction entities, the child entity inherits all di-
other they will be repeated. In the example data model, mensions (and key attributes) from the parent entity.
if a Sale consists of three Sale Items, the discount This provides the ability to “drill down” between
amount will be stored in three different rows in the Sale transaction levels.
Item table. Adding the discount amounts together then
results in double-counting (or in this case, triple

D. Moody, M. Kortink 5- 7
• Numerical attributes within transaction entities ability to “drill down” between levels of detail (e.g.
should be aggregated by key attributes (dimensions). from Sale to Sale Item). The constellation schema
The aggregation attributes and functions used de- which results from the example data model is shown in
pend on the application. Figure 15links between fact tables are shown in bold.
Figure 13 shows the star schema that results from
the Sale transaction entity. This star schema has four Period
Sale
Posted
Sale Location

dimensions, each of which contains embedded hierar- Date Sale_Date Loc_Id


Mth Posted_Date Loc_Name
chies. The aggregated fact is Discount amount. Qtr Cust_Id Loc_Type_Id
Yr Loc_Id Loc_Type_Name
Fiscal_Yr Sum_of_Discount Loc_Regn_Id
Regn_Name
Sale
Period Sale Location State_Id
State_Name
Date Sale_Date Loc_Id
Mth Posted Posted_Date Loc_Name
Sale
Item
Qtr Cust_Id Loc_Type_Id Customer
Sale_Date
Yr Loc_Id Loc_Type_Name
Posted_Date Cust_Id
Fiscal_Yr Sum(Discount_Amt) Loc_Regn_Id
Product Cust_Id Cust_Name
Regn_Name
Loc_Id Cust_Type_Id
State_Id Prod_Id
Prod_Id Cust_Type_Name
State_Name Prod_Name
Sum_of_Qty Cust_Regn_Id
Prod_Type_Id
Customer Sum_of_Value Regn_Name
Prod_Type_Name
State_Id
Cust_Id
State_Name
Cust_Name
Sale
Cust_Type_Id
Fee
Cust_Type_Name
Sale_Date
Cust_Regn_Id Posted_Date
Fee
Regn_Name
Type Cust_Id
State_Id Fee_Type_Id Loc_Id
State_Name Fee_Type_Name Fee_Type_Id
Sum_of_Fee

Figure 13. Sale Star Schema


Figure 15. Sales Constellation Schema
Figure 14 shows the star schema which results from
the Sale Item transaction entity. This star schema has Galaxy Schema
five dimensions. This includes four dimensions from More generally, a set of star schemas or constella-
its “parent” transaction entity (Sale) and one of its own tions can be combined together to form a galaxy. A
(Product). The aggregated facts are quantity and item galaxy is of a collection of star schemas with shared
cost (quantity * price). dimensions. Unlike a constellation schema, the fact
tables in a galaxy do not need to be directly related.
Period Location

Date Loc_Id

Mth Loc_Name Option 4: Snowflake Schema


Qtr Loc_Type_Id
Sale
Yr Loc_Type_Name

Fiscal_Yr
Posted
Loc_Regn_Id In a star schema, hierarchies in the original data
Sale
Item
Regn_Name
State_Id
model are collapsed or denormalised to form dimen-
Sale_Date
Posted_Date
State_Name
sion tables. Each dimension table may contain multiple
Product
Prod_Id
Cust_Id
Loc_Id
independent hierarchies. A snowflake schema is a star
Prod_Name Prod_Id
Customer schema with all hierarchies explicitly shown. A snow-
Prod_Type_Id Sum_of_Qty

Prod_Type_Name Sum_of_ItemCost
Cust_Id flake schema can be formed from a star schema by ex-
Cust_Name
Cust_Type_Id panding out (normalising) the hierarchies in each di-
Cust_Type_Name
Cust_Regn_Id mension. Alternatively, a snowflake schema can be
Regn_Name
State_Id
produced directly from an Entity Relationship model
State_Name
by the following procedure:
• A fact table is formed for each transaction entity.
Figure 14. Sale Item Star Schema The key of the table is the combination of the keys
of the associated component entities.
A separate star schema is produced for each trans- • Each component entity becomes a dimension table.
action table in the original data model. • Where hierarchical relationships exist between
transaction entities, the child entity inherits all rela-
Constellation Schema tionships to component entities (and key attributes)
from the parent entity.
Instead of a number of discrete star schemas, the ex-
• Numerical attributes within transaction entities
ample data model can be transformed into a constella-
should be aggregated by the key attributes. The at-
tion schema. A constellation schema consists of a set
tributes and functions used depend on the applica-
of star schemas with hierarchically linked fact tables.
tion.
The links between the various fact tables provide the

D. Moody, M. Kortink 5- 8
Figure 16 shows the snowflake schema which results Location
Location
Dimension
Location
from the Sale transaction entity. Type

Sale Location Sale Region State Overlap


Period Sale Location
Posted Type
Date Cust-Id Loc_Id Loc_Type_Id
Mth Sale_Date Loc_Name Loc_Type_Name

Qtr Posted_Date Loc_Regn_Id Customer


Customer Customer
Yr Loc-Id Loc_Type_Id Region State Type
Fiscal_Yr SUM (Discount)
Dimension
Regn_Id State_Id
Regn_Name State_Name
State_Id

Customer Customer Figure 17. Intersecting Hierarchies in Example


Type
Cust_Id
Cust_Name
Cust_Type_Id Data Model
Cust_Type_Name
Cust_Type_Id
Cust_Regn_Id

We define a star cluster schema as one which has


the minimal number of tables while avoiding overlap
Figure 16. Sale Snowflake Schema
between dimensions. It is a star schema which is se-
lectively “snowflaked” to separate out hierarchical
Option 5: Star Cluster Schema
segments or subdimensions which are shared between
Kimball (1996) argues that “snowflaking” is unde- different dimensions. Subdimensions effectively repre-
sirable, because it adds complexity to the schema and sent the “highest common factor” between dimensions.
requires extra joins. Clearly, expanding all hierarchies A star cluster schema may be produced from an En-
defeats the purpose of producing simple, user friendly tity Relationship model using the following procedure.
database designsin the example above, it more than Each star cluster is formed by:
doubles the number of tables in the schema. In this pa- • A fact table is formed for each transaction entity.
per, we argue that neither a “pure” star schema (fully The key of the table is the combination of the keys
collapsed hierarchies) nor a “pure” snowflake schema of the associated component entities.
(fully expanded hierarchies) results in the best solution. • Classification entities should be collapsed down
As in many design problems, the optimal solution is a their hierarchies until they reach either a fork entity
balance between two extremes. or a component entity. If a fork is reached, a sub-
The problem with fully collapsing hierarchies occurs dimension table should be formed. The subdimen-
when hierarchies overlap, leading to redundancy be- sion table will consist of the fork entity plus all its
tween dimensions when they are collapsed. This can ancestors. Collapsing should begin again after the
result in confusion for users, increased complexity in fork entity. When a component entity is reached, a
extract processes and inconsistent results from queries dimension table should be formed.
if hierarchies become inconsistent. For these reasons, • Where hierarchical relationships exist between
we require that dimensions should be orthogonal. transaction entities, the child entity should inherit all
Overlapping dimensions can be identified via dimensions (and key attributes) from the parent en-
“forks” in hierarchies. A fork occurs when an entity tity.
acts as a “parent” in two different dimensional hierar- • Numerical attributes within transaction entities
chies. This results in the entity and all of its ancestors should be aggregated by the key attributes (dimen-
being collapsed into two separate dimension tables. sions). The attributes and functions used depend on
Fork entities can be identified as classification entities the application.
with multiple one-to-many relationships. The exception Figure 18 shows the star cluster schema that results
to this rule occurs when the hierarchy converges again from the model fragment of Figure 17.
lower downDampney (1996) calls this a commuting
loop. DIMENSION TABLE

In the example data model, a fork occurs at the Re- Location

gion entity. Region is a parent of Location and Cus- Loc_Id


FACT TABLE Loc_Name
tomer, which are both components of the Sale transac- Loc_Regn_Id
SUB-DIMENSION

Sale
tion. In the star schema representation, State and Re- Loc_Type_Id
Loc_Type_Name
Region
Cust-Id
gion would be included in both the Location and Cus- Sale_Date
DIMENSION TABLE
Regn_Id
Regn_Name
tomer dimensions when the hierarchies are collapsed. Posted_Date
State_Id
Loc-Id Customer State_Name
This results in overlap between the dimensions. SUM (Discount)
Cust_Id
Cust_Name
Cust_Type_Id
Cust_Regn_Id
Cust_Type_Name

Figure 18. Star Cluster Schema

D. Moody, M. Kortink 5- 9
Figure 19 shows how entities in the original data sent a “break” in the hierarchical chain, and cannot be
model were clustered to form the star cluster schema. collapsed. There are a number of options for dealing
The overlap between hierarchies has now been re- with many-to-many relationships:
moved. (a) Ignore the intersection entity (eliminate it from
the data mart)
Location
Location (b) Convert the many-to-many relationship to a one-
Type
Loc_Id Loc_Type_Id Location to-many relationship, by defining a “primary”
Loc_Name
Loc_Regn_Id
Loc_Type_Name Dimension relationship
Loc_Type_Id
(c) Include it as a many-to-many relationship in the
Sale
Cust-Id Region State
data martsuch entities may be useful to expert
Sale_Date
Posted_Date
Regn_Id State_Id Region analysts but will not be amenable to analysis
Regn_Name State_Name Sub-dimension
Loc-Id
SUM (Discount)
State_Id using an OLAP tool.
For example, in the model below, each client may be
Customer
Customer
Type involved in a number of industries. The intersection
Cust_Id Cust_Type_Id Customer
Cust_Name Cust_Type_Name Dimension entity Client Industry breaks the hierarchical chain and
Cust_Type_Id
Cust_Regn_Id cannot be collapsed into Client.

Client
Figure 19. Revised Clustering Industry Type Industry Class Industry
Industry
Client

If required, views may be used to reconstruct a star


schema from a star cluster schema. This gives the best Figure 20. Multiple Classification
of both worlds: the simplicity of a star schema while
preserving consistency between dimensions. As with The options are (a) to exclude the industry hierar-
star schemas, star clusters may be combined together to chy, (b) convert it to a one-to-many relationship or (c)
form constellations or galaxies. include it as a many-to-many relationship.

major

Step 4. Evaluation and Refinement (b) Industry


industry
Client

In practice, dimensional modelling is an iterative


process. The clustering procedure described in Step 3 (c) Industry Client Industry Client
is useful for producing a first cut design, but this will
need to be refined to produce the final data mart de-
sign. Most of these modifications have to do with fur- Figure 21. Design Options
ther simplifying the model and dealing with non-
hierarchical patterns in the data. Handling Subtypes

Combining Fact Tables Supertype/subtype relationships can be converted to


a hierarchical structure by removing the subtypes and
Fact tables with the same primary keys (i.e. the same creating a classification entity to distinguish between
dimensions) should be combined. This reduces the subtypes. This can then be converted to a dimensional
number of star schemas and facilitates comparison be- model in a straightforward manner.
tween related facts (e.g. budget and actual figures).
Vehicle Type
Combining Dimension Tables Vehicle

Creating dimension tables for each component entity Car

often results in a large number of dimension tables. To Truck

simplify the data mart structure, related dimensions Vehicle

should be consolidated together into a single dimension


table.
Figure 22. Conversion of Subtypes to
Many to Many Relationships Hierarchical Form

Most of the complexities which arise in converting a


traditional Entity Relationship model to a dimensional
model result from many-to-many relationships or inter-
section entities. Many-to-many relationships cause
problems in dimensional modelling because they repre-

D. Moody, M. Kortink 5- 10
5. CONCLUSION Implications for Data Warehouse Design
Practice
Summary of the Method The advantages of this approach are:
We have described a method for developing data • It provides a more structured approach to develop-
warehouse and data mart designs from an enterprise ing dimensional models than working from first
data model. The method has now been applied in a principles
wide range of industries, including manufacturing, • It ensures that the data marts and the central data
health, insurance and banking. The method has warehouse reflect the underlying relationships in the
evolved considerably as a result of experiences in data
practice. The steps of the method are: • Developing data warehouse and data mart designs
1. Develop Enterprise Data Model (if one doesn’t based on a common enterprise data model simplifies
exist already) extract and load processes
2. Design Central Data Warehouse: this will be • An existing enterprise data model provides a useful
closely based on the enterprise data model, but will basis for identifying information requirements in a
be a subset of the model which is relevant for deci- bottom up mannerbased on what data exists in the
sion support purposes. A staged approach is rec- enterprise. This can be usefully combined with
ommended for implementing the central data Kimball’s (1996) top-down analysis approach.
warehouse, starting with the most important sub- • An enterprise data model provides a more stable ba-
ject areas. sis for design than user query requirements, which
3. Classify Entities: classify entities in the central are unpredictable and subject to frequent change
data warehouse model as either transaction, com- • It ensures that the central data warehouse is flexible
ponent or classification entities. enough to support the widest possible range of
4. Identify Hierarchies: identify the hierarchies which analysis requirements, by storing data at the level of
exist in the data model individual transactions. Aggregation of data at this
5. Design Data Marts: develop star cluster schemas level reduces the granularity of data in the data
for each transaction entity in the central data ware- warehouse, which limits the types of analyses which
house model. Each star cluster will consist of a are possible.
fact table and a number of dimension and subdi- • It maximises the integrity of data stored in the cen-
mension tables. This minimises the number of ta- tral data warehouse
bles while avoiding overlap between dimensions. While the approach is not entirely mechanical, it
The separate star clusters may be combined to- provides much more guidance to designers of data
gether to form constellations or galaxies. warehouses and data marts than previous approaches.
Careful analysis is still required to identify the entities
Design Options in the enterprise data model which are relevant for de-
We have identified a range of options for developing cision making and classifying them. However once this
data marts to support end user queries from an enter- has been done, the development of a dimensional
prise data model. These options represent different model can take place in a relatively straightforward
trade-offs between the number of tables (complexity) manner. Using an Entity Relationship model of the data
and redundancy of data (Figure 23). provides a much better starting point for developing
dimensional models than starting from scratch, and can
Increased help avoid many of the pitfalls faced by inexperienced
Complexity
designers.
Snowflake Schema

Star Cluster Schema


Federated Data Warehouse Architecture
Star Schema Use of this approach also supports the development
Terraced Schema of independent but consistent data warehouses based on
Flat Schema
a common enterprise data model (federated architec-
ture). The enterprise data model can be used to ensure
Increased the consistency of data definitions between the different
Redundancy
warehouses, enabling data to be compared and com-
bined between them. It also allows use of common
Figure 23. Design Trade-offs extract processes. This may be a more feasible option
than a single centralised data warehouse in large and
diverse organisations.

D. Moody, M. Kortink 5- 11
REFERENCES cepts and Terminology for the Conceptual Schema and
the Information Base, ISO Technical Report 9007.
1. BANKER, R.D., AND KAUFFMAN, R.J. (1991): Re- 17. KIMBALL, R. (1996): The Data Warehouse Toolkit,
use and Productivity in Integrated Computer Aided New York: J. Wiley & Sons.
Software Engineering: An Empirical Study, MIS Quar- 18. KIMBALL, R. (1997): A Dimensional Manifesto,
terly, DBMS Online, August, 1997.
2. BATINI, C., CERI, S. AND NAVATHE, S.B. (1994) 19. LEWIS, D. (1999): “Use and Abuse of a Meta-Data
Conceptual Database Design: An Entity Relationship Registry”, Fourth Australian Data Management Con-
Approach, Benjamin Cummings, Redwood City, Cali- ference, Melbourne, Australia, October 11-13.
fornia. 20. LOVE, B. (1994): Enterprise Information Technologies,
3. CHEN, P.P. (1976): The Entity Relationship Model: Van Nostrum Reinhold, New York.
Towards an Integrated View of Data, ACM Transactions 21. MARTIN, J. (1983): Strategic Data Planning Method-
on Database Systems, 1(1), March: 9-36. ologies, Prentice-Hall.
4. CHEN, P.P. (1983): A Preliminary Framework for En- 22. MOODY, D.L. AND SIMSION, G.C. (1995) Justifying
tity-Relationship Models, in Chen, P.P. (ed.) The En- Investment in Information Resource Management, Aus-
tity-Relationship Approach to Information Modelling tralian Journal of Information Systems, 3(1), Septem-
and Analysis, North-Holland. ber: 25-37.
5. CODD, E.F. (1970) A Relational Model of Data for 23. POKORNY, J. (1998): Data Warehouses: A Modelling
Large Shared Data Banks, Communications of the ACM, Perspective, Proceedings of the 7th International Con-
13 (6), June: 377-387. ference on Information Systems Development, Bled,
6. DAMPNEY, C.N.G. (1996): Corporate Genera or Cor- Slovenia, Plenum Press.
porate Trivia—Can We Really Base Applications on 24. SAGER, M.J. (1988): Data-centred Enterprise Model-
Corporate Data Models? Proceedings of the First Aus- ling Methodologies - A Study Of Practice And Poten-
tralian DAMA Conference, Melbourne, Australia, De- tial. Australian Computer Journal, August.
cember 1-2. 25. SEN, A. and JACOB, V.S. (1998): Industrial Strength
7. DEVLIN, B. (1997): Data Warehouse: From Architec- Data Warehousing, Communications of the ACM, Sep-
ture to Implementation, Addison-Wesley, Reading, tember.
Mass. 26. SHANKS, G.G. (1996): “Enterprise Data Architectures:
8. GARDNER, S.R. (1998): Building the Data Warehouse, A Study of Practice”, First Australian Data Manage-
Communications of the ACM, September. ment Conference, Melbourne, Australia, December 2-3.
9. GILES, J. (1999): Data Warehousing: Is It Just First 27. SIMSION, G.C. (1994): Data Modelling Essentials,
Aid?, in D.L. Moody (ed.) From Data to Information to Van Nostrand Reinhold, New York.
Knowledge: Managing the Unmanageable, Proceedings
of the Fourth Australian DAMA Conference, Mel-
bourne, Australia, October 11-13.
10. GLASSEY, K. (1998): Seducing the End User, Com-
munications of the ACM, September.
11. GOODHUE, D.L., KIRSCH, L.J., AND WYBO, M.D.
(1992): The Impact of Data Integration on the Costs and
Benefits of Information Systems, MIS Quarterly, 16(3),
September: 293-311.
12. GRAHAM, S., COBURN, D. and OLESON, C. (1996):
The Foundations of Wisdom: A Study of the Financial
Impact of Data Warehousing, International Data Corpo-
ration (IDC) Ltd. Canada.
13. HALPIN, T. (1995): Conceptual Schema and Relational
Database Design: A Fact Oriented Approach, Prentice
Hall, Sydney.
14. HITCHMAN, S. (1995) Practitioner Perceptions On
The Use Of Some Semantic Concepts In The Entity
Relationship Model, European Journal of Information
Systems, 4, 31-40.
15. INMON, W. (1996): Building the Data Warehouse (2nd
Edition), New York: J. Wiley & Sons.
16. INTERNATIONAL STANDARDS ORGANISATION
(ISO) (1987), Information Processing Systems - Con-

D. Moody, M. Kortink 5- 12

You might also like