You are on page 1of 77

DATA WAREHOUSING

BY
B.M.BRINDA,
AP / IT - VCEW
12/04/2016

OVERVIEW

Introduction to Data
Data Warehouse
OLAP Vs. OLTP
Multi Dimensional Data
Data Mining
Data Preprocessing

12/04/2016

INTRODUCTION
Data
Data is information that
has been translated into a
form
that
is
more
convenient to move or
process.

Database
Database is a organized
collection of information
which can easily be
accessed, managed, and
updated by a set of
programs.
12/04/2016

Data, Data everywhere


yet ...

12/04/2016

I cant find the data I need


data is scattered over the
network
many versions, subtle
differences
I cant get the data I need
need an expert to get the
data
I cant understand the data I
found
available data poorly
documented
I cant use the data I found
results are unexpected
data needs to be transformed
from one form to other
4

What is a Data Warehouse?


A single, complete and
consistent store of data
obtained from a variety of
different sources made
available to end users in a
way they can understand
and use in a business
context.

12/04/2016

What is Data Warehousing?


Information

A
process
of
transforming
data
into
information and
making it available
to users in a timely
enough manner to
make a difference
Data
12/04/2016

DATA WAREHOUSE
A data warehouse is a subjectoriented,
integrated,
timevariant, and non
volatile
collection of data in support of
managements
decision-making
process.W. H. Inmon

12/04/2016

Subject-Oriented

Data that gives information about a particular subje


stead of about a company's ongoing operations.
Data is categorized and stored by business subject
ther than by application.
Application Oriented

Subject Oriented

Loans
ATM
Credit Card

Customer
Product
Vendor
Activity

Trust
Savings
12/04/2016

Integrated
Data on a given subject is defined and stored once.
Data that is gathered into the data warehouse from
a variety of sources and merged into a coherent whole.
Savings

Current
accounts

Loans

OLTP Applications
12/04/2016

Customer

Data Warehouse
9

Time-Variant
Data is stored as a series of snapshots,
each representing a period of time
All data in the data warehouse is identified
with a particular time period.

Time
Jan-97
Feb-97
Mar-97
12/04/2016

Data
January
February
March
10

Nonvolatile

Data in the data warehouse is not updated or


deleted.
Data is stable in a data warehouse. More data is
added but data
is never removed. This enables
Operational
Warehouse
management
to gain a
consistent picture
of the
business.
Load

Insert
Update
Delete

12/04/2016

Read

Read

11

Changing Data
First time load
Warehouse Database
Operational
Database

Refresh

Refresh

Refresh

12/04/2016

12

OLTP
OLTP Online Transaction Processing
or Operational Database Systems
Performs Online Transaction & Query
Processing
Covers most of the day - to day
operations
Characterized by a large number of
short on-line transactions (INSERT,
UPDATE, DELETE).
Purchasing,
Inventory,
Manufacturing,
Banking, Payroll, Registration, etc..,
12/04/2016
13

OLAP
OLAP Online Analytical Processing or Data
Warehouse
It serves users or knowledge workers in the role of
decision making and data analysis
Organize and present data in various formats in
order to satisfy various user requests
characterized by relatively low volume of
transactions.
OLAP allows users to analyze database information
from multiple database systems at one time.
OLAP data is stored in multidimensional databases.
12/04/2016

14

Data Warehouse vs. Operational


DBMS
Distinct features (OLTP vs. OLAP):
User and system orientation:

customer vs.

market

Data contents:

current, detailed vs. historical,

consolidated

Database design:

ER + application vs. star +

subject

View: current, local vs. evolutionary, integrated


Access patterns: update vs. read-only but
complex queries
12/04/2016

15

OLTP vs. OLAP

users
function
DB design
data

usage
access
unit of work
# records
accessed
#users
DB size
metric
12/04/2016

OLTP
clerk, IT professional
day to day operations
application-oriented
current, up-to-date
detailed, flat
relational
isolated
repetitive
read/write
index/hash on prim.
key
short, simple
transaction
tens

OLAP
knowledge worker
decision support
subject-oriented
historical,
summarized,
multidimensional
integrated, consolidated
ad-hoc
lots of scans

thousands
100MB-GB
transaction
throughput

hundreds
100GB-TB
query throughput, response

complex query
millions

16

Why Separate Data


Warehouse?

High performance for both systems


DBMS tuned for OLTP: access methods, indexing,
concurrency control, recovery
Warehousetuned for OLAP: complex OLAP queries,
multidimensional view, consolidation

Different functions and different data:


missing data: Decision support requires historical data
which operational DBs do not typically maintain
data consolidation: DS requires consolidation
(aggregation, summarization) of data from
heterogeneous sources
data quality: different sources typically use inconsistent
data representations, codes and formats which have to
be reconciled
12/04/2016
17

MULTIDIMENSIONAL DATA
MODEL
A data warehouse is based on a multi dimensional data model which views data in
the form of a data cube
Data Cube Allows data to be modeled and
viewed in multiple dimensions.
Supports viewing/modeling of a variable (a
set of variables) of interest.
Measures are used to report the values of
the particular variable with respect to a
given set of dimensions.
12/04/2016

18

DATA CUBE
Data Cube is defined by dimensions and
facts
Dimensions Entities in which organization wants
to keep record.
Dimension Table - Each dimension may be
associated with a table
Item, Branch, Location

Facts Numerical Measures.


Fact Table contains the names of facts, measures,
as well as keys to each of related dimension table
Units_sold, Amount_Budgeted
12/04/2016

19

Cube: A Lattice of Cuboids


In data warehousing literature, an n-D base cube
is called a base cuboid. The top most 0-D cuboid,
which holds the highest-level of summarization, is
called the apex cuboid. The lattice of cuboids
forms a data cube.

12/04/2016

20

3D Data Cube

Dollars_Sold

12/04/2016

21

Modeling of Data
Warehouse
A data warehouse, however, requires
a concise, subject-oriented schema
that facilitates on-line data analysis.
Star Schema
Snow Flake Schema
Fact Constellation Schema

12/04/2016

22

STAR SCHEMA
Contains
a
large
central
table

fact
table contains
bulk data and a
set of smaller
attendant
tables

dimension
table, one for
each dimension

12/04/2016

23

SNOWFLAKE SCHEMA
A variant of the
star
schema
model,
where
some dimension
tables
are
normalized,
thereby
further
splitting the data
into
additional
tables.

12/04/2016

24

FACT CONSTELLATION
Schema can be
viewed
as
a
collection
of
stars, and hence
it is called a
galaxy
schema
or
a
fact
constellation.
Used
for
sophisticated
applications
12/04/2016

25

CONCEPT HIERARCHY
Defines a sequence of mappings from a set of
low-level concepts to higher-level, more general
concepts.

12/04/2016

26

OLAP OPERATIONS
Roll Up (Drill Up) Reduction of
Dimension
Drill Down ( Roll Down) Adds new
Dimesnion
Slice and Dice
Pivot (Rotate)

12/04/2016

27

ROLL UP
Performs
aggregation on a
data cube, either
by climbing up a
concept hierarchy
for a dimension
Dimension
Reduction

12/04/2016

28

DRILL DOWN
Drill-down
is
the
reverse of roll-up. It
navigates from less
detailed data to more
detailed data.
Drill-down
can
be
realized
by
either
stepping
down
a
concept hierarchy for
a dimension
Introducing additional
dimensions
12/04/2016

29

SLICE & DICE


The slice operation selects one
particular dimension from a given
cube and provides a new sub-cube.
Dice selects two or more dimensions
from a given cube and provides a
new sub-cube.

12/04/2016

30

SLICE & DICE

12/04/2016

31

PIVOT
The
pivot
operation
is
also known as
rotation.
It
rotates
the
data axes in
view, in order
to provide an
alternative
presentation of
data.
12/04/2016

32

Design of Data Warehouse


Four views regarding the design of a data warehouse
Top-down view
allows selection of the relevant information necessary for
the data warehouse

Data source view


exposes the information being captured, stored, and
managed by operational systems

Data warehouse view


consists of fact tables and dimension tables

Business query view


sees the perspectives of data in the warehouse from the
view of end-user
12/04/2016

33

Data Warehouse Design


Process
Top-down, bottom-up approaches or a

combination of both
Top-down: Starts with overall design and
planning (mature)
Bottom-up: Starts with experiments and
prototypes (rapid)
Data Warehouse Design Steps
Choose a business process to model, e.g.,
orders, invoices, etc.
Choose the grain (atomic level of data) of the
business process
Choose the dimensions that will apply to each
fact table record
12/04/2016
Choose the measure that will populate each 34

3 TIER DATA WAREHOUSE ARCHITECTURE

12/04/2016

35

Three Data Warehouse


Models
Enterprise warehouse
collects all of the information about subjects spanning
the entire organization
Data Mart
a subset of corporate-wide data that is of value to a
specific groups of users.
Its scope is confined to
specific, selected groups, such as marketing data mart

Independent vs. dependent (directly from warehouse) data mart

Virtual warehouse
A set of views over operational databases
Only some of the possible summary views may be
materialized
12/04/2016

36

DATA MINING
Extracting or Mining knowledge from
large amounts of data
Also called as Knowledge Extraction,
Knowledge Discovery from Data,
Data / Pattern Analysis

12/04/2016

37

DATA MINING STEPS

12/04/2016

38

KDD STEPS
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are
retrieved from the database)
4. Data transformation (where data are transformed or
consolidated into forms appropriate for mining by performing
summary or aggregation operations, for instance)
5. Data mining (an essential process where intelligent methods are
applied in order to extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns
representing knowledge based on some interestingness
measures;
7. Knowledge presentation (where visualization and knowledge
representation techniques are used to present the mined
knowledge to the user)
12/04/2016

39

Why Data Preprocessing?


Data in the real world is dirty
incomplete: lacking attribute values,
lacking certain attributes of interest, or
containing only aggregate data
e.g., occupation=

noisy: containing errors or outliers


e.g., Salary=-10

inconsistent: containing discrepancies in


codes or names
e.g., Age=42 Birthday=03/07/1997
e.g., Was rating 1,2,3, now rating A, B, C
e.g., discrepancy between duplicate records
8/13/16

40

Why Is Data Dirty?


Incomplete data may come from
Not applicable data value when collected
Different considerations between the time when the data
was collected and when it is analyzed.
Human/hardware/software problems

Noisy data (incorrect values) may come from


Faulty data collection instruments
Human or computer error at data entry
Errors in data transmission

Inconsistent data may come from


Different data sources
Functional dependency violation (e.g., modify some linked
data)

Duplicate records also need data cleaning


8/13/16

41

Why Is Data Preprocessing


Important?
No quality data, no quality mining results!
Quality decisions must be based on quality data
e.g., duplicate or missing data may cause incorrect or
even misleading statistics.

Data warehouse needs consistent integration of


quality data
Data extraction, cleaning, and transformation
comprises the majority of the work of building a
data warehouse

8/13/16

42

Multi-Dimensional Measure of Data


Quality
A well-accepted multidimensional view:
Accuracy
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility

8/13/16

43

Data Preprocessing - Tasks


Data cleaning
Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies

Data integration
Integration of multiple databases, data cubes, or files

Data transformation
Normalization and aggregation

Data reduction
Obtains reduced representation in volume but produces
the same or similar analytical results

Data discretization
Part of data reduction but with particular importance,
especially for numerical data
8/13/16

44

Forms of Data Preprocessing

8/13/16

45

Data Cleaning
Importance
Data cleaning is one of the three biggest
problems in data warehousingRalph Kimball
Data cleaning is the number one problem in
data warehousingDCI survey

Data cleaning tasks

Fill in missing values


Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration

Data Cleaning
: How to Handle Missing Data?

Ignore the tuple: usually done when class label


is missing (assuming the tasks in classification
not effective when the percentage of missing
values per attribute varies considerably.
Fill in the missing value manually
Fill in it automatically with
a global constant : e.g., unknown, a new class?!
the attribute mean
the attribute mean for all samples belonging to the
same class: smarter
the most probable value: inference-based such as
Bayesian formula or regression

Data Cleaning
: How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency)
bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.

Regression
smooth by fitting the data into regression functions

Clustering
detect and remove outliers

Combined computer and human inspection


detect suspicious values and check by human (e.g.,
deal with possible outliers)

Simple Discretization Methods: Binning

Equal-width (distance) partitioning


Divides the range into N intervals of equal size: uniform
grid
if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B A)/N.
The most straightforward, but outliers may dominate
presentation
Skewed data is not handled well

Equal-depth (frequency) partitioning


Divides the range into N intervals, each containing
approximately same number of samples
Good data scaling

8/13/16

Data Mining: Concepts and


Techniques

49

Data Cleaning
: Binning Methods
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

REGRESSION
Data can be smoothed by fitting the data to a
function, such as with regression.
Linear regression involves finding the best line
to fit two attributes (or variables), so that one
attribute can be used to predict the other.
Multiple linear regression is an extension of
linear regression, where more than two
attributes are involved and the data are fit to a
multidimensional surface.
12/04/2016

51

Regression
y

Y1
y=x+1

Y1

X1

8/13/16

52

CLUSTERING
Outliers
may
be
detected
by
clustering, where similar values are
organized into groups, or clusters.
Intuitively, values that fall outside of
the set of clusters may be considered
outliers
Outliers - Data objects with characteristics
that are considerably different than most
of the other data objects in the data set
12/04/2016

53

CLUSTERING

12/04/2016

54

Data Integration
Data integration:
Combines data from multiple sources into a
coherent store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data
sources, e.g., Bill Clinton = William Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values
from different sources are different
Possible reasons: different representations,
different scales, e.g., metric vs. British units
8/13/16

55

Handling Redundancy in Data


Integration
Redundant data occur often when integration of
multiple databases
Object identification: The same attribute or
object may have different names in different
databases
Derivable data: One attribute may be a
derived attribute in another table, e.g.,
annual revenue
Redundant attributes may be able to be detected
by correlation analysis
Careful integration of the data from multiple
sources may help reduce/avoid redundancies and
8/13/16
56
inconsistencies and improve mining speed and

Data Integration
Data integration:
Combines data from multiple sources into a coherent store

Schema integration: e.g., A.cust-id B.cust-#


Integrate metadata from different sources

Entity identification problem:


Identify real world entities from multiple data sources, e.g.,
Bill Clinton = William Clinton

Detecting and resolving data value conflicts


For the same real world entity, attribute values from
different sources are different
Possible reasons: different representations, different scales

Data Integration
: Handling Redundancy in Data Integration

Redundant data occur often when integration of


multiple databases
Object identification: The same attribute or object
may have different names in different databases
Derivable data: One attribute may be a derived
attribute in another table, e.g., annual revenue

Redundant attributes may be able to be


detected by correlation analysis
Careful integration of the data from multiple
sources may help reduce/avoid redundancies
and inconsistencies and improve mining speed
and quality

Data Integration :
Correlation Analysis (Numerical Data)
Correlation coefficient (also called Pearsons product
moment coefficient)

rA, B

( A A)( B B ) ( AB) n A B

( n 1)AB

( n 1)AB

where n is the number of tuples,


and
are the respective means of A and B, A and B
are the respective standard deviation of A and B, and (AB) is the sum of the AB crossproduct.

If rA,B > 0, A and B are positively correlated (As


values increase as Bs). The higher, the stronger
correlation.
rA,B = 0: independent; rA,B < 0: negatively correlated

Data Integration
: Correlation Analysis (Categorical Data)
2 (chi-square) test
2
(
Observed

Expected
)
2
Expected
The larger the 2 value, the more likely the
variables are related
The cells that contribute the most to the 2 value
are those whose actual count is very different from
the expected count
Correlation does not imply causality
# of hospitals and # of car-theft in a city are correlated
Both are causally linked to the third variable: population

Chi-Square Calculation: An Example

Play chess

Not play chess

Sum (row)

Like science fiction

250(90)

200(360)

450

Not like science fiction

50(210)

1000(840)

1050

Sum(col.)

300

1200

1500

2 (chi-square) calculation (numbers in parenthesis are


expected counts calculated based on the data distribution in
the two categories)
2
2
2
2
(
250

90
)
(
50

210
)
(
200

360
)
(
1000

840
)
2

507.93
90
210
360
840

It shows that like_science_fiction and play_chess are


correlated in the group

Data Transformation
Smoothing: remove noise from data
Aggregation: summarization, data cube
construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small,
specified range
min-max normalization
z-score normalization
normalization by decimal scaling

Attribute/feature construction
New attributes constructed from the given ones

Data Transformation
: Normalization

Min-max normalization: to [new_minA, new_maxA]

v'

v minA
(new _ maxA new _ minA) new _ minA
maxA minA

Ex. Let income range $12,000 to $98,000 normalized to [0.0,


73,600 to
12,000
1.0]. Then $73,000 is mapped
(1.0 0) 0 0.716
98,000 12,000

Z-score normalization (: mean, : standard deviation):

v'

v A

73,600 54,000

Ex. Let = 54,000, = 16,000. Then


16,000

1.225

Normalization by decimal scaling

v
v' j
10

Where j is the smallest integer such that Max(||) < 1

Data Reduction Strategies

Why data reduction?


A database/data warehouse may store terabytes of data
Complex data analysis/mining may take a very long time to run
on the complete data set

Data reduction
Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the
same) analytical results

Data reduction strategies

Aggregation
Sampling
Dimensionality Reduction
Feature subset selection
Feature creation
Discretization and Binarization
Attribute Transformation

Data Reduction :
Aggregation
Combining two or more attributes (or
objects) into a single attribute (or object)
Purpose
Data reduction
Reduce the number of attributes or objects

Change of scale
Cities aggregated into regions, states, countries,
etc

More stable data


Aggregated data tends to have less variability

Data Reduction : Sampling


Sampling is the main technique employed for data
selection.
It is often used for both the preliminary investigation of
the data and the final data analysis.

Statisticians sample because obtaining the entire set


of data of interest is too expensive or time consuming.
Sampling is used in data mining because processing
the entire set of data of interest is too expensive or
time consuming.

Data Reduction : Types of


Sampling
Simple Random Sampling
There is an equal probability of selecting any
particular item

Sampling without replacement


As each item is selected, it is removed from the
population

Sampling with replacement


Objects are not removed from the population as they
are selected for the sample.
In sampling with replacement, the same object can be
picked up more than once

Data Reduction
: Dimensionality Reduction
Purpose:
Avoid curse of dimensionality
Reduce amount of time and memory required by
data mining algorithms
Allow data to be more easily visualized
May help to eliminate irrelevant features or
reduce noise

Techniques
Principle Component Analysis
Singular Value Decomposition
Others: supervised and non-linear techniques

Dimensionality Reduction :
PCA
Goal is to find a projection that
captures the largest amount of
x2
variation in data
e

x1

Dimensionality Reduction :
PCA
Find the eigenvectors of the
covariancex matrix
2
The eigenvectors define the new
space
e

x1

Data Reduction
: Feature Subset Selection
Another way to reduce dimensionality of data
Redundant features
duplicate much or all of the information contained in
one or more other attributes
Example: purchase price of a product and the amount
of sales tax paid

Irrelevant features
contain no information that is useful for the data
mining task at hand
Example: students' ID is often irrelevant to the task of
predicting students' GPA

Data Reduction
: Feature Subset Selection

Techniques:

Brute-force approch:
Try all possible feature subsets as input to data
mining algorithm

Filter approaches:
Features are selected before data mining
algorithm is run

Wrapper approaches:
Use the data mining algorithm as a black box
to find best subset of attributes

Data Reduction
: Feature Creation
Create new attributes that can capture the
important information in a data set much
more efficiently than the original attributes
Three general methodologies:
Feature Extraction
domain-specific

Mapping Data to New Space


Feature Construction
combining features

Data Reduction
: Mapping Data to a New Space

Fourier transform
Wavelet transform

Two Sine Waves

Two Sine Waves + Noise

Frequency

Data Reduction
: Discretization Using Class Labels

Entropy based approach

3 categories for both x and y

5 categories for both x and y

Data Reduction
: Discretization Without Using Class Labels

Data

Equal frequency

Equal interval width

K-means

Data Reduction
: Attribute Transformation

A function that maps the entire set of


values of a given attribute to a new
set of replacement values such that
each old value can be identified with
one of the new values
Simple functions: xk, log(x), ex, |x|
Standardization and Normalization

You might also like