You are on page 1of 63

Data Warehousing Concepts

WHY DATAWAREHOUSE?

Report :

Purpose:

Collection of Data

Analysis Comparative Study of Data,


Historical Data

Final:

Improve Decision

Multi Dimensional Analysis of Data Reporting


Improve Decision Making

OLTP
ONLINE TRANSACTION PROCESSING

CITY BANK
Account, Loans, Mutual, Insure
Capture Information:
-- Customer
-- Saving, Account
-- Insurance

Multi Dimensional
Analysis of data
-Reporting

Database

T1

Front End
Applications
-- Java, . Net

R1
T2

INSERT/UPDATE
/DELETE

Select
Statement

R2

T3

Transaction by Transaction

T4

ER Entity Relationship
OLTP- Online Transactional Processing
or Transactional Systems or
Operational Systems

R3

CITY BANK Salary Accounts IBM/Accenture/Dell


Accounts
Saving Account
Insurance Acc
Loans Acc
Customer
Employee of Each org
Insert/Delete/Update

Balance
Account ID
Branch
Date

5,00,000
Office
IBM/Accenture/Dell
Insert/Delete/Update

1000

10,00,000

8000

Branch/ATM
Insert/Delete/Update

Report: Give me all the offices in which branch they are


doing transactions more Analysis for Open new branch
Select office, branch from office, customer, accounts,
balance, branch where office = customer
Customer= Account
Longer Time to scan the data
Account=Balance
based on Join conditions
Balance= Branch

What is Data Warehouse ?

What is a Data Warehouse ?

A data warehouse is a subject-oriented, integrated,


nonvolatile, time-variant collection of data in support of
management's decisions.
- WH Inmon

Integrated - Characteristics of a Data Warehouse


Sale ID Integer
Sales
Hyderabad

Product-Char

OLTP DB- SQL


Server

Informatica
Staging
SaleID-Decimal
Product-String

E
Sales
Chennai

OLTP DBOracle Server

Sale ID Numeric
Product-Varchar2

Data
Warehouse

Non-volatile - Characteristics of a Data Warehouse

insert

change

Data
Warehouse

Operational
delete

insert
load

replace

change

read only
access

Subject-Oriented- Characteristics of a Data Warehouse


Data
Warehouse

DW is a subject-oriented database which supports the business needs of


Individual departments in the enterprise
Example : SALES,HR,ACCOUNTS,LOANS etc.

SALES

LOANS

ACCOUNTS

HR

Time Variant - Characteristics of a Data Warehouse

Operational

Current Value data


time horizon : 60-90 days

Data
Warehouse

Snapshot data
time horizon : 5-10 years
data warehouse stores historical data

Data warehouse is a database which is specifically designed for


analyzing the business but not for business transactional
processing.
- Ralph Kimball

OLTP Vs Data Warehouse

OLTP Vs Data Warehouse

Data Warehousing Architecture


S
O
U
R
C
E
S
Y
S
T
E
M
S

E
T
L

S
T
A
G
I
N
G
A
R
E
A

DATA
WAREHOUSE

E
T
L
DATA
MARTS

O
L
A
P
S
E
R
V
E
R
(S)

OLAP
REPORTS

OLAP
REPORTS

DATA ACQUISITION

It is a process of Extracting the relevant business information,


Transforming the data into a required business format and Loading
Into the Data Warehouse.
It is defined with the following processes.
Data Extraction
Data Transformation
Data Loading

What is ETL?
ETL stands for Extract Transform & Load

The process of updating the data warehouse

ETL is the automated and auditable data acquisition process from


source system that involves one or more sub processes of data
extraction, data transportation, data transformation, data
consolidation, data integration, data loading and data cleaning.

Need for ETL


The process of ETL is required so that data from different
heterogeneous sources can be combined and brought into one
common source.

The Advantage of having the process of ETL is that, as data from


different sources can be brought together, highly complex and
user friendly reports can be generated for decision making

Need for ETL


Data stored in different formats in different types of databases.

Some data sources might be archives while others may be


active operational systems

Data extraction and cleansing - time-consuming and difficult

Aggregation of data

What ETL is Not ?


Never creates new Data.
e.g. If a list of hundred employees is being loaded,
one more employee cannot be added to the list and
make it hundred and one. Or if last name of
customer is absent an arbitrary last name cannot be
substituted.

What ETL is Not ?

Data warehouses are not OLTP systems


Duplication of calculations in Source system & the data
warehouse should not be attempted, as in future the
process in the source system can change that will result
in asynchronous data.

Feature of ETL Tools


Support data extraction, cleansing, aggregation, reorganization,
transformation, and load operations
Generate and maintain centralized metadata
Filter data, convert codes, calculate derived values, map source
data fields to target data fields
Automatic generation of ETL programs
Closely integrated with RDBMS
High speed loading of target data warehouses using Enginedriven ETL Tools

Advantages of using ETL Tools


GUI based design of jobs ease of development and
maintenance
Generation of directly executable code
Engine driven technology is fast, efficient and multithreaded
In-memory data streaming for high-speed data processing
Products are easy to learn and require less training
Automatic generation and maintenance of open, extensible
metadata
Support for multiple data formats and platforms
Large number of vendor supplied data transformation
objects

Meta Data
Data about data
Needed by both information technology personnel and users
IT personnel need to know data sources and targets;
database, table and column names; refresh schedules; data
usage measures; etc.
Users need to know entity/attribute definitions; reports/query
tools available; report distribution information; help desk
contact information, etc.

The ETL Process


Source
Systems

Presentation
System

Staging
Area

Extract

Transform

Load

DATA ACQUISITION DATA EXTRACTION


Data Extraction:
It is a process of reading the data from various types of sources
Such as relational sources, ERP sources, Mainframe sources,
XML file and Flat files.

Relational
ERP
Mainframe

File

Oracle, SQL Server,


SAP, PeopleSoft
COBOL Files, DB2

Flat Files (Text Files), XML Files

DATA ACQUISITION DATA TRANSFORMATION

Data Transformation:
It is a process of cleaning the data and transforming the data into
A required business format.
The following data transformation activities take place in staging
Area.
Data Merging
Data Cleansing
Data Scrubbing
Data Aggregation

DATA ACQUISITION --DATA TRANSFORMATION


Data Merging:
It is a process of combining the data from multiple inputs and
Load into a single output. There are two types of Data Merging Activities.
1. Join
2. Union
Data Cleansing:
It is a process of removing unwanted data from Staging
OR
It is a process of changing inconsistencies and inaccuracies
Example : Init Cap() and Round() functions

DATA ACQUISITION --DATA TRANSFORMATION

Data Scrubbing:
It is a process of deriving new data definitions using existing data.
Example: Concat (First Name+ Last Name), Sal
Amount=QTY*Price
Data Aggregation:
Its process of calculating the summaries for a group of records
Using aggregate functions.
Example : Average, Max, Min etc.

DATA ACQUISITION --DATA LOADING


Data Loading:
It is a process of inserting the data into a target system. There are
2 types of Data Loads.

Initial or Full Load


Incremental or Delta Load

1. Initial or Full Load


It is a process of loading all the required data at very first load.
2. Incremental or Delta Load
It is a process of loading only new records after initial load.

Data Marts

A data mart is a simple form of a data warehouse that is focused


on a single subject (or functional area), such as Sales, Finance,
or Marketing. Data marts are often built and controlled by a
single department within an organization.
There are 2 types of DM
1. Dependent DM
2. Independent DM

Data Marts

Top Down Approach or Dependent Data Marts (W.H.Inmon)


According to W.H.Inmon first we need to design an Enterprise
Data warehouse then design a small form of Subject Oriented
Department design specific DB known as Data Marts

Data Warehousing Strategies


Top Down Approach
Data Sources

Operational system

Operational system

Staging Area

Warehouse

Data

Sales

Purchase

Inventory
Operational system

Data Marts

Bottom-Up Approach or Independent Data Marts (Ralph Kimball)


According to Ralph Kimball first we need to design department
specific database known as Data Marts then integrate all data
marts into Enterprise Data Marts.

Data Warehousing Strategies


Bottom Up Approach

Data Sources

Staging Area

Warehouse

Operational system
Sales

Operational system

Purchase

Operational system

Inventory

Data

Data Warehouse Dimensional Modeling

Dimension Table

Dimension tables contain textual information that represents


the attributes of the business
Contain relatively static data
Dimension tables are joined to a fact able through foreign key
reference

Dimension Table Examples


Retail store name, zip code, product name, product category,
day of week
Telecommunications -- call origin, call destination
Banking customer name, account number, branch, account
officer
Insurance policy type, insured party

FACT TABLE
Contain numerical metrics of the business
Can hold large volumes of data
Can grow quickly

Fact Table Examples


Retail number of units sold, sales amount
Telecommunications length of call in minutes, average number
of calls
Banking average monthly balance
Insurance claims amount

Types of Schemas

1. Star Schema
2. Snow Flake Schema
3. Galaxy Schema
4. Fact Constellation Schema

STAR SCHEMA

A star schema is the one in which a central fact table is


surrounded by denormalized dimensional tables. A star schema
can be simple or complex. A simple star schema consists of one
fact table where as a complex star schema have more than one
fact table

STAR SCHEMA

SNOWFLAKE SCHEMA

A snow flake schema is an enhancement of star schema by


adding additional dimensions. Snow flake schema are useful
when there are low cardinality attributes in the dimensions

SNOWFLAKE SCHEMA

GALAXY SCHEMA

Galaxy schema contains many fact tables with some common


dimensions (conformed dimensions). This schema is a combination
of many data marts.

GALAXY SCHEMA

FACT CONSTELLATION SCHEMA

The dimensions in this schema are segregated into independent


dimensions based on the levels of hierarchy. For example, if
geography has five levels of hierarchy like teritary, region,
country, state and city; constellation schema would have five
dimensions instead of one.

CONFIRMED DIMENSIONS
An Dimension table which is shared across Data Marts or more
than 1 Fact table
Example
Calendar/Date/Time Dimension
Customer Dimension
Product Dimension

SURROGATE KEYS
It has no meaning, other than stating uniqueness for each
record stored in the dimension tables.
Will be used in all dimension tables.
It is a just an Sequence No.
Advantage of Surrogate keys include:
-- Control over data
-- Reduced fact table size
Avoid using the OLTP keys as data warehouse keys.

Empid sat/001/hyd/7924

Projid GE/comm/US/NJ/001

Empkey 1001

Projkey 3001

Emp Dim
Emp Key

Proj Dim

Emp id

Proj Key

Ename

Proj id
Proj name
Emp Key
Loc Key
Proj Key
Date Key

Loc Dim

Date Dim

Loc Key

Date Key

Loc id

Date id

Loc name

Month
year
Reduced space Fact table
Integer to Integer comparison
instead of string to string

Locid Hye/001/hitech/204
Lockey 2001

Dateid 20/11/2011
Datekey 4001

SLOWLY CHANGING DIMENSIONS


SCD captures the changes which takes place over the period of
time.

1. SCD Type 1 : Type 1 dimension keeps only the current values.


Doesnt maintain history
2. SCD Type 2: Type 2 dimension maintain the full history in the
target. For each update it inserts a new record in the target
tables.
3. SCD Type 3 : Type 3 dimension maintains current and
previous information (Partial History)

You might also like