You are on page 1of 129

September 9, 2014 Kho d liu v khai ph d liu: Chng 3 1

KHO D LIU V KHAI PH D LIU


Chng 3: Gii thiu chung v Kho d liu
Ti liu ny s dng mt phn
Bi ging Data Mining: Concepts and Techniques Slides for Textbook
Jiawei Han and Micheline Kamber
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 2
Ni dung
Khi nim kho d liu
M hnh d liu a chiu
Kin trc kho d liu
Thi hnh kho d liu
T xy dng kho d liu ti KPDL
S pht trin mi ca cng ngh khi d liu
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
3
Khi nim kho d liu
Kho d liu (KDL) c nh ngha theo nhiu cch song khng
nghim ngt (chnh xc).
CSDL h tr quyt nh c duy tr tch bit vi CSDL tc
nghip ca t chc.
H tr x l thng tin nh cung cp mt nn tng vng chc d
liu hp nht, lch s phn tch.
KDL l mt tp hp d liu hng ch , tch hp, c tnh thi
gian v khng thay i h tr qu trnh to quyt nh qun
l.W. H. Inmon [Inm02]
Bn c trng: hng ch , tch hp, c tnh thi gian v khng
thay i

[Inm02] W. H. Inmon (2002). Building the Data Warehouse (Third Edition). John Wiley & Sons, Inc.
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
4
Kho d liu: khi nim
Kho d liu l mt mi trng thng tin [Pon01]:
Cung cp mt khung nhn tch hp v tng th v doanh nghip
To s sn c thng tin hin ti v lch s ca doanh nghip
thun li ra quyt nh
To kh nng giao dch h tr quyt nh m khng cn tr h
thng tc nghip
Cung cp tnh nht qun thng tin doanh nghip
Trnh din ngun thng tin chin lc linh hot v tng tc c
B sung c trng th nm l kt ht d liu - Data Granularity
[Pon01] Paulraj Ponniah, Data warehousing fundamentals, John Wiley & Sons Inc., 2001
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
5
Kin trc kho d liu: s b
[Ora02] Oracle9 i. Data Warehousing Guide, Release 2 (9.2), March 2002, Part No. A96520-01
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
6
To kho d liu Data warehousing
Xy dng KDL (KDL ha): Qu trnh xy dng v s dng KDL
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
7
KDL c trng hng ch
c t chc xung quanh cc ch chnh, chng hn
nh khch hng, sn phm, bn hng.
Tp trung vo xy dng m hnh v phn tch d liu
to quyt nh; khng phi qu trnh tc nghip hoc giao
dch hng ny.
Cung cp mt khung nhn n gin v ngn gn v cc
ti thuc ch c th nh loi b cc d liu v dng
trong qu trnh ra quyt nh.
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
8
ng dng tc nghip ch KDL
KDL c trng hng ch
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
9
KDL - c trng tch hp
KDL c xy dng t vic tch hp cc ngun d liu
phc, khng ng nht
CSDL quan h, CSDL file phng (flat files: m ha
CSDL sang dng c bit nh .txt hoc .ini), cc bn
ghi giao dch trc tuyn
S dng cc k thut lm sch d liu v tch hp d
liu.
m bo tnh nht qun quy c t tn, cu trc
m ha, o lng thuc tnh, gia cc ngun d
liu khc nhau
VD, gi khch sn: tin t, thu, bao gi n sng
D liu chuyn ti KDL th n c chuyn i.
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
10
KDL - c trng tch hp
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
11
KDL: cc vn tch hp
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
12
KDL: ch - tch hp
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
13
KDL - c trng thi gian
Chiu thi gian i vi KDL l ng k di hn so vi h
thng CSDL tc nghip.
CSDL tc nghip: d liu gi tr hin thi.
D liu KDL: cung cp thng tin theo quan im lch
s (chng hn, 5-10 nm qu kh)
Mi cu trc ct li trong KDL
Cha yu t thi gian, hin hoc n
Nhng ct li ca d liu tc nghip c th cha hoc
khng cha yu t thi gian.

September 9, 2014
Kho d liu v khai ph d liu: Chng 2
14
KDL - c trng thi gian
Chiu thi gian 5=10 nm

Anbum nh chp d liu
Cu trc chnh cha yu t thi gian
Chiu thi gian hin thi ti 60-
90 ngy
Cp nht h s
Cu trc chnh cha / khng
cha yu t thi gian
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
15
KDL - c trng khng thay i
Lu tr vt l ring bit cc d liu c chuyn t mi
trng tc nghip sang.
Cp nht tc nghip d liu khng xut hin trong mi
trng KDL.
Khng c x l giao dch, phc hi v c ch iu
khin ng thi.
Ch c hai thao tc truy nhp d liu:
Ti ban u d liu v truy cp d liu. D liu
ngun khng bin i trong KDL.
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
16
KDL - c trng khng thay i
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
17
KDL HQT CSDL khng ng nht
Tch hp CSDL khng ng nht truyn thng:
Xy dng b bao gi/b ha hp trn nh CSDL khng ng
nht
Tip cn theo truy vn
Khi mt truy vn c a n CSDL cc b: dng mt siu
t in dch truy vn thnh cc truy vn ph hp vi cc
CSDL cc b ring r v kt qu c tch hp thnh mt tp
tr li ton cc
Phc tp lc thng tin, cnh tranh ti nguyn
KDL: nh hng cp nht, hiu nng cao
Thng tin t cc ngun khng ng nht c tch hp trc v
lu tr trong KDL truy vn v x l trc tip
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
18
KDL v H QTCSDL tc nghip
OLTP (x l giao dch trc tuyn / on-line transaction processing)
Bi ton chnh ca H QT CSDL quan h truyn thng
Tc nghip hng ngy: thu mua, lu kho, ngn hng, sn xut,
tin lng, ng k, k ton, vv
OLAP (x l phn tch trc tuyn/ on-line analytical processing)
Bi toand chnh ca h thng KDL
Phn tch d liu v to quyt nh
c trng phn bit (OLTP <> OLAP):
nh hng ngi dng v h thng: khch hng <>th trng
Ni dung d liu: hin thi, c th <> lch s, hp nht
Thit k CSDL: ER + ng dng <> hnh sao + ch
Khung nhn: hin thi, cc b <> tin ha, tch hp
Mu truy cp: truy nhp <> ch c vi cu hi phc
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
19
OLTP <> OLAP
OLTP OLAP
Ngi dng Th l, chuyn vin CNTT Chuyn vin tri thc
Chc nng Tc nghip hng ngy H tr quyt nh
Thit k CSDL Hng ng dng Hng ch
D liu Hin thi, cp nht
chi tit, quan h phng bit
lp
Lch s, tm tt, tch hp a chiu,
hp nht
S dng Lp D tm (ad-hoc)
Truy cp c/ghi
Ch mc/bm theo kha
chnh
Nhiu duyt
n v thao tc Giao dch ngn,n gin Cu hi phc tp
# bn ghi truy cp Chc Triu
#ngi dng Nghn Trm
Kch thc CSDL 100MB-GB 100GB-TB
n v o Thng lng giao dch Thng lng truy vn, p ng


September 9, 2014
Kho d liu v khai ph d liu: Chng 2
20
Kho d liu ring bit
Hiu nng cao cho c hai h thng
DBMS phn b cho OLTP: phng php truy cp, lp ch mc,
iu khin ng thi, khi phc
Warehousephn b cho OLAP: truy vn OLAP phc, khung nhn
a chiu, hp nht
Chc nng khc nhau v d liu khc nhau:
Thiu d liu: H tr quyt nh cn d liu lch s m CSDL tc
nghip thng khng duy tr
Hp nht d liu: H tr quyt nh i hi hp nht (tng hp,
tm tt) ca d liu t cc ngun khng ng nht
Cht lng d liu: ngun khc nhau s dng trnh din, m ha
v khun dng d liu khng nht qun (cn phi ha hp)
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
21
Kho d liu ring bit
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
22
T cc bng v bng tnh ti khi d liu
Mt KDL da trn mt m hnh d liu a chiu vi khung nhn d
liu di dng cc khi d liu
Mt khi d liu, nh sales, cho php d liu c m hnh ha v
c nhn theo a chiu
Bng chiu, nh item (item_name, brand, type), hoc time(day,
week, month, quarter, year)
Bng s kin cha cc gi tr o (nh dollars_sold) v cc kha
ti mi bng chiu lin quan
Theo cch ni ca KDL, mt khi c s n-D c gi l mt cuboid
c s. Cao nht l 0-D cuboid cha tm tt mc cao nht (c
gi l cuboid nh). Dn cc cuboid to thnh mt khi d liu.
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
23
Khi d liu sales: Dn cc Cuboid
all
time item location supplier
time,item
time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,location
time,item,supplier
time,location,supplier
item,location,supplier
time, item, location, supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 24
Chng 3: C s v kho d liu
Khi nim kho d liu
M hnh d liu a chiu
Kin trc kho d liu
Thi hnh kho d liu
T xy dng kho d liu ti KPDL
S pht trin mi ca cng ngh khi d liu
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
25
M hnh khi nim ca KDL
M hnh KDL: chiu v gi tr o
S hnh sao (star schema): Mt bng s kin
trung tm c kt ni vi mt tp cc bng chiu
S i bng tuyt (Snowflake schema): Mt m rng
ca s hnh sao trong mt vi cu trc chiu
c chun ha thnh mt tp cc bng chiu nh
hn, hnh thc tng t nh bng tuyt.
S chm sao s kin (Fact constellations schema):
Bng s kin phc chia s cc bng chiu, to khung
nhn mt tp cc ngi sao, nn cn c gi s
ngn h (galaxy schema) hoc chm sao s kin
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
26
V d v s hnh sao
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
state_or_province
country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name
branch_type
branch
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
27
V d v s bng tuyt
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
item
branch_key
branch_name
branch_type
branch
supplier_key
supplier_type
supplier
city_key
city
state_or_province
country
city
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
28
Example of Fact Constellation
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
province_or_state
country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name
branch_type
branch
Shipping Fact Table
time_key
item_key
shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_key
shipper_name
location_key
shipper_type
shipper
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
29
Mt ngn ng hi KPDL: DMQL
Data Mining Query Language: DMQL
nh ngha khi - Cube Definition (Bng s kin)
define cube <cube_name> [<dimension_list>]:
<measure_list>
Dimension Definition ( Dimension Table )
define dimension <dimension_name> as
(<attribute_or_subdimension_list>)
Special Case (Shared Dimension Tables)
First time as cube definition
define dimension <dimension_name> as
<dimension_name_first_time> in cube
<cube_name_first_time>

September 9, 2014
Kho d liu v khai ph d liu: Chng 2
30
Xc nh s hnh sao trong DMQL
define cube sales_star [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week,
month, quarter, year)
define dimension item as (item_key, item_name, brand,
type, supplier_type)
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street, city,
province_or_state, country)
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
31
Xc nh s bng tuyt trong DMQL
define cube sales_snowflake [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter,
year)
define dimension item as (item_key, item_name, brand, type,
supplier(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city(city_key,
province_or_state, country))
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
32
Xc nh s chm sao s kin trong DMQL
define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state,
country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location
in cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
33
Gi tr o: Ba loi
Phn bit: Nu kt qu nhn c t p dng hm ti n
gi tr kt hp ging nh kt qu nhn c bi p dng
chnh hm trn mi gi tr khng phn hoch.
Chng hn, count(), sum(), min(), max().
i s (algebraic): nu n c tnh ton bi mt hm i
s vi M i s (M l mt s nguyn hu hn), mi i s
thu c bi mt hm tch hp phn b.
Chng hn, avg(), min_N(), standard_deviation().
Lp lun (holistic): Nu cn ti mt hng s hn ch theo
kch thc lu tr m t mt tp hp con.
Chng hn, median(), mode(), rank().
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
34
Mt kin trc khi nim: chiu (a danh)
all
Europe North_America
Mexico Canada Spain Germany
Vancouver
M. Wind L. Chan
...
... ...
...
...
...
all
region
office
country
Toronto Frankfurt city
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
35
Khung nhn ca cc KDL v cc kin trc
Specification of hierarchies
Schema hierarchy
day < {month <
quarter; week} < year
Set_grouping hierarchy
{1..10} < inexpensive
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
36
D liu a chiu
Khi lng bn hng l mt hm ca sn phm,
thng, v qun
P
r
o
d
u
c
t

Month
Cc chiu: SP, a danh, Thi gian
Cc ng tm tt phn cp
Industry Region Year

Category Country Quarter

Product City Month Week

Office Day
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
37
Mt khi d liu v d
Total annual sales
of TV in U.S.A.
Date
C
o
u
n
t
r
y

sum
sum

TV
VCR
PC
1Qtr
2Qtr
3Qtr
4Qtr
U.S.A
Canada
Mexico
sum
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
38
Tng ng Cuboids ti khi DL
all
product
date
country
product,date product,country date, country
product, date, country
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D(base) cuboid
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
39
Xem lt mt khi DL
Trc quan ha (Visualization)
Nng lc OLAP (OLAP capabilities)
Vn ng tng tc (Interactive
manipulation)
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
40
Thao tc OLAP in hnh (10/6)
Roll up (Cun ln/drill-up): tm tt d liu (summarize data)
Nh leo theo phn cp hoc theo rt gn chiu
Drill down (khoan xung/roll down): ngc vi cun ln
T tm tt mc cao ti tm tt mc thp hoc d liu chi tit, or
m u mt chiu mi
Thi v vch (Slice and dice):
Chiu v chn (project and select )
Xoay Trc (quay) / Pivot (rotate):
Xoay chiu khi DL, trc quan ha, 3D thnh mt dy mt hai
chiu.
Thao tc khc
Khoan cho (drill across): cun ht nhiu hn 1 bng s kin.
Khoan thng (drill through): xuyn quan mc y ca KDL ti cc
bng quan h y ca n (dng SQL)
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
41
M hnh truy vn mng ngi sao

Shipping Method
AIR-EXPRESS
TRUCK
ORDER
Customer Orders
CONTRACTS
Customer
Product
PRODUCT GROUP
PRODUCT LINE
PRODUCT ITEM
SALES PERSON
DISTRICT
DIVISION
Organization Promotion
CITY
COUNTRY
REGION
Location
DAILY QTRLY ANNUALY
Time
Mi chu trnh
c gi l mt
vt
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
42
S dng kho d liu
Ba kiu ng dng KDL
X l thng tin (Information processing)
H tr truy vn, phn tch thng k c bn, v lp bo co s
dng xuyn m, bng, s ct v th
X l phn tch
Phn tch a chiu d liu trong kho d liu
H tr thao tc OLAP c bn, cun ln, khoan xung, xoay
KPDL
Pht hin tri thc t mu n
H tr m hnh phn tch kt hp, xy dng, thi hnh phn
lp v d bo, v trnh din kt qu khai ph bng tin ch
trc quan ha.
Phn bit ba kiu bi ton ny
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 43
Chng 3: C s v kho d liu
Khi nim kho d liu
M hnh d liu a chiu
Kin trc kho d liu
Thi hnh kho d liu
T xy dng kho d liu ti KPDL
S pht trin mi ca cng ngh khi d liu
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 44
Thit k KDL: Mt khung phn tch kinh
doanh
4 khung nhn i vi thit k mt KDL
Khung trn-xung (Top-down view)
Cho php la chn thng tin lin quan cn thit cho KDL
Khung ngun DL (Data source view)
Trnh by thng tin c nm gi, lu tr v qun l bi h
thng tc nghip
Khung KDL (Data warehouse view)
Cha cc bng s kin v cc bng chiu
Khung truy vn kinh doanh (Business query view)
Thy phi cnh ca d liu trong kho t khung nhn ca
ngi s dng
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 45
Qu trnh thit k KDL
Tip cn Top-down, bottom-up hoc kt hp c hai
Top-down: Khi u vi thit k v ln k hoch khi qut
(hon thnh)
Bottom-up: Khi u t kinh nghim v mu (nhanh)
Theo quan im ca k ngh phn mm
Thc nc (Waterfall): Phn tch cu trc v h thng ti mi
bc trc khi tin hnh bc tip theo
i phun nc/xon c (Spiral): Pht sinh nhanh h thng chc
nng tng trng, chu k ngn v nhanh
Qu trnh thit k KDL in hnh
Chn qu trnh kinh doanh m hnh ha, nh t hng, gi
danh n hng,
Chn ht/grain (d liu nguyn t) ca qu trnh kinh doanh
Chn cc chiu s p dng ti mi bn ghi bng s kin
Chn o s c tr mi bn ghi bng s kin
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 46
Kin trc a tng
Data
Warehouse
Extract
Transform
Load
Refresh
OLAP Engine
Analysis
Query
Reports
Data mining
Monitor
&
Integrator
Metadata
Data Sources
Front-End Tools
Serve
Data Marts
Operational
DBs
other
sources
Data Storage
OLAP Server
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 47
Kin trc ba tng
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 48
Ba m hnh KDL
Kho doanh nghip (Enterprise warehouse)
Tp hp tt c cc thng tin v cc ch tri trn ton
b hng
KDL chuyn (Data Mart)
Mt tp con d liu ton hng c gi tr i vi mt
nhm ngi dng chuyn bit. Phm vi ca KDL chuyn
c th gii hn trong cc nhm chuyn bit, c
chn lc, v d nh KDL chuyn tip th.
KDL chuyn c lp <> Ph thuc (trc tip t KDL)
Kho o (Virtual warehouse)
Mt tp khung nhn trn CSDL tc nghip
Ch mt b phn khung tm tt kh nang c th hu
hnh
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 49
Pht trin KDL: Mt tip cn c
nh ngha mt m hnh d liu hng mc cao
Data
Mart
Data
Mart
Distributed
Data Marts
KDL a mc
KDL ton
b hng
Lm mn m hnh Lm mn m hnh
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 50
Kin trc phc v OLAP
OLAP quan h (Relational ROLAP)
Dng CSDL quan h hoc quan h m rng lu tr v qun l
KDL v phn mm lp gia h tr cc b phn b thiu ht.
Bao gi ti u ha lp trong (backend) ca DBMS, thi hnh t
hp lgic dn ng v cc tin ch v dch v b sung.
Tnh kh c ln hn
OLAP a chiu (Multidimensional MOLAP)
H thng lu gi a chiu theo mng (k thut ma trn tha)
nh ch mc nhanh ti d liu m t (tm tt) c tnh ton
trc
OLAP lai kt hp (Hybrid HOLAP)
Mm do cho ngi dng, chng hn, mc thp: quan h, mc
cao: mng
Phc v SQL c t
H tr c bit truy vn SQL trn cc s SAO/BNG TUYT
M hnh d liu a chiu
Khuynh hng suy ngh ca ngi qun l kinh
doanh: nhiu chiu (multidimensionally). V d,
khuynh hng m t nhng g m cng ty lm:
Chng ti kinh doanh cc sn phm trong nhiu th trng
khc nhau, v chng ti nh gi hiu qu thc hin ca
chng ti qua thi gian.
Ngi thit k DWH thng lng nghe cn thn v
thm vo cc nhn mnh c bit:
Chng ti kinh doanh cc sn phm trong nhiu th trng
khc nhau, v chng ti nh gi hiu qu thc hin ca
chng ti qua thi gian.
September 9, 2014 Kho d liu v khai ph d liu: Chng 3
M hnh d liu a chiu (2)
Trc gic: vic kinh doanh nh mt khi (cube) d liu:
Mi nhn trn mi cnh ca khi.
im trong khi l cc giao im ca cc cnh.
Vi m t kinh doanh trn
Cnh l Sn phm, Th trng, v Thi gian.
hiu v tng tng rng: im trong khi l cc o hiu qu kinh
doanh, kt hp cc gi tr Sn phm, Th trng v Thi gian.
Thtrng
Thi gi an
San pham
M phng cc chiu trong kinh doanh
September 9, 2014 Kho d liu v khai ph d liu: Chng 3
X L PHN TCH TRC TUYN
H thng OLAP (On_Line Analysis Processing - X
l phn tch trc tuyn)
HT qun l d liu giu nng lc cho php
phn tch d liu:
ct lt (slice) d liu theo nhiu cnh khc nhau,
khoan xung (drill down) mc chi tit hn
cun ln (roll up) mc tng hp hn.
Bn cht ct li ca OLAP
d liu c ly ra t KDL hoc t Datamart (kho
d liu ch )
d liu c chuyn thnh m hnh a chiu
d liu c lu tr trong mt kho d liu a chiu.
September 9, 2014 Kho d liu v khai ph d liu: Chng 3
X L PHN TCH TRC TUYN
i tng chnh ca OLAP l khi (cube): mt s biu din a
chiu ca d liu chi tit v tng th.
Nhc li: Khi bao gm mt bng s kin (Fact), mt/nhiu
bng chiu (Dimensions), cc n v o (Measures) v cc
phn hoch (Partitions).
Khi (Cube) : Khi l phn t chnh trong x l phn tch trc
tuyn, l tp con (subset) d liu t kho d liu, c t chc v
tng hp trong cc cu trc a chiu
Chiu (Dimension): Chiu l cch m t chng loi, theo cc
d liu s trong khi c phn b phn tch.
n v o lng (Measures): n v o ca khi l ct trong
bng Fact. Cc n v o xc nh nhng gi tr s t bng Fact,
c tng hp phn tch nh nh gi, tr gi, hoc s lng bn.
Cc phn hoch (Partitions) : Tt c cc khi u c ti thiu
mt phn hoch cha d liu ca n; mt phn hoch n
c t ng to ra khi khi c nh ngha.
September 9, 2014 Kho d liu v khai ph d liu: Chng 3
Cc yu cu OLAP
a ra tp yu cu nh mt chun mc m
cc m hnh d liu a chiu v cc ngn ng
truy vn ca OLAP phi p ng.
Nhng yu cu ny xut pht t nhng thit k
ch yu tng qut thnh cng vi m hnh
quan h v t nhng nt c th ca cc ng
dng OLAP:
1. Cc nguyn tc thit k chung
2. Cu trc phc tp ca cc chiu
3. Cc gi tr c cu trc phc tp (v o)
4. Cc truy vn in hnh
September 9, 2014 Kho d liu v khai ph d liu: Chng 3
CC NGUYN TC THIT K CHUNG
Hnh thc thc hin c lp: Mt m hnh chun phi trong sut
v quan nim.
khng cha bt k 1 chi tit no ca s thc hin.
c bit quan trng trong ng dng OLAP nh mt s h thng
c (gi l h thng ROLAP) thc hin a chiu ho bng
cch sp t m hnh a chiu logic ti mt m hnh quan h.
y, tng quan h xem nh cu trc vt l lu tr d liu
C s tch bit gia cu trc v ni dung:
Cho php tch bit ca cu trc d liu (tc l khi a chiu v
chiu ca cc khi ) v cc ni dung (tc l cc gi tr )
S m t ngn ng truy vn:
Tng t SQL: ngn ng truy vn a chiu cho php ti u ho
cc truy vn v d liu c lp.
Mt php ton logic hay php ton i s cho php cc s ti
u ho c xem xt ti cho mc ch ny.
Mt phn yu cu chung, cc ng dng OLAP phi c mt s yu
cu c bit m khng p dng cho cc lnh vc khc ca ng
dng phn tch a chiu (V d phn tch hnh nh, GIS, PACS).
September 9, 2014 Kho d liu v khai ph d liu: Chng 3
CU TRC PHC TP CHIU
V d: Nh my t mun
Phn tch vic sa cha phng tin
ci tin sn phm, m t hp ng bo him mi v
nh gi cht lng ca gara.
vn phn tch chnh l sa cha phng tin.
Cc chiu c trng ca ng dng ny l cc phng tin c
sa cha
v d t ca ng Simpson
gara sa cha
v d, Munich, Main Street 5,
v ngy sa cha
27-6.
Cc chiu ny tri di trn khi khng gian 3 chiu.
September 9, 2014 Kho d liu v khai ph d liu: Chng 3
CU TRC PHC TP CHIU(2)
Mng chiu ca khng gian d liu a chiu ch c cu trc bi
cc ng c m t theo ch s (thng mang gi tr nguyn).
ng dng OLAP: cc ch s ny khng do t khung nhn ca
ngi dng cui OLAP, cc yu t (v d tng ng) ca mt
chiu OLAP (cc chiu thnh vin) l khng thng hng (v d
garage).
Tu thuc vo th bc cha trong cc tng m xy dng cc
chiu. Mi tng th bc cha mt tp cc thnh vin.
Mt tng L1 c th cun ln tng L2.
ngha ca mi quan h ny
tng L2 a ra mt s phn loi v khi nim ca L1.
Vic m t th bc rt cn thit cho m hnh c cu trc tng.
rt cn thit xc nh xem yu t no ca tng thp s tng
ng vi yu t no tng cao.
Cu trc ny: cu trc thnh vin.
September 9, 2014 Kho d liu v khai ph d liu: Chng 3
CU TRC PHC TP CHIU (3)
Cc th bc c s dng trong cu trc cc chiu
C cu trc thnh vin v cu trc tng ca mt chiu: nh
dng th v cc thnh vin, tng: nt; quan h: cnh.
Hnh v: th thnh vin l mt cu trc cy. Trng hp c
bit: cc tng c th t thng hng nhau do quan h vi bn
trn.
Thng thng: cu trc thnh vin v cu trc tng c m
t bi cc th khng chu trnh.
September 9, 2014 Kho d liu v khai ph d liu: Chng 3
CU TRC PHC TP GI TR
Ni dung ti mt ca khi a chiu
c xy dng bi nhiu cch tnh khc nhau.
mt c th cha mt vi o c to ra t mt bn ghi.
ng dng OLAP thng cha ng mt s lng ln cc o.
cc o khng l nguyn t
c c lng t cc o khc ( o nguyn t/ o nhn
c) trong khi.
Ph thuc vo cc cng thc tnh ton nhn c t cc o
m t cc th bc trong cc o nguyn t.
X l o phc hp trong phm vi tp hp
Vn cn c quan tm n.
S c lng ca tp hp cc chc nng c th mang ngha
y cho tt c mi o
September 9, 2014 Kho d liu v khai ph d liu: Chng 3
CU TRC PHC TP GI TR (2)
Khi nim o tng t vi khi nim khung
nhn trong h thng quan h
M t cc o nhn c
Cch thc tnh ton: mt phn ca lc c s d liu.
Cc o nhn c v cc o nguyn t
c x l bng cc ngn ng truy vn.
Ngn ng truy vn cng c th h tr cc tnh ton c bit
c m t trong truy vn
September 9, 2014 Kho d liu v khai ph d liu: Chng 3
TRUY VN IN HNH
Cc truy vn OLAP in hnh
chn t khng gian chiu (hu ht l 2 chiu).
hn ch cc chiu ca khi kt qu, xc nh cc
tng c th c tnh ton (v d nm = 1997).
Mi chiu ca mt tng c a ra bi cc chiu
khc
do s hn ch hay bi trng thi ca mt kt qu yu cu
v d: a ra cho mi thng.
Gii hn c th c to thnh theo cng thc s
dng o trong cc v t
v d s lng ngi ln hn 3
September 9, 2014 Kho d liu v khai ph d liu: Chng 3
TRUY VN IN HNH (2)
Mt mi trng OLAP cn c m t bi truy cp d liu.
Ngi dng to cc truy vn s dng cng thc ph thuc vo kt
qu ca cc truy vn trc
v d chia cc gi tr ca nm 1997 theo thng.
Thao tc thng thng l thao tc i xung /i ln dng cu trc
chiu.
Ph thuc vo kiu phn tch m ngi dng mun
Chc nng phn tch c bit tr nn cn thit
V d cho phn tch ABC.
Thc tin, ngn ng truy vn c th kt hp cc hm
ngi dng.
September 9, 2014 Kho d liu v khai ph d liu: Chng 3
TRUY VN IN HNH (3)
Lc a chiu ca v d s dng k hiu ME/R
V d: Truy vn: a ra trung bnh ca tng gi tr mi thng ca
gara Baravia theo loi gara trong sut nm 1997.
Truy vn ny c nhiu cch hiu, gi s hiu theo ngha:
Truy vn hn ch trn chiu nm (nm =1997) v vng a l
(vng=Baravia).
Truy vn cha thao tc i ln t ngy ti thng s dng php tnh
tng nh l mt tp hp cc hm (thc t ch theo mt phng tin sa
cha),
truy vn lin quan ti tt c phng tin sa cha c thng)
September 9, 2014 Kho d liu v khai ph d liu: Chng 3
Cc phng php tip cn
Agrawal, Gupta, Sarawagi
Cabibbo, Torlone
Li, Wang
Gyssens, Lakshmanan
Lehner
Vassiliadis



September 9, 2014 Kho d liu v khai ph d liu: Chng 3
TIP CN CA Agrawal, Gupta, Sarawagi
Phng php [AGS97]
biu din m hnh d liu a chiu v cc php ton i s.
t chc d liu trong mt /nhiu khi lp phng o.
Gi tr ca tt c cc c th l mt b n hoc nm trong
tp hp {0,1}.
mang gi tr 1 ngha l t hp cc gi tr ca chiu
khng gian ny tn ti.
Mt b n biu din mt bn ghi c n n v.
khng mang ni dung c gn gi tr 0.
Chiu khng gian khng c cu trc hoc th t v cc nhn
t c biu din bng tn.
September 9, 2014 Kho d liu v khai ph d liu: Chng 3
TIP CN CA Agrawal, Gupta, Sarawagi (2)
Phng php [AGS97]
Trong khi lp phng C
c k chiu vi n b,
gi tr ca mt c nh ngha bng b ba gi tr (D,
E(C), N), trong D l tp hp gm k tn ca chiu khng
gian .
Mi chiu khng gian c mt min dom
i
. E(C) l hm nh x
dom
1
x...x dom
k
n mt b n (cc gi tr ca nm trong
khi lp phng C) hoc n {0, 1}.
N l mt b n, cha tn cc thnh phn ca b n nm trong
khi lp phng.
September 9, 2014 Kho d liu v khai ph d liu: Chng 3
TIP CN CA Cabibbo, Torlone
M hnh a chiu v ngn ng m t tng ng da trn
tnh ton logic ([CT97])
M hnh d liu a chiu c nh ngha bng bng f
nh cu trc d liu c bn.
Bng f l bng quan h cha mt b v mi trong khi
lp phng d liu u cha mt gi tr.
Chiu khng gian c nh ngha bng cu trc th
(DAG) bao gm cc tng khng gian.
September 9, 2014 Kho d liu v khai ph d liu: Chng 3
TIP CN CA Cabibbo, Torlone (2)
Mt chiu khng gian
c nh ngha l mt b gm 3 gi tr (L, , R-UP).
L l tp xc nh cc tng c sp xp theo th t nh
n ln (v d: gara vng).
R-UP l tp cc hm nh ngha nh x t nhn t nm
tng thp n nhn t nm tng cao hn. (v d gara A,
B v C thuc vng Bavaria).
Mi tng l L c nh x n mt tp cc gi tr gi l
min ca L (v d: dom(gara) = {A, B, C...})
Bng f n chiu:
m hnh thc th trung tm c dng nh sau:
f[A1:l1, .... An:ln]:l0.
f l tn ca bng f,
li (0in) l tn ca tng khng gian
v Aj (ljn) l tn ca mt thuc tnh.
September 9, 2014 Kho d liu v khai ph d liu: Chng 3
TIP CN CA LI WANG
M hnh d liu a chiu dnh cho cc ng dng OLAP:
Ct li ca phng php ny l ngn ng truy vn i s, cn
gi l i s nhm.
Khi nim c bn l mt khi lp phng a chiu cha cc
quan h, cc chiu khng gian.
i vi mi t hp chiu khng gian, c mt d liu v hng
tng ng i din cho mt thuc tnh n l.
mt m hnh khi nim biu din tt qua nhiu cng c mnh
cn m rng cung cp thm nhiu gi tr phc tp thay v ch
mt gi tr v hng n.
cho php lp m hnh cc s kin phc tp trong mt khi lp
phng thay v nhiu khi lp phng trn cng chiu khng
gian
September 9, 2014 Kho d liu v khai ph d liu: Chng 3
TIP CN CA Gyssens, Lakshmanan
Phng php Gyssens v Lakshmanan [GL97]
gii thiu khi nim m hnh d liu a chiu cho ng dng OLAP.
u im chnh l kh nng tch bit cu trc v ni dung.
a ra cc php ton i s v php tnh ton tng ng cho
m hnh.
Mt c s d liu dng bng a hng l tp hp cc bng.
Gi nh n lc hnh sao, tuy nhin, mc khi nim
Cha cc ton t i s gm la chn, chiu, i tn, hp, phn
on, tch bit, tch -cc, cc ton t xy dng cu trc (tri
(thm chiu khng gian), gp (xa chiu khng gian)), phn loi
v tm tt.
September 9, 2014 Kho d liu v khai ph d liu: Chng 3
TIP CN CA Lehner
[LRT97] m rng m hnh a chiu, cung cp:
2 c ch cu trc trc giao cho cc chiu: s phn loi cc cp bc
v cc c trng.
mt ngn ng truy vn (gi l CQL) dnh cho m hnh d liu a
chiu.
gm mt m t chnh thc cc m hnh d liu a chiu cng vi
cc h tr m rng ny v mt i s thao tc d liu.
Cc mc chiu c sp xp th t
Cn bng cc cp bc cy cu trc c th c m hnh ha.
Cu trc chiu ch c m t mt cch khng chnh thc
September 9, 2014 Kho d liu v khai ph d liu: Chng 3
TiP CN CA Vassiliadis [Vas98]
Mc ch:
cung cp mt m hnh gm cc ton t OLAP t nhin (nh
silicing v drilling) nh cc ton t.
a ra m hnh d liu chnh thc v mt i s c th t: i
s quan h thch hp vi cc cu trc d liu mng t nhin.
nh ngha c bn ca chiu
ging vi m hnh MD [CT97].
Mt chiu c nh ngha nh mt li (H, ). H =
{DL1,,DLm} l mt tp cc mc vi tn min dom(DLi) ng
vi mi mc DLi.
Mt thuc tnh c bit l cch s dng cc chiu a gi tr
chiu gm nhiu hn mt thnh vin.
Cho php ton t chiu tinh vi. Quan h xc nh th t b
phn cc mc chiu.
September 9, 2014 Kho d liu v khai ph d liu: Chng 3
TiP CN CA Vassiliadis
Mi chiu gm mt tp ng dn chiu
vi ch mt yu t duy nht nu khng a chiu cp bc
c nh ngha trong chiu )
Cc mc chiu ca cc chiu khc nhau phi c tho ri
ra.
mt thuc tnh c bit ca cng vic: ton t drill-
down
hin th r rng trong m hnh ca chng.
Trong khi cube ng nh mt ton t khng c r rng
trong hu ht cc m hnh, mt ton t tng ng
cng c gii thiu trong [Leh98].
September 9, 2014 Kho d liu v khai ph d liu: Chng 3
So snh
September 9, 2014 Kho d liu v khai ph d liu: Chng 3
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 76
Chng 3: C s v kho d liu
Khi nim kho d liu
M hnh d liu a chiu
Kin trc kho d liu
Thi hnh kho d liu
T xy dng kho d liu ti KPDL
S pht trin mi ca cng ngh khi d liu
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 77
HiU LC TNH TON KHI D LiU
Khi d liu dc nhn nh dn cc cuboids
cuboid thp nht l cuboid c s
cuboid cao nht (apex) cha ch mt
S lng cuboids trong 1 khi n-chiu vi L
i
mc:

S thc hin ca khi d liu
Thc hin mi (cuboid) (thc hin hon ton),
khng (khng thc hin) hoc b phn (thc hin b
phn)
La chn cc cuboids thc hin
Da theo kch thc, s phn b, tn sut truy nhp vv.
) 1
1
(

n
i
i
L T
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 78
Tnh ton khi
nh ngha khi v tnh ton khi trong DMQL
define cube sales[item, city, year]: sum(sales_in_dollars)
compute cube sales
Bin i vo ngn ng kiu SQL (vi ton t mi cube by,
xem Gray v cng s 96)
SELECT item, city, year, SUM (amount)
FROM SALES
CUBE BY item, city, year
Cn tnh theo cc Group-Bys sau y
(date, product, customer),
(date,product),(date, customer), (product, customer),
(date), (product), (customer)
()
(item) (city)
()
(year)
(city, item) (city, year) (item, year)
(city, item, year)
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 79
Tnh ton khi: Phng php theo ROLAP
Cc phng php tnh ton khi hiu qu
ROLAP-based cubing algorithms (Agarwal et al96)
Array-based cubing algorithm (Zhao et al97)
Bottom-up computation method (Beyer & Ramarkrishnan99)
H-cubing technique (Han, Pei, Dong & Wang:SIGMOD01)
Thut ton khi da theo ROLAP (Agarwal et al96)
p dng cc thao tc sorting, hashing, and grouping ti cc
thuc tnh chiu theo t t sp lii v phn cm cc b lin
quan
Gp nhm c thi hnh theo tng hp con nh bc gom
nhm thnh phn
Tng hp c tnh ton t cc tng hp tnh ton c sn hn
l t bng s kin
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 80
Tnh ton khi: Phng php theo ROLAP (2)
Phng php da trn Hash/sort (Agarwal et. al. VLDB96)
Smallest-parent: tnh ton mt cuboid t khi c
tnh ton trc, nh nht
Cache-results: kt qu lu gi ca mt cuboid t
cc cuboids khc c tnh nhm rt gn I/O a
Amortize-scans: tnh ton c nhiu c th c
cuboids ti cng mt thi im hon tr c a
Share-sorts: chia s gi sp xp cuboids phc khi
dng phng php da theo sp xpd
Share-partitions: chia s gi phn hoch dc theo
cuboids phc khi dng thut ton da theo bm
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 81
Tng hp mng a cch cho tnh
ton khi
Phn cc mng thnh cc khc (cc khi con c th t vo b nh
trong).
a ch mng ri rc wocj nn: (chunk_id, offset)
Tnh tng hp theo multiway qua thm cube theo th t m cc
tiu s ln thm mi , v thu gn gi truy nhp v lu tr b nh
What is the best
traversing order
to do multi-way
aggregation?
A
B
29 30 31 32
1 2 3 4
5
9
13 14 15 16
64 63 62 61
48 47 46 45
a1 a0
c3
c2
c1
c 0
b3
b2
b1
b0
a2 a3
C
B
44
28
56
40
24
52
36
20
60
Example 3-D data array of A, B, C
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 82
A
B
29 30 31 32
1 2 3 4
5
9
13 14 15 16
64 63 62 61
48 47 46 45
a1 a0
c3
c2
c1
c 0
b3
b2
b1
b0
a2 a3
C
44
28
56
40
24
52
36
20
60
B
Tng hp mng a cch cho tnh
ton khi
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 83
A
B
29 30 31 32
1 2 3 4
5
9
13 14 15 16
64 63 62 61
48 47 46 45
a1 a0
c3
c2
c1
c 0
b3
b2
b1
b0
a2 a3
C
44
28
56
40
24
52
36
20
60
B
Tng hp mng a cch cho tnh
ton khi
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 84
Method: the planes should be sorted and computed
according to their size in ascending order.
See the details of Example 3-D data array of A, B, C
Idea: keep the smallest plane in the main memory,
fetch and compute only one chunk at a time for the
largest plane
Limitation of the method: computing well only for a
small number of dimensions
If there are a large number of dimensions, bottom-
up computation and iceberg cube computation
methods can be explored
Tng hp mng a cch cho tnh
ton khi
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 85
Ch mc d liu OLAP: Ch mc Bitmap
Index on a particular column
Each value in the column has a bit vector: bit-op is fast
The length of the bit vector: # of records in the base table
The i-th bit is set if the i-th row of the base table has the value for
the indexed column
not suitable for high cardinality domains
Cust Region Type
C1 Asia Retail
C2 Europe Dealer
C3 Asia Dealer
C4 America Retail
C5 Europe Dealer
RecID Retail Dealer
1 1 0
2 0 1
3 0 1
4 1 0
5 0 1
RecIDAsia Europe America
1 1 0 0
2 0 1 0
3 1 0 0
4 0 0 1
5 0 1 0
Base table
Index on Region Index on Type
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 86
Ch mc d liu OLAP: kt ni ch mc
Join index: JI(R-id, S-id) where R (R-id, ) S
(S-id, )
Traditional indices map the values to a list of
record ids
It materializes relational join in JI file and
speeds up relational join a rather costly
operation
In data warehouses, join index relates the values
of the dimensions of a start schema to rows in
the fact table.
E.g. fact table: Sales and two dimensions city
and product
A join index on city maintains for each
distinct city a list of R-IDs of the tuples
recording the Sales in the city
Join indices can span multiple dimensions
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 87
X l hiu qu cu hi OLAP
Determine which operations should be performed on the
available cuboids:
transform drill, roll, etc. into corresponding SQL and/or
OLAP operations, e.g, dice = selection + projection
Determine to which materialized cuboid(s) the relevant
operations should be applied.
Exploring indexing structures and compressed vs. dense
array structures in MOLAP
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 88
Kho cha Metadata
Meta data is the data defining warehouse objects. It has the following
kinds
Description of the structure of the warehouse
schema, view, dimensions, hierarchies, derived data defn, data mart
locations and contents
Operational meta-data
data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged), monitoring information
(warehouse usage statistics, error reports, audit trails)
The algorithms used for summarization
The mapping from operational environment to the data warehouse
Data related to system performance
warehouse schema, view and derived data definitions
Business data
business terms and definitions, ownership of data, charging policies
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 89
Cng c v tin ch c sn KDL
Data extraction:
get data from multiple, heterogeneous, and external
sources
Data cleaning:
detect errors in the data and rectify them when possible
Data transformation:
convert data from legacy or host format to warehouse
format
Load:
sort, summarize, consolidate, compute views, check
integrity, and build indicies and partitions
Refresh
propagate the updates from the data sources to the
warehouse
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 90
Chng 3: C s v kho d liu
Khi nim kho d liu
M hnh d liu a chiu
Kin trc kho d liu
Thi hnh kho d liu
T xy dng kho d liu ti KPDL
S pht trin mi ca cng ngh khi d liu
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 91
Thm d theo pht hin cc khi d liu
Hypothesis-driven
Thm d theo ngi dng, khng gian tm kim ln
Discovery-driven (Sarawagi, et al.98)
Dn dng hiu qu khi d liu OLAP ln
Tnh ton c o biu l loi tr, hng ngi
dng vo phn tch d liu, ti mimc ca tng hp
Loi tr: khc bit ng k t gi tr c on trc,
da theo m hnh thng k
Dc trc quan nh mu nn c dng phn nh
loi tr ca mi
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 92
Kiu loi tr cad tnh ton chng
Tham s
SelfExp: chp cc lin quan ti cc khc cngmc
tng hp
InExp: chp pha di ca
PathExp: chp pha di ca theo mi ng khoan
xung
Tnh ton ch s loi tr (thit t m hnh v tnh cc
gi tr SelfExp, InExp, and PathExp) c th gi vo xy
dng khi
Loi t t chnh c th c bo qun, ch mc v tm
kim nh mt tng hp tnh ton trc
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 93
V d: Khi d liu theo pht hin
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 94
Complex Aggregation at Multiple
Granularities: Multi-Feature Cubes
Multi-feature cubes (Ross, et al. 1998): Compute complex queries
involving multiple dependent aggregates at multiple granularities
Ex. Grouping by all subsets of {item, region, month}, find the
maximum price in 1997 for each group, and the total sales among all
maximum price tuples
select item, region, month, max(price), sum(R.sales)
from purchases
where year = 1997
cube by item, region, month: R
such that R.price = max(price)
Continuing the last example, among the max price tuples, find the
min and max shelf live, and find the fraction of the total sales due to
tuple that have min shelf life within the set of all max price tuples
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 95
Cube-Gradient (Cubegrade)
Analysis of changes of sophisticated measures
in multi-dimensional spaces
Query: changes of average house price in
Vancouver in 00 comparing against 99
Answer: Apts in West went down 20%,
houses in Metrotown went up 10%
Cubegrade problem by Imielinski et al.
Changes in dimensions changes in
measures
Drill-down, roll-up, and mutation
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 96
From Cubegrade to Multi-dimensional
Constrained Gradients in Data Cubes
Significantly more expressive than association rules
Capture trends in user-specified measures
Serious challenges
Many trivial cells in a cube significance constraint
to prune trivial cells
Numerate pairs of cells probe constraint to select
a subset of cells to examine
Only interesting changes wanted gradient
constraint to capture significant changes
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 97
MD Constrained Gradient Mining
Significance constraint C
sig
: (cnt100)
Probe constraint C
prb:
(city=Van, cust_grp=busi,
prod_grp=*)
Gradient constraint C
grad
(c
g
, c
p
):
(avg_price(c
g
)/avg_price(c
p
)1.3)
Dimensions Measures
cid Yr City Cst_grp Prd_grp Cnt Avg_price
c1 00 Van Busi PC 300 2100
c2 * Van Busi PC 2800 1800
c3 * Tor Busi PC 7900 2350
c4 * * busi PC 58600 2250
Base cell
Aggregated cell
Siblings
Ancestor
Probe cell: satisfied C
prb

(c4, c2) satisfies C
grad
!
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 98
A LiveSet-Driven Algorithm
Compute probe cells using C
sig
and C
prb

The set of probe cells P is often very small
Use probe P and constraints to find gradients
Pushing selection deeply
Set-oriented processing for probe cells
Iceberg growing from low to high dimensionalities
Dynamic pruning probe cells during growth
Incorporating efficient iceberg cubing method
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 99
From On-Line Analytical Processing
to On Line Analytical Mining (OLAM)
Why online analytical mining?
High quality of data in data warehouses
DW contains integrated, consistent, cleaned data
Available information processing structure surrounding data
warehouses
ODBC, OLEDB, Web accessing, service facilities, reporting and
OLAP tools
OLAP-based exploratory data analysis
mining with drilling, dicing, pivoting, etc.
On-line selection of data mining functions
integration and swapping of multiple mining functions,
algorithms, and tasks.
Architecture of OLAM
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 100
OLAP Mining: Integration of Data Mining and Data Warehousing
Data mining systems, DBMS, Data warehouse systems
coupling
No coupling, loose-coupling, semi-tight-coupling, tight-coupling
On-line analytical mining data
integration of mining and OLAP technologies
Interactive mining multi-level knowledge
Necessity of mining knowledge and patterns at different levels of
abstraction by drilling/rolling, pivoting, slicing/dicing, etc.
Integration of multiple mining functions
Characterized classification, first clustering and then association
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 101
An OLAM Architecture
Data
Warehouse
Meta Data
MDDB
OLAM
Engine
OLAP
Engine
User GUI API
Data Cube API
Database API
Data cleaning
Data integration
Layer3
OLAP/OLAM
Layer2
MDDB
Layer1
Data
Repository
Layer4
User Interface
Filtering&Integration Filtering
Databases
Mining query Mining result
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 102
Kinds of Exceptions and their Computation
01/6/09
Parameters (thm d c iu khin) ba ch s hp dn
SelfExp: surprise of cell relative to other cells at same
level of aggregation
InExp: surprise beneath the cell
PathExp: surprise beneath cell for each drill-down
path
Computation of exception indicator (modeling fitting and
computing SelfExp, InExp, and PathExp values) can be
overlapped with cube construction
Exception themselves can be stored, indexed and
retrieved like precomputed aggregates
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 103
Examples: Discovery-Driven Data Cubes
v d item, time, region
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 104
Complex Aggregation at Multiple
Granularities: Multi-Feature Cubes
Multi-feature cubes (Ross, et al. 1998): Compute complex queries
involving multiple dependent aggregates at multiple granularities
Ex.4.17: Grouping by all subsets of {item, region, month}, find the
maximum price in 2004 for each group, and the total sales among all
maximum price tuples c php SQL m rng
select item, region, month, max(price), sum(R.sales)
from purchases
where year = 2004
cube by item, region, month: R (m rng group by)
such that R.price = max(price)
Continuing the example 4.18, among the max price tuples, find the
min and max shelf live, and find the fraction of the total sales due to
tuple that have min shelf life within the set of all max price tuples
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 105
Cube-Gradient (Cubegrade)
Analysis of changes of sophisticated measures
in multi-dimensional spaces
Query: changes of average house price in
Vancouver in 00 comparing against 99
Answer: Apts in West went down 20%,
houses in Metrotown went up 10%
Cubegrade problem by Imielinski et al.
Changes in dimensions changes in
measures
Drill-down, roll-up, and mutation
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 106
From Cubegrade to Multi-dimensional
Constrained Gradients in Data Cubes
Significantly more expressive than association rules
Capture trends in user-specified measures
Serious challenges
Many trivial cells in a cube significance constraint
to prune trivial cells
Numerate pairs of cells probe constraint to select
a subset of cells to examine
Only interesting changes wanted gradient
constraint to capture significant changes
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 107
MD Constrained Gradient Mining
Significance constraint C
sig
: (cnt100)
Probe constraint C
prb:
(city=Van, cust_grp=busi,
prod_grp=*)
Gradient constraint C
grad
(c
g
, c
p
):
(avg_price(c
g
)/avg_price(c
p
)1.3)
Dimensions Measures
cid Yr City Cst_grp Prd_grp Cnt Avg_price
c1 00 Van Busi PC 300 2100
c2 * Van Busi PC 2800 1800
c3 * Tor Busi PC 7900 2350
c4 * * busi PC 58600 2250
Base cell
Aggregated cell
Siblings
Ancestor
Probe cell: satisfied C
prb

(c4, c2) satisfies C
grad
!
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 108
A LiveSet-Driven Algorithm
sig: rng buc du, rng bu d
Compute probe cells using C
sig
and C
prb

The set of probe cells P is often very small
Use probe P and constraints to find gradients
Pushing selection deeply
Set-oriented processing for probe cells
Iceberg growing from low to high dimensionalities
Dynamic pruning probe cells during growth
Incorporating efficient iceberg cubing method
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 109
Chng 3: C s v kho d liu
Khi nim kho d liu
M hnh d liu a chiu
Kin trc kho d liu
Thi hnh kho d liu
T xy dng kho d liu ti KPDL
S pht trin mi ca cng ngh khi d liu
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 110
Iceberg Cube
Computing only the cuboid cells whose count
or other aggregates satisfying the condition:
HAVING COUNT(*) >= minsup
Motivation
Only a small portion of cube cells may be
above the water in a sparse cube
Only calculate interesting datadata
above certain threshold
Suppose 100 dimensions, only 1 base cell.
How many aggregate (non-base) cells if
count >= 1? What about count >= 2?
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 111
Bottom-Up Computation (BUC)
BUC (Beyer & Ramakrishnan,
SIGMOD99)
Bottom-up vs. top-down?
depending on how you view it!
Apriori property:
Aggregate the data,
then move to the next level
If minsup is not met, stop!
If minsup = 1 compute full
CUBE!
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 112
Partitioning
Usually, entire data set
cant fit in main memory
Sort distinct values, partition into blocks that fit
Continue processing
Optimizations
Partitioning
External Sorting, Hashing, Counting Sort
Ordering dimensions to encourage pruning
Cardinality, Skew, Correlation
Collapsing duplicates
Cant do holistic aggregates anymore!
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 113
Drawbacks of BUC
Requires a significant amount of memory
On par with most other CUBE algorithms though
Does not obtain good performance with dense CUBEs
Overly skewed data or a bad choice of dimension
ordering reduces performance
Cannot compute iceberg cubes with complex measures
CREATE CUBE Sales_Iceberg AS
SELECT month, city, cust_grp,
AVG(price), COUNT(*)
FROM Sales_Infor
CUBEBY month, city, cust_grp
HAVING AVG(price) >= 800 AND
COUNT(*) >= 50
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 114
Non-Anti-Monotonic Measures
The cubing query with avg is non-anti-monotonic!
(Mar, *, *, 600, 1800) fails the HAVING clause
(Mar, *, Bus, 1300, 360) passes the clause
CREATE CUBE Sales_Iceberg AS
SELECT month, city, cust_grp,
AVG(price), COUNT(*)
FROM Sales_Infor
CUBEBY month, city, cust_grp
HAVING AVG(price) >= 800 AND
COUNT(*) >= 50
Month City Cust_grp Prod Cost Price
Jan Tor Edu Printer 500 485
Jan Tor Hld TV 800 1200
Jan Tor Edu Camera 1160 1280
Feb Mon Bus Laptop 1500 2500
Mar Van Edu HD 540 520

September 9, 2014 Kho d liu v khai ph d liu: Chng 3 115
Top-k Average
Let (*, Van, *) cover 1,000 records
Avg(price) is the average price of those 1000 sales
Avg
50
(price) is the average price of the top-50 sales
(top-50 according to the sales price
Top-k average is anti-monotonic
The top 50 sales in Van. is with avg(price) <= 800
the top 50 deals in Van. during Feb. must be with
avg(price) <= 800
Month City Cust_grp Prod Cost Price

September 9, 2014 Kho d liu v khai ph d liu: Chng 3 116
Binning for Top-k Average
Computing top-k avg is costly with large k
Binning idea
Avg
50
(c) >= 800
Large value collapsing: use a sum and a count
to summarize records with measure >= 800
If count>=800, no need to check small records
Small value binning: a group of bins
One bin covers a range, e.g., 600~800, 400~600,
etc.
Register a sum and a count for each bin
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 117
Approximate top-k average
Range Sum Count
Over 800 28000 20
600~800 10600 15
400~600 15200 30

Top 50
Approximate avg
50
()=
(28000+10600+600*15)/50=952
Suppose for (*, Van, *), we have
Month City Cust_grp Prod Cost Price

The cell may pass the HAVING clause
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 118
Quant-info for Top-k Average
Binning
Accumulate quant-info for cells to compute
average iceberg cubes efficiently
Three pieces: sum, count, top-k bins
Use top-k bins to estimate/prune descendants
Use sum and count to consolidate current cell
Approximate avg
50
()
Anti-monotonic, can
be computed
efficiently
real avg
50
()
Anti-monotonic, but
computationally
costly
avg()
Not anti-
monotonic
strongest weakest
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 119
An Efficient Iceberg Cubing Method:
Top-k H-Cubing
One can revise Apriori or BUC to compute a top-k avg
iceberg cube. This leads to top-k-Apriori and top-k BUC.
Can we compute iceberg cube more efficiently?
Top-k H-cubing: an efficient method to compute iceberg
cubes with average measure
H-tree: a hyper-tree structure
H-cubing: computing iceberg cubes using H-tree
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 120
H-tree: A Prefix Hyper-tree
Month City Cust_grp Prod Cost Price
Jan Tor Edu Printer 500 485
Jan Tor Hhd TV 800 1200
Jan Tor Edu Camera 1160 1280
Feb Mon Bus Laptop 1500 2500
Mar Van Edu HD 540 520

root
edu
hhd
bus
Jan Mar Jan Feb
Tor Van Tor Mon
Q.I. Q.I. Q.I.
Quant-Info
Sum: 1765
Cnt: 2
bins
Attr. Val. Quant-Info Side-link
Edu Sum:2285
Hhd
Bus

Jan
Feb

Tor
Van
Mon

Header
table
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 121
Properties of H-tree
Construction cost: a single database scan
Completeness: It contains the complete
information needed for computing the iceberg
cube
Compactness: # of nodes n*m+1
n: # of tuples in the table
m: # of attributes
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 122
Computing Cells Involving
Dimension City
root
Edu.
Hhd.
Bus.
Jan. Mar. Jan. Feb.
Tor. Van. Tor. Mon.
Q.I. Q.I. Q.I.
Quant-Info
Sum: 1765
Cnt: 2
bins
Attr. Val. Quant-Info Side-link
Edu Sum:2285
Hhd
Bus

Jan
Feb

Tor
Van
Mon

Attr. Val. Q.I. Side-link
Edu
Hhd
Bus

Jan
Feb

Header
Table
H
Tor

From (*, *, Tor) to (*, Jan, Tor)
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 123
Computing Cells Involving Month
But No City
root
Edu.
Hhd. Bus.
Jan. Mar. Jan. Feb.
Tor.
Van. Tor. Mont.
Q.I. Q.I. Q.I.
Attr. Val. Quant-Info Side-link
Edu. Sum:2285
Hhd.
Bus.

Jan.
Feb.
Mar.

Tor.
Van.
Mont.

1. Roll up quant-info
2. Compute cells involving
month but no city
Q.I.
Top-k OK mark: if Q.I. in a child passes
top-k avg threshold, so does its parents.
No binning is needed!
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 124
Computing Cells Involving Only
Cust_grp
root
edu
hhd
bus
Jan Mar Jan Feb
Tor Van
Tor Mon
Q.I. Q.I. Q.I.
Attr. Val. Quant-Info Side-link
Edu Sum:2285
Hhd
Bus

Jan
Feb
Mar

Tor
Van
Mon

Check header table directly
Q.I.
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 125
Properties of H-Cubing
Space cost
an H-tree
a stack of up to (m-1) header tables
One database scan
Main memory-based tree traversal & side-links
updates
Top-k_OK marking
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 126
Scalability w.r.t. Count Threshold
(No min_avg Setting)
0
50
100
150
200
250
300
0.00% 0.05% 0.10%
Count threshold
R
u
n
t
i
m
e

(
s
e
c
o
n
d
)
top-k H-Cubing
top-k BUC
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 127
Computing Iceberg Cubes with Other
Complex Measures
Computing other complex measures
Key point: find a function which is weaker but ensures
certain anti-monotonicity
Examples
Avg() v: avg
k
(c) v (bottom-k avg)
Avg() v only (no count): max(price) v
Sum(profit) (profit can be negative):
p_sum(c) v if p_count(c) k; or otherwise, sum
k
(c) v
Others: conjunctions of multiple conditions
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 128
Discussion: Other Issues
Computing iceberg cubes with more complex measures?
No general answer for holistic measures, e.g., median,
mode, rank
A research theme even for complex algebraic functions,
e.g., standard_dev, variance
Dynamic vs . static computation of iceberg cubes
v and k are only available at query time
Setting reasonably low parameters for most nontrivial
cases
Memory-hog? what if the cubing is too big to fit in
memory?projection and then cubing
September 9, 2014 Kho d liu v khai ph d liu: Chng 3 129
Condensed Cube
W. Wang, H. Lu, J. Feng, J. X. Yu, Condensed Cube: An Effective
Approach to Reducing Data Cube Size. ICDE02.
Icerberg cube cannot solve all the problems
Suppose 100 dimensions, only 1 base cell with count = 10.
How many aggregate (non-base) cells if count >= 10?
Condensed cube
Only need to store one cell (a
1
, a
2
, , a
100
, 10), which
represents all the corresponding aggregate cells
Adv.
Fully precomputed cube without compression
Efficient computation of the minimal condensed cube

You might also like