You are on page 1of 67

TRNG I HC HNG HI VIT NAM

KHOA CNG NGH THNG TIN


B MN H THNG THNG TIN

-----***-----

BI GING
KHAI PH D LIU

TN HC PHN:
M HC PHN:
TRNH O TO:
DNG CHO SV NGNH:

KHAI PH D LIU
17409
I HC CHNH QUY
CNG NGH THNG TIN

HI PHNG - 2011

2
MC LC

Ni dung
Chng 1. Tng quan kho d liu (Data warehouse)
1.1. Cc chin lc x l v khai thc thng tin
1.2. nh ngha kho d liu
1.3. Mc ch ca kho d liu
1.4. c tnh ca d liu trong kho d liu
1.5. Phn bit kho d liu vi cc c s d liu tc nghip
Chng 2. Tng quan v khai ph d liu
2.1. Khai ph d liu l g?
2.2. Phn loi cc h thng khai ph d liu
2.3. Nhng nhim v chnh
2.4. Tch hp h thng khai ph d liu vi c s d liu hoc kho
2.5. Cc phng php khai ph d liu
2.6. Li th ca khai ph d liu so vi phng php c bn
2.7. La chn phng php
2.8. Nhng thch thc trong ng dng v nghin cu trong k thut khai ph d liu
Chng 3. Tin x l d liu
3.1. Mc ch
3.2. Lm sch d liu
3.3. Tch hp v bin i d liu
Chng 4. Khai ph da trn cc mu ph bin v lut kt hp
4.1. Khi nim v lut kt hp

4.2. Gii thut Apriori


4.3. Gii thut FP-Growth
4.4. So snh v nh gi
Chng 5. Phn lp v d on
5.1. Khi nim c bn
5.2. Phn lp da trn cy quyt nh

Trang
5
5
6
7
8
10
13
13
13
14
16
18
22
23
24
28
28
29
31
41
Error:
Referen
ce
source
not
found
40
45
51
54
54
56

3
Tn hc phn: Khai ph d liu
Loi hc phn: 2
B mn ph trch ging dy: H thng Thng tin
Khoa
ph trch: CNTT.
M hc phn: 17409
Tng s TC: 2
Tng s
L thuyt
Thc hnh/
T hc Bi tp
n mn
tit
Xemina
ln
hc
45
30
15
0
khng
khng
Hc phn hc trc: C s d liu; C s d liu nng cao; H qun tr CSDL
Hc phn tin quyt: Khng yu cu.
Hc phn song song: Khng yu cu.
Mc tiu ca hc phn:
Cung cp cc kin thc c bn v kho d liu ln v cc k thut khai ph
d liu.
Ni dung ch yu:
Tng quan v kho d liu v khai ph d liu; Phng php t chc lu tr
d liu ln, v cc k thut khai ph d liu; Phn tch d liu s dng phng
php phn cm; ng dng k thut khai ph d liu.
Ni dung chi tit:
TN CHNG MC
Chng 1. Tng quan kho d liu (Data warehouse)
1.1. Cc chin lc x l v khai thc thng tin
1.2. nh ngha kho d liu
1.3. Mc ch ca kho d liu
1.4. c tnh ca d liu trong kho d liu
1.5. Phn bit kho d liu vi cc c s d liu tc
nghip
Chng 2. Tng quan v khai ph d liu
2.1. Khai ph d liu l g?
2.2. Phn loi cc h thng khai ph d liu
2.3. Nhng nhim v chnh
2.4. Tch hp h thng khai ph d liu vi c s d liu
hoc kho
2.5. Cc phng php khai ph d liu
2.6. Li th ca khai ph d liu so vi phng php c
bn
2.7. La chn phng php
2.8. Nhng thch thc trong ng dng v nghin cu
trong k thut khai ph d liu
Chng 3. Tin x l d liu
3.1. Mc ch
3.2. Lm sch d liu
3.3. Tch hp v bin i d liu
Chng 4. Khai ph da trn cc mu ph bin v
lut kt hp
4.1. Khi nim lut kt hp
4.2. Gii thut Apriori
4.3. Gii thut FP-Growth
4.4. So snh v nh gi

PHN PHI S TIT


TS
LT
TH BT KT
6
4
2

12

4
TN CHNG MC
Chng 5. Phn lp v d on
5.1. Khi nim c bn
5.2. Phn lp da trn cy quyt nh

PHN PHI S TIT


TS
LT
TH BT KT
9
6
3

Nhim v ca sinh vin:


Tham d cc bui hc l thuyt v thc hnh, lm cc bi tp c giao,
lm cc bi thi gia hc phn v bi thi kt thc hc phn theo ng quy nh.
Ti liu hc tp:
1. J. Han, M. Kamber, Data Mining: Concepts and Techniques, 2nd edition,
Morgan Kaufmann, 2006.
2. P. N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, AddisonWesley, 2006.
3. Paulraj Ponnian, Data Warehousing Fundamentals, John Wiley.
Hnh thc v tiu chun nh gi sinh vin:
- Hnh thc thi: t lun hoc trc nghim.
- Tiu chun nh gi sinh vin: cn c vo s tham gia hc tp ca sinh
vin trong cc bui hc l thuyt v thc hnh, kt qu lm cc bi tp c
giao, kt qu ca cc bi thi gia hc phn v bi thi kt thc hc phn.
Thang im: Thang im ch A, B, C, D, F.
im nh gi hc phn: Z = 0,3X + 0,7Y.
Bi ging ny l ti liu chnh thc v thng nht ca B mn H thng Thng tin, Khoa
Cng ngh Thng tin v c dng ging dy cho sinh vin.
Ngy ph duyt:
Trng B mn

Chng 1. Tng quan v kho d liu (Datawarehouse)


1.1.

Cc chin lc x l v khai thc thng tin


S pht trin ca cng ngh thng tin v vic ng dng cng ngh thng tin trong nhiu lnh

vc ca i sng, kinh t x hi trong nhiu nm qua cng ng ngha vi lng d liu c


cc c quan thu thp v lu tr ngy mt tch lu nhiu ln. H lu tr cc d liu ny v cho rng
trong n n cha nhng gi tr nht nh no . Tuy nhin, theo thng k th ch c mt lng nh
ca nhng d liu ny (khong t 5% n 10%) l lun c phn tch, s cn li h khng bit s
phi lm g hoc c th lm g vi chng nhng h vn tip tc thu thp rt tn km vi ngh lo s
rng s c ci g quan trng b b qua sau ny c lc cn n n. Mt vn t ra l lm th
no t chc, khai thc nhng khi lng d liu khng l v a dng c?
V pha ngi s dng, cc kh khn gp phi thng l:
Khng th tm thy d liu cn thit
D liu ri rc rt nhiu h thng vi cc giao din v cng c khc nhau, khin
tn nhiu thi gian chuyn t h thng ny sang h thng khc.
C th c nhiu ngun thng tin p ng c i hi, nhng chng li c nhng
khc bit v kh pht hin thng tin no l ng.
Khng th ly ra c d liu cn thit
Thng xuyn phi c chuyn gia tr gip, dn n cng vic b dn ng.
C nhng loi thng tin khng th ly ra c nu khng m rng kh nng lm
vic ca h thng c sn.
Khng th hiu d liu tm thy
M t d liu ngho nn v thng xa ri vi cc thut ng nghip v quen thuc.
Khng th s dng c d liu tm thy
Kt qu thng khng p ng v bn cht d liu v thi gian tm kim.
D liu phi chuyn i bng tay vo mi trng lm vic ca ngi s dng.
Nhng vn v h thng thng tin:

Pht trin cc chng trnh ng dng khc nhau l khng n gin.


Mt chc nng c th hin rt nhiu chng trnh, nhng vic t chc v s
dng n l rt kh khn do hn ch v k thut.
Chuyn i d liu t cc khun dng tc nghip khc nhau ph hp vi ngi s
dng l rt kh khn.

Duy tr nhng chng trnh ny gp rt nhiu vn


Mt thay i mt ng dng s nh hng n cc ng dng khc c lin quan.

6
Thng thng s ph thuc ln nhau gia cc chng trnh khng r rng hoc l
khng xc nh c.
Do s phc tp ca cng vic chuyn i cng nh ton b qu trnh bo tr dn n
m ngun ca cc chng trnh tr nn ht sc phc tp.

Khi lng d liu lu tr tng rt nhanh


Khng kim sot c kh nng chng cho d liu trong cc mi trng thng tin
dn n khi lng d liu tng nhanh.

Qun tr d liu phc tp


Thiu nhng nh ngha chun, thng nht v d liu dn n vic mt kh nng
kim sot mi trng thng tin.
Mt thnh phn d liu tn ti nhiu ngun khc nhau.
Gii php cho tt c cc vn nu trn chnh l vic xy dng mt kho d liu (Data
Warehouse) v pht trin mt khuynh hng k thut mi l k thut pht hin tri thc v khai
ph d liu (KDD - Knowledge Discovery and Data Mining).
Trc ht, chng ta nhc li mt vi khi nim c bn lin quan n d liu, c s d liu,
kho d liu
1.2.

nh ngha kho d liu


Thng thng chng ta coi d liu nh mt dy cc bit, hoc cc s v cc k hiu, hoc cc

i tng vi mt ngha no khi c gi cho mt chng trnh di mt dng nht nh.


Chng ta s dng cc bit o lng cc thng tin v xem n nh l cc d liu c lc b cc
d tha, c rt gn ti mc ti thiu c trng mt cch c bn cho d liu. Chng ta c th
xem tri thc nh l cc thng tin tch hp, bao gm cc s kin v cc mi quan h gia chng. Cc
mi quan h ny c th c hiu ra, c th c pht hin, hoc c th c hc. Ni cch khc,
tri thc c th c coi l d liu c tru tng v t chc cao .
Theo John Ladley, k ngh kho d liu (DWT - Data Warehouse Technology) l tp cc
phng php, k thut v cc cng c c th kt hp, h tr nhau cung cp thng tin cho ngi
s dng trn c s tch hp t nhiu ngun d liu, nhiu mi trng khc nhau.
Kho d liu (Data Warehouse), l tuyn chn cc c s d liu tch hp, hng theo cc
ch nht nh, c thit k h tr cho chc nng tr gip quyt nh, m mi n v d liu
lin quan n mt khong thi gian c th.
Kho d liu thng c dung lng rt ln, thng l hng Gigabytes hay c khi ti hng
Terabytes.
Kho d liu c xy dng tin li cho vic truy cp t nhiu ngun, nhiu kiu d liu
khc nhau sao cho c th kt hp c c nhng ng dng ca cc cng ngh hin i v va c th
k tha c t cc h thng c t trc. D liu c pht sinh t cc hot ng hng ngy v

7
c thu thp x l phc v cng vic nghip v c th ca mt t chc, v vy thng c gi
l d liu tc nghip v hot ng x l d liu ny gi l x l giao dch trc tuyn (OLPT - On
Line Transaction Processing).
Dng d liu trong mt t chc (c quan, x nghip, cng ty, vv) c th m t khi qut
nh sau:

D liu tc
nghip
Kho d liu
H THNG
DI SN
(c sn)

Kho d liu cc
b

Kho d liu c
nhn

Siu d liu
Hnh 1.1. Lung d liu trong mt t
chc
D liu c nhn khng thuc phm vi qun l ca h qun tr kho d liu. N cha cc
thng tin c trch xut ra t cc h thng d liu tc nghip, kho d liu v t nhng kho d liu
cc b ca nhng ch lin quan bng cc php gp, tng hp hay x l theo mt cch no .
1.3.

Mc ch ca kho d liu
Mc tiu chnh ca kho d liu nhm p ng cc tiu chun c bn:

Phi c kh nng p ng mi yu cu v thng tin ca ngi s dng.

H tr cc nhn vin ca t chc thc hin tt, hiu qu cng vic ca mnh, nh c
nhng quyt nh hp l, nhanh v bn c nhiu hng hn, nng sut cao hn, thu c
li nhun cao hn ..v..v..

Gip cho t chc xc nh, qun l v iu hnh cc d n, cc nghip v mt cch hiu qu


v chnh xc.

Tch hp d liu v siu d liu t nhiu ngun khc nhau.


Mun t c nhng yu cu trn th DW phi:

Nng cao cht lng d liu bng cc phng php lm sch v tinh lc d liu theo nhng
hng ch nht nh.

Tng hp v kt ni d liu.

ng b ho cc ngun d liu vi DW.

Phn nh v ng nht cc h qun tr c s d liu tc nghip nh l cc cng c chun


phc v cho DW.

Qun l siu d liu (metadata)

Cung cp thng tin c tch hp, tm tt hoc c lin kt, t chc theo cc ch .
Cc kt qu khai thc kho d liu c dng trong h thng h tr quyt nh (Decision

Support System - DSS), cc h thng thng tin tc nghip hoc h tr cho cc truy vn c bit.
Mc tiu c bn ca mi t chc l li nhun v iu ny c m t nh sau:

Li nhun
Li tc
Bn hng

Xc nh gi

xut kinh doanh

Chi ph
Chi ph c
nh

Chi ph bin
i

Chi ph trong sn xut

Hnh 1.2. Mi quan h v cch nhn nhn trong h


thng
thc hin chin lc kinh doanh hiu qu, cc nh lnh o vch ra phng hng kinh
doanh hng ho. Vic xc nh gi ca hng ho v qu trnh bn hng s sn sinh li tc. Tuy
nhin, c c hng ha kinh doanh th cn phi mt cc khon chi ph. Li tc tr i chi ph s
cho li nhun ca n v.
1.4.

c tnh ca d liu trong kho d liu


c im c bn ca kho d liu l mt tp hp d liu c cc c tnh sau :
-

Tnh tch hp

Tnh hng ch

Tnh n nh

D liu tng hp

1.4.1. Tnh tch hp (Intergration)


D liu trong kho d liu c t chc theo nhiu cch khc nhau sao cho ph hp vi cc
quy c t tn, thng nht v s o, c cu m ho v cu trc vt l ca d liu, ..v..v.. Mt kho
d liu l mt khung nhn thng tin mc ton b n v sn xut kinh doanh , thng nht ton b
cc khung nhn khc nhau thnh mt khung nhn theo mt ch im no . V d, h thng x l
giao dch trc tuyn (OLAP) truyn thng c xy dng trn mt vng nghip v. Mt h thng

9
bn hng v mt h thng tip th (marketing) c th c chung mt dng thng tin khch hng. Tuy
nhin, cc vn v ti chnh cn c mt khung nhn khc v khch hng. Khung nhn bao gm
cc phn d liu khc nhau v ti chnh v marketing.
Tnh tch hp th hin ch: d liu tp hp trong kho d liu c thu thp t nhiu ngun
c trn ghp vi nhau thnh mt th thng nht.
1.4.2. Tnh hng ch
D liu trong kho d liu c t chc theo ch phc v cho t chc d dng xc nh
c cc thng tin cn thit trong tng hot ng ca mnh. V d, trong h thng qun l ti chnh
c c th c d liu c t chc cho cc chc nng: cho vay, qun l tn dng, qun l ngn
sch, ..v..v.. Ngc li, trong kho d liu v ti chnh, d liu c t chc theo ch im da vo
cc i tng: khch hng, sn phm, cc x nghip, ..v..v.. S khc nhau ca 2 cch tip cn trn
dn n s khc nhau v ni dung d liu lu tr trong h thng.
* Kho d liu khng lu tr d liu chi tit, ch cn lu tr d liu mang tnh tng hp phc
v ch yu cho qu trnh phn tch tr gip quyt nh.
* CSDL trong cc ng dng tc nghip li cn x l d liu chi tit, phc v trc tip cho
cc yu cu x l theo cc chc nng ca lnh vc ng dng hin thi. Do vy, cc h thng ng
dng tc nghip (Operational Application System - OAS) cn lu tr d liu chi tit. Mi quan h
ca d liu trong h thng ny cng khc, i hi phi c tnh chnh xc, c tnh thi s, ..v..v..
* D liu cn gn vi thi gian v c tnh lch s. Kho cha d liu bao hm mt khi
lng ln d liu c tnh lch s. D liu c lu tr thnh mt lot cc snapshot (nh chp d
liu). Mi bn ghi phn nh nhng gi tr ca d liu ti mt thi im nht nh th hin khung
nhn ca mt ch im trong mt giai on. Do vy cho php khi phc li lch s v so snh tng
i chnh xc cc giai on khc nhau. Yu t thi gian c vai tr nh mt phn ca kho m
bo tnh n nht ca mi sn phm hng ho c cung cp c trng v thi gian cho d liu. V d,
trong h thng qun l kinh doanh cn c d liu lu tr v n gi cu mt hng theo ngy (
chnh l yu t thi gian). C th mi mt hng theo mt n v tnh v ti mt thi im xc nh
phi c mt n gi khc nhau (s bin ng v gi c mt hng xng du trong thi gian qua l
mt minh chng in hnh).
D liu trong OAS th cn phi chnh xc ti thi im truy cp, cn DW th ch cn c
hiu lc trong khong thi gian no , trong khong 5 n 10 nm hoc lu hn. D liu ca
CSDL tc nghip thng sau mt khong thi gian nht nh s tr thnh d liu lch s v chng
s c chuyn vo trong kho d liu. chnh l nhng d liu hp l v nhng ch im cn lu
tr.

10
So snh v CSDL tc nghip v nh chp d liu, ta thy:
CSDL tc nghip
Thi gian ngn (30 60 ngy)
C th c yu t thi gian hoc khng
D liu c th c cp nht

nh chp d liu
Thi gian di (5 10 nm)
Lun c yu t thi gian
Khi d liu c chp li th khng cp

nht c
Bng 1.1. Tnh thi gian ca d liu
1.4.3. D liu c tnh n nh (nonvolatility)
D liu trong DW l d liu ch c v ch c th c kim tra, khng th c thay i
bi ngi dng u cui (terminal users). N ch cho php thc hin 2 thao tc c bn l np d
liu vo kho v truy cp vo cc cung trong DW. Do vy, d liu khng bin ng.
Thng tin trong DW phi c ti vo sau khi d liu trong h thng iu hnh c cho l
qu c. Tnh khng bin ng th hin ch: d liu c lu tr lu di trong kho d liu. Mc d
c thm d liu mi nhp vo nhng d liu c trong kho d liu vn khng b xo hoc thay i.
iu cho php cung cp thng tin v mt khong thi gian di, cung cp s liu cn thit cho
cc m hnh nghip v phn tch, d bo. T c c nhng quyt nh hp l, ph hp vi cc
quy lut tin ho ca t nhin.
1.4.4. D liu tng hp
D liu tc nghip thun tu khng c lu tr trong DW. D liu tng hp c tch hp
li qua nhiu giai on khc nhau theo cc ch im nu trn.
1.5.

Phn bit kho d liu vi cc c s d liu tc nghip


Trn c s cc c trng ca DW, ta phn bit DW vi nhng h qun tr CSDL tc nghip

truyn thng:

Kho d liu phi c xc nh hng theo ch . N c thc hin theo ca ngi


s dng u cui. Trong khi cc h CSDL tc nghip dng phc v cc mc ch p
dng chung.

Nhng h CSDL thng thng khng phi qun l nhng lng thng tin ln m qun l
nhng lng thng tin va v nh. DW phi qun l mt khi lng ln cc thng tin c
lu tr trn nhiu phng tin lu tr v x l khc nhau. cng l c th ca DW.

DW c th ghp ni cc phin bn (version) khc nhau ca cc cu trc CSDL. DW tng


hp thng tin th hin chng di nhng hnh thc d hiu i vi ngi s dng.

DW tch hp v kt ni thng tin t nhiu ngun khc nhau trn nhiu loi phng tin lu
tr v x l thng tin nhm phc v cho cc ng dng x l tc nghip trc tuyn.

DW c th lu tr cc thng tin tng hp theo mt ch nghip v no sao cho to ra


cc thng tin phc v hiu qu cho vic phn tch ca ngi s dng.

11

DW thng thng cha cc d liu lch s kt ni nhiu nm trc ca cc thng tin tc


nghip c t chc lu tr c hiu qu v c th c hiu chnh li d dng. D liu trong
CSDL tc nghip thng l mi, c tnh thi s trong mt khong thi gian ngn.

D liu trong CSDL tc nghip c cht lc v tng hp li chuyn sang mi trng


DW. Rt nhiu d liu khc khng c chuyn v DW, ch nhng d liu cn thit cho
cng tc qun l hay tr gip quyt nh mi c chuyn sang DW.

Ni mt cch tng qut, DW lm nhim v phn pht d liu cho nhiu i tng (khch hng),
x l thng tin nhiu dng nh: CSDL, truy vn d liu (SQL query), bo co (report) ..v..v..

12

BI TP:
L THUYT:
1. Kho d liu l g?
2. Cho v d v cc h thng hoc lnh vc no c iu kin xy dng cc kho
d liu ln?
3. Mt bng d liu c 50.000 bn ghi liu c th c gi l mt kho d liu ln hay
cha? L gii cho cu tr li?
4. Cho v d v mt ngun d liu lu tr c cu trc bng, cu trc semi-structured,
hoc khng cu trc?
5. Phn bit kho d liu vi c s d liu tc nghip?
THC HNH:
1. Ci t b ng dng Microsoft Visual Studio 2005?
2. Ci t v tm hiu dch v Data analysis?
3. Quan st v tm hiu c s d liu NorthWind?

13

Chng 2: Tng quan v khai ph d liu


2.1.

Khai ph d liu
Khai ph d liu c dng m t qu trnh pht hin ra tri thc trong CSDL. Qu trnh

ny kt xut ra cc tri thc tim n t d liu gip cho vic d bo trong kinh doanh, cc hot ng
sn xut, ... Khai ph d liu lm gim chi ph v thi gian so vi phng php truyn thng trc
kia (v d nh phng php thng k).
Sau y l mt s nh nghi mang tnh m t ca nhiu tc gi v khai ph d liu.
nh ngha ca Ferruzza: Khai ph d liu l tp hp cc phng php c dng trong
tin trnh khm ph tri thc ch ra s khc bit cc mi quan h v cc mu cha bit bn trong
d liu
nh ngha ca Parsaye: Khai ph d liu l qu trnh tr gip quyt nh, trong chng
ta tm kim cc mu thng tin cha bit v bt ng trong CSDL ln
nh ngha ca Fayyad: Khai ph tri thc l mt qu trnh khng tm thng nhn ra
nhng mu d liu c gi tr, mi, hu ch, tim nng v c th hiu c.
2.2.

Cc ng dng ca khai ph d liu


Pht hin tri thc v khai ph d liu lin quan n nhiu ngnh, nhiu lnh vc: thng k,

tr tu nhn to, c s d liu, thut ton, tnh ton song song v tc cao, thu thp tri thc cho
cc h chuyn gia, quan st d liu... c bit pht hin tri thc v khai ph d liu rt gn gi vi
lnh vc thng k, s dng cc phng php thng k m hnh d liu v pht hin cc mu, lut
... Ngn hng d liu (Data Warehousing) v cc cng c phn tch trc tuyn (OLAP- On Line
Analytical Processing) cng lin quan rt cht ch vi pht hin tri thc v khai ph d liu.
Khai ph d liu c nhiu ng dng trong thc t, v d nh:

Bo him, ti chnh v th trng chng khon: phn tch tnh hnh ti chnh v d bo gi ca
cc loi c phiu trong th trng chng khon. Danh mc vn v gi, li sut, d liu th tn
dng, pht hin gian ln, ...

Thng k, phn tch d liu v h tr ra quyt nh. V d nh bng sau:


Nm

Dn s th gii
(triu ngi)

Nm

Dn s th gii
(triu ngi)

Nm

Dn s th gii
(triu ngi)

1950

2555

1970

3708

1990

5275

1951

2593

1971

3785

1991

5359

1952

2635

1972

3862

1992

5443

1953

2680

1973

3938

1993

5524

1954

2728

1974

4014

1994

5604

14
1955

2779

1975

4087

1995

5685

1956

2832

1976

4159

1996

5764

1957

2888

1977

4231

1997

5844

1958

2945

1978

4303

1998

5923

1959

2997

1979

4378

1999

6001

1960

3039

1980

4454

2000

6078

1961

3080

1981

4530

2001

6153

1962

3136

1982

4610

2002

6228

1963

3206

1983

4690

1964

3277

1984

4769

1965

3346

1985

4850

1966

3416

1986

4932

1967

3486

1987

5017

1968

3558

1988

5102

1969

3632

1989

5188

Ngun: U.S. Bureau of the Census, International Data Base. Cp nht ngy 10/10/2002.
Bng 2.1. Dn s th gii tnh ti thi im gia nm

iu tr y hc v chm sc y t: mt s thng tin v chun on bnh lu trong cc h thng


qun l bnh vin. Phn tch mi lin h gia cc triu chng bnh, chun on v phng
php iu tr (ch dinh dng, thuc, ...)

Sn xut v ch bin: Quy trnh, phng php ch bin v x l s c.

Text mining v Web mining: Phn lp vn bn v cc trang Web, tm tt vn bn,...

Lnh vc khoa hc: Quan st thin vn, d liu gene, d liu sinh vt hc, tm kim, so snh
cc h gene v thng tin di truyn, mi lin h gene v mt s bnh di truyn, ...

Mng vin thng: Phn tch cc cuc gi in thoi v h thng gim st li, s c, cht lng
dch v, ...

2.3.

Cc bc ca qu trnh khai ph d liu


Quy trnh pht hin tri thc thng tun theo cc bc sau:

15

Hnh 2.1. Quy trnh pht hin tri thc


Bc th nht: Hnh thnh, xc nh v nh ngha bi ton. L tm hiu lnh vc ng dng
t hnh thnh bi ton, xc nh cc nhim v cn phi hon thnh. Bc ny s quyt nh cho
vic rt ra c cc tri thc hu ch v cho php chn cc phng php khai ph d liu thch hp
vi mc ch ng dng v bn cht ca d liu.
Bc th hai: Thu thp v tin x l d liu. L thu thp v x l th, cn c gi l tin
x l d liu nhm loi b nhiu (lm sch d liu), x l vic thiu d liu (lm giu d liu), bin
i d liu v rt gn d liu nu cn thit, bc ny thng chim nhiu thi gian nht trong ton
b qui trnh pht hin tri thc. Do d liu c ly t nhiu ngun khc nhau, khng ng nht,
c th gy ra cc nhm ln. Sau bc ny, d liu s nht qun, y , c rt gn v ri rc ho.
Bc th ba: Khai ph d liu, rt ra cc tri thc. L khai ph d liu, hay ni cch khc l
trch ra cc mu hoc/v cc m hnh n di cc d liu. Giai on ny rt quan trng, bao gm
cc cng on nh: chc nng, nhim v v mc ch ca khai ph d liu, dng phng php khai
ph no? Thng thng, cc bi ton khai ph d liu bao gm: cc bi ton mang tnh m t - a
ra tnh cht chung nht ca d liu, cc bi ton d bo - bao gm c vic pht hin cc suy din
da trn d liu hin c. Tu theo bi ton xc nh c m ta la chn cc phng php khai ph
d liu cho ph hp.
Bc th t: S dng cc tri thc pht hin c. L hiu tri thc tm c, c bit l
lm sng t cc m t v d on. Cc bc trn c th lp i lp li mt s ln, kt qu thu c
c th c ly trung bnh trn tt c cc ln thc hin. Cc kt qu ca qu trnh pht hin tri thc

16
c th c a v ng dng trong cc lnh vc khc nhau. Do cc kt qu c th l cc d on
hoc cc m t nn chng c th c a vo cc h thng h tr ra quyt nh nhm t ng ho
qu trnh ny.
Tm li: KDD l mt qu trnh kt xut ra tri thc t kho d liu m trong khai ph d
liu l cng on quan trng nht.

2.4.

Nhim v chnh trong khai thc d liu


Qu trnh khai ph d liu l qu trnh pht hin ra mu thng tin. Trong , gii thut khai

ph tm kim cc mu ng quan tm theo dng xc nh nh cc lut, phn lp, hi quy, cy quyt


nh, ...
2.4.1.

Phn lp (phn loi - classification)

L vic xc nh mt hm nh x t mt mu d liu vo mt trong s cc lp c bit


trc . Mc tiu ca thut ton phn lp l tm ra mi quan h no gia thuc tnh d bo v
thuc tnh phn lp. Nh th qu trnh phn lp c th s dng mi quan h ny d bo cho cc
mc mi. Cc kin thc c pht hin biu din di dng cc lut theo cch sau: Nu cc thuc
tnh d bo ca mt mc tho mn iu kin ca cc tin th mc nm trong lp ch ra trong kt
lun.
V d: Mt mc biu din thng tin v nhn vin c cc thuc tnh d bo l: h tn, tui,
gii tnh, trnh hc vn, v thuc tnh phn loi l trnh lnh o ca nhn vin.
2.4.2.

Hi qui (regression)

L vic hc mt hm nh x t mt mu d liu thnh mt bin d on c gi tr thc.


Nhim v ca hi quy tng t nh phn lp, im khc nhau chnh l ch thuc tnh d bo
l lin tc ch khng phi ri rc. Vic d bo cc gi tr s thng c lm bi cc phng php
thng k c in, chng hn nh hi quy tuyn tnh. Tuy nhin, phng php m hnh ho cng
c s dng, v d: cy quyt nh.
ng dng ca hi quy l rt nhiu, v d: d on s lng sinh vt pht quang hin thi
trong khu rng bng cch d tm vi sng bng cc thit b cm bin t xa; c lng sc xut ngi
bnh c th cht bng cch kim tra cc triu chng; d bo nhu cu ca ngi dng i vi mt
sn phm,
2.4.3.

Phn nhm (clustering)

L vic m t chung tm ra cc tp hay cc nhm, loi m t d liu. Cc nhm c th


tch nhau hoc phn cp hay gi ln nhau. C ngha l d liu c th va thuc nhm ny li va
thuc nhm khc. Cc ng dng khai ph d liu c nhim v phn nhm nh pht hin tp cc
khch hng c phn ng ging nhau trong CSDL tip th; xc nh cc quang ph t cc phng

17
php o tia hng ngoi, Lin quan cht ch n vic phn nhm l nhim v nh gi d liu,
hm mt xc sut a bin/ cc trng trong CSDL.
2.4.4.

Tng hp (summarization)

L cng vic lin quan n cc phng php tm kim mt m t tp con d liu [1, 2, 5].
K thut tng hp thng p dng trong vic phn tch d liu c tnh thm d v bo co t ng.
Nhim v chnh l sn sinh ra cc m t c trng cho mt lp. M t loi ny l mt kiu tng
hp, tm tt cc c tnh chung ca tt c hay hu ht cc mc ca mt lp. Cc m t c trng th
hin theo lut c dng sau: Nu mt mc thuc v lp ch trong tin th mc c tt c cc
thuc tnh nu trong kt lun. Lu rng lut dng ny c cc khc bit so vi lut phn lp.
Lut pht hin c trng cho lp ch sn sinh khi cc mc thuc v lp .
2.4.5.

M hnh ho s ph thuc (dependency modeling)

L vic tm kim mt m hnh m t s ph thuc gia cc bin, thuc tnh theo hai mc:
Mc cu trc ca m hnh m t (thng di dng th). Trong , cc bin ph thuc b phn
vo cc bin khc. Mc nh lng m hnh m t mc ph thuc. Nhng ph thuc ny thng
c biu th di dng theo lut nu - th (nu tin l ng th kt lun ng). V nguyn tc,
c tin v kt lun u c th l s kt hp logic ca cc gi tr thuc tnh. Trn thc t, tin
thng l nhm cc gi tr thuc tnh v kt lun ch l mt thuc tnh. Hn na h thng c th
pht hin cc lut phn lp trong tt c cc lut cn phi c cng mt thuc tnh do ngi dng
ch ra trong kt lun.
Quan h ph thuc cng c th biu din di dng mng tin cy Bayes. l th c
hng, khng chu trnh. Cc nt biu din thuc tnh v trng s ca lin kt ph thuc gia cc nt
.
2.4.6.

Pht hin s bin i v lch (change and deviation dectection)

Nhim v ny tp trung vo khm ph hu ht s thay i c ngha di dng o bit


trc hoc gi tr chun, pht hin lch ng k gia ni dung ca tp con d liu thc v ni
dung mong i. Hai m hnh lch hay dng l lch theo thi gian hay lch theo nhm. lch
theo thi gian l s thay i c ngha ca d liu theo thi gian. lch theo nhm l s khc
nhau ca gia d liu trong hai tp con d liu, y tnh c trng hp tp con d liu ny thuc
tp con kia, ngha xc nh d liu trong mt nhm con ca i tng c khc ng k so vi ton
b i tng khng? Theo cch ny, sai st d liu hay sai lch so vi gi tr thng thng c
pht hin.
V nhng nhim v ny yu cu s lng v cc dng thng tin rt khc nhau nn chng
thng nh hng n vic thit k v chn phng php khai ph d liu khc nhau. V d nh
phng php cy quyt nh (s c trnh by di y) to ra c mt m t phn bit c cc
mu gia cc lp nhng khng c tnh cht v c im ca lp.

18
2.5.

Cc phng php khai ph d liu


Khai ph d liu l lnh vc m con ngi lun tm cch t c mc ch s dng thng

tin ca mnh. Qu trnh khai ph d liu l qu trnh pht hin mu, trong phng php khai ph
d liu tm kim cc mu ng quan tm theo dng xc nh. C th k ra y mt vi phng
php nh: s dng cng c truy vn, xy dng cy quyt nh, da theo khong cch (K-lng ging
gn), gi tr trung bnh, pht hin lut kt hp, Cc phng php trn c th c phng theo v
c tch hp vo cc h thng lai khai ph d liu theo thng k trong nhiu nm nghin cu.
Tuy nhin, vi d liu rt ln trong kho d liu th cc phng php ny cng i din vi thch
thc v mt hiu qu v quy m.
2.5.1.

Cc thnh phn ca gii thut khai ph d liu

Gii thut khai ph d liu bao gm 3 thnh phn chnh nh sau: biu din m hnh, kim
nh m hnh v phng php tm kim.
Biu din m hnh: M hnh c biu din theo mt ngn ng L no miu t cc mu
c th khai thc c. M t m hnh r rng th hc my s to ra mu c m hnh chnh xc cho
d liu. Tuy nhin, nu m hnh qu ln th kh nng d on ca hc my s b hn ch. Nh th
s lm cho vic tm kim phc tp hn cng nh hiu c m hnh l khng n gin hoc s
khng th c cc mu to ra c mt m hnh chnh xc cho d liu. V d m t cy quyt nh
s dng phn chia cc nt theo 1 trng d liu, chia khng gian u vo thnh cc siu phng song
song vi trc cc thuc tnh. Phng php cy quyt nh nh vy khng th khai ph c d liu
dng cng thc X = Y d cho tp hc c quy m ln th no i na. V vy, vic quan trng l
ngi phn tch d liu cn phi hiu y cc gi thit miu t. Mt iu cng kh quan trng l
ngi thit k gii thut cng phi din t c cc gi thit m t no c to ra bi gii thut
no. Kh nng miu t m hnh cng ln th cng lm tng mc nguy him do b hc qu v lm
gim i kh nng d on cc d liu cha bit. Hn na, vic tm kim s cng tr ln phc tp
hn v vic gii thch m hnh cng kh khn hn.
M hnh ban u c xc nh bng cch kt hp bin u ra (ph thuc) vi cc bin c
lp m bin u ra ph thuc vo. Sau phi tm nhng tham s m bi ton cn tp trung gii
quyt. Vic tm kim m hnh s a ra c mt m hnh ph hp vi tham s c xc nh da
trn d liu (trong mt s trng hp khc th m hnh v cc tham s li thay i ph hp vi
d liu). Trong mt s trng hp, tp cc d liu c chia thnh tp d liu hc v tp d liu
th. Tp d liu hc c dng lm cho tham s ca m hnh ph hp vi d liu. M hnh sau
s c nh gi bng cch a cc d liu th vo m hnh v thay i cc tham s cho ph
hp nu cn. M hnh la chn c th l phng php thng k nh SASS, mt s gii thut hc
my (v d nh cy quyt nh v cc quyt nh hc c thy khc), mng neuron, suy din hng
tnh hung (case based reasoning), cc k thut phn lp.

19
Kim nh m hnh (model evaluation): L vic nh gi, c lng cc m hnh chi tit,
chun trong qu trnh x l v pht hin tri thc vi s c lng c d bo chnh xc hay khng
v c tho mn c s logic hay khng? c lng phi c nh gi cho (cross validation) vi
vic m t c im bao gm d bo chnh xc, tnh mi l, tnh hu ch, tnh hiu c ph hp
vi cc m hnh. Hai phng php logic v thng k chun c th s dng trong m hnh kim
nh.
Phng php tm kim: Phng php ny bao gm hai thnh phn: tm kim tham s v tm
kim m hnh. Trong tm kim tham s, gii thut cn tm kim cc tham s ti u ha cc tiu
chun nh gi m hnh vi cc d liu quan st c v vi mt m t m hnh nh. Vic tm
kim khng cn thit i vi mt s bi ton kh n gin: cc nh gi tham s ti u c th t
c bng cc cch n gin hn. i vi cc m hnh chung th khng c cc cch ny, khi gii
thut tham lam thng c s dng lp i lp li. V d nh phng php gim gradient trong
gii thut lan truyn ngc (backpropagation) cho cc mng neuron. Tm kim m hnh xy ra
ging nh mt vng lp qua phng php tm kim tham s: m t m hnh b thay i to nn mt
h cc m hnh. Vi mi mt m t m hnh, phng php tm kim tham s c p dng nh
gi cht lng m hnh. Cc phng php tm kim m hnh thng s dng cc k thut tm kim
heuristic v kch thc ca khng gian cc m hnh c th thng ngn cn cc tm kim tng th,
hn na cc gii php n gin (closed form) khng d t c.
2.5.2.

Phng php suy din / quy np

Mt c s d liu l mt kho thng tin nhng cc thng tin quan trng hn cng c th c
suy din t kho thng tin . C hai k thut chnh thc hin vic ny l suy din v quy np.
Phng php suy din: Nhm rt ra thng tin l kt qu logic ca cc thng tin trong c s
d liu. V d nh ton t lin kt p dng cho bng quan h, bng u cha thng tin v cc nhn
vin v phng ban, bng th hai cha cc thng tin v cc phng ban v cc trng phng. Nh vy
s suy ra c mi quan h gia cc nhn vin v cc trng phng. Phng php suy din da trn
cc s kin chnh xc suy ra cc tri thc mi t cc thng tin c. Mu chit xut c bng cch
s dng phng php ny thng l cc lut suy din.
Phng php quy np: phng php quy np suy ra cc thng tin c sinh ra t c s d
liu. C ngha l n t tm kim, to mu v sinh ra tri thc ch khng phi bt u vi cc tri thc
bit trc. Cc thng tin m phng php ny em li l cc thng tin hay cc tri thc cp cao
din t v cc i tng trong c s d liu. Phng php ny lin quan n vic tm kim cc mu
trong CSDL. Trong khai ph d liu, quy np c s dng trong cy quyt nh v to lut.
2.5.3.

Phng php ng dng K-lng ging gn

S miu t cc bn ghi trong tp d liu khi tr vo khng gian nhiu chiu l rt c ch i


vi vic phn tch d liu. Vic dng cc miu t ny, ni dung ca vng ln cn c xc nh,

20
trong cc bn ghi gn nhau trong khng gian c xem xt thuc v ln cn (hng xm lng
ging) ca nhau. Khi nim ny c dng trong khoa hc k thut vi tn gi K-lng ging gn,
trong K l s lng ging c s dng. Phng php ny rt hiu qu nhng li n gin.
tng thut ton hc K-lng ging gn l thc hin nh cc lng ging gn ca bn lm.
V d: d on hot ng ca c th xc nh, K-lng ging tt nht ca c th c xem
xt, v trung bnh cc hot ng ca cc lng ging gn a ra c d on v hot ng ca c
th .
K thut K-lng ging gn l mt phng php tm kim n gin. Tuy nhin, n c mt s
mt hn ch gii l hn phm vi ng dng ca n. l thut ton ny c phc tp tnh ton l
lu tha bc 2 theo s bn ghi ca tp d liu.
Vn chnh lin quan n thuc tnh ca bn ghi. Mt bn ghi gm hiu thuc tnh c
lp, n bng mt im trong khng gian tm kim c s chiu ln. Trong cc khng gian c s chiu
ln, gia hai im bt k hu nh c cng khong cch. V th m k thut K-lng ging khng cho
ta thm mt thng tin c ch no, khi tt c cc cp im u l cc lng ging. Cui cng, phng
php K-lng ging khng a ra l thuyt hiu cu trc d liu. Hn ch c th c khc
phc bng k thut cy quyt nh.
2.5.4.

Phng php s dng cy quyt nh v lut

Vi k thut phn lp da trn cy quyt nh, kt qu ca qu trnh xy dng m hnh s


cho ra mt cy quyt nh. Cy ny c s dng trong qu trnh phn lp cc i tng d liu
cha bit hoc nh gi chnh xc ca m hnh. Tng ng vi hai giai on trong qu trnh
phn lp l qu trnh xy dng v s dng cy quyt nh.
Qu trnh xy dng cy quyt nh bt u t mt nt n biu din tt c cc mu d liu.
Sau , cc mu s c phn chia mt cch quy da vo vic la chn cc thuc tnh. Nu cc
mu c cng mt lp th nt s tr thnh l, ngc li ta s dng mt o thuc tnh chn ra
thuc tnh tip theo lm c s phn chia cc mu ra cc lp. Theo tng gi tr ca thuc tnh va
chn, ta to ra cc nhnh tng ng v phn chia cc mu vo cc nhnh to. Lp li qu trnh
trn cho ti khi to ra c cy quyt nh, tt c cc nt trin khai thnh l v c gn nhn.
Qu trnh quy s dng li khi mt trong cc iu kin sau c tha mn:
-

Tt c cc mu thuc cng mt nt.

Khng cn mt thuc tnh no la chn.

Nhnh khng cha mu no.

Phn ln cc gii thut sinh cy quyt nh u c hn ch chung l s dng nhiu b nh.


Lng b nh s dng t l thun vi kch thc ca mu d liu hun luyn. Mt chng trnh
sinh cy quyt nh c h tr s dng b nh ngoi song li c nhc im v tc thc thi. Do

21
vy, vn ta bt cy quyt nh tr nn quan trng. Cc nt l khng n nh trong cy quyt
nh s c ta bt.
K thut ta trc l vic dng sinh cy quyt nh khi chia d liu khng c ngha.
2.5.5.

Phng php pht hin lut kt hp

Phng php ny nhm pht hin ra cc lut kt hp gia cc thnh phn d liu trong c s
d liu. Mu u ra ca gii thut khai ph d liu l tp lut kt hp tm c. Ta c th ly mt v
d n gin v lut kt hp nh sau: s kt hp gia hai thnh phn A v B c ngha l s xut hin
ca A trong bn ghi ko theo s xut hin ca B trong cng bn ghi : A => B.
Cho mt lc R={A1, , Ap} cc thuc tnh vi min gi tr {0,1}, v mt quan h r trn
R. Mt lut kt hp trn r c m t di dng X=>B vi X R v B R\X. V mt trc gic, ta
c th pht biu ngha ca lut nh sau: nu mt bn ghi ca bng r c gi tr 1 ti mi thuc tnh
thuc X th gi tr ca thuc tnh B cng l 1 trong cng bn ghi . V d nh ta c tp c s d
liu v cc mt hng bn trong siu th, cc dng tng ng vi cc ngy bn hng, cc ct tng
ng vi cc mt hng th gi tr 1 ti (20/10, bnh m) xc nh rng bnh m bn ngy hm
cng ko theo s xut hin gi tr 1 ti (20/10, b).
Cho W R, t s(W,r) l tn s xut hin ca W trong r c tnh bng t l ca cc hng
trong r c gi tr 1 ti mi ct thuc W. Tn s xut hin ca lut X=>B trong r c nh ngha l
s(X {B}, r) cn gi l h tr ca lut, tin cy ca lut l s(X {B}, r)/s(X, r). y X c
th gm nhiu thuc tnh, B l gi tr khng c nh. Nh vy m khng xy ra vic to ra cc lut
khng mong mun trc khi qu trnh tm kim bt u. iu cng cho thy khng gian tm
kim c kch thc tng theo hm m ca s lng cc thuc tnh u vo. Do vy cn phi ch
khi thit k d liu cho vic tm kim cc lut kt hp.
Nhim v ca vic pht hin cc lut kt hp l phi tm tt c cc lut X=>B sao cho tn s
ca lut khng nh hn ngng cho trc v tin cy ca lut khng nh hn ngng cho
trc. T mt c s d liu ta c th tm c hng nghn v thm ch hng trm nghn cc lut kt
hp.
Ta gi mt tp con X R l thng xuyn trong r nu tha mn iu kin s(X, r). Nu
bit tt c cc tp thng xuyn trong r th vic tm kim cc lut rt d dng. V vy, gii thut tm
kim cc lut kt hp trc tin i tm tt c cc tp thng xuyn ny, sau to dng dn cc lut
kt hp bng cch ghp dn cc tp thuc tnh da trn mc thng xuyn.
Cc lut kt hp c th l mt cch hnh thc ha n gin. Chng rt thch hp cho vic to
ra cc kt qu c d liu dng nh phn. Gii hn c bn ca phng php ny l ch cc quan h
cn phi tha theo ngha khng c tp thng xuyn no cha nhiu hn 15 thuc tnh. Gii thut
tm kim cc lut kt hp to ra s lut t nht phi bng vi s cc tp ph bin v nu nh mt tp

22
ph bin c kch thc K th phi c t nht l 2 K tp ph bin. Thng tin v cc tp ph bin c
s dng c lng tin cy ca cc tp lut kt hp.
2.6.

Li th ca khai ph d liu so vi phng php c bn


Nh phn tch trn, ta thy phng php khai ph d liu khng c g l mi v hon

ton da trn cc phng php c bn bit. Vy khai ph d liu c g khc so vi cc phng


php ? V ti sao khai ph d liu li c u th hn hn chng? Cc phn tch sau y s gii p
cc cu hi ny.
2.6.1.

Hc my (Machine Learning)

Mc d ngi ta c gng ci tin cc phng php hc my c th ph hp vi mc


ch khai ph d liu nhng s khc bit gia cch thit k, cc c im ca c s d liu lm
cho phng php hc my tr nn khng ph hp vi mc ch ny, mc d cho n nay, phn ln
cc phng php khai ph d liu vn a trn nn tng c s ca phng php hc my. Nhng
phn tch sau y s cho thy iu .
Trong qun tr c s d liu, mt c s d liu l mt tp hp c tch hp mt cch logic
ca d liu c lu trong mt hay nhiu tp v c t chc lu tr c hiu qu, sa i v ly
thng tin lin quan c d dng. V d nh trong CSDL quan h, d liu c t chc thnh cc
tp hoc cc bng c cc bn ghi c di c nh. Mi bn ghi l mt danh sch c th t cc gi
tr, mi gi tr c t vo mt trng. Thng tin v tn trng v gi tr ca trng c t
trong mt tp ring gi l th vin d liu (data dictionary). Mt h thng qun tr c s d liu s
qun l cc th tc (procedures) ly, lu tr, v x l d liu trong cc c s d liu .
Trong hc my, thut ng c s d liu ch yu cp n mt tp cc mu (instance hay
example) c lu trong mt tp. Cc mu thng l cc vector c im c di c nh. Thng
tin v cc tn c im, dy gi tr ca chng i khi cng c lu li nh trong t in d liu.
Mt gii thut hc cn s dng tp d liu v cc thng tin km theo tp d liu lm u vo v
u ra biu th kt qu ca vic hc (v d nh mt khi nim).
Vi so snh c s d liu thng thng v CSDL trong hc my nh trn, c th thy l hc
my c kh nng c p dng cho c s d liu, bi v khng phi hc trn tp cc mu m hc
trn tp cc bn ghi ca CDSL.
Tuy nhin, pht hin tri thc trong c s d liu lm tng thm cc vn vn l in
hnh trong hc my v qu kh nng ca hc my. Trong thc t, c s d liu thng ng,
khng y , b nhiu, v ln hn nhiu so vi tp cc d liu hc my in hnh. Cc yu t ny
lm cho hu ht cc gii thut hc my tr nn khng hiu qu trong hu ht cc trng hp. V vy
trong khai ph d liu, cn tp trung rt nhiu cng sc vo vic vt qua nhng kh khn, phc
tp ny trong CSDL.
2.6.2.

Phng php h chuyn gia

23
Cc h chuyn gia c gng nm bt cc tri thc thch hp vi bi ton no . Cc k thut
thu thp gip cho vip hp l mt cch suy din cc chuyn gia con ngi. Mi phng php
l mt cch suy din cc lut t cc v d v gii php i vi bi ton chuyn gia a ra. Phng
php ny khc vi khai ph d liu ch cc v d ca chuyn gia thng mc cht lng cao
hn rt nhiu so vi cc d liu trong c s d liu, v chng thng ch bao c cc trng hp
quan trng. Hn na, cc chuyn gia s xc nhn tnh gi tr v hu dng ca cc mu pht hin
c. Cng nh vi cc cng c qun tr c s d liu, cc phng php ny i hi c s tham
gia ca con ngi trong vic pht hin tri thc
2.6.3.

Pht kin khoa hc

Khai ph d liu rt khc vi pht kin khoa hc ch khai ph trong CSDL t c ch tm


v c iu kin hn. Cc d liu khoa hc c thc nghim nhm loi b mt s tc ng ca cc
tham s nhn mnh bin thin ca mt hay mt s tham s ch. Tuy nhin, cc c s d liu
thng mi in hnh li ghi mt s lng tha thng tin v cc d n ca h t c mt s
mc ch v mt t chc. d tha ny (hay c th gi l s ln ln confusion) c th nhn thy
v cng c th n cha trong cc mi quan h d liu. Hn na, cc nh khoa hc c th to li cc
th nghim v c th tm ra rng cc thit k ban u khng thch hp. Trong khi , cc nh qun
l c s d liu hu nh khng th xa x i thit k li cc trng d liu v thu thp li d liu.
2.6.4.

Phng php thng k

Mt cu hi hin nhin l khai ph d liu khc g so vi phng php thng k. Mt cu hi


hin nhin l khai ph d liu khc g so vi phng php thng k. T nhiu nm nay, con ngi
s dng phng php thng k mt cch rt hiu qu t c mc ch ca mnh.
Mc d cc phng php thng k cung cp mt nn tng l thuyt vng chc cho cc bi ton
phn tch d liu nhng ch c tip cn thng k thun ty thi cha . Th nht, cc phng php
thng k chun khng ph hp i vi cc kiu d liu c cu trc trong rt nhiu cc CSDL. Th
hai, thng k hon ton theo d liu (data driven), n khng s dng tri thc sn c v lnh vc.
Th ba, cc kt qu phn tch thng k c th s rt nhiu v kh c th lm r c. Cui cng,
cc phng php thng k cn c s hng dn ca ngi dng xc nh phn tch d liu nh
th no v u.
S khc nhau c bn gia khai ph d liu v thng k l ch khai ph d liu l mt phng
tin c dng bi ngi s dng u cui ch khng phi l cc nh thng k. Khai ph d liu t
ng qu trnh thng k mt cch c hiu qu, v vy lm nh bt cng vic ca ngi dng u
cui, to ra mt cng c d s dng hn. Nh vy, nh c khai ph d liu, vic d on v kim
tra rt vt v trc y c th c a ln my tnh, c tnh, d on v kim tra mt cch t
ng.
2.7.

La chn phng php

24
Cc gii thut khai ph d liu t ng vn mi ch giai on pht trin ban u. Ngi ta
vn cha a ra c mt tiu chun no trong vic quyt nh s dng phng php no v trong
trng hp hp no th c hiu qu.
Hu ht cc k thut khai ph d liu u mi i vi lnh vc kinh doanh. Hn na li c
rt nhiu k thut, mi k thut c s dng cho nhiu bi ton khc nhau. V vy, ngay sau cu
hi khai ph d liu l g? s l cu hi vy th dng k thut no?. Cu tr li tt nhin l
khng n gin. Mi phng php u c im mnh v yu ca n, nhng hu ht cc im yu
u c th khc phc c. Vy th phi lm nh th no p dng k thut mt cch tht n
gin, d s dng khng cm thy nhng phc tp vn c ca k thut .
so snh cc k thut cn phi c mt tp ln cc quy tc v cc phng php thc
nghim tt. Thng th quy tc ny khng c s dng khi nh gi cc k thut mi nht. Vi vy
m nhng yu cu ci thin chnh xc khng phi lc no cng thc hin c.
Nhiu cng ty a ra nhng sn phm s dng kt hp nhiu k thut khai ph d liu
khc nhau vi hy vng nhiu k thut s tt hn. Nhng thc t cho thy nhiu k thut ch thm
nhiu rc ri v gy kh khn cho vic so snh gia cc phng php v cc sn phm ny. Theo
nhiu nh gi cho thy, khi hiu c cc k thut v nghin cu tnh ging nhau gia chng,
ngi ta thy rng nhiu k thut lc u th c v khc nhau nhng thc cht ra khi hiu c cc
k thut ny th thy chng hon ton ging nhau. Tuy nhin, nh gi ny cng ch tham kho
v cho n nay, khai ph d liu vn cn l k thut mi cha nhiu tim nng m ngi ta vn
cha khai thc ht.
2.8.

Nhng thch thc trong ng dng v nghin cu trong k thut khai ph d liu
y, ta a ra mt s kh khn trong vic nghin cu v ng dng k thut khai ph d

liu. Tuy nhin, th khng c ngha l vic gii quyt l hon ton b tc m ch mun nu ln rng
khai ph c d liu khng phi n gin, m phi xem xt cng nh tm cch gii quyt
nhng vn ny. Ta c th lit k mt s kh khn nh sau:
2.8.1.

Cc vn v c s d liu

u vo ch yu ca mt h thng khai thc tri thc l cc d liu th trong c s pht sinh


trong khai ph d liu chnh l t y. Do cc d liu trong thc t thng ng, khng y , ln
v b nhiu. Trong nhng trng hp khc, ngi ta khng bit c s d liu c cha cc thng tin
cn thit cho vic khai thc hay khng v lm th no gii quyt vi s d tha nhng thng tin
khng thch hp ny.
D liu ln: Cho n nay, cc c s d liu vi hng trm trng v bng, hng triu bn
ghi v vi kch thc n gigabytes l chuyn bnh thng. Hin nay bt u xut hin cc c
s d liu c kch thc ti terabytes. Cc phng php gii quyt hin nay l a ra mt ngng

25
cho c s d liu, lu mu, cc phng php xp x, x l song song (Agrawal et al, Holsheimer et
al).
Kch thc ln: khng ch c s lng bn ghi ln m s cc trng trong c s d liu
cng nhiu. V vy m kch thc ca bi ton tr nn ln hn. Mt tp d liu c kch thc ln
sinh ra vn lm tng khng gian tm kim m hnh suy din. Hn na, n cng lm tng kh
nng mt gii thut khai ph d liu c th tm thy cc mu gi. Bin php khc phc l lm gim
kch thc tc ng ca bi ton v s dng cc tri thc bit trc xc nh cc bin khng ph
hp.
D liu ng: c im c bn ca hu ht cc c s d liu l ni dung ca chng thay
i lin tc. D liu c th thay i theo thi gian v vic khai ph d liu cng b nh hng bi
thi im quan st d liu. V d trong c s d liu v tnh trng bnh nhn, mt s gi tr d liu
l hng s, mt s khc li thay i lin tc theo thi gian (v d cn nng v chiu cao), mt s
khc li thay i ty thuc vo tnh hung v ch c gi tr c quan st mi nht l (v d nhp
p ca mch). Vy thay i d liu nhanh chng c th lm cho cc mu khai thc c trc
mt gi tr. Hn na, cc bin trong c s d liu ca ng dng cho cng c th b thay i, b
xa hoc l tng ln theo thi gian. Vn ny c gii quyt bng cc gii php tng trng
nng cp cc mu v coi nhng thay i nh l c hi khai thc bng cch s dng n tm
kim cc mu b thay i.
Cc trng khng ph hp: Mt c im quan trng khc l tnh khng thch hp ca d
liu, ngha l mc d liu tr thnh khng thch hp vi trng tm hin ti ca vic khai thc. Mt
kha cnh khc i khi cng lin quan n ph hp l tnh ng dng ca mt thuc tnh i vi
mt tp con ca c s d liu. V d trng s ti khon Nostro khng p dng cho cc tc nhn.
Cc gi tr b thiu: S c mt hay vng mt ca gi tr cc thuc tnh d liu ph hp c
th nh hng n vic khai ph d liu. Trong h thng tng tc, s thiu vng d liu quan trng
c th dn n vic yu cu cho gi tr ca n hoc kim tra xc nh gi tr ca n. Hoc cng
c th s vng mt ca d liu c coi nh mt iu kin, thuc tnh b mt c th c coi nh
mt gi tr trung gian v l gi tr khng bit.
Cc trng b thiu: Mt quan st khng y c s d liu c th lm cho cc d liu c
gi tr b xem nh c li. Vic quan st c s d liu phi pht hin c ton b cc thuc tnh c
th dng gii thut khai ph d liu c th p dng nhm gii quyt bi ton. Gi s ta c cc
thuc tnh phn bit cc tnh hung ng quan tm. Nu chng khng lm c iu th c
ngha l c li trong d liu. i vi mt h thng hc chun on bnh st rt t mt c s
d liu bnh nhn th trng hp cc bn ghi ca bnh nhn c triu chng ging nhau nhng li c
cc chn on khc nhau l do trong d liu b li. y cng l vn thng xy ra trong c s

26
d liu kinh doanh. Cc thuc tnh quan trng c th s b thiu nu d liu khng c chun b
cho vic khai ph d liu.
nhiu v khng chc chn: i vi cc thuc tnh thch hp, nghim trng ca li
ph thuc vo kiu d liu ca cc gi tr cho php. Cc gi tr ca cc thuc tnh khc nhau c th
l cc s thc, s nguyn, chui v c th thuc vo tp cc gi tr nh danh. Cc gi tr nh danh
ny c th sp xp theo th t tng phn hoc y , thm ch c th c cu trc ng ngha.
Mt yu t khc ca khng chc chn chnh l tnh k tha hoc chnh xc m d liu
cn c, ni cch khc l nhiu crn cc php o v phn tch c u tin, m hnh thng k m t
tnh ngu nhin c to ra v c s dng nh ngha mong mun v dung sai ca d
liu. Thng th cc m hnh thng k c p dng theo cch c bit xc nh mt cch ch
quan cc thuc tnh t c cc thng k v nh gi kh nng chp nhn ca cc (hay t hp
cc) gi tr thuc tnh. c bit l vi d liu kiu s, s ng n ca d liu c th l mt yu t
trong vic khai ph. V d nh trong vic o nhit c th, ta thng cho php chnh lch 0.1 .
Nhng vic phn tch theo xu hng nhy cm nhit ca c th li yu cu chnh xc cao
hn. mt h thng khai thc c th lin h n xu hng ny chun on th li cn c mt
nhiu trong d liu u vo.
Mi quan h phc tp gia cc trng: cc thuc tnh hoc cc gi tr c cu trc phn cp,
cc mi quan h gia cc thuc tnh v cc phng tin phc tp din t tri thc v ni dung ca
c s d liu yu cu cc gii thut phi c kh nng s dng mt cch hiu qu cc thng tin ny.
Ban u, k thut khai ph d liu ch c pht trin cho cc bn ghi c gi tr thuc tnh n gin.
Tuy nhin, ngy nay ngi ta ang tm cch pht trin cc k thut nhm rt ra mi quan h gia
cc bin ny.
2.8.2.

Mt s vn khc

Qu ph hp (Overfitting) Khi mt gii thut tm kim cc tham s tt nht cho s


dng mt tp d liu hu hn, n c th s b tnh trng qu d liu (ngha l tm kim qu
mc cn thit gy ra hin tng ch ph hp vi cc d liu m khng c kh nng p ng cho
cc d liu l), lm cho m hnh hot ng rt km i vi cc d liu th. Cc gii php khc phc
bao gm nh gi cho (cross-validation), thc hin theo nguyn tc no hoc s dng cc bin
php thng k khc.
nh gi tm quan trng thng k: Vn (lin quan n overfitting) xy ra khi mt h
thng tm kim qua nhiu m hnh. V d nh nu mt h thng kim tra N m hnh mc quan
trng 0,001 th vi d liu ngu nhin trung bnh s c N/1000 m hnh c chp nhn l quan
trng. x l vn ny, ta c th s dng phng php iu chnh thng k trong kim tra nh
mt hm tm kim, v d nh iu chnh Bonferroni i vi cc kim tra c lp.

27
Kh nng biu t ca mu: Trong rt nhiu ng dng, iu quan trng l nhng iu khai
thc c phi cng d hiu vi con ngi cng tt. V vy, cc gii php thng bao gm vic din
t di dng ha, xy dng cu trc lut vi cc th c hng (Gaines), biu din bng ngn
ng t nhin (Matheus et al.) v cc k thut khc nhm biu din tri thc v d liu.
S tng tc vi ngi s dng v cc tri thc sn c: rt nhiu cng c v phng php
khai ph d liu khng thc s tng tc vi ngi dng v khng d dng kt hp cng vi cc tri
thc bit trc . Vic s dng tri thc min l rt quan trng trong khai ph d liu. c
nhiu bin php nhm khc phc vn ny nh s dng c s d liu suy din pht hin tri
thc, nhng tri thc ny sau c s dng hng dn cho vic tm kim khai ph d liu
hoc s dng s phn b v xc sut d liu trc nh mt dng m ha tri thc c sn.

Bi tp:
1. K thut khai ph d liu l g?
2. Nhim v chnh ca qu trnh khai ph d liu?
3. Trnh by cc nt khc nhau c bn gia k thut khai ph d liu vi cc phng php nh
my hc, thng k?
4. Cc bc ca qu trnh khai ph d liu?
5. Hy cho v d ng dng k thut khai ph d liu trong thc t?

28

Chng 3: Tin x l d liu


3.1.

Mc ch
Cc K thut datamining u thc hin trn cc c s d liu, ngun d liu ln. l kt

qu ca qu trnh ghi chp lin tc thng tin phn nh hot ng ca con ngi, cc qu trnh t
nhin Tt nhin cc d liu lu tr hon ton l di dng th, cha sn sng cho vic pht hin,
khm ph thng tin n cha trong . Do vy chng cn phi c lm sch cng nh bin i v
cc dng thch hp trc khi tin hnh bt k mt phn tch no.
thc hin c vic trch rt thng tin hu ch, hay p dng cc phng php khai ph
nh phn lp, d on th ngun d liu th ban u cn phi tri qua nhiu cng on bin i.
Cc cng on ny c rt nhiu cch thc hin ty thuc vo nhu cu v d nh: Gim thiu kch
thc, chch chn cc d liu thc s quan trng, gii hn phm vi ca cc d liu thi gian thc,
hoc thay i, iu chnh cc d liu sao cho ph hp nht vi yu cu t ra. Tt nhin khng nn
qu k vng vo vic p dng my tnh tm ra cc tri thc hu ch m khng c s tr gip ca
con ngi, cng nh khng th mong mun rng mt ngun d liu sau khi bin i ca bi ton
ny li c th ph hp vi mt bi ton khai ph khc.
V d, Mt Cng ty in t a ra yu cu phn tch d liu bn hng ti cc chi nhnh. Khi
nhn vin phn tch cn phi kim tra k lng c s d liu bn hng ca ton cng ty cng
nh kho xng xc nh v la chn cc thuc tnh hoc chiu thng tin a vo phn tch nh:
Chng loi mt hng, mt hng, gi c, chi nhnh bn ra. Tuy nhin khng th trnh khi vic cc
giao dch thng nht c nhng sai li nht nh trong qu trnh ghi chp ca nhn vin bn hng.
Cc sai li rt a dng t vic khng ghi li thng tin cho n vic ghi sai thng tin so vi quy
nh, quy chun bnh thng. Do vy cng vic phn tch s kh th trin khai c nu gi nguyn
ngun d liu ban u trng thi cha y (thiu gi tr thuc tnh hoc cc thuc tnh nht
nh ch cha cc d liu tng hp), nhiu (c cha li, hoc bin ca gi tr khc so vi d kin),
v khng ph hp (v d, c s khc bit trong m s chi nhnh c s dng phn loi).
Nhng iu nu trong v d trn l hon ton c thc trong th gii hin ti, n gin l vo
thi im thu thp chng khng c coi l quan trng, cc d liu lin quan khng c ghi li do
mt s hiu nhm, hoc do trc trc thit b. Ngoi ra cn c cc trng hp cc d liu ghi sau
khi qua mt qu trnh xem xt no trc b xa i, cng nh vic ghi chp s bin i mang
tnh lch s ca cc giao dch c th b b qua m ch gi li nhng thng tin tng hp vo thi
im xt. Do vy, lm pht sinh nhu cu lm sch d liu l tm (in) thm cc gi tr thiu,
lm mn cc d liu nhiu hoc loi b cc gi tr khng ngha, d liu gy mu thun.
Qu trnh chun b d liu phc v khai ph d liu thng thng gm:
- Lm sch d liu;

29
- Tch hp d liu;
- Bin i d liu;
- Rt gn d liu.

3.2. Lm sch d liu


3.2.1. Thiu gi tr
Hy xem xt mt kho d liu bn hng v qun l khch hng. Trong c th c mt hoc
nhiu gi tr m kh c th thu thp c v d nh thu nhp ca khch hng. Vy lm cch no
chng ta c c cc thng tin , hy xem xt cc phng php sau.
- B qua cc b: iu ny thng c thc hin khi thng tin nhn d liu b mt. Phng
php ny khng phi lc no cng hiu qu tr khi cc b c cha mt s thuc tnh khng thc s
quan trng.
- in vo cc gi tr thiu bng tay: Phng php ny thng tn thi gian v c th khng
kh thi cho mt tp d liu ngun ln vi nhiu gi tr b thiu.
- S dng cc gi tr quy c in vo cho gi tr thiu: Thay th cc gi tr thuc tnh
thiu bi cng mt hng s quy c, chng hn nh mt nhn ghi gi tr Khng bit hoc .
Tuy vy iu ny cng c th khin cho chng trnh khai ph d liu hiu nhm trong mt s
trng hp v a ra cc kt lun khng hp l.

30
- S dng cc thuc tnh c ngha l in vo cho gi tr thiu: V d, ta bit thu nhp
bnh qun u ngi ca mt khu vc l 800.000, gi tr ny c th c dng th thay th cho gi
tr thu nhp b thiu ca khch hng trong khu vc .
- S dng cc gi tr ca cc b cng th loi thay th cho gi tr thiu: V d, nu khch
hng A thuc cng nhm phn loi theo ri ro tn dng vi mt khch hng B khc trong khi
khch hng ny c thng tin thu nhp bnh qun. Ta c th s dng gi tr in vo cho gi tr
thu nhp bnh qun ca khch hng A .
- S dng gi tr c t l xut hin cao in vo cho cc gi tr thiu.: iu ny c th xc
nh bng phng php hi quy, cc cng c suy lun da trn l thuyt Bayersian hay cy quyt
nh
3.2.2. D liu nhiu
Nhiu d liu l mt li ngu nhin hay do bin ng ca cc bin trong qu trnh thc
hin, hoc s ghi chp nhm ln ko c kim sot V d cho thuc tnh nh gi c, lm cch
no c th lm mn thuc tnh ny loi b d liu nhiu. Hy xem xt cc k thut lm mn
sau:
Mng lu gi cc mt hng: 4, 8, 15, 21, 21, 24, 25, 28, 34
Phn thnh cc bin
Bin 1: 4, 8 , 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Lm mn s dng phng php trung v
Bin 1: 9, 9 ,9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Lm mn bin
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
Bng 3.1. V d v phng php lm mn Binning
a. Binning: Lm mn mt gi tr d liu c xc nh thng qua cc gi tr xung quanh n.
V d, cc gi tr gi c c sp xp trc sau phn thnh cc di khc nhau c cng kch thc
3 (tc mi Bin cha 3 gi tr).

31
- Khi lm mn trung v trong mi bin, cc gi tr s c thay th bng gi tr trung bnh cc
gi tr c trong bin
- Lm mn bin: cc gi tr nh nht v ln nht c xc nh v dng lm danh gii ca
bin. Cc gi tr cn li ca bin s c thay th bng mt trong hai gi tr trn ty thuc vo lch
gia gi tr ban u vi cc gi tr bin .
V d, bin 1 c cc gi tr 4, 8, 15 vi gi tr trung bnh l 9. Do vy nu lm mn trung v
cc gi tr ban u s c thay th bng 9. Cn nu lm mn bin gi tr 8 gn gi tr 4 hn nn
n c thay th bng 4.
b. Hi quy: Phng php thng dng l hi quy tuyn tnh, tm ra c mt mi quan
h tt nht gia hai thuc tnh (hoc cc bin), t mt thuc tnh c th dng d on thuc
tnh khc. Hi quy tuyn tnh a im l mt s m rng ca phng php trn, trong c nhiu
hn hai thuc tnh c xem xt, v cc d liu tnh ra thuc v mt min a chiu.

Hnh 3.1. Phn cm d liu khch hng da trn thng tin a ch


c. Nhm cm: Cc gi tr tng t nhau c t chc thnh cc nhm hay cm" trc quan.
Cc gi tr ri ra bn ngoi cc nhm ny s c xem xt lm mn a chng
3.3. Tch hp v bin i d liu
3.3.1. Tch hp d liu
Trong nhiu bi ton phn tch, chng ta phi ng rng ngun d liu dng phn tch
khng thng nht. c th phn tch c, cc d liu ny cn phi c tch hp, kt hp thnh
mt kho d liu thng nht. V dng thc, cc ngun d liu c th c lu tr rt a dng t: cc
c s d liu ph dng, cc tp tin flat-file, cc d liu khi . Vn t ra l lm th no c th
tch hp chng m vn m bo tnh tng ng ca thng tin gia cc ngun.

32
V d, lm th no m ngi phn tch d liu hoc my tnh chc chn rng thuc tnh id
ca khch hng trong mt c s d liu A v s hiu cust trong mt flat-file l cc thuc tnh ging
nhau v tnh cht?
Vic tch hp lun cn cc thng tin din t tnh cht ca mi thuc tnh (siu d liu) nh:
tn, ngha, kiu d liu, min xc nh, cc quy tc x l gi tr rng, bng khng . Cc siu d
liu s c s dng gip chuyn i cc d liu. Do vy bc ny cng lin quan n qu trnh
lm sch d liu.
D tha d liu: y cng l mt vn quan trng, v d nh thuc tnh doanh thu hng
nm c th l d tha nu nh n c th c suy din t cc thuc tnh hoc tp thuc tnh khc.
Mt s d tha c th c pht hin thng qua cc phn tch tng quan, Gi s cho hai
thuc tnh, vic phn tch tng quan c th ch ra mc mt thuc tnh ph thuc vo thuc tnh
kia, da trn cc d liu c trong ngun. Vi cc thuc tnh s hc, chng ta c th nh gi s
tng quan gia hai thuc tnh A v B bng cch tnh ton tng quan nh sau:

Trong :
- N l s b
- ai v bi l cc gi tr ca thuc tnh A v B ti b th i
-

biu din ngha cc gi tr ca A v B

biu din lch chun ca A v B

l tng ca tch AB (vi mi b, gi tr ca thuc tnh A c nhn vi gi tr

ca thuc tnh b trong b )


- Lu rng

Nu

ln hn 0, th A v B c kh nng c mi lin h tng quan vi nhau, ngha l

nu gi tr A tng th gi tr cua B cng tng ln. Gi tr ny cng cao th mi quan h cng cht ch.
V h qu l nu gi tr

cao th mt trong hai thuc tnh A (hoc B) c th c loi b.

33

Nu

bng 0 th A v B l c lp vi nhau v gia chng khng c mi quan h no.

Nu

nh hn 0 th A v B c mi quan h tng quan nghch, khi nu mt thuc

tnh tng th gi tr ca thuc tnh kia gim i.


Ch rng, nu gia A v B c mi quan h tng quan th khng c ngha chng c mi
quan h nhn qu, ngha l A hoc B bin i l do s tc t thuc tnh kia. V d c th xem xt
mi quan h tng quan gia s bnh vin v s v tai nn t mt a phng. Hai thuc tnh
ny thc s khng c quan h nhn qu trc tip m chng quan h nhn qu vi mt thuc tnh th
3 l dn s.
Vi ngun d liu ri rc, mt mi quan h tng quan gia hai thuc tnh A v B c th
c khm ph ra qua php kim 2. Gi s A c c gi tr khng lp c k hiu l a 1, a2, , ac. B
c r gi tr khng lp, k hiu b 1, b2, , br. Bng biu din mi quan h A v B c th c xy
dng nh sau:
- c gi tr ca A to thnh ct
- r gi tr ca B to hnh hng.
- Gi (Aj, Bj) biu din cc trng hp m thuc tnh A nhn gi tr ai, B nhn gi tr bi
Gi tr 2 c tnh nh sau

Trong :
-

l tn xut quan st c cc trng hp (Aj, Bj)

l tn xut d kin cc trng hp (Aj, Bj)

34

Vi N l tng s b,

l s b c cha gi tr ai cho thuc tnh A,

l tng s b c cha tr bj cho thuc tnh B.

V d: phn tch tng quan ca cc thuc tnh s dng phng php 2


Gi s c mt nhm 1500 ngi c kho st. Gii tnh ca h c ghi nhn sau h s
c hi v th loi sch yu thch thuc hai dng h cu v vin tng. Nh vy y c hai
thuc tnh gii tnh v s thch c. S ln xut hin ca cc trng hp c cho trong bng
sau
Nam
H cu
250 (90)
Vin tng 50 (210)
Tng
300
Vy chng ta tnh c

N
200 (360)
1000 (840)
1200

Tng
450
1050
1500

Ch trn mi dng tng s cc tn xut xut hin d kin c ghi trong cp ngoc () v
tng s tn xut d kin trn mi ct bng vi tng s tn xut quan st c trn ct .

T bng d liu cho thy bc t do (r-1)(c-1) = (2-1)(2-1) = 1. Vi 1 bc t do, gi tr

cn

bc b gi thit ny mc 0.001 l 10.828. V vi gi tr tnh c nh trn 507.93 cho thy


gii thuyt s thch c l c lp vi gii tnh l khng chc chn, hai thuc tnh ny c mt quan
h tng quan kh mnh trong nhm ngi c kho st.
3.3.2. Bin i d liu
Trong phn ny cc d liu s c bin i sang cc dng ph hp cho vic khai ph d
liu. Cc phng php thng thy nh:
- Lm mn: Phng php ny loi b cc trng hp nhiu khi d liu v d nh cc
phng php binning, hi quy, nhm cm.

35
- Tng hp: trong tng hp hoc tp hp cc hnh ng c p dng trn d liu. V d
thy rng doanh s bn hng hng ngy c th c tng hp tnh ton hng thng v hng nm.
Bc ny thng c s dng xy dng mt khi d liu cho vic phn tch.
- Khi qut ha d liu, trong cc d liu mc thp hoc th c thay th bng cc khi
nim mc cao hn thng qua kin trc khai nim. V d, cc thuc tnh phn loi v d nh
ng ph c th khi qut ha ln mc cao hn thnh Thnh ph hay Quc gia. Tng t
nh vy cc gi tr s, nh tui c th c nh x ln khi nim cao hn nh Tr, Trung nin,
C tui
- Chun ha, trong cc d liu ca thuc tnh c quy v cc khong gi tr nh hn v
d nh t -1.0 n 1.0, hoc t 0.0 n 1.0
- Xc nh thm thuc tnh, trong o cc thuc tnh mi s c thm vo ngun d liu
gip cho qu trnh khai ph.
Trong phn ny chng ta s xem xt phng php chun ha lm ch o
Mt thuc tnh c chun ha bng cch nh x mt cch c t l d liu v mt khong
xc nh v d nh 0.0 n 1.0. Chun ha l mt phn hu ch ca thut ton phn lp trong mng
noron, hoc thut ton tnh ton lch s dng trong vic phn lp hay nhm cm cc phn t lin
k. Chng ta s xem xt ba phng php: min-max, z-score, v thay i s ch s phn thp phn
(decimal scaling)
a. Min-Max
Thc hin mt bin i tuyn tnh trn d liu ban u. Gi s rng min A v maxA l gi tr
ti thiu v ti a ca thuc tnh A. Chun ha min-max s nh x gi tr v ca thuc tnh A thnh v
trong khong [new_minA, new_maxA] bng cch tnh ton

V d: Gi s gi tr nh nht v ln nht cho thuc tnh thu nhp bnh qun l 500.000 v
4.500.000. Chng ta mun nh x gi tr 2.500.000 v khong [0.0, 1.0] s dng chun ha minmax. Gi tr mi thu c l

b. z-score
Vi phng php ny, cc gi tr ca mt thuc tnh A c chun ha da vo lch tiu
chun v trung bnh ca A. Mt gi tr v ca thuc tnh A c nh x thnh v nh sau:

36
Vi v d pha trn: Gi s thu nhp bnh qun c lch tiu chun v trung bnh l:
1.000.000 v 500.000. S dng phng php z-score th gi tr 2.500.000 c nh x thnh

c. Thay i s ch s phn thp phn (decimal scale)


Phng php ny s di chuyn du phn cc phn thp phn ca cc gi tr ca thuc tnh A.
S ch s sau du phn cch phn thp phn c xc nh ph thuc vo gi tr tuyt i ln nht
c th c ca thuc tnh A. Khi gi tr v s c nh x thnh v bng cch tnh

Trong j l gi tr nguyn nh nht tha mn Max(|v|) < 1


V d: Gi s rng cc gi tr ca thuc tnh A c ghi nhn nm trong khong -968 n
917. Gi tr tuyt i ln nht ca min l 986. thc hin chun ha theo phng php nh ny,
trc chng ta mang cc gi tr chia cho 1.000 (j = 3). Nh vy gi tr -986 s chuyn thnh
-0.986 v 917 c chuyn thnh 0.917
3.3.3. Thu nh d liu
Vic khai ph d liu lun c tin hnh trn cc kho d liu khng l v phc tp. Cc k
thut khai ph khi p dng trn chng lun tn thi gian cng nh ti tuyn ca my tnh. Do vy
i hi chng cn c thu nh trc khi p dng cc k thut khai ph. Mt s chin lc thu nh
d liu nh sau:
- Tng hp khi d liu, trong cc hnh ng tng hp c p dng trn d liu hnh
thnh cc khi.
- La chn tp thuc tnh con, trong cc thuc tnh khng thch hp, yu hoc d tha
hay cc chiu s c loi b
- Rt gn chiu, trong cc c ch m ha s rt gn kch thc d liu
- Rt gn s hc, trong cc d liu s c thay th bng cc d liu ph nh hn nhng
cng biu din vn .
- Ri rc v phn cp khi nim , trong c gi tr ca cc thuc tnh c thay th bng
cc di khi nim mc cao hn. Dng thc ri rc ha d liu s dng rt gn s hc thng rt
hu dng cho vic t ng pht sinh cc di phn cp khi nim. Phng php ny cho php vic
khai ph d liu din ra cc mc tru tng.
a. Tng hp khi d liu
Hy xem xt d liu bn hng ca mt n v, cc d liu c t chc bo co theo
hng qu cho cc nm t 2008 n 2010. Tuy nhin vic khai ph d liu li quan tm hn n cc

37
bo co bn hng theo nm ch khng phi theo tng qu. Do cc d liu nn c tng hp
thnh bo co tng v tnh hnh bn hng theo nm hn l theo qu.

Hnh 3.2. D liu bn hng

Hnh 3.3. D liu tng hp


Phn cp khi nim c th tn ti mi thuc tnh, n cho php phn tch d liu nhiu
mc tru tng. V d, phn cp chi nhnh cho php cc chi nhnh c nhm li theo thng vng
da trn a ch. Khi d liu cho php truy cp nhanh n cc d liu tnh ton, tng hp do vy
n kh ph hp vi cc qu trnh khi ph.
Cc khi d liu c to mc tru tng thp thng c gi l cuboid. Cc cuboid
tng ng vi mt tp thc th no v d nh ngi bn hng, khch hng. Cc khi ny cung
cp nhiu thng tin hu dng cho qu trnh phn tch. Khi d liu mc tru tng cao gi l
apex cuboid, trong hnh 3.3 trn th hin d liu bn hng cho c 3 nm, tt c cc loi mt hng v
cc chi nhnh. Khi d liu c to t nhiu mc tru tng thng c gi l cuboids, do vy
khi d liu thng c gi bng tn khc l li cuboids.

38
b. La chn tp thuc tnh con
Ngun d liu dng phn tch c th cha hng trm thuc tch, rt nhiu trong s c th
khng cn cho vic phn tch hoc chng l d tha. V d nu nhim v phn tch ch lin quan
n vic phn loi khch hng xem h c hoc khng mun mua mt a nhc mi hay khng. Khi
thuc tnh in thoi ca khch hng l khng cn thit khi so vi cc thuc tnh nh tui, s
thch m nhc. Mc d vy vic la chn thuc tnh no cn quan tm l mt vic kh khn v mt
thi gian t bit khi cc c tnh ca d liu l khng r rng. Gi cc thuc tnh cn, b cc thuc
tnh khng h ch cng s c th gy nhm ln, v sai lch kt qu ca cc thut ton khai ph d
liu.
Phng php ny rt gn kch thc d liu bng cch loi b cc thuc tnh khng hu ch
hoc d tha (hoc loi b cc chiu). Mc ch chnh l tm ra tp thuc tnh nh nht sao cho khi
p dng cc phng php khai ph d liu th kt qu thu c l gn st nht vi kt qu khi s
dng tt c cc thuc tnh.
Vy lm cch no tm ra mt tp thuc tnh con tt t tp thuc tnh ban u. Nh
rng vi N thuc tnh chng ta s c 2n tp thuc tnh con. Vic pht sinh v xem xt ht cc tp ny
l kh tn cng sc cng nh ti nguyn c bit khi N v s cc lp d liu tng ln. Do vy cn
c cc phng php khc, mt trong s l phng php tm kim tham lam, n s duyt qua
khng gian thuc tnh v tm kim cc la chn tt nht vo thi im xt.
La chn tng dn
Tpthuc tnh ban u

Loi bt
Tpthuc tnh ban u

Cy quyt nh
Tpthuc tnh ban u

{A1, A2, A3, A4, A5, A6}

{A1, A2, A3, A4, A5, A6}

{A1, A2, A3, A4, A5, A6}

Tp rt gn ban u
{}

=> {A1, A3, A4, A5, A6}

=> {A1}

=> {A1, A4, A5, A6}

=> {A1, A4}

=> Kt qu {A1, A4, A6}

=> Kt qu {A1, A4, A6}


=> Kt qu {A1, A4, A6}
Bng 3.2. V d k thut rt gn
Vic la chn ra thuc tnh tt (xu) c xc nh thng qua cc php kim thng k, trong
gi s rng thuc tnh ang xt l c lp vi cc thuc tnh khc hoc phng php nh gi
thuc tnh s dng o thng tin thng c dng trong vic xy dng cy quyt nh phn lp.
Cc k thut la chn thng dng nh:

39
1. La chn tng dn: Xut pht t mt tp rng cc thuc tnh, cc thuc tnh tt nht mi
khi xc nh c s c thm vo tp ny. Lp li bc trn cho n khi khng thm c thuc
tnh no na.
2. Loi bt: Xut pht t tp c y cc thuc tnh. mi bc loi ra cc thuc tnh ti
nht.
3. Kt hp gia phng php loi bt v la chn tng dn bng cch ti mi bc ngoi
vic la chn thm cc thuc tnh tt nht a vo tp th cng ng thi loi b i cc thuc tnh
ti nht khi tp ang xt.
4. Cy quyt inh: Khi s dng, cy c xy dng t ngun d liu ban u. Tt c cc
thuc tnh khng xut hin trn cy c coi l khng hu ch. Tp cc thuc tnh c trn cy s l
tp thuc tnh rt gn

Bi tp:
1. Nu mt thuc tnh trong ngun d liu im-Sinh vin c cc gi tr A, B, C, D, F th kiu
d liu d kin ca thuc tnh trong qu trnh tin x l l g?
2. Cho mng mt chiu X = {5.0, 23.0, 17.6, 7.23, 1.11}, hy chun ha mng s dng
a. Decimal scaling: trong khong [1, 1].
b. Min-max: trong khong [0, 1].
c. Min-max: trong khong [1, 1].
d. Phng php lch
e. So snh kt qu ca cc dng chun trn v cho nhn xt v u nhc im ca cc
phng php?
3. Lm mn d liu s dng k thut lm trn cho tp sau:
Y = {1.17, 2.59, 3.38, 4.23, 2.67, 1.73, 2.53, 3.28, 3.44}
Sau biu din tp thu c vi cc chnh xc:
a. 0.1
b. 1.
4. Cho tp mu vi cc gi tr b thiu
o

X1 = {0, 1, 1, 2}

X2 = {2, 1, , 1}

X3 = {1, , , 0}

X4 = {, 2, 1, }

Nu min xc nh ca tt c cc thuc tnh l [0, 1, 2], hy xc nh cc gi tr b thiu bit


rng cc gi tr c th l mt trong s cc xc tr ca min xc nh? Hy gii thch

40
nhng ci c v mt nu rt gn chiu ca kho d liu ln trong qu trnh tin x l d
liu?

41

Chng 4: Lut kt hp
4.1. Khi nim v lut kt hp
Cho mt tp mc I = {i1, i2,, in}, mi phn t thuc I c gi l mt mc (item). i khi
mc cn c gi l thuc tnh v I cng c gi l tp cc thuc tnh. Mi tp con trong I c
gi l mt mt tp mc, s lng cc phn t trong mt tp mc c gi l di hay kch thc
ca mt tp mc.
Cho mt c s d liu giao dch D = {t1, t2,, tm}, trong mi ti l mt giao dch v l mt
tp con ca I. Thng th s lng cc giao dch (lc lng ca tp D k hiu l |D| hay card(D)) l
rt ln.
Cho X, Y l hai tp mc (hai tp con ca I). Lut kt hp (association rule) c k hiu l
XY, trong X v Y l hai tp khng giao nhau, th hin mi rng buc ca tp mc Y theo tp
mc X theo ngha s xut hin ca X s ko theo s xut hin ca Y ra sao trong cc giao dch. Tp
mc X c gi l xut hin trong giao dch t nu nh X l tp con ca t. h tr ca mt tp
mc X (k hiu l supp(X)) c nh ngha l t l cc giao dch trong D c cha X:
supp(X) = N(X)/|D|
Trong N(X) s lng cc giao dch trong CSDL giao dch D m c cha X.
Gi tr ca lut kt hp XY c th hin thng qua hai o l h tr supp(XY) v
tin cy conf(XY).
h tr supp(XY) l t l cc giao dch c cha X U Y trong tp D:
supp(XY) = P(X Y) = N(X Y)/|D|
Trong k hiu N(X Y) l s lng cc giao dch c cha X U Y.
tin cy conf(XY) l t l cc tp giao dch c cha X U Y so vi cc tp giao dch c
cha X:
conf(XY) = P(Y|X) = N(X Y)/N(X) = supp(XY)/supp(X)
Trong k hiu N(X) s lng cc giao dch c cha X.
T nh ngha ta thy 0 supp(XY) 1 v 0 conf(XY) 1. Theo quan nim xc sut,
h tr l xc sut xut hin tp mc X Y, cn tin cy l xc sut c iu kin xut hin Y
khi xut hin X.
Lut kt hp XY c coi l mt tri thc (mu c gi tr) nu xy ra ng thi supp(XY)
minsup v conf(XY) minconf. Trong minsup v minconf l hai gi tr ngng cho trc.
Mt tp mc X c h tr vt qua ngng minsup c gi l tp ph bin.
4.2. Thut ton Apriori
Thut ton Apriori l mt thut ton in hnh p dng trong khai ph lut
kt hp. Thut ton da trn nguyn l Apriori tp con bt k ca mt tp ph
bin cng l mt tp ph bin. Mc ch ca thut ton Apriori l tm ra c

42
tt c cc tp ph bin c th c trong c s d liu giao dch D. Thut ton
hot ng theo nguyn tc quy hoch ng, ngha l t cc tp F i = { ci | ci l
tp ph bin, |ci| = 1} gm mi tp mc ph bin c di i (1 i k), i
tm tp Fk+1 gm mi tp mc ph bin c di k+1. Cc mc i 1, i2,, in trong
thut ton c sp xp theo mt th t c nh.
Thut ton Apriori:
Input:

C s d liu giao dch D = {t1, t2,, tm}.


Ngng ti thiu minsup > 0.

Output:

Tp hp tt c cc tp ph bin.

mincount = minsup * |D|;


F1 = { cc tp ph bin c di 1};
for(k=1; Fk != ; k++)
{
Ck+1 = Apriori_gen(Fk);
for each t in D
{
Ct = { c Ck+1 | c t};
for c Ct
c.count++;
}
Fk+1 = {c Ck+1 | c.count > mincount}
}
Fk ;
return
k

Th tc con Apriori_gen c nhim v sinh ra (generation) cc tp mc c


di k+1 t cc tp mc c di k trong tp F k. Th tc ny c thi hnh
thng qua vic ni (join) cc tp mc c chung cc tin t (prefix) v sau p
dng nguyn l Apriori loi b bt nhng tp khng tha mn:

Bc ni: Sinh cc tp mc Lk+1 l ng vin ca tp ph bin c di


k+1 bng cch kt hp hai tp ph bin P k v Qk c di k v trng
nhau k-1 mc u tin:
Lk+1 = Pk + Qk = {i1, i2,, ik-1, ik, ik}
Vi Pk = {i1, i2,, ik-1, ik} v Qk = {i1, i2,, ik-1, ik}, trong i1i2ik1

ikik.

43

Bc ta: Gi li tt c cc ng vin L k+1 tha tha mn nguyn l Apriori


tc l mi tp con c di k ca n u l tp ph bin (X Lk+1 v |X|
= k th X Fk).
Trong mi bc k, thut ton Apriori u phi duyt c s d liu giao dch

D. Khi ng thut ton s tin hnh duyt D c c F 1 (loi b nhng mc


c h tr nh hn minsup).
Kt qu ca thut ton l tp gm cc tp ph bin c di t 1 n k:
F = F1 F2 Fk
sinh cc lut kt hp th i vi mi tp ph bin I F, ta xc nh cc
tp mc khng rng l con ca I. Vi mi tp mc con s khng rng ca I ta s
thu c mt lut kt hp s(I-s) nu tin cy tha mn:
conf(s(I-s)) = supp(I)/supp(I-s) minconf vi minconf l ngng tin cy
cho trc.
Phin

truy

cp
Session 1

Cc trang truy cp
/shopping/comestic.htm,

/shopping/fashion.htm,

Session 2
Session 3
Session 4

/cars.htm
/shopping/fashion.htm, /news.htm
/shopping/fashion.htm, /sport.htm
/shopping/comestic.htm,
/shopping/fashion.htm,

Session
Session
Session
Session

/news.htm
/shopping/comestic.htm, /sport.htm
/shopping/fashion.htm, /sport.htm
/shopping/comestic.htm, /sport.htm
/shopping/comestic.htm,
/shopping/fashion.htm,

5
6
7
8

Session 9

/sport.htm, /cars.htm
/shopping/comestic.htm,

/shopping/fashion.htm,

/sport.htm
Bng 3.1: Cc phin truy cp ca mt ngi dng
Gi s sau khi tin x l d liu thu c t web log, ta xc nh c cc
phin truy cp ca ngi dng nh bng 3.1. y mi phin truy cp c th
coi l mt giao dch v mi trang c truy cp l mt mc. Vic p dng gii
thut Apriori c th gip xc nh c nhng trang no thng c truy cp
cng vi nhau. Nhng mu thu c s cung cp nhng tri thc rt hu ch
phc v cho nhng lnh vc nh tip th in t hay t chc li website sao cho
thun tin nht i vi ngi dng.
ngn gn, ta k hiu cc trang truy cp nh sau:

44
/shopping/comestic.htm

I1

/shopping/fashion.htm

I2

/sport.htm

I3

/news.htm

I4

/cars.htm

I5

Ta c c s d liu giao dch D gm 9 giao dch vi cc tp mc nh sau:


Giao dch Tp mc
T01
I1, I2, I5
T02
I2, I4
T03
I2, I3
T04
I1, I2, I4
T05
I1, I3
T06
I2, I3
T07
I1, I3
T08
I1, I2, I3, I5
T09
I1, I2, I3
p dng gii thut Apriori cho c s d liu giao dch ny vi cc ngng
c la chn l minsup = 2/9 22% v minconf = 70%.
Bc 1: Duyt CSDL giao dch D xc nh h tr cho cc tp ph bin c
di 1. Cc tp mc c h tr nh hn 2/9 s b loi b. Trong trng hp
ny cha c tp mc no b loi, tt c cc tp u l tp ph bin.
Tp
mc
{I1}
{I2}
{I3}
{I4}
{I5}

S
xut
hin
6
7
6
2
2

ln

h
tr
6/9
7/9
6/9
2/9
2/9

Loi b cc
tp mc c
h tr nh
hn
minsup=2/9

Tp

ph

xut

bin
{I1}
{I2}
{I3}
{I4}
{I5}

hin
6
7
6
2
2

Bc 2: To ra cc tp mc c di 2 bng

Tp

cch kt ni cc tp mc c di 1, duyt

ph

xut

tr

CSDL giao dch D xc nh h tr cho

bin
{I1, I2}
{I1, I3}
{I1, I5}
{I2, I3}
{I2, I4}
{I2, I5}

hin
4
4
2
4
2
2

4/9
4/9
2/9
4/9
2/9
2/9

tng tp mc v loi b cc tp mc c h
tr nh hn 2/9 thu c cc tp ph bin.

ln

h
tr
6/9
7/9
6/9
2/9
2/9

ln h

45
Tp

mc

xut

tr

{I1,
{I1,
{I1,
{I1,
{I2,
{I2,
{I2,
{I3,
{I3,
{I4,

hin
4
4
1
2
4
2
2
0
1
0

4/9
4/9
1/9
2/9
4/9
2/9
2/9
0
1/9
0

I2}
I3}
I4}
I5}
I3}
I4}
I5}
I4}
I5}
I5}

ln h

Loi b cc
tp mc c
h tr nh
hn
minsup=2/9

Trong bc 2 ny ta cha cn s dng nguyn l Apriori ta bt cc tp


mc khng tha mn v tp con ca cc tp mc di 2 l nhng tp mc c
di 1 v nh xt bc 1, nhng tp mc c di 1 u l tp ph
bin.
Bc 3: Kt ni cc tp mc c di 2 thu c cc tp mc c di 3.
Trong bc ny ta phi s dng n nguyn l Apriori loi b bt nhng tp
mc m tp con ca n khng phi l tp ph bin.
Sau khi kt ni ta thu c cc tp sau y:
{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}
Cc tp {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5} v {I2, I4, I5} b loi b v tn ti nhng
tp con ca chng khng phi l tp ph bin. Cui cng ta cn cc tp mc
sau y:
S ln

Tp mc

xut

tr

{I1, I2, I3}


{I1, I2, I5}

hin
2
2

2/9
2/9

Tp ph
bin
{I1, I2, I3}
{I1, I2, I5}

S ln

xut

tr

hin
2
2

2/9
2/9

Bc 4: Kt ni hai tp mc {I 1, I2, I3}, {I1, I2, I5} thu c tp mc c di 4


l {I1, I2, I3, I5} tuy nhin tp mc ny b loi b do tp con ca n l {I 2, I3, I5}
khng phi l tp ph bin. Thut ton kt thc.
Cc tp ph bin thu c l:

46
F = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1, I2}, {I1, I3}, {I1, I5}, {I2, I3}, {I2, I4}, {I2, I5},
{I1, I2, I3}, {I1, I2, I5}}
sinh ra cc lut kt hp, cn tch cc tp ph bin thnh hai tp khng
giao nhau v tnh tin cy cho cc lut tng ng. Lut no c tin cy vt
ngng minconf = 70% s c gi li. V d: xt tp ph bin: {I 1, I2, I5}. Ta c
cc lut sau y:
R1: I1, I2 I5
conf(R1) = supp({I1, I2, I5})/supp({I1, I2}) = 2/4 = 50% (R1 b loi)
R2: I1, I5 I2
conf(R2) = supp({I1, I2, I5})/supp({I1, I5}) = 2/2 = 100%
R3: I2, I5 I1
conf(R2) = supp({I1, I2, I5})/supp({I2, I5}) = 2/2 = 100%
R4: I1 I2, I5
conf(R2) = supp({I1, I2, I5})/supp({I1}) = 2/6 = 33% (R4 b loi)
R5: I2 I1, I5
conf(R2) = supp({I1, I2, I5})/supp({I2}) = 2/7 = 29% (R5 b loi)
R6: I5 I1, I2
conf(R2) = supp({I1, I2, I5})/supp({I5}) = 2/2 = 100%
T lut R2 ta c th kt lun rng, nu ngi dng quan tm n cc trang
comestic.htm v car.htm th nhiu kh nng ngi dng ny cng quan tm
n trang fashion.htm. y c th l gi cho mt k hoch qung co. Tng
t, t lut R6 ta c th kt lun, nu ngi dng quan tm n xe hi th h
cng quan tm n thi trang v m phm. Vy nn t cc banner qung co
v cc lin kt n cc trang fashion.htm v comestic.htm ngay trn trang
car.htm thun tin cho ngi dng.
4.3. Thut ton FP-Growth ng dng trong khai ph d liu s dng
Web
Gii thut Apriori c nhc im l to ra qu nhiu tp d tuyn. Gi s
ban u c 104 tp ph bin c di 1 th sau qu trnh kt ni s to ra 10 7
tp mc c di 2 (chnh xc l 10 4(104 1)/2 tp mc). R rng mt tp mc
c di k th phi cn n t nht 2 k 1 tp mc d tuyn trc . Mt nhc
im khc na l gii thut Apriori phi kim tra tp d liu nhiu ln, dn ti
chi ph ln khi kch thc cc tp mc tng ln. Nu tp mc c di k c
sinh ra th cn phi kim tra tp d liu k+1 ln.

47
Gii thut FP-Growth khai ph lut kt hp c xy dng da trn nhng
nguyn tc c bn sau y:
1. Nn tp d liu vo cu trc cy nh gim chi ph cho ton tp d
liu dng trong qu trnh khai ph. Cc mc khng ph bin c loi b
sm nhng vn m bo kt qu khai ph khng b nh hng.
2. p dng trit phng php chia tr (devide-and-conquer). Qu
trnh khai ph d liu c chia thnh cc cng on nh hn, l xy
dng cy FP v khai ph cc tp ph bin da trn cy FP to.
3. Trnh to ra cc tp d tuyn. Mi ln, gii thut ch kim tra mt phn
ca tp d liu.
Cy FP (cn gi l FP-Tree) l cu trc d liu dng cy c t chc nh sau:
1. Nt gc (root) c gn nhn null
2. Mi nt cn li cha cc thng tin: item-name, count, node-link. Trong :
- Item-name: Tn ca mc m nt i din.
- Count: S giao dch c cha mu bao gm cc mc duyt t nt gc
n nt ang xt.
- Node-link: Ch n nt k tip trong cy (hoc tr n null nu nt
ang xt l nt l).
3. Bng Header c s dng bng s mc. Mi dng cha 3 thuc tnh: itemname, item-count, node-link. Trong :
- Item-name: Tn ca mc.
- Item-count: Tng s bin count ca tt c cc nt cha mc .
- Node-link: Tr n nt sau cng c to ra cha mc trong
cy.
Cy FP c th xy dng t c s d liu giao dch D thng qua th tc sau y:
Input:

C s d liu giao dch D.


Ngng min-sup.

Output:

Cy FP.

Procedure FP_TreeConstruction
{
1. Duyt D ln u thu c tp F gm cc frequent item v support
count ca chng. Sp xp cc item trong F theo trt t gim dn ca
supprort count ta c danh sch L.
2. To nt gc R v gn nhn null.

48
To bng Header c |F| dng v t tt c cc nodelink ch n null.
3. for each giao dch T D
{
// Duyt D ln 2
Chn cc item ph bin ca T a vo P;
Sp cc item trong P theo trt t L;
Call Insert_Tree(P, R);
}
}
Th tc con Insert_Tree c nh ngha nh sau:
Procedure Insert_Tree(P, R)
{
t P=[p|P p] , vi p l phn t u v P p l phn cn li ca danh
sch;
if R c mt con N sao cho N.item-name = p then
N.count ++;
else
{
To nt mi N;
N.count = 1;
N.item-name = p;
N. parent = R;
// To node-link ch n item, H l bng Header
N.node-link = H[p].head;
H[p].head = N;
}
// Tng bin count ca p trong bng header thm 1
H[p].count ++;
if (P p) != null then Call Insert_Tree(P p, N) ;
}
khm ph cc cc mu ph bin t cy FP-Tree, ta s dng th tc FPGrowth:
Input:
min_sup, = null.

Cy FP-Tree ca c s giao dch D, ngng

49
Output:

Mt tp y cc mu ph bin F.

Procedure FP-Growth(Tree, )
{
F = ;
if Tree ch cha mt ng dn n P then
{
for each t hp ca cc nt trong P do
{
Pht sinh mu p = ;
support_count(p) = min_sup cc nt trong ;
F = F p;
}
}
else
for each ai in the header of Tree
{
Pht sinh mu = ai ;
support_count()=ai.support_count;
F = F ;
Xy dng c s c iu kin ca ;
Xy dng FP-Tree c iu kin Tree ca ;
if (Tree != ) then Call FP_Growth(Tree, );
}
}
p dng gii thut FP-Growth cho c s d liu giao dch D xt trong
mc 3.3, ngng h tr minimum support count = 2 (hay min_sup=2/9):
Giao dch
T01
T02
T03
T04
T05
T06
T07
T08
T09

Tp mc
I1, I2, I5
I2, I4
I2, I3
I1, I2, I4
I1, I3
I2, I3
I1, I3
I1, I2, I3, I5
I1, I2, I3

50
Trc tin cy FP s c xy dng dn dn qua cc bc. Cc giao dch
s ln lt c xt v cc mc tng ng c thm vo cy.
Ln duyt th nht: Tm cc tp mc c di 1 v sp xp chng theo
danh sch vi trt t gim dn theo tn s xut hin. Loi b cc tp mc c
h tr nh hn ngng min_sup thu c danh sch:
L={{I2:7}, {I1:6}, {I3:6}, {I4:2}, {I5:2}}
Ln duyt th hai: Xy dng dn cy FP qua cc bc. Cc mc trong mi
giao dch c x l theo trt t trong L.

51

52

Sau khi c ht cc giao dch, cy FP hon chnh c xy dng cng vi


bng Header tng ng:

sinh cc mu ph bin, ngi ta tin hnh duyt cy FP:


-

Xt ai=I3: = I3 = I3:6 (support_count ()=6)

C s mu c iu kin: (I3 l mt suffix):


{{I1, I2: 2},{I2: 2},{I1:2}}

Xy dng Conditional FP-Tree (Tree) vi tp mu:


{{I1, I2: 2},{I2: 2},{I1:2}}

Min_sup=2

L={I1: 4, I2: 4}

53

Tip theo xt ai = I2 ta c =I2 U = {I3:6, I2:4}

support_count() = support_count(I2)= 4

C s mu c iu kin: {{I1:2}}

Cy thu c c ng dn n.

Cc mu ph bin l: {I3, I2, I1:2}, {I2, I3: 4}

Xt ai = I1 ta c =I1 U = {I3:6, I1:4}

support_count() = support_count(I1)= 4

C s mu c iu kin {}

Cy thu c: Null

Cc mu ph bin: hay {I3, I1: 4}

Thc hin tng t vi cc nt a i khc trong Header ca cy, cui cng s


thu c cc tp ph bin nh sau:
Mc

I5
I4

C s mu c iu Cy

FP

kin

kin

{{I2, I1:1}, {I2, I1, I3:1}}

<I2:2, I1:2>

iu Mu ph bin c

{{I2, I1:1}, {I2:1}}


<I2:2>
{{I2,
I1:2},
{I2:2},
I3
<I2:4, I1:2>, <I1:2>
{I1:2}}
I1
{{I2:4}}
<I2:4>
4.4. So snh v nh gi

to
{I2, I5:2}, {I1, I5:2},
{I2, I1, I5:2}
{I2, I4:2}
{I2, I3:2}, {I1, I3:2},
{I2, I1, I3:2}
{I2, I1:4}

54
S khc bit ln nht gia hai gii thut l gii thut Apriori phi sinh ra
mt lng ln cc tp ng vin trong khi FP-Growth tm cch trnh iu ny.
Gii thut Apriori s lm vic km hiu qu trong trng hp tp mc c kch
thc ln v ngng h tr nh, dn ti s lng mu ph bin ln. iu
ny s khin kch thc tp ng vin tr nn ln n mc kh chp nhn.

Hnh 4.1: So snh thi gian thc vi cc ngng h tr khc nhau


Gii thut FP-Growth trnh s bng n ca cc tp ng vin bng cch nn
d liu vo cu trc cy, kim sot cht ch vic sinh cc ng vin v p dng
hiu qu chin lc chia tr. Nhc im ln nht ca gii thut FP-Growth
chnh vic xy dng cy FP kh phc tp, i hi chi ph ln v mt thi gian v
b nh. Tuy nhin mt khi xy dng xong cy FP th vic khai ph cc mu
ph bin li tr nn v cng hiu qu. Hnh 3.1a v 3.1b cho ta s so snh v
thi gian thc thi ca hai gii thut vi nhng mc thay i khc nhau ca
h tr v s lng cc giao dch

55

Hnh 4.2: So snh thi gian thc thi vi s lng khc nhau cc giao
dch

4.5. Kt lun chng 4


Lut kt hp l loi mu in hnh nht trong phn tch mu truy cp Web.
Ni dung chng 3 tp trung trnh by s b v lut kt hp cng nh hai gii
thut kinh in s dng trong khai ph lut kt hp l Apriori v FP-Growth. Tuy
hiu qu hn nhiu so vi gii thut Apriori nhng trong thc t vic ci t
gii thut FP-Growth l kh phc tp. Cn phi cn nhc ti chi ph v b nh
lu tr ton b cy FP.

56

Bi tp:
L THUYT:
1. Cc gi tr thng thng c s dng lm tham s cho support v confidence trong
thut ton Apriori?
2. Ti sao qu trnh khm ph lut kt lp kh n gin khi so snh n vi vic pht sinh mt
lng ln itemset trong c s d liu giao dch?
3. Cho c s d liu giao dch nh sau:
X: TID Items
T01 A, B, C, D
T02 A, C, D, F
T03 C, D, E, G, A
T04 A, D, F, B
T05 B, C, G
T06 D, F, G
T07 A, B, G
T08 C, D, F, G
a. S dng cc gi tr ngng support = 25% v confidence = 60%, tm:
1. Tt c cc tp itemsets trong c s d liu X.
2. Cc lut kt hp ng tin cy.
5. Cho c s d liu giao dch nh sau:
Y: TID Items
T01 A1, B1, C2
T02 A2, C1, D1
T03 B2, C2, E2
T04 B1, C1, E1
T05 A3, C3, E2
T06 C1, D2, E2
a. S dng cc ngng support s = 30% v confidence c = 60%, tm:
1. Tt c cc tp itemset trong Y.
2. Nu cc tp itemset c cu trc sao cho A + {A1, A2, A3}, B= {B1, B2},
C = {C1, C2, C3}, D = {D1, D2} v E = {E1, E2}, hy tm cc tp itemset
c nh ngha trn mc khi nim?
3. Tm cc lut kt hp ng tin cy cho cc tp itemset cu trn.
THC HNH:
1. S dng thut ton Apriori tm kim cc tp itemset trong c s d liu
Northwind?

57

Chng 5: Phn lp v d on
5.1. Khi nim c bn
Kho d liu lun cha rt nhiu cc thng tin hu ch c th dng cho vic ra cc quyt nh
lin quan n iu hnh, nh hng ca mt n v, t chc. Phn lp v d on l hai dng ca
qu trnh phn tch d liu c s dng trch rt cc m hnh biu din cc lp d liu quan
trng hoc d don cc d liu pht sinh trong tng lai. K thut phn tch ny gip cho chng ta
hiu k hn v cc kho d liu ln. V d chng ta c th xy dng mt m hnh phn lp xc
nh mt giao dch cho vay ca ngn hn l an ton hay c ri ro, hoc xy dng m hnh d on
phn on kh nng chi tiu ca cc khch hng tim nm da trn cc thng tin lin quan n
thu nhp ca h. Rt nhiu cc phng php phn lp v d on c nghin cu trong cc lnh
vc my hc, nhn dng mu v thng k. Hu ht cc thut ton u c hn ch v b nh vi cc
gi nh l kch thc d liu nh. K thut khai ph d liu gn y c pht trin xy
dng cc phng php phn lp v d on ph hp hn vi ngun d liu c kch thc ln.
5.1.1. Phn lp
Qu trnh phn lp thc hin nhim v xy dng m hnh cc cng c phn lp gip cho
vic gn nhn phn loi cho cc d liu. V d nhn An ton hoc Ri ro cho cc yu cu vay
vn; C hoc Khng cho cc thng tin th trng. Cc nhn dng phn loi c biu din
bng cc gi tr ri rc trong vic sp xp chng l khng c ngha.
Phn lp d liu gm hai qu trnh. Trong qu trnh th nht mt cng c phn lp s c
xy dng xem xt ngun d liu. y l qu trnh hc, trong mt thut ton phn lp c
xy dng bng cch phn tch hoc hc t tp d liu hun luyn c xy dng sn bao gm
nhiu b d liu. Mt b d liu X biu din bng mt vector n chiu, X = (x1, x2,, xn) , y l
cc gi tr c th ca mt tp n thuc tnh ca ngun d liu {A1, A2, , An}. Mi b c gi s
rng n thuc v mt lp c nh ngha trc vi cc nhn xc nh.

58

Hnh 5.1. Qu trnh hc

Hnh 5.2. Qu trnh phn lp

59
Qu trnh u tin ca phn lp c th c xem nh vic xc nh nh x hoc hm y =
f(X), hm ny c th d on nhn y cho b X. Ngha l vi mi lp d liu chng ta cn hc (xy
dng) mt nh x hoc mt hm tng ng.
Trong bc th hai, m hnh thu c s c s dng phn lp. m bo tnh khch
quan nn p dng m hnh ny trn mt tp kim th hn l lm trn tp d liu hun luyn ban
du. Tnh chnh xc ca m hnh phn lp trn tp d liu kim th l s phn trm cc b d liu
kim tra c nh nhn ng bng cch so snh chng vi cc mu trong b d liu hun luyn.
Nu nh chnh xc ca m hnh d on l chp nhn c th chng ta c th s dng n cho
cc b d liu vi thng tin nhn phn lp cha xc nh.
5.1.2.

D on

D on d liu l mt qu trnh gm hai bc, n gn ging vi qu trnh phn lp. Tuy


nhin d on, chng ta b qua khi nim nhn phn lp bi v cc gi tr c d on l lin
tc (c sp xp) hn l cc gi tr phn loi. V d thay v phn loi xem mt khon vay c l an
ton hay ri do th chng ta s d on xem tng s tin cho vay ca mt khon vay l bao nhiu th
khon vay l an ton.
C th xem xt vic d on cng l mt hm y = f(X), trong X l d liu u vo, v
u ra l mt gi tr y lin tc hoc sp xp c. Vic d on v phn lp c mt vi im khc
nhau khi s dng cc phng php xy dng m hnh. Ging vi phn lp, tp d liu hun luyn
s dng xy dng m hnh d on khng c dng nh gi tnh chnh xc. Tnh chnh xc
ca m hnh d on c nh gi da trn vic tnh lch gi cc gi tr d on vi cc gi tr
thc s nhn c ca mi b kim tra X.
5.2. Phn lp s dng cy quyt nh
5.2.1. Cy quyt nh
Cui nhng nm 70 u nhng nm 80, J.Ross Quinlan pht trin mt thut ton sinh cy
quyt nh. y l mt tip cn tham lam, trong n xc nh mt cy quyt dnh c xy dng
t trn xung mt cch quy theo hng chia tr. Hu ht cc thut ton sinh cy quyt nh
u da trn tip cn top-down trnh by sau y, trong n bt u t mt tp cc b hun luyn
v cc nhn phn lp ca chng. Tp hun luyn c chia nh mt cc quy thnh cc tp con
trong qu trnh cy c xy dng.
Generate_decision_tree: Thut ton sinh cy quyt nh t cc b d liu hun luyn ca
ngun d liu D
u vo:
- Ngun d liu D, trong c cha cc b d liu hun luyn v cc nhn phn lp
- Attribute_list - danh sch cc thuc tnh

60
- Attribute_selection_method, mt th tc xc nh tiu ch phn chia cc b d liu mt
cc tt nht thnh cc lp. Tiu ch ny bao gm mt thuc tnh phn chia splitting_attribute, im
chia split_point v tp phn chia splitting_subset.
u ra: Mt cy quyt nh
Ni dung thut ton:
1.
2.
3.
4.
5.

To nt N
If cc b trong D u c nhn lp C then
Tr v N thnh mt nt l vi nhn lp C
If danh sch thuc tnh attribute_list l rng then
Tr v N thnh mt nt l vi nhn l lp chim a s trong D (Vic ny thc hin
qua gi hm Attribute_selection_method(D, attribute_list) tm ra tiu ch phn chia tt

6.
7.

nht splitting_criterion v gn nhn cho N tiu ch )


If splitting_attribute l mt gi tr ri rc v c nhiu cch chia then
Attribute_list = attribute_list splitting_attribute // Loi b thuc tnh

splitting_attribute
Foreach j in splitting_criterion
// Phn chia cc b xy dng cy cho cc phn chia
9.
t Dj l tp cc b trong D ph hp vi tiu ch j
10.
If Dj l rng then
11.
Gn nhn cho nt N vi nhn ph bin trong D
12.
Else Gn nt c tr v bi hm Generate_decision_tree(D j, attribute_list) cho nt
8.

N
13. Endfor
14. Return N
5.2.2. La chn thuc tnh
Vic la chn thuc tnh thc hin nh vic la chn cc tiu ch phn chia sao cho vic
phn ngun d liu D cho mt cch tt nht thnh cc lp phn bit. Nu chng ta chia D thnh
cc vng nh hn da trn cc kt qu tm c ca tiu ch phn chia, th mi vng s kh l thun
chng (Ngha l cc tp cc vng phn chia c th hon ton thuc v cng mt lp). iu ny
gip xc nh cch cc b gi tr ti mt nt xc nh s c chia th no. Cy c to cho phn
vng D c gn nhn vi tiu ch phn chia, cc nhnh ca n c hnh thnh cn c vo cc kt
qu phn chia ca cc b.
Gi s D l mt phn vng d liu cha cc b hun luyn c gn nhn. Cc nhn c m
gi tr phn bit xc nh m lp, Ci (vi i = 1,..,m). Gi Ci,D l tp cc b ca lp Ci trong D
Thng tin cn thit phn lp mt b trong D cho bi

Trong pi l kh nng mt b trong D thuc v lp Ci c xc nh bi |Ci,D| /|D|.


Gi gi s chng ta phn chia cc b D da trn mt s thuc tnh A c v gi tr phn bit
{a1, .., av}. Thuc tnh A c th dng chia D thnh v phn vng hoc tp con {D1, D2, , Dv}

61
trong Dj cha cc b trong D c kt qu u ra a j. Cc phn vng s tng ng vi cc
nhnh ca nt N.
Thng tin xc nh xem vic phn chia gn tip cn n mt phn lp c cho nh sau

l trng lng ca phn vng th j. InfoA(D) th hin thng tin cn thit phn lp
mt b ca D da trn phn lp theo A. Gi tr thng tin nh nht s cho ra phn vng thun ty
tng ng.
o thng tin thu c c cho

Gain(A) s cho chng ta bit bao nhiu nhnh c th thu nhn c t A. Thuc tnh A vi
o thng tin thu c ln nht s c dng lm thuc tnh phn chia ca nt N.

62

MT S THI MU

63
Trng i Hc Hng Hi Vit Nam
Khoa Cng ngh Thng tin
B MN H THNG THNG TIN
-----***----THI KT THC HC PHN
Tn hc phn:
Nm hc: x

KHAI PH D LIU

thi s:

K duyt :

Thi gian: 60 pht


Cu 1: (2 im)
Trnh by khi nim khai ph d liu?
Cu 2: (4 im)
Cho bng tng hp sau biu din d liu tng hp kt qu bn hng ca mt siu th,
trong hot-dogs th hin s giao dch c cha hot-dog trong danh sch mt hng,
th hin s giao dch khng c cha hot-dog trong danh sch, tng t nh vy
i vi hamburgers.
Hot-dogs
Hamburgers

a. Gi s lut kt hp

2.000

500

2.500

1.000

1.500

2.500

3.000

2.000

5.000

c khai ph. Cho min_sup =

25% v min_conf = 50%. Lut trn c phi l lut kt hp mnh hay khng? Gii
thch?
b. Da trn cc d liu cho, hy cho bit vic mua hot-dog c c lp vi vic mua
humbergers hay khng? Nu khng hy cho bit mi quan h tng quan gia hai
mt hng trn?
Cu 3: (2 im)
Hy trnh by ngha ca tin x l d liu trong k thut khai ph d liu?
Cu 4: (2 im)
Cho tp d liu dng phn tch v tui c sp xp tng dn nh sau: {13,
15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46,
52, 70}
a. S dng phng php lm mn bin vi rng bin l 5. Minh ha cc bc thc
hin?
b. S dng phng phng php chun ha min-m bin i gi tr tui 35 v
khong [0.0, 1.0].
----------------------------***HT***---------------------------Lu : - Khng sa, xa thi, np li sau khi thi

64
Trng i Hc Hng Hi Vit Nam
Khoa Cng ngh Thng tin
B MN H THNG THNG TIN
-----***----THI KT THC HC PHN
Tn hc phn: KHAI PH D LIU
Nm hc: x

thi s:

K duyt :

Thi gian: 60 pht


Cu 1: (2 im)
Trnh by thut ton Apriori?
Cu 2: (4 im)
Cho mt c s d liu vi 5 giao dch, gi s min_sup = 60% v min_conf= 80%
TID

Mt hng

T100

{M, O, N, K, E, Y}

T200

{D, O, N, K, E, Y}

T300

{M, A, K, E}

T400

{M, U, C, K, Y}

T500
{C, O, O, K, I, E}
a. Tm tt c tt c cc tp ph bin Itemsets s dng thut ton Apriori ?
b. Lit k tt c cc lut kt hp mnh (vi support s, v confidence c) p ng tn
t sau, trong X l bin biu din khch hng v itemi l cc bin biu din cc mt
hng (v d A, B, )

Cu 3: (2 im)
Trnh by cc im khc bit gia kho d liu v mt c s d liu thng thng?
Cu 4: (2 im)
Cho tp d liu dng phn tch v tui c sp xp tng dn nh sau: {13,
15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46,
52, 70}
a. S dng phng php lm mn trung v vi rng bin l 3. Minh ha cc bc
thc hin?
b. S dng phng phng php chun ha decimal-scale bin i gi tr tui 35.
----------------------------***HT***---------------------------Lu : - Khng sa, xa thi, np li sau khi thi

65
Trng i Hc Hng Hi Vit Nam
Khoa Cng ngh Thng tin
B MN H THNG THNG TIN
-----***----THI KT THC HC PHN
Tn hc phn: KHAI PH D LIU
Nm hc: x

thi s:

K duyt :

Thi gian: 60 pht


Cu 1: (2 im)
Cho v d v mt ngun d liu lu tr c cu trc bng, cu trc
semi-structured, hoc khng cu trc?
Cu 2: (4 im)
Cho mt c s d liu vi 5 giao dch, gi s min_sup = 60% v min_conf= 80%
TID

Mt hng

T100

{M, O, N, K, E, Y}

T200

{D, O, N, K, E, Y}

T300

{M, A, K, E}

T400

{M, U, C, K, Y}

T500
{C, O, O, K, I, E}
a. Tm tt c tt c cc tp ph bin Itemsets s dng thut ton Apriori ?
b. Lit k tt c cc lut kt hp mnh (vi support s, v confidence c) p ng tn
t sau, trong X l bin biu din khch hng v itemi l cc bin biu din cc mt
hng (v d A, B, )

Cu 3: (2 im)
Cc bc ca qu trnh khai ph d liu?

Cu 4: (2 im)
Lm mn d liu s dng k thut lm trn cho tp sau:
Y = {1.17, 2.59, 3.38, 4.23, 2.67, 1.73, 2.53, 3.28, 3.44}
Sau biu din tp thu c vi cc chnh xc:
a. 0.1
b. 1.

----------------------------***HT***---------------------------Lu : - Khng sa, xa thi, np li sau khi thi

66
Trng i Hc Hng Hi Vit Nam
Khoa Cng ngh Thng tin
B MN H THNG THNG TIN
-----***----THI KT THC HC PHN
Tn hc phn: KHAI PH D LIU
Nm hc: x

thi s:

K duyt :

Thi gian: 60 pht


Cu 1: (2 im)
Nhim v chnh ca qu trnh khai ph d liu?
Cu 2: (4 im)
Cho bng tng hp sau biu din d liu tng hp kt qu bn hng ca mt siu th,
trong hot-dogs th hin s giao dch c cha hot-dog trong danh sch mt hng,
th hin s giao dch khng c cha hot-dog trong danh sch, tng t nh vy
i vi hamburgers.
Hot-dogs
Hamburgers

a. Gi s lut kt hp

2.000

500

2.500

1.000

1.500

2.500

3.000

2.000

5.000

c khai ph. Cho min_sup =

30% v min_conf = 70%. Lut trn c phi l lut kt hp mnh hay khng? Gii
thch?
b. Da trn cc d liu cho, hy cho bit vic mua hot-dog c c lp vi vic mua
humbergers hay khng? Nu khng hy cho bit mi quan h gia hai mt hng trn?
Cu 3: (2 im)
Trnh by cc im khc bit gia hai phng php phn lp v phn cm d liu?
Cu 4: (2 im)
Cho tp mu vi cc gi tr b thiu
o

X1 = {0, 1, 1, 2}

X2 = {2, 1, , 1}

X3 = {1, , , 0}

X4 = {, 2, 1, }

Nu min xc nh ca tt c cc thuc tnh l [0, 1, 2], hy xc nh cc gi tr b thiu bit


rng cc gi tr c th l mt trong s cc xc tr ca min xc nh? Hy gii thch
nhng ci c v mt nu rt gn chiu ca kho d liu ln?

----------------------------***HT***----------------------------

67
Lu : - Khng sa, xa thi, np li sau khi thi
Trng i Hc Hng Hi Vit Nam
Khoa Cng ngh Thng tin
B MN H THNG THNG TIN
-----***----THI KT THC HC PHN
Tn hc phn: KHAI PH D LIU
Nm hc: x

thi s:

K duyt :

Thi gian: 60 pht


Cu 1: (2 im)
K thut khai ph d liu bao gm nhng im c bn no?
Cu 2: (4 im)
Cho mt c s d liu vi 5 giao dch, gi s min_sup = 60% v min_conf= 80%
TID

Mt hng

T100

{M, O, N, K, E, Y}

T200

{D, O, N, K, E, Y}

T300

{M, A, K, E}

T400

{M, U, C, K, Y}

T500
{C, O, O, K, I, E}
a. Tm tt c tt c cc tp ph bin Itemsets s dng thut ton Apriori ?
b. Lit k tt c cc lut kt hp mnh (vi support s, v confidence c) p ng tn
t sau, trong X l bin biu din khch hng v itemi l cc bin biu din cc mt
hng (v d A, B, )

Cu 3: (2 im)
Trnh by khi nim d on, cho v d v phn tch?
Cu 4: (2 im)
Nu cc tp itemset c cu trc sao cho A + {A1, A2, A3}, B= {B1, B2}, C = {C1, C2, C3},
D = {D1, D2} v E = {E1, E2}

a. Hy tm cc tp itemset c nh ngha trn mc khi nim?


b. Tm cc lut kt hp ng tin cy cho cc tp itemset cu trn.
----------------------------***HT***---------------------------Lu : - Khng sa, xa thi, np li sau khi thi

You might also like