Professional Documents
Culture Documents
-----***-----
BI GING
KHAI PH D LIU
TN HC PHN:
M HC PHN:
TRNH O TO:
DNG CHO SV NGNH:
KHAI PH D LIU
17409
I HC CHNH QUY
CNG NGH THNG TIN
HI PHNG - 2011
2
MC LC
Ni dung
Chng 1. Tng quan kho d liu (Data warehouse)
1.1. Cc chin lc x l v khai thc thng tin
1.2. nh ngha kho d liu
1.3. Mc ch ca kho d liu
1.4. c tnh ca d liu trong kho d liu
1.5. Phn bit kho d liu vi cc c s d liu tc nghip
Chng 2. Tng quan v khai ph d liu
2.1. Khai ph d liu l g?
2.2. Phn loi cc h thng khai ph d liu
2.3. Nhng nhim v chnh
2.4. Tch hp h thng khai ph d liu vi c s d liu hoc kho
2.5. Cc phng php khai ph d liu
2.6. Li th ca khai ph d liu so vi phng php c bn
2.7. La chn phng php
2.8. Nhng thch thc trong ng dng v nghin cu trong k thut khai ph d liu
Chng 3. Tin x l d liu
3.1. Mc ch
3.2. Lm sch d liu
3.3. Tch hp v bin i d liu
Chng 4. Khai ph da trn cc mu ph bin v lut kt hp
4.1. Khi nim v lut kt hp
Trang
5
5
6
7
8
10
13
13
13
14
16
18
22
23
24
28
28
29
31
41
Error:
Referen
ce
source
not
found
40
45
51
54
54
56
3
Tn hc phn: Khai ph d liu
Loi hc phn: 2
B mn ph trch ging dy: H thng Thng tin
Khoa
ph trch: CNTT.
M hc phn: 17409
Tng s TC: 2
Tng s
L thuyt
Thc hnh/
T hc Bi tp
n mn
tit
Xemina
ln
hc
45
30
15
0
khng
khng
Hc phn hc trc: C s d liu; C s d liu nng cao; H qun tr CSDL
Hc phn tin quyt: Khng yu cu.
Hc phn song song: Khng yu cu.
Mc tiu ca hc phn:
Cung cp cc kin thc c bn v kho d liu ln v cc k thut khai ph
d liu.
Ni dung ch yu:
Tng quan v kho d liu v khai ph d liu; Phng php t chc lu tr
d liu ln, v cc k thut khai ph d liu; Phn tch d liu s dng phng
php phn cm; ng dng k thut khai ph d liu.
Ni dung chi tit:
TN CHNG MC
Chng 1. Tng quan kho d liu (Data warehouse)
1.1. Cc chin lc x l v khai thc thng tin
1.2. nh ngha kho d liu
1.3. Mc ch ca kho d liu
1.4. c tnh ca d liu trong kho d liu
1.5. Phn bit kho d liu vi cc c s d liu tc
nghip
Chng 2. Tng quan v khai ph d liu
2.1. Khai ph d liu l g?
2.2. Phn loi cc h thng khai ph d liu
2.3. Nhng nhim v chnh
2.4. Tch hp h thng khai ph d liu vi c s d liu
hoc kho
2.5. Cc phng php khai ph d liu
2.6. Li th ca khai ph d liu so vi phng php c
bn
2.7. La chn phng php
2.8. Nhng thch thc trong ng dng v nghin cu
trong k thut khai ph d liu
Chng 3. Tin x l d liu
3.1. Mc ch
3.2. Lm sch d liu
3.3. Tch hp v bin i d liu
Chng 4. Khai ph da trn cc mu ph bin v
lut kt hp
4.1. Khi nim lut kt hp
4.2. Gii thut Apriori
4.3. Gii thut FP-Growth
4.4. So snh v nh gi
12
4
TN CHNG MC
Chng 5. Phn lp v d on
5.1. Khi nim c bn
5.2. Phn lp da trn cy quyt nh
6
Thng thng s ph thuc ln nhau gia cc chng trnh khng r rng hoc l
khng xc nh c.
Do s phc tp ca cng vic chuyn i cng nh ton b qu trnh bo tr dn n
m ngun ca cc chng trnh tr nn ht sc phc tp.
7
c thu thp x l phc v cng vic nghip v c th ca mt t chc, v vy thng c gi
l d liu tc nghip v hot ng x l d liu ny gi l x l giao dch trc tuyn (OLPT - On
Line Transaction Processing).
Dng d liu trong mt t chc (c quan, x nghip, cng ty, vv) c th m t khi qut
nh sau:
D liu tc
nghip
Kho d liu
H THNG
DI SN
(c sn)
Kho d liu cc
b
Kho d liu c
nhn
Siu d liu
Hnh 1.1. Lung d liu trong mt t
chc
D liu c nhn khng thuc phm vi qun l ca h qun tr kho d liu. N cha cc
thng tin c trch xut ra t cc h thng d liu tc nghip, kho d liu v t nhng kho d liu
cc b ca nhng ch lin quan bng cc php gp, tng hp hay x l theo mt cch no .
1.3.
Mc ch ca kho d liu
Mc tiu chnh ca kho d liu nhm p ng cc tiu chun c bn:
H tr cc nhn vin ca t chc thc hin tt, hiu qu cng vic ca mnh, nh c
nhng quyt nh hp l, nhanh v bn c nhiu hng hn, nng sut cao hn, thu c
li nhun cao hn ..v..v..
Nng cao cht lng d liu bng cc phng php lm sch v tinh lc d liu theo nhng
hng ch nht nh.
Tng hp v kt ni d liu.
Cung cp thng tin c tch hp, tm tt hoc c lin kt, t chc theo cc ch .
Cc kt qu khai thc kho d liu c dng trong h thng h tr quyt nh (Decision
Support System - DSS), cc h thng thng tin tc nghip hoc h tr cho cc truy vn c bit.
Mc tiu c bn ca mi t chc l li nhun v iu ny c m t nh sau:
Li nhun
Li tc
Bn hng
Xc nh gi
Chi ph
Chi ph c
nh
Chi ph bin
i
Tnh tch hp
Tnh hng ch
Tnh n nh
D liu tng hp
9
bn hng v mt h thng tip th (marketing) c th c chung mt dng thng tin khch hng. Tuy
nhin, cc vn v ti chnh cn c mt khung nhn khc v khch hng. Khung nhn bao gm
cc phn d liu khc nhau v ti chnh v marketing.
Tnh tch hp th hin ch: d liu tp hp trong kho d liu c thu thp t nhiu ngun
c trn ghp vi nhau thnh mt th thng nht.
1.4.2. Tnh hng ch
D liu trong kho d liu c t chc theo ch phc v cho t chc d dng xc nh
c cc thng tin cn thit trong tng hot ng ca mnh. V d, trong h thng qun l ti chnh
c c th c d liu c t chc cho cc chc nng: cho vay, qun l tn dng, qun l ngn
sch, ..v..v.. Ngc li, trong kho d liu v ti chnh, d liu c t chc theo ch im da vo
cc i tng: khch hng, sn phm, cc x nghip, ..v..v.. S khc nhau ca 2 cch tip cn trn
dn n s khc nhau v ni dung d liu lu tr trong h thng.
* Kho d liu khng lu tr d liu chi tit, ch cn lu tr d liu mang tnh tng hp phc
v ch yu cho qu trnh phn tch tr gip quyt nh.
* CSDL trong cc ng dng tc nghip li cn x l d liu chi tit, phc v trc tip cho
cc yu cu x l theo cc chc nng ca lnh vc ng dng hin thi. Do vy, cc h thng ng
dng tc nghip (Operational Application System - OAS) cn lu tr d liu chi tit. Mi quan h
ca d liu trong h thng ny cng khc, i hi phi c tnh chnh xc, c tnh thi s, ..v..v..
* D liu cn gn vi thi gian v c tnh lch s. Kho cha d liu bao hm mt khi
lng ln d liu c tnh lch s. D liu c lu tr thnh mt lot cc snapshot (nh chp d
liu). Mi bn ghi phn nh nhng gi tr ca d liu ti mt thi im nht nh th hin khung
nhn ca mt ch im trong mt giai on. Do vy cho php khi phc li lch s v so snh tng
i chnh xc cc giai on khc nhau. Yu t thi gian c vai tr nh mt phn ca kho m
bo tnh n nht ca mi sn phm hng ho c cung cp c trng v thi gian cho d liu. V d,
trong h thng qun l kinh doanh cn c d liu lu tr v n gi cu mt hng theo ngy (
chnh l yu t thi gian). C th mi mt hng theo mt n v tnh v ti mt thi im xc nh
phi c mt n gi khc nhau (s bin ng v gi c mt hng xng du trong thi gian qua l
mt minh chng in hnh).
D liu trong OAS th cn phi chnh xc ti thi im truy cp, cn DW th ch cn c
hiu lc trong khong thi gian no , trong khong 5 n 10 nm hoc lu hn. D liu ca
CSDL tc nghip thng sau mt khong thi gian nht nh s tr thnh d liu lch s v chng
s c chuyn vo trong kho d liu. chnh l nhng d liu hp l v nhng ch im cn lu
tr.
10
So snh v CSDL tc nghip v nh chp d liu, ta thy:
CSDL tc nghip
Thi gian ngn (30 60 ngy)
C th c yu t thi gian hoc khng
D liu c th c cp nht
nh chp d liu
Thi gian di (5 10 nm)
Lun c yu t thi gian
Khi d liu c chp li th khng cp
nht c
Bng 1.1. Tnh thi gian ca d liu
1.4.3. D liu c tnh n nh (nonvolatility)
D liu trong DW l d liu ch c v ch c th c kim tra, khng th c thay i
bi ngi dng u cui (terminal users). N ch cho php thc hin 2 thao tc c bn l np d
liu vo kho v truy cp vo cc cung trong DW. Do vy, d liu khng bin ng.
Thng tin trong DW phi c ti vo sau khi d liu trong h thng iu hnh c cho l
qu c. Tnh khng bin ng th hin ch: d liu c lu tr lu di trong kho d liu. Mc d
c thm d liu mi nhp vo nhng d liu c trong kho d liu vn khng b xo hoc thay i.
iu cho php cung cp thng tin v mt khong thi gian di, cung cp s liu cn thit cho
cc m hnh nghip v phn tch, d bo. T c c nhng quyt nh hp l, ph hp vi cc
quy lut tin ho ca t nhin.
1.4.4. D liu tng hp
D liu tc nghip thun tu khng c lu tr trong DW. D liu tng hp c tch hp
li qua nhiu giai on khc nhau theo cc ch im nu trn.
1.5.
truyn thng:
Nhng h CSDL thng thng khng phi qun l nhng lng thng tin ln m qun l
nhng lng thng tin va v nh. DW phi qun l mt khi lng ln cc thng tin c
lu tr trn nhiu phng tin lu tr v x l khc nhau. cng l c th ca DW.
DW tch hp v kt ni thng tin t nhiu ngun khc nhau trn nhiu loi phng tin lu
tr v x l thng tin nhm phc v cho cc ng dng x l tc nghip trc tuyn.
11
Ni mt cch tng qut, DW lm nhim v phn pht d liu cho nhiu i tng (khch hng),
x l thng tin nhiu dng nh: CSDL, truy vn d liu (SQL query), bo co (report) ..v..v..
12
BI TP:
L THUYT:
1. Kho d liu l g?
2. Cho v d v cc h thng hoc lnh vc no c iu kin xy dng cc kho
d liu ln?
3. Mt bng d liu c 50.000 bn ghi liu c th c gi l mt kho d liu ln hay
cha? L gii cho cu tr li?
4. Cho v d v mt ngun d liu lu tr c cu trc bng, cu trc semi-structured,
hoc khng cu trc?
5. Phn bit kho d liu vi c s d liu tc nghip?
THC HNH:
1. Ci t b ng dng Microsoft Visual Studio 2005?
2. Ci t v tm hiu dch v Data analysis?
3. Quan st v tm hiu c s d liu NorthWind?
13
Khai ph d liu
Khai ph d liu c dng m t qu trnh pht hin ra tri thc trong CSDL. Qu trnh
ny kt xut ra cc tri thc tim n t d liu gip cho vic d bo trong kinh doanh, cc hot ng
sn xut, ... Khai ph d liu lm gim chi ph v thi gian so vi phng php truyn thng trc
kia (v d nh phng php thng k).
Sau y l mt s nh nghi mang tnh m t ca nhiu tc gi v khai ph d liu.
nh ngha ca Ferruzza: Khai ph d liu l tp hp cc phng php c dng trong
tin trnh khm ph tri thc ch ra s khc bit cc mi quan h v cc mu cha bit bn trong
d liu
nh ngha ca Parsaye: Khai ph d liu l qu trnh tr gip quyt nh, trong chng
ta tm kim cc mu thng tin cha bit v bt ng trong CSDL ln
nh ngha ca Fayyad: Khai ph tri thc l mt qu trnh khng tm thng nhn ra
nhng mu d liu c gi tr, mi, hu ch, tim nng v c th hiu c.
2.2.
tr tu nhn to, c s d liu, thut ton, tnh ton song song v tc cao, thu thp tri thc cho
cc h chuyn gia, quan st d liu... c bit pht hin tri thc v khai ph d liu rt gn gi vi
lnh vc thng k, s dng cc phng php thng k m hnh d liu v pht hin cc mu, lut
... Ngn hng d liu (Data Warehousing) v cc cng c phn tch trc tuyn (OLAP- On Line
Analytical Processing) cng lin quan rt cht ch vi pht hin tri thc v khai ph d liu.
Khai ph d liu c nhiu ng dng trong thc t, v d nh:
Bo him, ti chnh v th trng chng khon: phn tch tnh hnh ti chnh v d bo gi ca
cc loi c phiu trong th trng chng khon. Danh mc vn v gi, li sut, d liu th tn
dng, pht hin gian ln, ...
Dn s th gii
(triu ngi)
Nm
Dn s th gii
(triu ngi)
Nm
Dn s th gii
(triu ngi)
1950
2555
1970
3708
1990
5275
1951
2593
1971
3785
1991
5359
1952
2635
1972
3862
1992
5443
1953
2680
1973
3938
1993
5524
1954
2728
1974
4014
1994
5604
14
1955
2779
1975
4087
1995
5685
1956
2832
1976
4159
1996
5764
1957
2888
1977
4231
1997
5844
1958
2945
1978
4303
1998
5923
1959
2997
1979
4378
1999
6001
1960
3039
1980
4454
2000
6078
1961
3080
1981
4530
2001
6153
1962
3136
1982
4610
2002
6228
1963
3206
1983
4690
1964
3277
1984
4769
1965
3346
1985
4850
1966
3416
1986
4932
1967
3486
1987
5017
1968
3558
1988
5102
1969
3632
1989
5188
Ngun: U.S. Bureau of the Census, International Data Base. Cp nht ngy 10/10/2002.
Bng 2.1. Dn s th gii tnh ti thi im gia nm
Lnh vc khoa hc: Quan st thin vn, d liu gene, d liu sinh vt hc, tm kim, so snh
cc h gene v thng tin di truyn, mi lin h gene v mt s bnh di truyn, ...
Mng vin thng: Phn tch cc cuc gi in thoi v h thng gim st li, s c, cht lng
dch v, ...
2.3.
15
16
c th c a v ng dng trong cc lnh vc khc nhau. Do cc kt qu c th l cc d on
hoc cc m t nn chng c th c a vo cc h thng h tr ra quyt nh nhm t ng ho
qu trnh ny.
Tm li: KDD l mt qu trnh kt xut ra tri thc t kho d liu m trong khai ph d
liu l cng on quan trng nht.
2.4.
Hi qui (regression)
17
php o tia hng ngoi, Lin quan cht ch n vic phn nhm l nhim v nh gi d liu,
hm mt xc sut a bin/ cc trng trong CSDL.
2.4.4.
Tng hp (summarization)
L cng vic lin quan n cc phng php tm kim mt m t tp con d liu [1, 2, 5].
K thut tng hp thng p dng trong vic phn tch d liu c tnh thm d v bo co t ng.
Nhim v chnh l sn sinh ra cc m t c trng cho mt lp. M t loi ny l mt kiu tng
hp, tm tt cc c tnh chung ca tt c hay hu ht cc mc ca mt lp. Cc m t c trng th
hin theo lut c dng sau: Nu mt mc thuc v lp ch trong tin th mc c tt c cc
thuc tnh nu trong kt lun. Lu rng lut dng ny c cc khc bit so vi lut phn lp.
Lut pht hin c trng cho lp ch sn sinh khi cc mc thuc v lp .
2.4.5.
L vic tm kim mt m hnh m t s ph thuc gia cc bin, thuc tnh theo hai mc:
Mc cu trc ca m hnh m t (thng di dng th). Trong , cc bin ph thuc b phn
vo cc bin khc. Mc nh lng m hnh m t mc ph thuc. Nhng ph thuc ny thng
c biu th di dng theo lut nu - th (nu tin l ng th kt lun ng). V nguyn tc,
c tin v kt lun u c th l s kt hp logic ca cc gi tr thuc tnh. Trn thc t, tin
thng l nhm cc gi tr thuc tnh v kt lun ch l mt thuc tnh. Hn na h thng c th
pht hin cc lut phn lp trong tt c cc lut cn phi c cng mt thuc tnh do ngi dng
ch ra trong kt lun.
Quan h ph thuc cng c th biu din di dng mng tin cy Bayes. l th c
hng, khng chu trnh. Cc nt biu din thuc tnh v trng s ca lin kt ph thuc gia cc nt
.
2.4.6.
18
2.5.
tin ca mnh. Qu trnh khai ph d liu l qu trnh pht hin mu, trong phng php khai ph
d liu tm kim cc mu ng quan tm theo dng xc nh. C th k ra y mt vi phng
php nh: s dng cng c truy vn, xy dng cy quyt nh, da theo khong cch (K-lng ging
gn), gi tr trung bnh, pht hin lut kt hp, Cc phng php trn c th c phng theo v
c tch hp vo cc h thng lai khai ph d liu theo thng k trong nhiu nm nghin cu.
Tuy nhin, vi d liu rt ln trong kho d liu th cc phng php ny cng i din vi thch
thc v mt hiu qu v quy m.
2.5.1.
Gii thut khai ph d liu bao gm 3 thnh phn chnh nh sau: biu din m hnh, kim
nh m hnh v phng php tm kim.
Biu din m hnh: M hnh c biu din theo mt ngn ng L no miu t cc mu
c th khai thc c. M t m hnh r rng th hc my s to ra mu c m hnh chnh xc cho
d liu. Tuy nhin, nu m hnh qu ln th kh nng d on ca hc my s b hn ch. Nh th
s lm cho vic tm kim phc tp hn cng nh hiu c m hnh l khng n gin hoc s
khng th c cc mu to ra c mt m hnh chnh xc cho d liu. V d m t cy quyt nh
s dng phn chia cc nt theo 1 trng d liu, chia khng gian u vo thnh cc siu phng song
song vi trc cc thuc tnh. Phng php cy quyt nh nh vy khng th khai ph c d liu
dng cng thc X = Y d cho tp hc c quy m ln th no i na. V vy, vic quan trng l
ngi phn tch d liu cn phi hiu y cc gi thit miu t. Mt iu cng kh quan trng l
ngi thit k gii thut cng phi din t c cc gi thit m t no c to ra bi gii thut
no. Kh nng miu t m hnh cng ln th cng lm tng mc nguy him do b hc qu v lm
gim i kh nng d on cc d liu cha bit. Hn na, vic tm kim s cng tr ln phc tp
hn v vic gii thch m hnh cng kh khn hn.
M hnh ban u c xc nh bng cch kt hp bin u ra (ph thuc) vi cc bin c
lp m bin u ra ph thuc vo. Sau phi tm nhng tham s m bi ton cn tp trung gii
quyt. Vic tm kim m hnh s a ra c mt m hnh ph hp vi tham s c xc nh da
trn d liu (trong mt s trng hp khc th m hnh v cc tham s li thay i ph hp vi
d liu). Trong mt s trng hp, tp cc d liu c chia thnh tp d liu hc v tp d liu
th. Tp d liu hc c dng lm cho tham s ca m hnh ph hp vi d liu. M hnh sau
s c nh gi bng cch a cc d liu th vo m hnh v thay i cc tham s cho ph
hp nu cn. M hnh la chn c th l phng php thng k nh SASS, mt s gii thut hc
my (v d nh cy quyt nh v cc quyt nh hc c thy khc), mng neuron, suy din hng
tnh hung (case based reasoning), cc k thut phn lp.
19
Kim nh m hnh (model evaluation): L vic nh gi, c lng cc m hnh chi tit,
chun trong qu trnh x l v pht hin tri thc vi s c lng c d bo chnh xc hay khng
v c tho mn c s logic hay khng? c lng phi c nh gi cho (cross validation) vi
vic m t c im bao gm d bo chnh xc, tnh mi l, tnh hu ch, tnh hiu c ph hp
vi cc m hnh. Hai phng php logic v thng k chun c th s dng trong m hnh kim
nh.
Phng php tm kim: Phng php ny bao gm hai thnh phn: tm kim tham s v tm
kim m hnh. Trong tm kim tham s, gii thut cn tm kim cc tham s ti u ha cc tiu
chun nh gi m hnh vi cc d liu quan st c v vi mt m t m hnh nh. Vic tm
kim khng cn thit i vi mt s bi ton kh n gin: cc nh gi tham s ti u c th t
c bng cc cch n gin hn. i vi cc m hnh chung th khng c cc cch ny, khi gii
thut tham lam thng c s dng lp i lp li. V d nh phng php gim gradient trong
gii thut lan truyn ngc (backpropagation) cho cc mng neuron. Tm kim m hnh xy ra
ging nh mt vng lp qua phng php tm kim tham s: m t m hnh b thay i to nn mt
h cc m hnh. Vi mi mt m t m hnh, phng php tm kim tham s c p dng nh
gi cht lng m hnh. Cc phng php tm kim m hnh thng s dng cc k thut tm kim
heuristic v kch thc ca khng gian cc m hnh c th thng ngn cn cc tm kim tng th,
hn na cc gii php n gin (closed form) khng d t c.
2.5.2.
Mt c s d liu l mt kho thng tin nhng cc thng tin quan trng hn cng c th c
suy din t kho thng tin . C hai k thut chnh thc hin vic ny l suy din v quy np.
Phng php suy din: Nhm rt ra thng tin l kt qu logic ca cc thng tin trong c s
d liu. V d nh ton t lin kt p dng cho bng quan h, bng u cha thng tin v cc nhn
vin v phng ban, bng th hai cha cc thng tin v cc phng ban v cc trng phng. Nh vy
s suy ra c mi quan h gia cc nhn vin v cc trng phng. Phng php suy din da trn
cc s kin chnh xc suy ra cc tri thc mi t cc thng tin c. Mu chit xut c bng cch
s dng phng php ny thng l cc lut suy din.
Phng php quy np: phng php quy np suy ra cc thng tin c sinh ra t c s d
liu. C ngha l n t tm kim, to mu v sinh ra tri thc ch khng phi bt u vi cc tri thc
bit trc. Cc thng tin m phng php ny em li l cc thng tin hay cc tri thc cp cao
din t v cc i tng trong c s d liu. Phng php ny lin quan n vic tm kim cc mu
trong CSDL. Trong khai ph d liu, quy np c s dng trong cy quyt nh v to lut.
2.5.3.
20
trong cc bn ghi gn nhau trong khng gian c xem xt thuc v ln cn (hng xm lng
ging) ca nhau. Khi nim ny c dng trong khoa hc k thut vi tn gi K-lng ging gn,
trong K l s lng ging c s dng. Phng php ny rt hiu qu nhng li n gin.
tng thut ton hc K-lng ging gn l thc hin nh cc lng ging gn ca bn lm.
V d: d on hot ng ca c th xc nh, K-lng ging tt nht ca c th c xem
xt, v trung bnh cc hot ng ca cc lng ging gn a ra c d on v hot ng ca c
th .
K thut K-lng ging gn l mt phng php tm kim n gin. Tuy nhin, n c mt s
mt hn ch gii l hn phm vi ng dng ca n. l thut ton ny c phc tp tnh ton l
lu tha bc 2 theo s bn ghi ca tp d liu.
Vn chnh lin quan n thuc tnh ca bn ghi. Mt bn ghi gm hiu thuc tnh c
lp, n bng mt im trong khng gian tm kim c s chiu ln. Trong cc khng gian c s chiu
ln, gia hai im bt k hu nh c cng khong cch. V th m k thut K-lng ging khng cho
ta thm mt thng tin c ch no, khi tt c cc cp im u l cc lng ging. Cui cng, phng
php K-lng ging khng a ra l thuyt hiu cu trc d liu. Hn ch c th c khc
phc bng k thut cy quyt nh.
2.5.4.
21
vy, vn ta bt cy quyt nh tr nn quan trng. Cc nt l khng n nh trong cy quyt
nh s c ta bt.
K thut ta trc l vic dng sinh cy quyt nh khi chia d liu khng c ngha.
2.5.5.
Phng php ny nhm pht hin ra cc lut kt hp gia cc thnh phn d liu trong c s
d liu. Mu u ra ca gii thut khai ph d liu l tp lut kt hp tm c. Ta c th ly mt v
d n gin v lut kt hp nh sau: s kt hp gia hai thnh phn A v B c ngha l s xut hin
ca A trong bn ghi ko theo s xut hin ca B trong cng bn ghi : A => B.
Cho mt lc R={A1, , Ap} cc thuc tnh vi min gi tr {0,1}, v mt quan h r trn
R. Mt lut kt hp trn r c m t di dng X=>B vi X R v B R\X. V mt trc gic, ta
c th pht biu ngha ca lut nh sau: nu mt bn ghi ca bng r c gi tr 1 ti mi thuc tnh
thuc X th gi tr ca thuc tnh B cng l 1 trong cng bn ghi . V d nh ta c tp c s d
liu v cc mt hng bn trong siu th, cc dng tng ng vi cc ngy bn hng, cc ct tng
ng vi cc mt hng th gi tr 1 ti (20/10, bnh m) xc nh rng bnh m bn ngy hm
cng ko theo s xut hin gi tr 1 ti (20/10, b).
Cho W R, t s(W,r) l tn s xut hin ca W trong r c tnh bng t l ca cc hng
trong r c gi tr 1 ti mi ct thuc W. Tn s xut hin ca lut X=>B trong r c nh ngha l
s(X {B}, r) cn gi l h tr ca lut, tin cy ca lut l s(X {B}, r)/s(X, r). y X c
th gm nhiu thuc tnh, B l gi tr khng c nh. Nh vy m khng xy ra vic to ra cc lut
khng mong mun trc khi qu trnh tm kim bt u. iu cng cho thy khng gian tm
kim c kch thc tng theo hm m ca s lng cc thuc tnh u vo. Do vy cn phi ch
khi thit k d liu cho vic tm kim cc lut kt hp.
Nhim v ca vic pht hin cc lut kt hp l phi tm tt c cc lut X=>B sao cho tn s
ca lut khng nh hn ngng cho trc v tin cy ca lut khng nh hn ngng cho
trc. T mt c s d liu ta c th tm c hng nghn v thm ch hng trm nghn cc lut kt
hp.
Ta gi mt tp con X R l thng xuyn trong r nu tha mn iu kin s(X, r). Nu
bit tt c cc tp thng xuyn trong r th vic tm kim cc lut rt d dng. V vy, gii thut tm
kim cc lut kt hp trc tin i tm tt c cc tp thng xuyn ny, sau to dng dn cc lut
kt hp bng cch ghp dn cc tp thuc tnh da trn mc thng xuyn.
Cc lut kt hp c th l mt cch hnh thc ha n gin. Chng rt thch hp cho vic to
ra cc kt qu c d liu dng nh phn. Gii hn c bn ca phng php ny l ch cc quan h
cn phi tha theo ngha khng c tp thng xuyn no cha nhiu hn 15 thuc tnh. Gii thut
tm kim cc lut kt hp to ra s lut t nht phi bng vi s cc tp ph bin v nu nh mt tp
22
ph bin c kch thc K th phi c t nht l 2 K tp ph bin. Thng tin v cc tp ph bin c
s dng c lng tin cy ca cc tp lut kt hp.
2.6.
Hc my (Machine Learning)
23
Cc h chuyn gia c gng nm bt cc tri thc thch hp vi bi ton no . Cc k thut
thu thp gip cho vip hp l mt cch suy din cc chuyn gia con ngi. Mi phng php
l mt cch suy din cc lut t cc v d v gii php i vi bi ton chuyn gia a ra. Phng
php ny khc vi khai ph d liu ch cc v d ca chuyn gia thng mc cht lng cao
hn rt nhiu so vi cc d liu trong c s d liu, v chng thng ch bao c cc trng hp
quan trng. Hn na, cc chuyn gia s xc nhn tnh gi tr v hu dng ca cc mu pht hin
c. Cng nh vi cc cng c qun tr c s d liu, cc phng php ny i hi c s tham
gia ca con ngi trong vic pht hin tri thc
2.6.3.
24
Cc gii thut khai ph d liu t ng vn mi ch giai on pht trin ban u. Ngi ta
vn cha a ra c mt tiu chun no trong vic quyt nh s dng phng php no v trong
trng hp hp no th c hiu qu.
Hu ht cc k thut khai ph d liu u mi i vi lnh vc kinh doanh. Hn na li c
rt nhiu k thut, mi k thut c s dng cho nhiu bi ton khc nhau. V vy, ngay sau cu
hi khai ph d liu l g? s l cu hi vy th dng k thut no?. Cu tr li tt nhin l
khng n gin. Mi phng php u c im mnh v yu ca n, nhng hu ht cc im yu
u c th khc phc c. Vy th phi lm nh th no p dng k thut mt cch tht n
gin, d s dng khng cm thy nhng phc tp vn c ca k thut .
so snh cc k thut cn phi c mt tp ln cc quy tc v cc phng php thc
nghim tt. Thng th quy tc ny khng c s dng khi nh gi cc k thut mi nht. Vi vy
m nhng yu cu ci thin chnh xc khng phi lc no cng thc hin c.
Nhiu cng ty a ra nhng sn phm s dng kt hp nhiu k thut khai ph d liu
khc nhau vi hy vng nhiu k thut s tt hn. Nhng thc t cho thy nhiu k thut ch thm
nhiu rc ri v gy kh khn cho vic so snh gia cc phng php v cc sn phm ny. Theo
nhiu nh gi cho thy, khi hiu c cc k thut v nghin cu tnh ging nhau gia chng,
ngi ta thy rng nhiu k thut lc u th c v khc nhau nhng thc cht ra khi hiu c cc
k thut ny th thy chng hon ton ging nhau. Tuy nhin, nh gi ny cng ch tham kho
v cho n nay, khai ph d liu vn cn l k thut mi cha nhiu tim nng m ngi ta vn
cha khai thc ht.
2.8.
Nhng thch thc trong ng dng v nghin cu trong k thut khai ph d liu
y, ta a ra mt s kh khn trong vic nghin cu v ng dng k thut khai ph d
liu. Tuy nhin, th khng c ngha l vic gii quyt l hon ton b tc m ch mun nu ln rng
khai ph c d liu khng phi n gin, m phi xem xt cng nh tm cch gii quyt
nhng vn ny. Ta c th lit k mt s kh khn nh sau:
2.8.1.
Cc vn v c s d liu
25
cho c s d liu, lu mu, cc phng php xp x, x l song song (Agrawal et al, Holsheimer et
al).
Kch thc ln: khng ch c s lng bn ghi ln m s cc trng trong c s d liu
cng nhiu. V vy m kch thc ca bi ton tr nn ln hn. Mt tp d liu c kch thc ln
sinh ra vn lm tng khng gian tm kim m hnh suy din. Hn na, n cng lm tng kh
nng mt gii thut khai ph d liu c th tm thy cc mu gi. Bin php khc phc l lm gim
kch thc tc ng ca bi ton v s dng cc tri thc bit trc xc nh cc bin khng ph
hp.
D liu ng: c im c bn ca hu ht cc c s d liu l ni dung ca chng thay
i lin tc. D liu c th thay i theo thi gian v vic khai ph d liu cng b nh hng bi
thi im quan st d liu. V d trong c s d liu v tnh trng bnh nhn, mt s gi tr d liu
l hng s, mt s khc li thay i lin tc theo thi gian (v d cn nng v chiu cao), mt s
khc li thay i ty thuc vo tnh hung v ch c gi tr c quan st mi nht l (v d nhp
p ca mch). Vy thay i d liu nhanh chng c th lm cho cc mu khai thc c trc
mt gi tr. Hn na, cc bin trong c s d liu ca ng dng cho cng c th b thay i, b
xa hoc l tng ln theo thi gian. Vn ny c gii quyt bng cc gii php tng trng
nng cp cc mu v coi nhng thay i nh l c hi khai thc bng cch s dng n tm
kim cc mu b thay i.
Cc trng khng ph hp: Mt c im quan trng khc l tnh khng thch hp ca d
liu, ngha l mc d liu tr thnh khng thch hp vi trng tm hin ti ca vic khai thc. Mt
kha cnh khc i khi cng lin quan n ph hp l tnh ng dng ca mt thuc tnh i vi
mt tp con ca c s d liu. V d trng s ti khon Nostro khng p dng cho cc tc nhn.
Cc gi tr b thiu: S c mt hay vng mt ca gi tr cc thuc tnh d liu ph hp c
th nh hng n vic khai ph d liu. Trong h thng tng tc, s thiu vng d liu quan trng
c th dn n vic yu cu cho gi tr ca n hoc kim tra xc nh gi tr ca n. Hoc cng
c th s vng mt ca d liu c coi nh mt iu kin, thuc tnh b mt c th c coi nh
mt gi tr trung gian v l gi tr khng bit.
Cc trng b thiu: Mt quan st khng y c s d liu c th lm cho cc d liu c
gi tr b xem nh c li. Vic quan st c s d liu phi pht hin c ton b cc thuc tnh c
th dng gii thut khai ph d liu c th p dng nhm gii quyt bi ton. Gi s ta c cc
thuc tnh phn bit cc tnh hung ng quan tm. Nu chng khng lm c iu th c
ngha l c li trong d liu. i vi mt h thng hc chun on bnh st rt t mt c s
d liu bnh nhn th trng hp cc bn ghi ca bnh nhn c triu chng ging nhau nhng li c
cc chn on khc nhau l do trong d liu b li. y cng l vn thng xy ra trong c s
26
d liu kinh doanh. Cc thuc tnh quan trng c th s b thiu nu d liu khng c chun b
cho vic khai ph d liu.
nhiu v khng chc chn: i vi cc thuc tnh thch hp, nghim trng ca li
ph thuc vo kiu d liu ca cc gi tr cho php. Cc gi tr ca cc thuc tnh khc nhau c th
l cc s thc, s nguyn, chui v c th thuc vo tp cc gi tr nh danh. Cc gi tr nh danh
ny c th sp xp theo th t tng phn hoc y , thm ch c th c cu trc ng ngha.
Mt yu t khc ca khng chc chn chnh l tnh k tha hoc chnh xc m d liu
cn c, ni cch khc l nhiu crn cc php o v phn tch c u tin, m hnh thng k m t
tnh ngu nhin c to ra v c s dng nh ngha mong mun v dung sai ca d
liu. Thng th cc m hnh thng k c p dng theo cch c bit xc nh mt cch ch
quan cc thuc tnh t c cc thng k v nh gi kh nng chp nhn ca cc (hay t hp
cc) gi tr thuc tnh. c bit l vi d liu kiu s, s ng n ca d liu c th l mt yu t
trong vic khai ph. V d nh trong vic o nhit c th, ta thng cho php chnh lch 0.1 .
Nhng vic phn tch theo xu hng nhy cm nhit ca c th li yu cu chnh xc cao
hn. mt h thng khai thc c th lin h n xu hng ny chun on th li cn c mt
nhiu trong d liu u vo.
Mi quan h phc tp gia cc trng: cc thuc tnh hoc cc gi tr c cu trc phn cp,
cc mi quan h gia cc thuc tnh v cc phng tin phc tp din t tri thc v ni dung ca
c s d liu yu cu cc gii thut phi c kh nng s dng mt cch hiu qu cc thng tin ny.
Ban u, k thut khai ph d liu ch c pht trin cho cc bn ghi c gi tr thuc tnh n gin.
Tuy nhin, ngy nay ngi ta ang tm cch pht trin cc k thut nhm rt ra mi quan h gia
cc bin ny.
2.8.2.
Mt s vn khc
27
Kh nng biu t ca mu: Trong rt nhiu ng dng, iu quan trng l nhng iu khai
thc c phi cng d hiu vi con ngi cng tt. V vy, cc gii php thng bao gm vic din
t di dng ha, xy dng cu trc lut vi cc th c hng (Gaines), biu din bng ngn
ng t nhin (Matheus et al.) v cc k thut khc nhm biu din tri thc v d liu.
S tng tc vi ngi s dng v cc tri thc sn c: rt nhiu cng c v phng php
khai ph d liu khng thc s tng tc vi ngi dng v khng d dng kt hp cng vi cc tri
thc bit trc . Vic s dng tri thc min l rt quan trng trong khai ph d liu. c
nhiu bin php nhm khc phc vn ny nh s dng c s d liu suy din pht hin tri
thc, nhng tri thc ny sau c s dng hng dn cho vic tm kim khai ph d liu
hoc s dng s phn b v xc sut d liu trc nh mt dng m ha tri thc c sn.
Bi tp:
1. K thut khai ph d liu l g?
2. Nhim v chnh ca qu trnh khai ph d liu?
3. Trnh by cc nt khc nhau c bn gia k thut khai ph d liu vi cc phng php nh
my hc, thng k?
4. Cc bc ca qu trnh khai ph d liu?
5. Hy cho v d ng dng k thut khai ph d liu trong thc t?
28
Mc ch
Cc K thut datamining u thc hin trn cc c s d liu, ngun d liu ln. l kt
qu ca qu trnh ghi chp lin tc thng tin phn nh hot ng ca con ngi, cc qu trnh t
nhin Tt nhin cc d liu lu tr hon ton l di dng th, cha sn sng cho vic pht hin,
khm ph thng tin n cha trong . Do vy chng cn phi c lm sch cng nh bin i v
cc dng thch hp trc khi tin hnh bt k mt phn tch no.
thc hin c vic trch rt thng tin hu ch, hay p dng cc phng php khai ph
nh phn lp, d on th ngun d liu th ban u cn phi tri qua nhiu cng on bin i.
Cc cng on ny c rt nhiu cch thc hin ty thuc vo nhu cu v d nh: Gim thiu kch
thc, chch chn cc d liu thc s quan trng, gii hn phm vi ca cc d liu thi gian thc,
hoc thay i, iu chnh cc d liu sao cho ph hp nht vi yu cu t ra. Tt nhin khng nn
qu k vng vo vic p dng my tnh tm ra cc tri thc hu ch m khng c s tr gip ca
con ngi, cng nh khng th mong mun rng mt ngun d liu sau khi bin i ca bi ton
ny li c th ph hp vi mt bi ton khai ph khc.
V d, Mt Cng ty in t a ra yu cu phn tch d liu bn hng ti cc chi nhnh. Khi
nhn vin phn tch cn phi kim tra k lng c s d liu bn hng ca ton cng ty cng
nh kho xng xc nh v la chn cc thuc tnh hoc chiu thng tin a vo phn tch nh:
Chng loi mt hng, mt hng, gi c, chi nhnh bn ra. Tuy nhin khng th trnh khi vic cc
giao dch thng nht c nhng sai li nht nh trong qu trnh ghi chp ca nhn vin bn hng.
Cc sai li rt a dng t vic khng ghi li thng tin cho n vic ghi sai thng tin so vi quy
nh, quy chun bnh thng. Do vy cng vic phn tch s kh th trin khai c nu gi nguyn
ngun d liu ban u trng thi cha y (thiu gi tr thuc tnh hoc cc thuc tnh nht
nh ch cha cc d liu tng hp), nhiu (c cha li, hoc bin ca gi tr khc so vi d kin),
v khng ph hp (v d, c s khc bit trong m s chi nhnh c s dng phn loi).
Nhng iu nu trong v d trn l hon ton c thc trong th gii hin ti, n gin l vo
thi im thu thp chng khng c coi l quan trng, cc d liu lin quan khng c ghi li do
mt s hiu nhm, hoc do trc trc thit b. Ngoi ra cn c cc trng hp cc d liu ghi sau
khi qua mt qu trnh xem xt no trc b xa i, cng nh vic ghi chp s bin i mang
tnh lch s ca cc giao dch c th b b qua m ch gi li nhng thng tin tng hp vo thi
im xt. Do vy, lm pht sinh nhu cu lm sch d liu l tm (in) thm cc gi tr thiu,
lm mn cc d liu nhiu hoc loi b cc gi tr khng ngha, d liu gy mu thun.
Qu trnh chun b d liu phc v khai ph d liu thng thng gm:
- Lm sch d liu;
29
- Tch hp d liu;
- Bin i d liu;
- Rt gn d liu.
30
- S dng cc thuc tnh c ngha l in vo cho gi tr thiu: V d, ta bit thu nhp
bnh qun u ngi ca mt khu vc l 800.000, gi tr ny c th c dng th thay th cho gi
tr thu nhp b thiu ca khch hng trong khu vc .
- S dng cc gi tr ca cc b cng th loi thay th cho gi tr thiu: V d, nu khch
hng A thuc cng nhm phn loi theo ri ro tn dng vi mt khch hng B khc trong khi
khch hng ny c thng tin thu nhp bnh qun. Ta c th s dng gi tr in vo cho gi tr
thu nhp bnh qun ca khch hng A .
- S dng gi tr c t l xut hin cao in vo cho cc gi tr thiu.: iu ny c th xc
nh bng phng php hi quy, cc cng c suy lun da trn l thuyt Bayersian hay cy quyt
nh
3.2.2. D liu nhiu
Nhiu d liu l mt li ngu nhin hay do bin ng ca cc bin trong qu trnh thc
hin, hoc s ghi chp nhm ln ko c kim sot V d cho thuc tnh nh gi c, lm cch
no c th lm mn thuc tnh ny loi b d liu nhiu. Hy xem xt cc k thut lm mn
sau:
Mng lu gi cc mt hng: 4, 8, 15, 21, 21, 24, 25, 28, 34
Phn thnh cc bin
Bin 1: 4, 8 , 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Lm mn s dng phng php trung v
Bin 1: 9, 9 ,9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Lm mn bin
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
Bng 3.1. V d v phng php lm mn Binning
a. Binning: Lm mn mt gi tr d liu c xc nh thng qua cc gi tr xung quanh n.
V d, cc gi tr gi c c sp xp trc sau phn thnh cc di khc nhau c cng kch thc
3 (tc mi Bin cha 3 gi tr).
31
- Khi lm mn trung v trong mi bin, cc gi tr s c thay th bng gi tr trung bnh cc
gi tr c trong bin
- Lm mn bin: cc gi tr nh nht v ln nht c xc nh v dng lm danh gii ca
bin. Cc gi tr cn li ca bin s c thay th bng mt trong hai gi tr trn ty thuc vo lch
gia gi tr ban u vi cc gi tr bin .
V d, bin 1 c cc gi tr 4, 8, 15 vi gi tr trung bnh l 9. Do vy nu lm mn trung v
cc gi tr ban u s c thay th bng 9. Cn nu lm mn bin gi tr 8 gn gi tr 4 hn nn
n c thay th bng 4.
b. Hi quy: Phng php thng dng l hi quy tuyn tnh, tm ra c mt mi quan
h tt nht gia hai thuc tnh (hoc cc bin), t mt thuc tnh c th dng d on thuc
tnh khc. Hi quy tuyn tnh a im l mt s m rng ca phng php trn, trong c nhiu
hn hai thuc tnh c xem xt, v cc d liu tnh ra thuc v mt min a chiu.
32
V d, lm th no m ngi phn tch d liu hoc my tnh chc chn rng thuc tnh id
ca khch hng trong mt c s d liu A v s hiu cust trong mt flat-file l cc thuc tnh ging
nhau v tnh cht?
Vic tch hp lun cn cc thng tin din t tnh cht ca mi thuc tnh (siu d liu) nh:
tn, ngha, kiu d liu, min xc nh, cc quy tc x l gi tr rng, bng khng . Cc siu d
liu s c s dng gip chuyn i cc d liu. Do vy bc ny cng lin quan n qu trnh
lm sch d liu.
D tha d liu: y cng l mt vn quan trng, v d nh thuc tnh doanh thu hng
nm c th l d tha nu nh n c th c suy din t cc thuc tnh hoc tp thuc tnh khc.
Mt s d tha c th c pht hin thng qua cc phn tch tng quan, Gi s cho hai
thuc tnh, vic phn tch tng quan c th ch ra mc mt thuc tnh ph thuc vo thuc tnh
kia, da trn cc d liu c trong ngun. Vi cc thuc tnh s hc, chng ta c th nh gi s
tng quan gia hai thuc tnh A v B bng cch tnh ton tng quan nh sau:
Trong :
- N l s b
- ai v bi l cc gi tr ca thuc tnh A v B ti b th i
-
Nu
nu gi tr A tng th gi tr cua B cng tng ln. Gi tr ny cng cao th mi quan h cng cht ch.
V h qu l nu gi tr
33
Nu
Nu
Trong :
-
34
Vi N l tng s b,
N
200 (360)
1000 (840)
1200
Tng
450
1050
1500
Ch trn mi dng tng s cc tn xut xut hin d kin c ghi trong cp ngoc () v
tng s tn xut d kin trn mi ct bng vi tng s tn xut quan st c trn ct .
cn
35
- Tng hp: trong tng hp hoc tp hp cc hnh ng c p dng trn d liu. V d
thy rng doanh s bn hng hng ngy c th c tng hp tnh ton hng thng v hng nm.
Bc ny thng c s dng xy dng mt khi d liu cho vic phn tch.
- Khi qut ha d liu, trong cc d liu mc thp hoc th c thay th bng cc khi
nim mc cao hn thng qua kin trc khai nim. V d, cc thuc tnh phn loi v d nh
ng ph c th khi qut ha ln mc cao hn thnh Thnh ph hay Quc gia. Tng t
nh vy cc gi tr s, nh tui c th c nh x ln khi nim cao hn nh Tr, Trung nin,
C tui
- Chun ha, trong cc d liu ca thuc tnh c quy v cc khong gi tr nh hn v
d nh t -1.0 n 1.0, hoc t 0.0 n 1.0
- Xc nh thm thuc tnh, trong o cc thuc tnh mi s c thm vo ngun d liu
gip cho qu trnh khai ph.
Trong phn ny chng ta s xem xt phng php chun ha lm ch o
Mt thuc tnh c chun ha bng cch nh x mt cch c t l d liu v mt khong
xc nh v d nh 0.0 n 1.0. Chun ha l mt phn hu ch ca thut ton phn lp trong mng
noron, hoc thut ton tnh ton lch s dng trong vic phn lp hay nhm cm cc phn t lin
k. Chng ta s xem xt ba phng php: min-max, z-score, v thay i s ch s phn thp phn
(decimal scaling)
a. Min-Max
Thc hin mt bin i tuyn tnh trn d liu ban u. Gi s rng min A v maxA l gi tr
ti thiu v ti a ca thuc tnh A. Chun ha min-max s nh x gi tr v ca thuc tnh A thnh v
trong khong [new_minA, new_maxA] bng cch tnh ton
V d: Gi s gi tr nh nht v ln nht cho thuc tnh thu nhp bnh qun l 500.000 v
4.500.000. Chng ta mun nh x gi tr 2.500.000 v khong [0.0, 1.0] s dng chun ha minmax. Gi tr mi thu c l
b. z-score
Vi phng php ny, cc gi tr ca mt thuc tnh A c chun ha da vo lch tiu
chun v trung bnh ca A. Mt gi tr v ca thuc tnh A c nh x thnh v nh sau:
36
Vi v d pha trn: Gi s thu nhp bnh qun c lch tiu chun v trung bnh l:
1.000.000 v 500.000. S dng phng php z-score th gi tr 2.500.000 c nh x thnh
37
bo co bn hng theo nm ch khng phi theo tng qu. Do cc d liu nn c tng hp
thnh bo co tng v tnh hnh bn hng theo nm hn l theo qu.
38
b. La chn tp thuc tnh con
Ngun d liu dng phn tch c th cha hng trm thuc tch, rt nhiu trong s c th
khng cn cho vic phn tch hoc chng l d tha. V d nu nhim v phn tch ch lin quan
n vic phn loi khch hng xem h c hoc khng mun mua mt a nhc mi hay khng. Khi
thuc tnh in thoi ca khch hng l khng cn thit khi so vi cc thuc tnh nh tui, s
thch m nhc. Mc d vy vic la chn thuc tnh no cn quan tm l mt vic kh khn v mt
thi gian t bit khi cc c tnh ca d liu l khng r rng. Gi cc thuc tnh cn, b cc thuc
tnh khng h ch cng s c th gy nhm ln, v sai lch kt qu ca cc thut ton khai ph d
liu.
Phng php ny rt gn kch thc d liu bng cch loi b cc thuc tnh khng hu ch
hoc d tha (hoc loi b cc chiu). Mc ch chnh l tm ra tp thuc tnh nh nht sao cho khi
p dng cc phng php khai ph d liu th kt qu thu c l gn st nht vi kt qu khi s
dng tt c cc thuc tnh.
Vy lm cch no tm ra mt tp thuc tnh con tt t tp thuc tnh ban u. Nh
rng vi N thuc tnh chng ta s c 2n tp thuc tnh con. Vic pht sinh v xem xt ht cc tp ny
l kh tn cng sc cng nh ti nguyn c bit khi N v s cc lp d liu tng ln. Do vy cn
c cc phng php khc, mt trong s l phng php tm kim tham lam, n s duyt qua
khng gian thuc tnh v tm kim cc la chn tt nht vo thi im xt.
La chn tng dn
Tpthuc tnh ban u
Loi bt
Tpthuc tnh ban u
Cy quyt nh
Tpthuc tnh ban u
Tp rt gn ban u
{}
=> {A1}
39
1. La chn tng dn: Xut pht t mt tp rng cc thuc tnh, cc thuc tnh tt nht mi
khi xc nh c s c thm vo tp ny. Lp li bc trn cho n khi khng thm c thuc
tnh no na.
2. Loi bt: Xut pht t tp c y cc thuc tnh. mi bc loi ra cc thuc tnh ti
nht.
3. Kt hp gia phng php loi bt v la chn tng dn bng cch ti mi bc ngoi
vic la chn thm cc thuc tnh tt nht a vo tp th cng ng thi loi b i cc thuc tnh
ti nht khi tp ang xt.
4. Cy quyt inh: Khi s dng, cy c xy dng t ngun d liu ban u. Tt c cc
thuc tnh khng xut hin trn cy c coi l khng hu ch. Tp cc thuc tnh c trn cy s l
tp thuc tnh rt gn
Bi tp:
1. Nu mt thuc tnh trong ngun d liu im-Sinh vin c cc gi tr A, B, C, D, F th kiu
d liu d kin ca thuc tnh trong qu trnh tin x l l g?
2. Cho mng mt chiu X = {5.0, 23.0, 17.6, 7.23, 1.11}, hy chun ha mng s dng
a. Decimal scaling: trong khong [1, 1].
b. Min-max: trong khong [0, 1].
c. Min-max: trong khong [1, 1].
d. Phng php lch
e. So snh kt qu ca cc dng chun trn v cho nhn xt v u nhc im ca cc
phng php?
3. Lm mn d liu s dng k thut lm trn cho tp sau:
Y = {1.17, 2.59, 3.38, 4.23, 2.67, 1.73, 2.53, 3.28, 3.44}
Sau biu din tp thu c vi cc chnh xc:
a. 0.1
b. 1.
4. Cho tp mu vi cc gi tr b thiu
o
X1 = {0, 1, 1, 2}
X2 = {2, 1, , 1}
X3 = {1, , , 0}
X4 = {, 2, 1, }
40
nhng ci c v mt nu rt gn chiu ca kho d liu ln trong qu trnh tin x l d
liu?
41
Chng 4: Lut kt hp
4.1. Khi nim v lut kt hp
Cho mt tp mc I = {i1, i2,, in}, mi phn t thuc I c gi l mt mc (item). i khi
mc cn c gi l thuc tnh v I cng c gi l tp cc thuc tnh. Mi tp con trong I c
gi l mt mt tp mc, s lng cc phn t trong mt tp mc c gi l di hay kch thc
ca mt tp mc.
Cho mt c s d liu giao dch D = {t1, t2,, tm}, trong mi ti l mt giao dch v l mt
tp con ca I. Thng th s lng cc giao dch (lc lng ca tp D k hiu l |D| hay card(D)) l
rt ln.
Cho X, Y l hai tp mc (hai tp con ca I). Lut kt hp (association rule) c k hiu l
XY, trong X v Y l hai tp khng giao nhau, th hin mi rng buc ca tp mc Y theo tp
mc X theo ngha s xut hin ca X s ko theo s xut hin ca Y ra sao trong cc giao dch. Tp
mc X c gi l xut hin trong giao dch t nu nh X l tp con ca t. h tr ca mt tp
mc X (k hiu l supp(X)) c nh ngha l t l cc giao dch trong D c cha X:
supp(X) = N(X)/|D|
Trong N(X) s lng cc giao dch trong CSDL giao dch D m c cha X.
Gi tr ca lut kt hp XY c th hin thng qua hai o l h tr supp(XY) v
tin cy conf(XY).
h tr supp(XY) l t l cc giao dch c cha X U Y trong tp D:
supp(XY) = P(X Y) = N(X Y)/|D|
Trong k hiu N(X Y) l s lng cc giao dch c cha X U Y.
tin cy conf(XY) l t l cc tp giao dch c cha X U Y so vi cc tp giao dch c
cha X:
conf(XY) = P(Y|X) = N(X Y)/N(X) = supp(XY)/supp(X)
Trong k hiu N(X) s lng cc giao dch c cha X.
T nh ngha ta thy 0 supp(XY) 1 v 0 conf(XY) 1. Theo quan nim xc sut,
h tr l xc sut xut hin tp mc X Y, cn tin cy l xc sut c iu kin xut hin Y
khi xut hin X.
Lut kt hp XY c coi l mt tri thc (mu c gi tr) nu xy ra ng thi supp(XY)
minsup v conf(XY) minconf. Trong minsup v minconf l hai gi tr ngng cho trc.
Mt tp mc X c h tr vt qua ngng minsup c gi l tp ph bin.
4.2. Thut ton Apriori
Thut ton Apriori l mt thut ton in hnh p dng trong khai ph lut
kt hp. Thut ton da trn nguyn l Apriori tp con bt k ca mt tp ph
bin cng l mt tp ph bin. Mc ch ca thut ton Apriori l tm ra c
42
tt c cc tp ph bin c th c trong c s d liu giao dch D. Thut ton
hot ng theo nguyn tc quy hoch ng, ngha l t cc tp F i = { ci | ci l
tp ph bin, |ci| = 1} gm mi tp mc ph bin c di i (1 i k), i
tm tp Fk+1 gm mi tp mc ph bin c di k+1. Cc mc i 1, i2,, in trong
thut ton c sp xp theo mt th t c nh.
Thut ton Apriori:
Input:
Output:
Tp hp tt c cc tp ph bin.
ikik.
43
truy
cp
Session 1
Cc trang truy cp
/shopping/comestic.htm,
/shopping/fashion.htm,
Session 2
Session 3
Session 4
/cars.htm
/shopping/fashion.htm, /news.htm
/shopping/fashion.htm, /sport.htm
/shopping/comestic.htm,
/shopping/fashion.htm,
Session
Session
Session
Session
/news.htm
/shopping/comestic.htm, /sport.htm
/shopping/fashion.htm, /sport.htm
/shopping/comestic.htm, /sport.htm
/shopping/comestic.htm,
/shopping/fashion.htm,
5
6
7
8
Session 9
/sport.htm, /cars.htm
/shopping/comestic.htm,
/shopping/fashion.htm,
/sport.htm
Bng 3.1: Cc phin truy cp ca mt ngi dng
Gi s sau khi tin x l d liu thu c t web log, ta xc nh c cc
phin truy cp ca ngi dng nh bng 3.1. y mi phin truy cp c th
coi l mt giao dch v mi trang c truy cp l mt mc. Vic p dng gii
thut Apriori c th gip xc nh c nhng trang no thng c truy cp
cng vi nhau. Nhng mu thu c s cung cp nhng tri thc rt hu ch
phc v cho nhng lnh vc nh tip th in t hay t chc li website sao cho
thun tin nht i vi ngi dng.
ngn gn, ta k hiu cc trang truy cp nh sau:
44
/shopping/comestic.htm
I1
/shopping/fashion.htm
I2
/sport.htm
I3
/news.htm
I4
/cars.htm
I5
S
xut
hin
6
7
6
2
2
ln
h
tr
6/9
7/9
6/9
2/9
2/9
Loi b cc
tp mc c
h tr nh
hn
minsup=2/9
Tp
ph
xut
bin
{I1}
{I2}
{I3}
{I4}
{I5}
hin
6
7
6
2
2
Bc 2: To ra cc tp mc c di 2 bng
Tp
cch kt ni cc tp mc c di 1, duyt
ph
xut
tr
bin
{I1, I2}
{I1, I3}
{I1, I5}
{I2, I3}
{I2, I4}
{I2, I5}
hin
4
4
2
4
2
2
4/9
4/9
2/9
4/9
2/9
2/9
tng tp mc v loi b cc tp mc c h
tr nh hn 2/9 thu c cc tp ph bin.
ln
h
tr
6/9
7/9
6/9
2/9
2/9
ln h
45
Tp
mc
xut
tr
{I1,
{I1,
{I1,
{I1,
{I2,
{I2,
{I2,
{I3,
{I3,
{I4,
hin
4
4
1
2
4
2
2
0
1
0
4/9
4/9
1/9
2/9
4/9
2/9
2/9
0
1/9
0
I2}
I3}
I4}
I5}
I3}
I4}
I5}
I4}
I5}
I5}
ln h
Loi b cc
tp mc c
h tr nh
hn
minsup=2/9
Tp mc
xut
tr
hin
2
2
2/9
2/9
Tp ph
bin
{I1, I2, I3}
{I1, I2, I5}
S ln
xut
tr
hin
2
2
2/9
2/9
46
F = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1, I2}, {I1, I3}, {I1, I5}, {I2, I3}, {I2, I4}, {I2, I5},
{I1, I2, I3}, {I1, I2, I5}}
sinh ra cc lut kt hp, cn tch cc tp ph bin thnh hai tp khng
giao nhau v tnh tin cy cho cc lut tng ng. Lut no c tin cy vt
ngng minconf = 70% s c gi li. V d: xt tp ph bin: {I 1, I2, I5}. Ta c
cc lut sau y:
R1: I1, I2 I5
conf(R1) = supp({I1, I2, I5})/supp({I1, I2}) = 2/4 = 50% (R1 b loi)
R2: I1, I5 I2
conf(R2) = supp({I1, I2, I5})/supp({I1, I5}) = 2/2 = 100%
R3: I2, I5 I1
conf(R2) = supp({I1, I2, I5})/supp({I2, I5}) = 2/2 = 100%
R4: I1 I2, I5
conf(R2) = supp({I1, I2, I5})/supp({I1}) = 2/6 = 33% (R4 b loi)
R5: I2 I1, I5
conf(R2) = supp({I1, I2, I5})/supp({I2}) = 2/7 = 29% (R5 b loi)
R6: I5 I1, I2
conf(R2) = supp({I1, I2, I5})/supp({I5}) = 2/2 = 100%
T lut R2 ta c th kt lun rng, nu ngi dng quan tm n cc trang
comestic.htm v car.htm th nhiu kh nng ngi dng ny cng quan tm
n trang fashion.htm. y c th l gi cho mt k hoch qung co. Tng
t, t lut R6 ta c th kt lun, nu ngi dng quan tm n xe hi th h
cng quan tm n thi trang v m phm. Vy nn t cc banner qung co
v cc lin kt n cc trang fashion.htm v comestic.htm ngay trn trang
car.htm thun tin cho ngi dng.
4.3. Thut ton FP-Growth ng dng trong khai ph d liu s dng
Web
Gii thut Apriori c nhc im l to ra qu nhiu tp d tuyn. Gi s
ban u c 104 tp ph bin c di 1 th sau qu trnh kt ni s to ra 10 7
tp mc c di 2 (chnh xc l 10 4(104 1)/2 tp mc). R rng mt tp mc
c di k th phi cn n t nht 2 k 1 tp mc d tuyn trc . Mt nhc
im khc na l gii thut Apriori phi kim tra tp d liu nhiu ln, dn ti
chi ph ln khi kch thc cc tp mc tng ln. Nu tp mc c di k c
sinh ra th cn phi kim tra tp d liu k+1 ln.
47
Gii thut FP-Growth khai ph lut kt hp c xy dng da trn nhng
nguyn tc c bn sau y:
1. Nn tp d liu vo cu trc cy nh gim chi ph cho ton tp d
liu dng trong qu trnh khai ph. Cc mc khng ph bin c loi b
sm nhng vn m bo kt qu khai ph khng b nh hng.
2. p dng trit phng php chia tr (devide-and-conquer). Qu
trnh khai ph d liu c chia thnh cc cng on nh hn, l xy
dng cy FP v khai ph cc tp ph bin da trn cy FP to.
3. Trnh to ra cc tp d tuyn. Mi ln, gii thut ch kim tra mt phn
ca tp d liu.
Cy FP (cn gi l FP-Tree) l cu trc d liu dng cy c t chc nh sau:
1. Nt gc (root) c gn nhn null
2. Mi nt cn li cha cc thng tin: item-name, count, node-link. Trong :
- Item-name: Tn ca mc m nt i din.
- Count: S giao dch c cha mu bao gm cc mc duyt t nt gc
n nt ang xt.
- Node-link: Ch n nt k tip trong cy (hoc tr n null nu nt
ang xt l nt l).
3. Bng Header c s dng bng s mc. Mi dng cha 3 thuc tnh: itemname, item-count, node-link. Trong :
- Item-name: Tn ca mc.
- Item-count: Tng s bin count ca tt c cc nt cha mc .
- Node-link: Tr n nt sau cng c to ra cha mc trong
cy.
Cy FP c th xy dng t c s d liu giao dch D thng qua th tc sau y:
Input:
Output:
Cy FP.
Procedure FP_TreeConstruction
{
1. Duyt D ln u thu c tp F gm cc frequent item v support
count ca chng. Sp xp cc item trong F theo trt t gim dn ca
supprort count ta c danh sch L.
2. To nt gc R v gn nhn null.
48
To bng Header c |F| dng v t tt c cc nodelink ch n null.
3. for each giao dch T D
{
// Duyt D ln 2
Chn cc item ph bin ca T a vo P;
Sp cc item trong P theo trt t L;
Call Insert_Tree(P, R);
}
}
Th tc con Insert_Tree c nh ngha nh sau:
Procedure Insert_Tree(P, R)
{
t P=[p|P p] , vi p l phn t u v P p l phn cn li ca danh
sch;
if R c mt con N sao cho N.item-name = p then
N.count ++;
else
{
To nt mi N;
N.count = 1;
N.item-name = p;
N. parent = R;
// To node-link ch n item, H l bng Header
N.node-link = H[p].head;
H[p].head = N;
}
// Tng bin count ca p trong bng header thm 1
H[p].count ++;
if (P p) != null then Call Insert_Tree(P p, N) ;
}
khm ph cc cc mu ph bin t cy FP-Tree, ta s dng th tc FPGrowth:
Input:
min_sup, = null.
49
Output:
Mt tp y cc mu ph bin F.
Procedure FP-Growth(Tree, )
{
F = ;
if Tree ch cha mt ng dn n P then
{
for each t hp ca cc nt trong P do
{
Pht sinh mu p = ;
support_count(p) = min_sup cc nt trong ;
F = F p;
}
}
else
for each ai in the header of Tree
{
Pht sinh mu = ai ;
support_count()=ai.support_count;
F = F ;
Xy dng c s c iu kin ca ;
Xy dng FP-Tree c iu kin Tree ca ;
if (Tree != ) then Call FP_Growth(Tree, );
}
}
p dng gii thut FP-Growth cho c s d liu giao dch D xt trong
mc 3.3, ngng h tr minimum support count = 2 (hay min_sup=2/9):
Giao dch
T01
T02
T03
T04
T05
T06
T07
T08
T09
Tp mc
I1, I2, I5
I2, I4
I2, I3
I1, I2, I4
I1, I3
I2, I3
I1, I3
I1, I2, I3, I5
I1, I2, I3
50
Trc tin cy FP s c xy dng dn dn qua cc bc. Cc giao dch
s ln lt c xt v cc mc tng ng c thm vo cy.
Ln duyt th nht: Tm cc tp mc c di 1 v sp xp chng theo
danh sch vi trt t gim dn theo tn s xut hin. Loi b cc tp mc c
h tr nh hn ngng min_sup thu c danh sch:
L={{I2:7}, {I1:6}, {I3:6}, {I4:2}, {I5:2}}
Ln duyt th hai: Xy dng dn cy FP qua cc bc. Cc mc trong mi
giao dch c x l theo trt t trong L.
51
52
Min_sup=2
L={I1: 4, I2: 4}
53
support_count() = support_count(I2)= 4
C s mu c iu kin: {{I1:2}}
Cy thu c c ng dn n.
support_count() = support_count(I1)= 4
C s mu c iu kin {}
Cy thu c: Null
I5
I4
C s mu c iu Cy
FP
kin
kin
<I2:2, I1:2>
iu Mu ph bin c
to
{I2, I5:2}, {I1, I5:2},
{I2, I1, I5:2}
{I2, I4:2}
{I2, I3:2}, {I1, I3:2},
{I2, I1, I3:2}
{I2, I1:4}
54
S khc bit ln nht gia hai gii thut l gii thut Apriori phi sinh ra
mt lng ln cc tp ng vin trong khi FP-Growth tm cch trnh iu ny.
Gii thut Apriori s lm vic km hiu qu trong trng hp tp mc c kch
thc ln v ngng h tr nh, dn ti s lng mu ph bin ln. iu
ny s khin kch thc tp ng vin tr nn ln n mc kh chp nhn.
55
Hnh 4.2: So snh thi gian thc thi vi s lng khc nhau cc giao
dch
56
Bi tp:
L THUYT:
1. Cc gi tr thng thng c s dng lm tham s cho support v confidence trong
thut ton Apriori?
2. Ti sao qu trnh khm ph lut kt lp kh n gin khi so snh n vi vic pht sinh mt
lng ln itemset trong c s d liu giao dch?
3. Cho c s d liu giao dch nh sau:
X: TID Items
T01 A, B, C, D
T02 A, C, D, F
T03 C, D, E, G, A
T04 A, D, F, B
T05 B, C, G
T06 D, F, G
T07 A, B, G
T08 C, D, F, G
a. S dng cc gi tr ngng support = 25% v confidence = 60%, tm:
1. Tt c cc tp itemsets trong c s d liu X.
2. Cc lut kt hp ng tin cy.
5. Cho c s d liu giao dch nh sau:
Y: TID Items
T01 A1, B1, C2
T02 A2, C1, D1
T03 B2, C2, E2
T04 B1, C1, E1
T05 A3, C3, E2
T06 C1, D2, E2
a. S dng cc ngng support s = 30% v confidence c = 60%, tm:
1. Tt c cc tp itemset trong Y.
2. Nu cc tp itemset c cu trc sao cho A + {A1, A2, A3}, B= {B1, B2},
C = {C1, C2, C3}, D = {D1, D2} v E = {E1, E2}, hy tm cc tp itemset
c nh ngha trn mc khi nim?
3. Tm cc lut kt hp ng tin cy cho cc tp itemset cu trn.
THC HNH:
1. S dng thut ton Apriori tm kim cc tp itemset trong c s d liu
Northwind?
57
Chng 5: Phn lp v d on
5.1. Khi nim c bn
Kho d liu lun cha rt nhiu cc thng tin hu ch c th dng cho vic ra cc quyt nh
lin quan n iu hnh, nh hng ca mt n v, t chc. Phn lp v d on l hai dng ca
qu trnh phn tch d liu c s dng trch rt cc m hnh biu din cc lp d liu quan
trng hoc d don cc d liu pht sinh trong tng lai. K thut phn tch ny gip cho chng ta
hiu k hn v cc kho d liu ln. V d chng ta c th xy dng mt m hnh phn lp xc
nh mt giao dch cho vay ca ngn hn l an ton hay c ri ro, hoc xy dng m hnh d on
phn on kh nng chi tiu ca cc khch hng tim nm da trn cc thng tin lin quan n
thu nhp ca h. Rt nhiu cc phng php phn lp v d on c nghin cu trong cc lnh
vc my hc, nhn dng mu v thng k. Hu ht cc thut ton u c hn ch v b nh vi cc
gi nh l kch thc d liu nh. K thut khai ph d liu gn y c pht trin xy
dng cc phng php phn lp v d on ph hp hn vi ngun d liu c kch thc ln.
5.1.1. Phn lp
Qu trnh phn lp thc hin nhim v xy dng m hnh cc cng c phn lp gip cho
vic gn nhn phn loi cho cc d liu. V d nhn An ton hoc Ri ro cho cc yu cu vay
vn; C hoc Khng cho cc thng tin th trng. Cc nhn dng phn loi c biu din
bng cc gi tr ri rc trong vic sp xp chng l khng c ngha.
Phn lp d liu gm hai qu trnh. Trong qu trnh th nht mt cng c phn lp s c
xy dng xem xt ngun d liu. y l qu trnh hc, trong mt thut ton phn lp c
xy dng bng cch phn tch hoc hc t tp d liu hun luyn c xy dng sn bao gm
nhiu b d liu. Mt b d liu X biu din bng mt vector n chiu, X = (x1, x2,, xn) , y l
cc gi tr c th ca mt tp n thuc tnh ca ngun d liu {A1, A2, , An}. Mi b c gi s
rng n thuc v mt lp c nh ngha trc vi cc nhn xc nh.
58
59
Qu trnh u tin ca phn lp c th c xem nh vic xc nh nh x hoc hm y =
f(X), hm ny c th d on nhn y cho b X. Ngha l vi mi lp d liu chng ta cn hc (xy
dng) mt nh x hoc mt hm tng ng.
Trong bc th hai, m hnh thu c s c s dng phn lp. m bo tnh khch
quan nn p dng m hnh ny trn mt tp kim th hn l lm trn tp d liu hun luyn ban
du. Tnh chnh xc ca m hnh phn lp trn tp d liu kim th l s phn trm cc b d liu
kim tra c nh nhn ng bng cch so snh chng vi cc mu trong b d liu hun luyn.
Nu nh chnh xc ca m hnh d on l chp nhn c th chng ta c th s dng n cho
cc b d liu vi thng tin nhn phn lp cha xc nh.
5.1.2.
D on
60
- Attribute_selection_method, mt th tc xc nh tiu ch phn chia cc b d liu mt
cc tt nht thnh cc lp. Tiu ch ny bao gm mt thuc tnh phn chia splitting_attribute, im
chia split_point v tp phn chia splitting_subset.
u ra: Mt cy quyt nh
Ni dung thut ton:
1.
2.
3.
4.
5.
To nt N
If cc b trong D u c nhn lp C then
Tr v N thnh mt nt l vi nhn lp C
If danh sch thuc tnh attribute_list l rng then
Tr v N thnh mt nt l vi nhn l lp chim a s trong D (Vic ny thc hin
qua gi hm Attribute_selection_method(D, attribute_list) tm ra tiu ch phn chia tt
6.
7.
splitting_attribute
Foreach j in splitting_criterion
// Phn chia cc b xy dng cy cho cc phn chia
9.
t Dj l tp cc b trong D ph hp vi tiu ch j
10.
If Dj l rng then
11.
Gn nhn cho nt N vi nhn ph bin trong D
12.
Else Gn nt c tr v bi hm Generate_decision_tree(D j, attribute_list) cho nt
8.
N
13. Endfor
14. Return N
5.2.2. La chn thuc tnh
Vic la chn thuc tnh thc hin nh vic la chn cc tiu ch phn chia sao cho vic
phn ngun d liu D cho mt cch tt nht thnh cc lp phn bit. Nu chng ta chia D thnh
cc vng nh hn da trn cc kt qu tm c ca tiu ch phn chia, th mi vng s kh l thun
chng (Ngha l cc tp cc vng phn chia c th hon ton thuc v cng mt lp). iu ny
gip xc nh cch cc b gi tr ti mt nt xc nh s c chia th no. Cy c to cho phn
vng D c gn nhn vi tiu ch phn chia, cc nhnh ca n c hnh thnh cn c vo cc kt
qu phn chia ca cc b.
Gi s D l mt phn vng d liu cha cc b hun luyn c gn nhn. Cc nhn c m
gi tr phn bit xc nh m lp, Ci (vi i = 1,..,m). Gi Ci,D l tp cc b ca lp Ci trong D
Thng tin cn thit phn lp mt b trong D cho bi
61
trong Dj cha cc b trong D c kt qu u ra a j. Cc phn vng s tng ng vi cc
nhnh ca nt N.
Thng tin xc nh xem vic phn chia gn tip cn n mt phn lp c cho nh sau
l trng lng ca phn vng th j. InfoA(D) th hin thng tin cn thit phn lp
mt b ca D da trn phn lp theo A. Gi tr thng tin nh nht s cho ra phn vng thun ty
tng ng.
o thng tin thu c c cho
Gain(A) s cho chng ta bit bao nhiu nhnh c th thu nhn c t A. Thuc tnh A vi
o thng tin thu c ln nht s c dng lm thuc tnh phn chia ca nt N.
62
MT S THI MU
63
Trng i Hc Hng Hi Vit Nam
Khoa Cng ngh Thng tin
B MN H THNG THNG TIN
-----***----THI KT THC HC PHN
Tn hc phn:
Nm hc: x
KHAI PH D LIU
thi s:
K duyt :
a. Gi s lut kt hp
2.000
500
2.500
1.000
1.500
2.500
3.000
2.000
5.000
25% v min_conf = 50%. Lut trn c phi l lut kt hp mnh hay khng? Gii
thch?
b. Da trn cc d liu cho, hy cho bit vic mua hot-dog c c lp vi vic mua
humbergers hay khng? Nu khng hy cho bit mi quan h tng quan gia hai
mt hng trn?
Cu 3: (2 im)
Hy trnh by ngha ca tin x l d liu trong k thut khai ph d liu?
Cu 4: (2 im)
Cho tp d liu dng phn tch v tui c sp xp tng dn nh sau: {13,
15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46,
52, 70}
a. S dng phng php lm mn bin vi rng bin l 5. Minh ha cc bc thc
hin?
b. S dng phng phng php chun ha min-m bin i gi tr tui 35 v
khong [0.0, 1.0].
----------------------------***HT***---------------------------Lu : - Khng sa, xa thi, np li sau khi thi
64
Trng i Hc Hng Hi Vit Nam
Khoa Cng ngh Thng tin
B MN H THNG THNG TIN
-----***----THI KT THC HC PHN
Tn hc phn: KHAI PH D LIU
Nm hc: x
thi s:
K duyt :
Mt hng
T100
{M, O, N, K, E, Y}
T200
{D, O, N, K, E, Y}
T300
{M, A, K, E}
T400
{M, U, C, K, Y}
T500
{C, O, O, K, I, E}
a. Tm tt c tt c cc tp ph bin Itemsets s dng thut ton Apriori ?
b. Lit k tt c cc lut kt hp mnh (vi support s, v confidence c) p ng tn
t sau, trong X l bin biu din khch hng v itemi l cc bin biu din cc mt
hng (v d A, B, )
Cu 3: (2 im)
Trnh by cc im khc bit gia kho d liu v mt c s d liu thng thng?
Cu 4: (2 im)
Cho tp d liu dng phn tch v tui c sp xp tng dn nh sau: {13,
15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46,
52, 70}
a. S dng phng php lm mn trung v vi rng bin l 3. Minh ha cc bc
thc hin?
b. S dng phng phng php chun ha decimal-scale bin i gi tr tui 35.
----------------------------***HT***---------------------------Lu : - Khng sa, xa thi, np li sau khi thi
65
Trng i Hc Hng Hi Vit Nam
Khoa Cng ngh Thng tin
B MN H THNG THNG TIN
-----***----THI KT THC HC PHN
Tn hc phn: KHAI PH D LIU
Nm hc: x
thi s:
K duyt :
Mt hng
T100
{M, O, N, K, E, Y}
T200
{D, O, N, K, E, Y}
T300
{M, A, K, E}
T400
{M, U, C, K, Y}
T500
{C, O, O, K, I, E}
a. Tm tt c tt c cc tp ph bin Itemsets s dng thut ton Apriori ?
b. Lit k tt c cc lut kt hp mnh (vi support s, v confidence c) p ng tn
t sau, trong X l bin biu din khch hng v itemi l cc bin biu din cc mt
hng (v d A, B, )
Cu 3: (2 im)
Cc bc ca qu trnh khai ph d liu?
Cu 4: (2 im)
Lm mn d liu s dng k thut lm trn cho tp sau:
Y = {1.17, 2.59, 3.38, 4.23, 2.67, 1.73, 2.53, 3.28, 3.44}
Sau biu din tp thu c vi cc chnh xc:
a. 0.1
b. 1.
66
Trng i Hc Hng Hi Vit Nam
Khoa Cng ngh Thng tin
B MN H THNG THNG TIN
-----***----THI KT THC HC PHN
Tn hc phn: KHAI PH D LIU
Nm hc: x
thi s:
K duyt :
a. Gi s lut kt hp
2.000
500
2.500
1.000
1.500
2.500
3.000
2.000
5.000
30% v min_conf = 70%. Lut trn c phi l lut kt hp mnh hay khng? Gii
thch?
b. Da trn cc d liu cho, hy cho bit vic mua hot-dog c c lp vi vic mua
humbergers hay khng? Nu khng hy cho bit mi quan h gia hai mt hng trn?
Cu 3: (2 im)
Trnh by cc im khc bit gia hai phng php phn lp v phn cm d liu?
Cu 4: (2 im)
Cho tp mu vi cc gi tr b thiu
o
X1 = {0, 1, 1, 2}
X2 = {2, 1, , 1}
X3 = {1, , , 0}
X4 = {, 2, 1, }
----------------------------***HT***----------------------------
67
Lu : - Khng sa, xa thi, np li sau khi thi
Trng i Hc Hng Hi Vit Nam
Khoa Cng ngh Thng tin
B MN H THNG THNG TIN
-----***----THI KT THC HC PHN
Tn hc phn: KHAI PH D LIU
Nm hc: x
thi s:
K duyt :
Mt hng
T100
{M, O, N, K, E, Y}
T200
{D, O, N, K, E, Y}
T300
{M, A, K, E}
T400
{M, U, C, K, Y}
T500
{C, O, O, K, I, E}
a. Tm tt c tt c cc tp ph bin Itemsets s dng thut ton Apriori ?
b. Lit k tt c cc lut kt hp mnh (vi support s, v confidence c) p ng tn
t sau, trong X l bin biu din khch hng v itemi l cc bin biu din cc mt
hng (v d A, B, )
Cu 3: (2 im)
Trnh by khi nim d on, cho v d v phn tch?
Cu 4: (2 im)
Nu cc tp itemset c cu trc sao cho A + {A1, A2, A3}, B= {B1, B2}, C = {C1, C2, C3},
D = {D1, D2} v E = {E1, E2}