You are on page 1of 68

CHNG 1.

GII THIU CHUNG V


KHAI PH D LIU
1
Ni dung
1. Nhu cu ca khai ph d liu (KPDL)
2. Khi nim KPDL v pht hin tri thc trong CSDL
3. KPDL v x l CSDL truyn thng
4. Mt s ng dng in hnh ca KPDL
5. Kiu d liu trong KPDL
6. Cc bi ton KPDL in hnh
7. Tnh lin ngnh ca KPDL
2
1. Nhu cu v khai ph d liu
S bng n d liu
L do cng ngh
L do x hi
Th hin
Ngnh kinh t nh hng d liu
Kinh t tri thc
Pht hin tri thc t d liu
3
Bng n d liu: Lut Moore
4
Xut x
Gordon E. Moore (1965). Cramming more components onto integrated
circuits, Electronics, 38 (8), April 19, 1965. Mt quan st v d bo
Phng ngn 2x
S lng bn dn tch hp trong mt chp s tng gp i sau khong hai
nm
Chi ph sn xut mch bn dn vi cng tnh nng gim mt na sau hai nm
Phin bn 18 thng: rt ngn chu k thi gian
Dn dt ngnh cng ngh bn dn
M hnh c bn cho ngnh cng nghip mch bn dn
nh lut Moore vn to kh nng c bn cho s pht trin ca chng ti, v
n vn cn hiu lc tt ti Intel nh lut Moore khng ch l mch bn dn.
N cng l cch s dng sng to mch bn dn. Paul S. Otellini, Ch tch v
Gim c iu hnh Tp on Intel
ton b chu trnh thit k, pht trin, sn xut, phn phi v bn hng c coi
l c tnh bn vng khi tun theo nh lut Moore Nu nh bi nh lut
Moore, th trng khng th hp th ht cc sn phm mi, v k s b mt vic
lm. Nu b tt sau nh lut Moore, khng c g mua, v gnh nng ln
i vai ca chui cc nh phn phi sn phm. Daniel Grupp, Gim c PT
cng ngh tin tin, Acorn Technologies, Inc. (http://acorntech.com/)
Thc y cng ngh x l, lu gi v truyn dn d liu
Cng ngh bn dn l nn tng ca cng nghip in t.
nh lut Moore vi cng nghip phn cng my tnh: b x l Intel trong 40
nm qua (trang tip theo).
Bng n v nng lc x l tnh ton v lu tr d liu.
Tc ng ti s pht trin cng ngh c s d liu (t chc v qun l d liu)
v cng ngh mng (truyn dn d liu) 5
Lut Moore & cng nghip in t
Another decade is probably straightforward...There is certainly no end to creativity.
Gordon Moore, Intel Chairman Emeritus of the Board Speaking of extending Moores
Law at the International Solid-State Circuits Conference (ISSCC), February 2003.


6
Lut Moore: B x l Intel
7
Gi tr, cch c cc bi v c in hnh
H thng c v bi n v o
Nng lc s ha
Thit b s ha a dng
Mi lnh vc Qun l, Thng mi, Khoa hc
Mt v d in hnh: SDSS
Sloan Digital Sky Survey
http://www.sdss.org/
to bn 3-chiu c cha hn 930.000 thin h v
hn 120.000 quasar
Knh vin vng u tin
Lm vic t 2000
Vi tun u tin: thu thp d liu thin vn hc = ton b
trong qu kh. Sau 10 nm: 140 TB
Knh vin vng k tip
Large Synoptic Survey Telescope
Bt u hot ng 2016. Sau 5 ngy s c 140 TB
8
Thit b thu thp lu tr d liu
Tin ha cng ngh CSDL [HK0106]
9
Bng n d liu: Cng ngh CSDL
Cng ngh CSDL: Mt s CSDL ln
Tp 10 CSDL ln nht
http://top-10-list.org/2010/02/16/top-10-largest-databases-list/
Library of Congress: 125 triu mc; Central Intelligence Agency (CIA):
100 h s: thng k dn s, bn hng thng; Amazon: 250 triu
sch, 55 triu ngi dng, 40TB; ChoicePoint: 75 ln Tri t Mt
trng; Sprint: 70.000 bn ghi vin thng; Google: 90 triu tm kim/ngy;
AT&T: 310TB; World Data Centre for Climate
Trung tm tnh ton khoa hc nghin cu nng lng
quc gia M
National Energy Research Scientific Computing Center: NERSC
thng 3/2010: khong 460 TB
http://www.nersc.gov/news/annual_reports/annrep0809/annrep0809.pdf
YouTube
Sau hai nm: hng trm triu video
dung lng CSDL YouTube tng gp i sau mi chu k 5 thng
10
Bng n d liu: Cng ngh mng
Tng lng giao vn IP trn mng
Ngun: Sch trng CISCO 2010
2010: 20.396 PB/thng, 2009-2014: tng trung bnh hng nm 34%
Web
13 t ri trang web c nh ch s (ngy 23/01/2011)
Ngun: http://www.worldwidewebsize.com/
11
Bng n d liu: Tc nhn to mi
M rng tc nhn to d liu
Phn to mi d liu ca ngi dng ngy cng tng
H thng trc tuyn ngi dng, Mng x hi
Mng x hi Facebook cha ti 40 t nh
2010: 900 EB do ngi dng to (trong 1260 EB tng th). Ngun: IDC Digital
Universe Study, sponsored by EMC, May 2010
12
Bng n d liu: Gi thnh v th hin
Ngun: IDC Digital Universe Study, sponsored by EMC, May 2010
Gi to d liu ngy cng r hn
Chiu hng gi to mi d liu gim dn
0,5 xu M/1 GB vo nm 2009 gim ti 0,02 xu M /1 GB vo nm 2020
Dung lng tng th tng
dc tng cng cao
t 35 ZB vo nm 2020

13
Bng n d liu vi tng trng nhn lc CNTT
Dung lng thng tin tng 67 ln, i tng d liu tng 67 ln
Lc lng nhn lc CNTT tng 1,4 ln
Ngun: IDC Digital Universe Study, sponsored by EMC, May 2010.
14
Nhu cu nm bt d liu
Jim Gray, chuyn gia ca Microsoft, gii thng Turing 1998
Chng ta ang ngp trong d liu khoa hc, d liu y t, d liu nhn khu hc,
d liu ti chnh, v cc d liu tip th. Con ngi khng c thi gian xem
xt d liu nh vy. S ch ca con ngi tr thnh ngun ti nguyn qu gi.
V vy, chng ta phi tm cch t ng phn tch d liu, t ng phn loi n, t
ng tm tt n, t ng pht hin v m t cc xu hng trong n, v t ng ch
dn cc d thng.
y l mt trong nhng lnh vc nng ng v th v nht ca cng ng nghin
cu c s d liu. Cc nh nghin cu trong lnh vc bao gm thng k, trc quan
ha, tr tu nhn to, v hc my ang ng gp cho lnh vc ny. B rng ca lnh
vc lm cho n tr nn kh khn nm bt nhng tin b phi thng trong vi
thp k gn y [HK0106].
Kenneth Cukier,
Thng tin t khan him ti d dt. iu mang li li ch mi to ln to nn
kh nng lm c nhiu vic m trc y khng th thc hin c: nhn ra cc
xu hng kinh doanh, ngn nga bnh tt, chng ti phm
c qun l tt, d liu nh vy c th c s dng m kha cc ngun mi
c gi tr kinh t, cung cp nhng hiu bit mi vo khoa hc v to ra li ch t
qun l. http://www.economist.com/node/15557443?story_id=15557443
15
Nhu cu thu nhn tri thc t d liu
Kinh t tri thc
Tri thc l ti nguyn c bn
S dng tri thc l ng lc ch cht cho tng trng kinh t
Hnh v: Nm 2003, ng gp ca tri thc cho tng GDP/u ngi ca
Hn Quc gp i so vi ng gp ca lao ng v vn. TFP: Total Factor
Productivity (The World Bank. Korea as a Knowledge Economy, 2006)

16
Kinh t tri thc
Kinh t dch v
X hi loi ngi chuyn dch t kinh t hng ha sang kinh t dch v.
Lao ng dch v vt lao ng nng nghip (2006).
Mi nn kinh t l kinh t dch v.
n v trao i trong kinh t v x hi l dch v
Dch v: d liu & thng tin tri thc gi tr mi
Khoa hc: d liu & thng tin tri thc
K ngh: tri thc dch v
Qun l: tc ng ti ton b quy trnh thi hnh dch v
Jim Spohrer (2006). A Next Frontier in Education, Employment, Innovation, and
Economic Growth, IBM Corporation, 2006
17
Kinh t dch v: T d liu ti gi tr
Ngnh cng nghip qun l v phn tch d liu
Chng ta nhp trong d liu m i kht tri thc
ng gi hn 100 t US$ vo nm 2010
Tng 10% hng nm, gn gp i kinh doanh phn mm ni chung
vi nm gn y cc tp on ln chi khong 15 t US$ mua cng ty
phn tch d liu
Tng hp ca Kenneth Cukier
Nhn lc khoa hc d liu
CIO v chuyn gia phn tch d liu c vai tr ngy cng cao
Ngi phn tch d liu: ngi lp trnh + nh thng k + ngh
nhn d liu. M c chun quy nh chc nng
Tham kho bi trao i Tn mn v c hi trong ngnh Thng k
(v KHMT) ca Nguyn Xun Long ngy 03/7/2009.
http://www.procul.org/blog/2009/07/03/t%e1%ba%a3n-m%e1%ba%a1n-
v%e1%bb%81-c%c6%a1-h%e1%bb%99i-trong-nganh-th%e1%bb%91ng-ke-va-khmt/
18
Ngnh kinh t nh hng d liu
September 9, 2014 19
Khi nim KDD
Knowledge discovery from databases
Trch chn cc mu hoc tri thc hp dn (khng tm thng,
n, cha bit v hu dng tiim nng) t mt tp hp ln d
liu
KDD v KPDL: tn gi ln ln? theo hai tc gi|Khai ph d liu
Data Mining l mt bc trong qu trnh KDD
September 9, 2014 20
Qu trnh KDD [FPS96]
[FPS96] Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth (1996). From
Data Mining to Knowledge Discovery: An Overview, Advances in Knowledge Discovery
and Data Mining 1996: 1-34
September 9, 2014 21
Cc bc trong qu trnh KDD
Hc t min ng dng
Tri thc sn c lin quan v mc tiu ca ng dng
Khi to mt tp d liu ch: chn la d liu
Chun b d liu v tin x l: (huy ng ti 60% cng sc!)
Thu gn v chuyn i d liu
Tm cc c trng hu dng, rt gn chiu/bin, tm cc i din bt bin.
Chn la chc nng (hm) KPDL
Tm tt, phn lp, hi quy, kt hp, phn cm.
Chn (cc) thut ton KPDL
Bc KPDL: tm mu hp dn
nh gi mu v trnh din tri thc
Trc quan ha, chuyn dng, loi b cc mu d tha, v.v.
S dng tri thc pht hin c
September 9, 2014 22
Cc khi nim lin quan
Cc tn thay th
chit lc tri thc (knowledge extraction),
pht hin thng tin (information discovery),
thu hoch thng tin (information harvesting),
khai qut/no vt d liu (data archaeology/ dredging),
Phn tch/x l mu/d liu (data/pattern analysis/processing)
Thng minh doanh nghip (business intelligence -BI)

Phn bit: Phi chng mi th l DM?
X l truy vn suy din.
H chuyn gia hoc chng trnh hc my/thng k nh
September 9, 2014
23
M hnh qu trnh KDD lp [CCG98]
Mt m hnh ci tin qu trnh KDD
nh hng kinh doanh: Xc nh 1-3 cu hi hoc mc ch h tr ch KDD
Kt qu thi hnh c: xc nh tp kt qu thi hnh c da trn cc m
hnh c nh gi
Lp kiu vng i pht trin phn mm
[CCG98] Kenneth Collier, Bernard Carey, Ellen Grusy, Curt Marjaniemi, Donald Sautter
(1998). A Perspective on Data Mining, Technical Reporrt, Northern Arizona University.

September 9, 2014 24


M hnh CRISP-DM 2000
Quy trnh chun tham chiu cng nghip KPDL
Cc pha trong m hnh quy trnh CRISP-DM (Cross-Industry Standard Process
for Data Mining). Hiu kinh doanh: hiu bi ton v nh gi
Thi hnh ch sau khi tham chiu kt qu vi hiu kinh doanh
CRISP-DM 2.0 SIG WORKSHOP, LONDON, 18/01/2007
Ngun: http://www.crisp-dm.org/Process/index.htm (13/02/2011)

September 9, 2014
25
Chu trnh pht trin tri thc thng qua khai ph d liu
Wang, H. and S. Wang (2008). A knowledge management approach to data mining
process for business intelligence, Industrial Management & Data Systems, 2008. 108(5):
622-634. [Oha09]
M hnh tch hp DM-BI [WW08]
September 9, 2014 26
D liu v Mu
D liu (tp d liu)
tp F gm hu hn cc trng hp (s
kin).
KDD:phi gm rt nhiu trng hp
Mu
Trong KDD: ngn ng L biu din
cc tp con cc s kin (d liu) thuc
vo tp s kin F,
Mu: biu thc E trong ngn ng L
tp con F
E
tng ng cc s kin trong
F. E c gi l mu nu n n gin
hn so vi vic lit k cc s kin thuc
F
E
.
Chng hn, biu thc "THUNHP < $t"
(m hnh cha mt bin THUNHP)

September 9, 2014 27
Tnh c gi tr
Mu c pht hin: phi c gi
tr i vi cc d liu mi theo
chn thc no y.
Tnh "c gi tr" : mt o tnh c
gi tr (chn thc) l mt hm C
nh x mt biu thc thuc ngn
ng biu din mu L ti mt
khng gian o c (b phn
hoc ton b) M
C
.
Chng hn, ng bin xc nh
mu "THUNHP < $t dch sang
phi (bin THUNHP nhn gi tr
ln hn) th chn thc gim
xung do bao gi thm cc tnh
hung vay tt li b a vo vng
khng cho vay n.
Nu a*THUNHP + b*N < 0
mu c gi tr hn.

September 9, 2014 28
Tnh mi v hu dng tim nng
Tnh mi: Mu phi l mi trong mt min xem xt no ,
t nht l h thng ang c xem xt.
Tnh mi c th o c :
s thay i trong d liu: so snh gi tr hin ti vi gi tr qu kh
hoc gi tr k vng
hoc tri thc: tri thc mi quan h nh th no vi cc tri thc
c.
Tng qut, iu ny c th c o bng mt hm N(E,F) hoc l
o v tnh mi hoc l o k vng.
Hu dng tim nng: Mu cn c kh nng ch dn ti cc tc
ng hu dng v c o bi mt hm tin ch.
Hm U nh x cc biu thc trong L ti mt khng gian o c th t
(b phn hoc ton b) M
U
: u = U (E,F).
V d, trong tp d liu vay n, hm ny c th l s tng hy vng theo
s tng li ca nh bng (tnh theo n v tin t) kt hp vi quy tc
quyt nh c trnh by trong Hnh 1.3.


September 9, 2014 29
Tnh hiu c, tnh hp dn v tri thc
Tnh hiu c: Mu phi hiu c
KDD: mu m con ngi hiu chng d dng hn cc d liu nn.
Kh o c mt cch chnh xc: "c th hiu c d hiu.
Tn ti mt s o d hiu:
Sp xp t c php (tc l c ca mu theo bit) ti ng ngha (tc l
d dng con ngi nhn thc c theo mt tc ng no ).
Gi nh rng tnh hiu c l o c bng mt hm S nh x biu
thc E trong L ti mt khng gian o c c th t (b phn /ton
b) M
S
: s = S(E,F).
Tnh hp dn: o tng th v mu l s kt hp ca cc tiu ch
gi tr, mi, hu ch v d hiu.
Hoc dng mt hm hp dn: i = I (E, F, C, N, U, S) nh x biu thc trong L vo
mt khng gian o c M
i
.
Hoc xc nh hp dn trc tip: th t ca cc mu c pht hin.
Tri thc: Mt mu E L c gi l tri thc nu nh i vi mt lp
ngi s dng no , ch ra c mt ngng i M
i
m hp
dn I(E,F,C,N,U,S) > i.
September 9, 2014 30
Kin trc in hnh h thng KPDL
September 9, 2014 31
Khai ph d liu v qun tr CSDL
Cu hi thuc h qun tr CSDL (DBMS)
Hy hin th s tin ng Smith trong ngy 5 thng Ging ?
ghi nhn ring l do x l giao dch trc tuyn (on-line
transaction processing OLTP).
C bao nhiu nh u t nc ngoi mua c phiu X trong
thng trc ? ghi nhn thng k do h thng h tr quyt
nh thng k (stastical decision suppport system - DSS)
Hin th mi c phiu trong CSDL vi mnh gi tng ? ghi
nhn d liu a chiu do x l phn tch trc tuyn (on-line
analytic processing - OLAP).
Cn c mt gi thit y v tri thc min phc tp!
September 9, 2014 32
Khi nim KPDL: cu hi DMS
Cu hi thuc h thng khai ph d liu (DMS)
Cc c phiu tng gi c c trng g ?
T gi US$ - DMark c c trng g ?
Hy vng g v c phiu X trong tun tip theo ?
Trong thng tip theo, s c bao nhiu on vin cng on
khng tr c n ca h ?
Nhng ngi mua sn phm Y c c trng g ?

Gi thit tri thc y khng cn c tnh ct li, cn b sung tri thc
cho h thng Ci tin (nng cp) min tri thc !
September 9, 2014 33
H thng CSDL v H thng Khai ph d liu
September 9, 2014 34
KHAI PH D LIU V THNG MINH KINH DOANH
Chiu tng bn cht
H tr quyt nh kinh
doanh
Ngi dng cui
Chuyn gia phn
tch kinh doanh
Chuyn gia
phn tch d liu
Qun
tr
CSDL
(DBA)
To
quyt nh
Trnh din DL
Visualization Techniques
KPDL
I nformation Discovery
Khai thc DL (Data Exploration)
OLAP, MDA
Phn tch thng k, Truy vn v Tr li
Kho DL(Data Warehouses) / KDL chuyn (Data Marts)
Ngun d liu
Bi vit, Files, Nh cung cp thng tin, H thng CSDL, OLTP
September 9, 2014 35
ng dng c bn ca KPDL
Phn tch d liu v h tr quyt nh
Phn tch v qun l th trng
Tip th nh hng, qun l quan h khch hng (CRM), phn tch thi quen
mua hng, bn hng cho, phn on th trng
Phn tch v qun l ri ro
D bo, duy tr khch hng, ci thin bo lnh, kim sot cht lng, phn tch
cnh tranh
Pht hin gian ln v pht hin mu bt thng (ngoi lai)
ng dng khc
Khai ph Text (nhm mi, email, ti liu) v khai ph Web
Khai ph d liu dng
Phn tch DNA v d liu sinh hc
September 9, 2014 36
Phn tch v qun l th trng
Ngun d liu c t u ?
Giao dch th tn dng, th thnh vin, phiu gim gi, cc phn nn
ca khch hng, cc nghin cu phong cch sng (cng cng) b sung
Tip th nh hng
Tm cm cc m hnh khch hng cng c trng: s quan tm, mc thu
nhp, thi quen chi tiu...
Xc nh cc mu mua hng theo thi gian
Phn tch th trng cho
Quan h kt hp/ng quan h gia bn hng v s bo da theo quan h
kt hp
H s khch hng
Kiu ca khch hng mua sn phm g (phn cm v phn lp)
Phn tch yu cu khch hng
nh danh cc sn phm tt nht ti khch hng (khc nhau)
D bo cc nhn t s thu ht khch hng mi
Cung cp thng tin tm tt
Bo co tm tt a chiu
Thng tin tm tt thng k (xu hng trung tm d liu v bin i)
September 9, 2014 37
Phn tch doanh nghip & Qun l ri ro
Ln k hoch ti chnh v nh gi ti sn
Phn tch v d bo dng tin mt
Phn tch yu cu ngu nhin nh gi ti sn
Phn tch lt ct ngang v chui thi gian (t s ti chnh, phn
tch xu hng)
Ln k hoch ti nguyn
Tm tt v so snh cc ngun lc v chi tiu
Cnh tranh
Theo di i th cnh tranh v nh hng th trng
Nhm khch hng thnh cc lp v nh gi da theo lp khch
Khi to chin lc gi trong th trng cnh tranh cao
September 9, 2014 38
Pht hin gian ln v khai ph mu him
Tip cn: Phn cm & xy dng m hnh gian ln, phn tch bt thng
ng dng: Chm sc sc khe, bn l, dch v th tn dng, vin
thng.
Bo him t ng: vng xung t
Ra tin: giao dch tin t ng ng
Bo him y t
Bnh ngh nghip, nhm bc s, v nhm ch dn
Xt nghim khng cn thit hoc tng quan
Vin thng: cuc gi gian ln
M hnh cuc gi: ch cuc gi, di, thi im trong ngy hoc
tun. Phn tch mu lch mt dng chun d kin
Cng nghip bn l
Cc nh phn tch c lng rng 38% gim bn l l do nhn vin
khng trung thc
Chng khng b
September 9, 2014 39
ng dng khc
Th thao
IBM Advanced Scout phn tch thng k mn NBA (chn bng,
h tr v li) a ti li th cnh trang cho New York Knicks
v Miami Heat
Thin vn hc
JPL v Palomar Observatory khm ph 22 chun tinh (quasar)
vi s tr gip ca KPDL
Tr gip lt web Internet
Tr gip IBM p dng cc thut ton KPDL bin bn truy nhp
Web i vi cc trang lin quan ti th trng khm ph u
i khch hng v cc trang hnh vi, phn tch tnh hiu qu ca
tip th Web, ci thi cch t chc Website
September 9, 2014 40
September 9, 2014 41
KPDL: S phn loi (Chc nng)
Chc nng chung
KPDL m t: tm tt, phn cm, lut kt hp
KPDL d on: phn lp, hi quy
Cc bi ton in hnh
M t khi nim
Quan h kt hp
Phn lp
Phn cm
Hi quy
M hnh ph thuc
Pht hin bin i v lch
Phn tch nh hng mu, cc bi ton khc
September 9, 2014 42
KPDL: S phn loi (Chc nng)
M t khi nim: c trng v phn bit
Tm cc c trng v tnh cht ca khi nim
Tng qut ha, tm tt, pht hin c trng rng buc, tng
phn, chng hn, cc vng kh so snh vi t
Bi ton m t in hnh: Tm tt (tm m t c ng)
K vng, phng sai
Tm tt vn bn
Quan h kt hp
Quan h kt hp gia cc bin d liu: Tng quan v nhn qu)
Diaper Beer [0.5%, 75%]
Lut kt hp: XY
V d, trong khai ph d liu Web
Pht hin quan h ng ngha
Quan h ni dung trang web vi mi quan tm ngi dng

September 9, 2014 43
Cc bi ton KPDL: Chc nng KPDL
Phn lp v D bo
Xy dng cc m hnh (chc nng) m t v phn bit khi
nim cho cc lp hoc khi nim d on trong tng lai
Chng hn, phn lp quc gia da theo kh hu, hoc phn lp
t da theo tiu tn xng
Trnh din: cy quyt nh, lut phn lp, mng nron
D on gi tr s cha bit hoc mt
September 9, 2014 44
KPDL: S phn loi (Chc nng)
Phn lp
xy dng/m t m hnh/
hm d bo m t/pht
hin lp/khi nim cho d
bo tip
hc mt hm nh x d
liu vo mt trong mt s
lp bit
Phn cm
nhm d liu thnh cc
"cm" (lp mi) pht
hin c mu phn b
d liu min ng dng.
Tnh tng t





September 9, 2014 45
Chc nng KPDL (2)
Phn tch cm
Nhn lp cha bit: Nhm d liu thnh cc lp mi: phn cm
cc nh tm mu phn b
Cc i tng t ni b cm & cc tiu tng t gia cc cm
Phn tch bt thng
Bt thng: i tng d liu khng tun theo hnh vi chung ca
ton b d liu. V d, s dng k vng mu v phng sai mu
Nhiu hoc ngoi l? Khng phi! Hu dng pht hin gian ln,
phn tch cc s kin him
Pht hin bin i v lch
Hu nh s thay i c ngha di dng o bit trc/gi
tr chun, cung cp tri thc v s bin i v lch
Pht hin bin i v lch <> tin x l

September 9, 2014 46
KPDL: S phn loi (Chc nng)
Hi quy
hc mt hm nh x d liu nhm xc nh gi tr thc ca mt
bin theo mt s bin khc
in hnh trong phn tch thng k v d bo
d on gi tr ca mt/mt s bin ph thuc vo gi tr ca mt
tp bin c lp.
M hnh ph thuc
xy dng m hnh ph thuc: tm mt m hnh m t s ph thuc
c ngha gia cc bin
mc cu trc:
dng th
bin l ph thuc b phn vo cc bin khc
mc nh lng: tnh ph thuc khi s dng vic o tnh theo gi
tr s
September 9, 2014 47
KPDL: S phn loi (Chc nng)
Phn tch xu hng v tin ha
Xu hng v lch: phn tch hi quy
Khai ph mu tun t, phn tch chu k
Phn tch da trn tng t
Phn tch nh hng mu khc hoc phn tch
thng k
September 9, 2014 48
KPDL: S phn loi (2)
Phn loi theo khung nhn
Kiu d liu c KP
Kiu tri thc cn pht hin
Kiu k thut c dng
Kiu min ng dng
September 9, 2014 49
Khung nhn a chiu ca KPDL
D liu c khai ph
Quan h, KDL, giao dch, dng, hng i tng/quan h, tch
cc, khng gian, chui thi gian, vn bn, a phng tin, khng
ng nahats, k tha, WWW
Tri thc c khai ph
c trng, phn bit, kt hp, phn lp, phn cm, xu hng/
lch, phn tch bt thng,
Cc chc nng phc/tch hp v KPDL cc mc phc hp
K thut c dng
nh hng CSDL, KDL (OLAP), hc my, thng k, trc quan
ha, .
ng dng ph hp
Bn l, vin thng, ngn hng, phn tch gian ln, KPDL sinh hc, phn
tch th trng chng khon, KP vn bn, KP Web,
September 9, 2014 50
KPDL: cc kiu d liu
CSDL quan h
Kho d liu
CSDL giao dch
CSDL m rng v kho cha thng tin
CSDL quan h-i tng
D liu khng gian v thi gian
D liu chui thi gian
D liu dng
D liu a phng tin
D liu khng ng nht v tha k
CSDL Text & WWW
September 9, 2014 51
Kiu d liu c phn tch/khai ph 8/2009
http://www.kdnuggets.com/polls/2010/data-types-analyzed.html
September 9, 2014 52
http://www.kdnuggets.com/polls/2010/data
-miner-salary.html
http://www.kdnuggets.com/polls/2009/largest-
database-data-mined.htm
September 9, 2014 53
Mi mu khai ph c u hp dn?
KPDL c th sinh ra ti hng nghn mu: Khng phi tt c u hp
dn
Tip cn gi : KPDL hng ngi dng, da trn cu hi, hng ch
o hp dn
Mu l hp dn nu d hiu, c gi tr theo d liu mi/kim tra vi
chc chn, hu dng tim nng, mi l hoc xc nhn cc gi thit m
ngi dng tm kim xc thc.
o hp dn khch quan v ch quan
Khch quan: da trn thng k v cu trc ca mu, chng hn, d h
tr, tin cy,
Ch quan: da trn s tin tng ca ngi dng i vi d liu, chng
hn, s khng ch n, tnh mi m, tc ng c...
September 9, 2014 54
Tm c tt c v ch cc mu hp dn?
Tm c mi mu hp dn: Bi ton v tnh y
H thng KHDL c kh nng tm mi mu hp dn?
Tm kim my m (heuristic) <> tm kim y
Kt hp <> phan lp <> phn cm
Tm ch cc mu hp dn: Bi ton ti u
H thng KPDL c kh nng tm ra ng cc mu hp dn?
Tip cn
u tin tm tng th tt c cc mu sau lc b cc mu
khng hp dn.
Sinh ra ch cc mu hp dnti u ha cu hi khai ph
September 9, 2014 Kho d liu v khai ph d liu: Chng 1 55
KPDL: Hi t ca nhiu ngnh phc
Data Mining
Database
Systems
Statistics
Other
Disciplines
Algorithm
Machine
Learning
Visualization
September 9, 2014 56
Thng k ton hc vi Khai ph d liu
Nhiu im chung gia KPDL vi thng k:
c bit nh phn tch d liu thm d (EDA: Exploratory
Data Analysis) cng nh d bo [Fied97, HD03].
H thng KDD thng gn kt vi cc th tc thng k c
bit i vi m hnh d liu v nm bt nhiu trong mt
khung cnh pht hin tri thc tng th.
Cc phng php KPDL da theo thng k nhn c s
quan tm c bit.
September 9, 2014 57
Thng k ton hc vi Khai ph d liu
Phn bit gia bi ton thng k v bi ton khai ph d liu
Bi ton kim nh gi thit thng k: cho trc mt gi thit +
tp d liu quan st c. Cn kim tra xem tp d liu quan st
c c ph hp vi gi thit thng k hay khng/ gi thit thng
k c ng trn ton b d liu quan st c hay khng.
Bi ton hc khai ph d liu: m hnh cha c trc. M hnh
kt qu phi ph hp vi tp ton b d liu -> cn m bo cc
tham s m hnh khng ph thuc vo cch chn tp d liu hc.
Bi ton hc KPDL i hi tp d liu hc/tp d liu kim tra
cn "i din" cho ton b d liu trong min ng dng v cn
c lp nhau. Mt s trng hp: hai tp d liu ny (hoc tp
d liu kim tra) c cng b di dng chun.
V thut ng: KPDL: bin ra/bin mc tiu, thut ton khai ph
d liu, thuc tnh/c trng, bn ghi... XLDLTK: bin ph thuc,
th tc thng k, bin gii thch, quan st...
Tham kho thm t Nguyn Xun Long

September 9, 2014 58
Ngun ch dn v KPDL
Data mining and KDD (SIGKDD: CDROM)
Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
Journal: Data Mining and Knowledge Discovery, KDD Explorations
Database systems (SIGMOD: CD ROM)
Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
Journals: ACM-TODS, IEEE-TKDE, JIIS, J. ACM, etc.
AI & Machine Learning
Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), etc.
Journals: Machine Learning, Artificial Intelligence, etc.
Statistics
Conferences: Joint Stat. Meeting, etc.
Journals: Annals of statistics, etc.
Visualization
Conference proceedings: CHI, ACM-SIGGraph, etc.
Journals: IEEE Trans. visualization and computer graphics, etc.
Mt s tham kho khc
http://www.kdnuggets.com/
Danh sch ti liu tham kho
Future Directions in Computer Science
September 9, 2014 59
September 9, 2014 60
S lc lch s pht trin cng ng KPDL
1989 IJCAI Workshop on Knowledge Discovery in Databases (Piatetsky-
Shapiro)
Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)
1991-1994 Workshops on Knowledge Discovery in Databases
Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth,
and R. Uthurusamy, 1996)
1995-1998 International Conferences on Knowledge Discovery in Databases
and Data Mining (KDD95-98)
Journal of Data Mining and Knowledge Discovery (1997)
1998 ACM SIGKDD, SIGKDD1999-2001 conferences, and SIGKDD
Explorations
More conferences on data mining
PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.
September 9, 2014 61
KDD.ORG
SIGKDD: http://www.sigkdd.org/index.php
Cung cp din n cao nht cho tin b v thng
qua khoa hc PHTT&KPDL
Khuyn khch:
NC c bn trong KDD: hi ngh, tp ch,
Cng nhn cc chun th trng v thut ng, nh gi,
phng php.
o to lin ngnh gia cc nh nghin cu, trin khai thc
t v ngi dng

September 9, 2014 62
KDD 2011
September 9, 2014 63
KDD - 2011

September 9, 2014
64
Khai ph d liu: tp 20 t kha hng u

September 9, 2014 65
Cc ch lin quan DM ang l thi s !
September 9, 2014 66
Trang web KDD; KPDL & bin i kh hu
Nguyn nhn gy bin i kh hu:
Gn 50% c gi KDnuggets tin rng thay i kh hu hin nay phn ln l do
hot ng ca con ngi, mt s ng k s ngi nghi ng.
Kh hu rt phc tp v cc nh khoa hc khng phi l tuyn b rng hot
ng ca con ngi l nguyn nhn duy nht ca thay i kh hu.
ng thun vi Hi ng lin chnh ph v Bin i kh hu: hot ng ca
con ngi l mt trong nhng nguyn nhn chnh.
Khai ph nhn nh: Opinion Mining / Sentiment Mining
September 9, 2014 67
Vn hin ti trong KPDL
Phng php lun khai ph
Khai ph cc kiu tri thc khc nhau t d liu hn tp nh sinh hc, dng, web
Hiu nng: Hiu sut, tnh hiu qu, v tnh m rng
nh gi mu: bi ton v tnh hp dn
Kt hp tri thc min: ontology
X l d liu nhiu v d liu khng y
Tnh song song, phn tn v phng php KP gia tng
Kt hp cc tri thc c khm ph vi tri thc hin c: tng hp tri thc
Tng tc ngi dng
Ngn ng hi KPDL v khai ph ngu hng
Biu din v trc quan kt qu KPDL
Khai thc tng tc tri thc cc cp tru tng
p dng v ch s x hi
KPDL c t min ng dng v KPDL v hnh
Bo m b mt d liu, ton vn v tnh ring t
September 9, 2014 68
Mt s yu cu ban u
S b v mt s yu cu d n KPDL thnh cng
Cn c k vng v mt li ch ng k v kt qu KPDL
Hoc trc tip nhn c tri cy treo thp (low-hanging fruit) d thu lm
(nh M hnh m rng khch hng qua tip th v bn hng)
Hoc gin tip to ra n by cao khi tc ng vo qu trnh sng cn c nh
hng sng ngm mnh (Gim cc n khon kh i t 10% cn 9,8% c s
tin ln).
Cn c mt i d n thi hnh cc k nng theo yu cu: chn d liu,
tch hp d liu, phn tch m hnh ha, lp v trnh din bo co. Kt
hp tt gi ngi phn tch v ngi kinh doanh
Nm bt v duy tr cc dng thng tin tch ly (chng hn, m hnh kt
qu t mt lot chin dch tip th)
Qu trnh hc qua nhiu chu k, cn chy ua vi thc tin (m hnh
m rng khch hng ban u cha phi ti u).
Mt tng hp v cc bi hc KPDL thnh cng, tht bi
[NEM09] Robert Nisbet, John Elder, and Gary Miner (2009). Handbook of
Statistical Analysis and Data Mining, Elsevier, 2009.

You might also like