KHAI PH D LIU 1 Ni dung 1. Nhu cu ca khai ph d liu (KPDL) 2. Khi nim KPDL v pht hin tri thc trong CSDL 3. KPDL v x l CSDL truyn thng 4. Mt s ng dng in hnh ca KPDL 5. Kiu d liu trong KPDL 6. Cc bi ton KPDL in hnh 7. Tnh lin ngnh ca KPDL 2 1. Nhu cu v khai ph d liu S bng n d liu L do cng ngh L do x hi Th hin Ngnh kinh t nh hng d liu Kinh t tri thc Pht hin tri thc t d liu 3 Bng n d liu: Lut Moore 4 Xut x Gordon E. Moore (1965). Cramming more components onto integrated circuits, Electronics, 38 (8), April 19, 1965. Mt quan st v d bo Phng ngn 2x S lng bn dn tch hp trong mt chp s tng gp i sau khong hai nm Chi ph sn xut mch bn dn vi cng tnh nng gim mt na sau hai nm Phin bn 18 thng: rt ngn chu k thi gian Dn dt ngnh cng ngh bn dn M hnh c bn cho ngnh cng nghip mch bn dn nh lut Moore vn to kh nng c bn cho s pht trin ca chng ti, v n vn cn hiu lc tt ti Intel nh lut Moore khng ch l mch bn dn. N cng l cch s dng sng to mch bn dn. Paul S. Otellini, Ch tch v Gim c iu hnh Tp on Intel ton b chu trnh thit k, pht trin, sn xut, phn phi v bn hng c coi l c tnh bn vng khi tun theo nh lut Moore Nu nh bi nh lut Moore, th trng khng th hp th ht cc sn phm mi, v k s b mt vic lm. Nu b tt sau nh lut Moore, khng c g mua, v gnh nng ln i vai ca chui cc nh phn phi sn phm. Daniel Grupp, Gim c PT cng ngh tin tin, Acorn Technologies, Inc. (http://acorntech.com/) Thc y cng ngh x l, lu gi v truyn dn d liu Cng ngh bn dn l nn tng ca cng nghip in t. nh lut Moore vi cng nghip phn cng my tnh: b x l Intel trong 40 nm qua (trang tip theo). Bng n v nng lc x l tnh ton v lu tr d liu. Tc ng ti s pht trin cng ngh c s d liu (t chc v qun l d liu) v cng ngh mng (truyn dn d liu) 5 Lut Moore & cng nghip in t Another decade is probably straightforward...There is certainly no end to creativity. Gordon Moore, Intel Chairman Emeritus of the Board Speaking of extending Moores Law at the International Solid-State Circuits Conference (ISSCC), February 2003.
6 Lut Moore: B x l Intel 7 Gi tr, cch c cc bi v c in hnh H thng c v bi n v o Nng lc s ha Thit b s ha a dng Mi lnh vc Qun l, Thng mi, Khoa hc Mt v d in hnh: SDSS Sloan Digital Sky Survey http://www.sdss.org/ to bn 3-chiu c cha hn 930.000 thin h v hn 120.000 quasar Knh vin vng u tin Lm vic t 2000 Vi tun u tin: thu thp d liu thin vn hc = ton b trong qu kh. Sau 10 nm: 140 TB Knh vin vng k tip Large Synoptic Survey Telescope Bt u hot ng 2016. Sau 5 ngy s c 140 TB 8 Thit b thu thp lu tr d liu Tin ha cng ngh CSDL [HK0106] 9 Bng n d liu: Cng ngh CSDL Cng ngh CSDL: Mt s CSDL ln Tp 10 CSDL ln nht http://top-10-list.org/2010/02/16/top-10-largest-databases-list/ Library of Congress: 125 triu mc; Central Intelligence Agency (CIA): 100 h s: thng k dn s, bn hng thng; Amazon: 250 triu sch, 55 triu ngi dng, 40TB; ChoicePoint: 75 ln Tri t Mt trng; Sprint: 70.000 bn ghi vin thng; Google: 90 triu tm kim/ngy; AT&T: 310TB; World Data Centre for Climate Trung tm tnh ton khoa hc nghin cu nng lng quc gia M National Energy Research Scientific Computing Center: NERSC thng 3/2010: khong 460 TB http://www.nersc.gov/news/annual_reports/annrep0809/annrep0809.pdf YouTube Sau hai nm: hng trm triu video dung lng CSDL YouTube tng gp i sau mi chu k 5 thng 10 Bng n d liu: Cng ngh mng Tng lng giao vn IP trn mng Ngun: Sch trng CISCO 2010 2010: 20.396 PB/thng, 2009-2014: tng trung bnh hng nm 34% Web 13 t ri trang web c nh ch s (ngy 23/01/2011) Ngun: http://www.worldwidewebsize.com/ 11 Bng n d liu: Tc nhn to mi M rng tc nhn to d liu Phn to mi d liu ca ngi dng ngy cng tng H thng trc tuyn ngi dng, Mng x hi Mng x hi Facebook cha ti 40 t nh 2010: 900 EB do ngi dng to (trong 1260 EB tng th). Ngun: IDC Digital Universe Study, sponsored by EMC, May 2010 12 Bng n d liu: Gi thnh v th hin Ngun: IDC Digital Universe Study, sponsored by EMC, May 2010 Gi to d liu ngy cng r hn Chiu hng gi to mi d liu gim dn 0,5 xu M/1 GB vo nm 2009 gim ti 0,02 xu M /1 GB vo nm 2020 Dung lng tng th tng dc tng cng cao t 35 ZB vo nm 2020
13 Bng n d liu vi tng trng nhn lc CNTT Dung lng thng tin tng 67 ln, i tng d liu tng 67 ln Lc lng nhn lc CNTT tng 1,4 ln Ngun: IDC Digital Universe Study, sponsored by EMC, May 2010. 14 Nhu cu nm bt d liu Jim Gray, chuyn gia ca Microsoft, gii thng Turing 1998 Chng ta ang ngp trong d liu khoa hc, d liu y t, d liu nhn khu hc, d liu ti chnh, v cc d liu tip th. Con ngi khng c thi gian xem xt d liu nh vy. S ch ca con ngi tr thnh ngun ti nguyn qu gi. V vy, chng ta phi tm cch t ng phn tch d liu, t ng phn loi n, t ng tm tt n, t ng pht hin v m t cc xu hng trong n, v t ng ch dn cc d thng. y l mt trong nhng lnh vc nng ng v th v nht ca cng ng nghin cu c s d liu. Cc nh nghin cu trong lnh vc bao gm thng k, trc quan ha, tr tu nhn to, v hc my ang ng gp cho lnh vc ny. B rng ca lnh vc lm cho n tr nn kh khn nm bt nhng tin b phi thng trong vi thp k gn y [HK0106]. Kenneth Cukier, Thng tin t khan him ti d dt. iu mang li li ch mi to ln to nn kh nng lm c nhiu vic m trc y khng th thc hin c: nhn ra cc xu hng kinh doanh, ngn nga bnh tt, chng ti phm c qun l tt, d liu nh vy c th c s dng m kha cc ngun mi c gi tr kinh t, cung cp nhng hiu bit mi vo khoa hc v to ra li ch t qun l. http://www.economist.com/node/15557443?story_id=15557443 15 Nhu cu thu nhn tri thc t d liu Kinh t tri thc Tri thc l ti nguyn c bn S dng tri thc l ng lc ch cht cho tng trng kinh t Hnh v: Nm 2003, ng gp ca tri thc cho tng GDP/u ngi ca Hn Quc gp i so vi ng gp ca lao ng v vn. TFP: Total Factor Productivity (The World Bank. Korea as a Knowledge Economy, 2006)
16 Kinh t tri thc Kinh t dch v X hi loi ngi chuyn dch t kinh t hng ha sang kinh t dch v. Lao ng dch v vt lao ng nng nghip (2006). Mi nn kinh t l kinh t dch v. n v trao i trong kinh t v x hi l dch v Dch v: d liu & thng tin tri thc gi tr mi Khoa hc: d liu & thng tin tri thc K ngh: tri thc dch v Qun l: tc ng ti ton b quy trnh thi hnh dch v Jim Spohrer (2006). A Next Frontier in Education, Employment, Innovation, and Economic Growth, IBM Corporation, 2006 17 Kinh t dch v: T d liu ti gi tr Ngnh cng nghip qun l v phn tch d liu Chng ta nhp trong d liu m i kht tri thc ng gi hn 100 t US$ vo nm 2010 Tng 10% hng nm, gn gp i kinh doanh phn mm ni chung vi nm gn y cc tp on ln chi khong 15 t US$ mua cng ty phn tch d liu Tng hp ca Kenneth Cukier Nhn lc khoa hc d liu CIO v chuyn gia phn tch d liu c vai tr ngy cng cao Ngi phn tch d liu: ngi lp trnh + nh thng k + ngh nhn d liu. M c chun quy nh chc nng Tham kho bi trao i Tn mn v c hi trong ngnh Thng k (v KHMT) ca Nguyn Xun Long ngy 03/7/2009. http://www.procul.org/blog/2009/07/03/t%e1%ba%a3n-m%e1%ba%a1n- v%e1%bb%81-c%c6%a1-h%e1%bb%99i-trong-nganh-th%e1%bb%91ng-ke-va-khmt/ 18 Ngnh kinh t nh hng d liu September 9, 2014 19 Khi nim KDD Knowledge discovery from databases Trch chn cc mu hoc tri thc hp dn (khng tm thng, n, cha bit v hu dng tiim nng) t mt tp hp ln d liu KDD v KPDL: tn gi ln ln? theo hai tc gi|Khai ph d liu Data Mining l mt bc trong qu trnh KDD September 9, 2014 20 Qu trnh KDD [FPS96] [FPS96] Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth (1996). From Data Mining to Knowledge Discovery: An Overview, Advances in Knowledge Discovery and Data Mining 1996: 1-34 September 9, 2014 21 Cc bc trong qu trnh KDD Hc t min ng dng Tri thc sn c lin quan v mc tiu ca ng dng Khi to mt tp d liu ch: chn la d liu Chun b d liu v tin x l: (huy ng ti 60% cng sc!) Thu gn v chuyn i d liu Tm cc c trng hu dng, rt gn chiu/bin, tm cc i din bt bin. Chn la chc nng (hm) KPDL Tm tt, phn lp, hi quy, kt hp, phn cm. Chn (cc) thut ton KPDL Bc KPDL: tm mu hp dn nh gi mu v trnh din tri thc Trc quan ha, chuyn dng, loi b cc mu d tha, v.v. S dng tri thc pht hin c September 9, 2014 22 Cc khi nim lin quan Cc tn thay th chit lc tri thc (knowledge extraction), pht hin thng tin (information discovery), thu hoch thng tin (information harvesting), khai qut/no vt d liu (data archaeology/ dredging), Phn tch/x l mu/d liu (data/pattern analysis/processing) Thng minh doanh nghip (business intelligence -BI)
Phn bit: Phi chng mi th l DM? X l truy vn suy din. H chuyn gia hoc chng trnh hc my/thng k nh September 9, 2014 23 M hnh qu trnh KDD lp [CCG98] Mt m hnh ci tin qu trnh KDD nh hng kinh doanh: Xc nh 1-3 cu hi hoc mc ch h tr ch KDD Kt qu thi hnh c: xc nh tp kt qu thi hnh c da trn cc m hnh c nh gi Lp kiu vng i pht trin phn mm [CCG98] Kenneth Collier, Bernard Carey, Ellen Grusy, Curt Marjaniemi, Donald Sautter (1998). A Perspective on Data Mining, Technical Reporrt, Northern Arizona University.
September 9, 2014 24
M hnh CRISP-DM 2000 Quy trnh chun tham chiu cng nghip KPDL Cc pha trong m hnh quy trnh CRISP-DM (Cross-Industry Standard Process for Data Mining). Hiu kinh doanh: hiu bi ton v nh gi Thi hnh ch sau khi tham chiu kt qu vi hiu kinh doanh CRISP-DM 2.0 SIG WORKSHOP, LONDON, 18/01/2007 Ngun: http://www.crisp-dm.org/Process/index.htm (13/02/2011)
September 9, 2014 25 Chu trnh pht trin tri thc thng qua khai ph d liu Wang, H. and S. Wang (2008). A knowledge management approach to data mining process for business intelligence, Industrial Management & Data Systems, 2008. 108(5): 622-634. [Oha09] M hnh tch hp DM-BI [WW08] September 9, 2014 26 D liu v Mu D liu (tp d liu) tp F gm hu hn cc trng hp (s kin). KDD:phi gm rt nhiu trng hp Mu Trong KDD: ngn ng L biu din cc tp con cc s kin (d liu) thuc vo tp s kin F, Mu: biu thc E trong ngn ng L tp con F E tng ng cc s kin trong F. E c gi l mu nu n n gin hn so vi vic lit k cc s kin thuc F E . Chng hn, biu thc "THUNHP < $t" (m hnh cha mt bin THUNHP)
September 9, 2014 27 Tnh c gi tr Mu c pht hin: phi c gi tr i vi cc d liu mi theo chn thc no y. Tnh "c gi tr" : mt o tnh c gi tr (chn thc) l mt hm C nh x mt biu thc thuc ngn ng biu din mu L ti mt khng gian o c (b phn hoc ton b) M C . Chng hn, ng bin xc nh mu "THUNHP < $t dch sang phi (bin THUNHP nhn gi tr ln hn) th chn thc gim xung do bao gi thm cc tnh hung vay tt li b a vo vng khng cho vay n. Nu a*THUNHP + b*N < 0 mu c gi tr hn.
September 9, 2014 28 Tnh mi v hu dng tim nng Tnh mi: Mu phi l mi trong mt min xem xt no , t nht l h thng ang c xem xt. Tnh mi c th o c : s thay i trong d liu: so snh gi tr hin ti vi gi tr qu kh hoc gi tr k vng hoc tri thc: tri thc mi quan h nh th no vi cc tri thc c. Tng qut, iu ny c th c o bng mt hm N(E,F) hoc l o v tnh mi hoc l o k vng. Hu dng tim nng: Mu cn c kh nng ch dn ti cc tc ng hu dng v c o bi mt hm tin ch. Hm U nh x cc biu thc trong L ti mt khng gian o c th t (b phn hoc ton b) M U : u = U (E,F). V d, trong tp d liu vay n, hm ny c th l s tng hy vng theo s tng li ca nh bng (tnh theo n v tin t) kt hp vi quy tc quyt nh c trnh by trong Hnh 1.3.
September 9, 2014 29 Tnh hiu c, tnh hp dn v tri thc Tnh hiu c: Mu phi hiu c KDD: mu m con ngi hiu chng d dng hn cc d liu nn. Kh o c mt cch chnh xc: "c th hiu c d hiu. Tn ti mt s o d hiu: Sp xp t c php (tc l c ca mu theo bit) ti ng ngha (tc l d dng con ngi nhn thc c theo mt tc ng no ). Gi nh rng tnh hiu c l o c bng mt hm S nh x biu thc E trong L ti mt khng gian o c c th t (b phn /ton b) M S : s = S(E,F). Tnh hp dn: o tng th v mu l s kt hp ca cc tiu ch gi tr, mi, hu ch v d hiu. Hoc dng mt hm hp dn: i = I (E, F, C, N, U, S) nh x biu thc trong L vo mt khng gian o c M i . Hoc xc nh hp dn trc tip: th t ca cc mu c pht hin. Tri thc: Mt mu E L c gi l tri thc nu nh i vi mt lp ngi s dng no , ch ra c mt ngng i M i m hp dn I(E,F,C,N,U,S) > i. September 9, 2014 30 Kin trc in hnh h thng KPDL September 9, 2014 31 Khai ph d liu v qun tr CSDL Cu hi thuc h qun tr CSDL (DBMS) Hy hin th s tin ng Smith trong ngy 5 thng Ging ? ghi nhn ring l do x l giao dch trc tuyn (on-line transaction processing OLTP). C bao nhiu nh u t nc ngoi mua c phiu X trong thng trc ? ghi nhn thng k do h thng h tr quyt nh thng k (stastical decision suppport system - DSS) Hin th mi c phiu trong CSDL vi mnh gi tng ? ghi nhn d liu a chiu do x l phn tch trc tuyn (on-line analytic processing - OLAP). Cn c mt gi thit y v tri thc min phc tp! September 9, 2014 32 Khi nim KPDL: cu hi DMS Cu hi thuc h thng khai ph d liu (DMS) Cc c phiu tng gi c c trng g ? T gi US$ - DMark c c trng g ? Hy vng g v c phiu X trong tun tip theo ? Trong thng tip theo, s c bao nhiu on vin cng on khng tr c n ca h ? Nhng ngi mua sn phm Y c c trng g ?
Gi thit tri thc y khng cn c tnh ct li, cn b sung tri thc cho h thng Ci tin (nng cp) min tri thc ! September 9, 2014 33 H thng CSDL v H thng Khai ph d liu September 9, 2014 34 KHAI PH D LIU V THNG MINH KINH DOANH Chiu tng bn cht H tr quyt nh kinh doanh Ngi dng cui Chuyn gia phn tch kinh doanh Chuyn gia phn tch d liu Qun tr CSDL (DBA) To quyt nh Trnh din DL Visualization Techniques KPDL I nformation Discovery Khai thc DL (Data Exploration) OLAP, MDA Phn tch thng k, Truy vn v Tr li Kho DL(Data Warehouses) / KDL chuyn (Data Marts) Ngun d liu Bi vit, Files, Nh cung cp thng tin, H thng CSDL, OLTP September 9, 2014 35 ng dng c bn ca KPDL Phn tch d liu v h tr quyt nh Phn tch v qun l th trng Tip th nh hng, qun l quan h khch hng (CRM), phn tch thi quen mua hng, bn hng cho, phn on th trng Phn tch v qun l ri ro D bo, duy tr khch hng, ci thin bo lnh, kim sot cht lng, phn tch cnh tranh Pht hin gian ln v pht hin mu bt thng (ngoi lai) ng dng khc Khai ph Text (nhm mi, email, ti liu) v khai ph Web Khai ph d liu dng Phn tch DNA v d liu sinh hc September 9, 2014 36 Phn tch v qun l th trng Ngun d liu c t u ? Giao dch th tn dng, th thnh vin, phiu gim gi, cc phn nn ca khch hng, cc nghin cu phong cch sng (cng cng) b sung Tip th nh hng Tm cm cc m hnh khch hng cng c trng: s quan tm, mc thu nhp, thi quen chi tiu... Xc nh cc mu mua hng theo thi gian Phn tch th trng cho Quan h kt hp/ng quan h gia bn hng v s bo da theo quan h kt hp H s khch hng Kiu ca khch hng mua sn phm g (phn cm v phn lp) Phn tch yu cu khch hng nh danh cc sn phm tt nht ti khch hng (khc nhau) D bo cc nhn t s thu ht khch hng mi Cung cp thng tin tm tt Bo co tm tt a chiu Thng tin tm tt thng k (xu hng trung tm d liu v bin i) September 9, 2014 37 Phn tch doanh nghip & Qun l ri ro Ln k hoch ti chnh v nh gi ti sn Phn tch v d bo dng tin mt Phn tch yu cu ngu nhin nh gi ti sn Phn tch lt ct ngang v chui thi gian (t s ti chnh, phn tch xu hng) Ln k hoch ti nguyn Tm tt v so snh cc ngun lc v chi tiu Cnh tranh Theo di i th cnh tranh v nh hng th trng Nhm khch hng thnh cc lp v nh gi da theo lp khch Khi to chin lc gi trong th trng cnh tranh cao September 9, 2014 38 Pht hin gian ln v khai ph mu him Tip cn: Phn cm & xy dng m hnh gian ln, phn tch bt thng ng dng: Chm sc sc khe, bn l, dch v th tn dng, vin thng. Bo him t ng: vng xung t Ra tin: giao dch tin t ng ng Bo him y t Bnh ngh nghip, nhm bc s, v nhm ch dn Xt nghim khng cn thit hoc tng quan Vin thng: cuc gi gian ln M hnh cuc gi: ch cuc gi, di, thi im trong ngy hoc tun. Phn tch mu lch mt dng chun d kin Cng nghip bn l Cc nh phn tch c lng rng 38% gim bn l l do nhn vin khng trung thc Chng khng b September 9, 2014 39 ng dng khc Th thao IBM Advanced Scout phn tch thng k mn NBA (chn bng, h tr v li) a ti li th cnh trang cho New York Knicks v Miami Heat Thin vn hc JPL v Palomar Observatory khm ph 22 chun tinh (quasar) vi s tr gip ca KPDL Tr gip lt web Internet Tr gip IBM p dng cc thut ton KPDL bin bn truy nhp Web i vi cc trang lin quan ti th trng khm ph u i khch hng v cc trang hnh vi, phn tch tnh hiu qu ca tip th Web, ci thi cch t chc Website September 9, 2014 40 September 9, 2014 41 KPDL: S phn loi (Chc nng) Chc nng chung KPDL m t: tm tt, phn cm, lut kt hp KPDL d on: phn lp, hi quy Cc bi ton in hnh M t khi nim Quan h kt hp Phn lp Phn cm Hi quy M hnh ph thuc Pht hin bin i v lch Phn tch nh hng mu, cc bi ton khc September 9, 2014 42 KPDL: S phn loi (Chc nng) M t khi nim: c trng v phn bit Tm cc c trng v tnh cht ca khi nim Tng qut ha, tm tt, pht hin c trng rng buc, tng phn, chng hn, cc vng kh so snh vi t Bi ton m t in hnh: Tm tt (tm m t c ng) K vng, phng sai Tm tt vn bn Quan h kt hp Quan h kt hp gia cc bin d liu: Tng quan v nhn qu) Diaper Beer [0.5%, 75%] Lut kt hp: XY V d, trong khai ph d liu Web Pht hin quan h ng ngha Quan h ni dung trang web vi mi quan tm ngi dng
September 9, 2014 43 Cc bi ton KPDL: Chc nng KPDL Phn lp v D bo Xy dng cc m hnh (chc nng) m t v phn bit khi nim cho cc lp hoc khi nim d on trong tng lai Chng hn, phn lp quc gia da theo kh hu, hoc phn lp t da theo tiu tn xng Trnh din: cy quyt nh, lut phn lp, mng nron D on gi tr s cha bit hoc mt September 9, 2014 44 KPDL: S phn loi (Chc nng) Phn lp xy dng/m t m hnh/ hm d bo m t/pht hin lp/khi nim cho d bo tip hc mt hm nh x d liu vo mt trong mt s lp bit Phn cm nhm d liu thnh cc "cm" (lp mi) pht hin c mu phn b d liu min ng dng. Tnh tng t
September 9, 2014 45 Chc nng KPDL (2) Phn tch cm Nhn lp cha bit: Nhm d liu thnh cc lp mi: phn cm cc nh tm mu phn b Cc i tng t ni b cm & cc tiu tng t gia cc cm Phn tch bt thng Bt thng: i tng d liu khng tun theo hnh vi chung ca ton b d liu. V d, s dng k vng mu v phng sai mu Nhiu hoc ngoi l? Khng phi! Hu dng pht hin gian ln, phn tch cc s kin him Pht hin bin i v lch Hu nh s thay i c ngha di dng o bit trc/gi tr chun, cung cp tri thc v s bin i v lch Pht hin bin i v lch <> tin x l
September 9, 2014 46 KPDL: S phn loi (Chc nng) Hi quy hc mt hm nh x d liu nhm xc nh gi tr thc ca mt bin theo mt s bin khc in hnh trong phn tch thng k v d bo d on gi tr ca mt/mt s bin ph thuc vo gi tr ca mt tp bin c lp. M hnh ph thuc xy dng m hnh ph thuc: tm mt m hnh m t s ph thuc c ngha gia cc bin mc cu trc: dng th bin l ph thuc b phn vo cc bin khc mc nh lng: tnh ph thuc khi s dng vic o tnh theo gi tr s September 9, 2014 47 KPDL: S phn loi (Chc nng) Phn tch xu hng v tin ha Xu hng v lch: phn tch hi quy Khai ph mu tun t, phn tch chu k Phn tch da trn tng t Phn tch nh hng mu khc hoc phn tch thng k September 9, 2014 48 KPDL: S phn loi (2) Phn loi theo khung nhn Kiu d liu c KP Kiu tri thc cn pht hin Kiu k thut c dng Kiu min ng dng September 9, 2014 49 Khung nhn a chiu ca KPDL D liu c khai ph Quan h, KDL, giao dch, dng, hng i tng/quan h, tch cc, khng gian, chui thi gian, vn bn, a phng tin, khng ng nahats, k tha, WWW Tri thc c khai ph c trng, phn bit, kt hp, phn lp, phn cm, xu hng/ lch, phn tch bt thng, Cc chc nng phc/tch hp v KPDL cc mc phc hp K thut c dng nh hng CSDL, KDL (OLAP), hc my, thng k, trc quan ha, . ng dng ph hp Bn l, vin thng, ngn hng, phn tch gian ln, KPDL sinh hc, phn tch th trng chng khon, KP vn bn, KP Web, September 9, 2014 50 KPDL: cc kiu d liu CSDL quan h Kho d liu CSDL giao dch CSDL m rng v kho cha thng tin CSDL quan h-i tng D liu khng gian v thi gian D liu chui thi gian D liu dng D liu a phng tin D liu khng ng nht v tha k CSDL Text & WWW September 9, 2014 51 Kiu d liu c phn tch/khai ph 8/2009 http://www.kdnuggets.com/polls/2010/data-types-analyzed.html September 9, 2014 52 http://www.kdnuggets.com/polls/2010/data -miner-salary.html http://www.kdnuggets.com/polls/2009/largest- database-data-mined.htm September 9, 2014 53 Mi mu khai ph c u hp dn? KPDL c th sinh ra ti hng nghn mu: Khng phi tt c u hp dn Tip cn gi : KPDL hng ngi dng, da trn cu hi, hng ch o hp dn Mu l hp dn nu d hiu, c gi tr theo d liu mi/kim tra vi chc chn, hu dng tim nng, mi l hoc xc nhn cc gi thit m ngi dng tm kim xc thc. o hp dn khch quan v ch quan Khch quan: da trn thng k v cu trc ca mu, chng hn, d h tr, tin cy, Ch quan: da trn s tin tng ca ngi dng i vi d liu, chng hn, s khng ch n, tnh mi m, tc ng c... September 9, 2014 54 Tm c tt c v ch cc mu hp dn? Tm c mi mu hp dn: Bi ton v tnh y H thng KHDL c kh nng tm mi mu hp dn? Tm kim my m (heuristic) <> tm kim y Kt hp <> phan lp <> phn cm Tm ch cc mu hp dn: Bi ton ti u H thng KPDL c kh nng tm ra ng cc mu hp dn? Tip cn u tin tm tng th tt c cc mu sau lc b cc mu khng hp dn. Sinh ra ch cc mu hp dnti u ha cu hi khai ph September 9, 2014 Kho d liu v khai ph d liu: Chng 1 55 KPDL: Hi t ca nhiu ngnh phc Data Mining Database Systems Statistics Other Disciplines Algorithm Machine Learning Visualization September 9, 2014 56 Thng k ton hc vi Khai ph d liu Nhiu im chung gia KPDL vi thng k: c bit nh phn tch d liu thm d (EDA: Exploratory Data Analysis) cng nh d bo [Fied97, HD03]. H thng KDD thng gn kt vi cc th tc thng k c bit i vi m hnh d liu v nm bt nhiu trong mt khung cnh pht hin tri thc tng th. Cc phng php KPDL da theo thng k nhn c s quan tm c bit. September 9, 2014 57 Thng k ton hc vi Khai ph d liu Phn bit gia bi ton thng k v bi ton khai ph d liu Bi ton kim nh gi thit thng k: cho trc mt gi thit + tp d liu quan st c. Cn kim tra xem tp d liu quan st c c ph hp vi gi thit thng k hay khng/ gi thit thng k c ng trn ton b d liu quan st c hay khng. Bi ton hc khai ph d liu: m hnh cha c trc. M hnh kt qu phi ph hp vi tp ton b d liu -> cn m bo cc tham s m hnh khng ph thuc vo cch chn tp d liu hc. Bi ton hc KPDL i hi tp d liu hc/tp d liu kim tra cn "i din" cho ton b d liu trong min ng dng v cn c lp nhau. Mt s trng hp: hai tp d liu ny (hoc tp d liu kim tra) c cng b di dng chun. V thut ng: KPDL: bin ra/bin mc tiu, thut ton khai ph d liu, thuc tnh/c trng, bn ghi... XLDLTK: bin ph thuc, th tc thng k, bin gii thch, quan st... Tham kho thm t Nguyn Xun Long
September 9, 2014 58 Ngun ch dn v KPDL Data mining and KDD (SIGKDD: CDROM) Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc. Journal: Data Mining and Knowledge Discovery, KDD Explorations Database systems (SIGMOD: CD ROM) Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA Journals: ACM-TODS, IEEE-TKDE, JIIS, J. ACM, etc. AI & Machine Learning Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), etc. Journals: Machine Learning, Artificial Intelligence, etc. Statistics Conferences: Joint Stat. Meeting, etc. Journals: Annals of statistics, etc. Visualization Conference proceedings: CHI, ACM-SIGGraph, etc. Journals: IEEE Trans. visualization and computer graphics, etc. Mt s tham kho khc http://www.kdnuggets.com/ Danh sch ti liu tham kho Future Directions in Computer Science September 9, 2014 59 September 9, 2014 60 S lc lch s pht trin cng ng KPDL 1989 IJCAI Workshop on Knowledge Discovery in Databases (Piatetsky- Shapiro) Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991) 1991-1994 Workshops on Knowledge Discovery in Databases Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996) 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD95-98) Journal of Data Mining and Knowledge Discovery (1997) 1998 ACM SIGKDD, SIGKDD1999-2001 conferences, and SIGKDD Explorations More conferences on data mining PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc. September 9, 2014 61 KDD.ORG SIGKDD: http://www.sigkdd.org/index.php Cung cp din n cao nht cho tin b v thng qua khoa hc PHTT&KPDL Khuyn khch: NC c bn trong KDD: hi ngh, tp ch, Cng nhn cc chun th trng v thut ng, nh gi, phng php. o to lin ngnh gia cc nh nghin cu, trin khai thc t v ngi dng
September 9, 2014 62 KDD 2011 September 9, 2014 63 KDD - 2011
September 9, 2014 64 Khai ph d liu: tp 20 t kha hng u
September 9, 2014 65 Cc ch lin quan DM ang l thi s ! September 9, 2014 66 Trang web KDD; KPDL & bin i kh hu Nguyn nhn gy bin i kh hu: Gn 50% c gi KDnuggets tin rng thay i kh hu hin nay phn ln l do hot ng ca con ngi, mt s ng k s ngi nghi ng. Kh hu rt phc tp v cc nh khoa hc khng phi l tuyn b rng hot ng ca con ngi l nguyn nhn duy nht ca thay i kh hu. ng thun vi Hi ng lin chnh ph v Bin i kh hu: hot ng ca con ngi l mt trong nhng nguyn nhn chnh. Khai ph nhn nh: Opinion Mining / Sentiment Mining September 9, 2014 67 Vn hin ti trong KPDL Phng php lun khai ph Khai ph cc kiu tri thc khc nhau t d liu hn tp nh sinh hc, dng, web Hiu nng: Hiu sut, tnh hiu qu, v tnh m rng nh gi mu: bi ton v tnh hp dn Kt hp tri thc min: ontology X l d liu nhiu v d liu khng y Tnh song song, phn tn v phng php KP gia tng Kt hp cc tri thc c khm ph vi tri thc hin c: tng hp tri thc Tng tc ngi dng Ngn ng hi KPDL v khai ph ngu hng Biu din v trc quan kt qu KPDL Khai thc tng tc tri thc cc cp tru tng p dng v ch s x hi KPDL c t min ng dng v KPDL v hnh Bo m b mt d liu, ton vn v tnh ring t September 9, 2014 68 Mt s yu cu ban u S b v mt s yu cu d n KPDL thnh cng Cn c k vng v mt li ch ng k v kt qu KPDL Hoc trc tip nhn c tri cy treo thp (low-hanging fruit) d thu lm (nh M hnh m rng khch hng qua tip th v bn hng) Hoc gin tip to ra n by cao khi tc ng vo qu trnh sng cn c nh hng sng ngm mnh (Gim cc n khon kh i t 10% cn 9,8% c s tin ln). Cn c mt i d n thi hnh cc k nng theo yu cu: chn d liu, tch hp d liu, phn tch m hnh ha, lp v trnh din bo co. Kt hp tt gi ngi phn tch v ngi kinh doanh Nm bt v duy tr cc dng thng tin tch ly (chng hn, m hnh kt qu t mt lot chin dch tip th) Qu trnh hc qua nhiu chu k, cn chy ua vi thc tin (m hnh m rng khch hng ban u cha phi ti u). Mt tng hp v cc bi hc KPDL thnh cng, tht bi [NEM09] Robert Nisbet, John Elder, and Gary Miner (2009). Handbook of Statistical Analysis and Data Mining, Elsevier, 2009.