You are on page 1of 26

Nghien cu cong cu Data Mining trong SQL Server 2000

Trang 2

MUC LUC
1. Gii thieu ........................................................................................................3 2. Cac thuat toan Data Mining cua Microsoft.....................................................3 3. Xay dng cac mo hnh Data Mining bang Analysis Services 2000 ...............6 3.1. Nguon d lieu cho mo hnh data mining..................................................6 3.2. Tao mo hnh data mining .........................................................................7 3.3. Huan luyen mo hnh data mining.............................................................8 3.4. Duyet qua noi dung cua mo hnh data mining .........................................9 3.5. Dung mo hnh data mining thc hien d bao .........................................12 4. Ket luan .........................................................................................................15 Phu luc A: Ket qua th nghiem ........................................................................16 A.1. Ket qua thc thi tren thuat toan cay quyet nh ...................................16 A.1.1. Ket qua thc thi Training khi khong co s lien ket gia cac bang16 A.1.2. Ket qua thc thi Training khi co s lien ket gia cac bang ..........19 A.2. Ket qua thc thi tren thuat toan Clustering (phan lp).........................21 A.2.1. Ket qua thc thi Training khi khong co s lien ket gia cac bang21 A.2.2. Ket qua thc thi Training khi co s lien ket gia cac bang ..........23 Phu luc B: Mot so thuat ng .............................................................................25 Phu luc C: Chng trnh demo ..........................................................................26 Tai lieu tham khao:...........................................................................................27

Nghien cu cong cu Data Mining trong SQL Server 2000

Trang 3

Nghien cu cong cu Data Mining trong SQL Server 2000


1. Gii thieu
Trong moi trng thng mai ien t ngay nay, lnh vc data mining ngay cang thu hut nhieu s quan tam. Nh vao cac phng tien t ong hay ban t ong, data mining khao sat va phan tch tren mot lng ln d lieu e rut ra nhng mau va qui luat co y ngha. Cac thong tin nay giup cac cong ty kinh doanh chang han nh hieu ro khach hang hn e t o co cac chien lc phu hp hn nham nham cai thien hoat ong tiep th, ban hang va ho tr khach hang. Qua nhieu nam hoat ong, cac cong ty kinh doanh tch luy c cac c s d lieu (CSDL) rat ln t cac ng dung nh Lap ke hoach s dung nguon tai nguyen cho hoat ong kinh doanh (Enterprise Resource Planning (ERP)), Quan ly khach hang (Client Relationship Management (CRM)), hay t cac he thong ieu hanh khac. Ngi ta tin rang co cac gia tr cha c khai thac tiem an ben trong cac d lieu nay. Cac ky thuat data mining co the giup lay ra nhng mau nh the. Gan ay Microsoft a a ra OLE DB cho giao dien lap trnh ng dung (API) Data Mining vi nhieu data mining provider hang au. API nay nh ngha mot ngon ng truy van data mining da tren cu phap SQL. Cac mo hnh data mining (Data Mining Model) c xem nh la mot dang ac biet cua bang quan he. Cac tnh toan d bao c xem nh la mot dang ac biet cua phep ket. Microsoft SQL Server 2000 Analysis Services cung cap Microsoft data mining provider da tren OLE DB cho chuan Data Mining. Provider nay gom hai thuat toan data mining: Microsoft Decision Trees va Microsoft Clustering.

2. Cac thuat toan Data Mining cua Microsoft

Hai thuat toan data mining trong SQL Server 2000, Microsoft Decision Trees (MDT) va Microsoft Clustering, la ket qua cua nhieu nam nghien cu tai Microsoft Research. Sau ay la trnh bay tom tat hai thuat toan nay. Thuat toan Microsoft Decision Trees (cay quyet nh) Cay quyet nh co le la ky thuat pho bien nhat cho viec lap mo hnh d bao. Bang sau ay la mot tap d lieu huan luyen (training data) c dung e d bao credit risk
Customer ID 1 2 3 4 5 6 Debt level High High High Low Low Low Income level Employment type Credit risk High High Low Low Low High Self-employed Salaried Salaried Salaried Self-employed Self-employed Bad Bad Bad Good Bad Good

Nghien cu cong cu Data Mining trong SQL Server 2000


7 Low High Salaried

Trang 4
Good

Sau ay la mot cay quyet nh c tao ra t tap d lieu nay:

Trong v du nay, thuat toan Decision Tree xac nh thuoc tnh quan trong nhat la Debt level, do o re nhanh au tien c thc hien da tren debt level. Node vi Debt = High la node la (ca ba trng hp eu la bad credit risk). Node vi Debt = Low con lan lon (3 trng hp good credit risk, 1 trng hp bad credit risk). Tiep theo, Employment la thuoc tnh quan trong ke tiep. Tng t node vi Employment = Salaried la node la. Tren ay ch la mot v du nho da vao d lieu tong hp, nhng no cho thay cay quyet nh co the dung cac thuoc tnh co lien quan e d bao credit risk. Khi pham vi cua van e c m rong th se gay kho khan cho viec rut ra cac luat mot cach thu cong. Thuat toan co the chay tren hang tram thuoc tnh va hang trieu record e a ra cay quyet nh mo ta cac luat d bao credit risk. Co nhieu thuat toan khac nhau vi cac phng phap re nhanh khac nhau c dung e xay dng cay quyet nh. Microsoft Decision Tree la cay phan lp theo xac suat (Probabilistic Classification Tree). No rat giong vi C4.5, nhng mac nh dung Bayesian score lam tieu chuan re nhanh thay v Entropy. Thuat toan Microsoft Clustering Clustering ngha la tm cac nhom (hay cluster) trong tap d lieu gom cac tap con co cac record tng t nhau. No khac vi mo hnh d bao cho no khong co thuoc tnh ch trong tap d lieu. Thuat toan clustering quyet nh thuoc tnh an mi nay bang cach khao sat tap d lieu. Co nhieu phng phap phan nhom d lieu. Cac thuat toan pho bien nh K-Means, cac phng phap khoi lien ket nhieu tang, va lap mo hnh pha tron bang cach dung thuat toan Expectation-Maximization (EM) e lien

Nghien cu cong cu Data Mining trong SQL Server 2000

Trang 5

ket cac mo hnh pha tron theo xac suat ti tap d lieu. Cac record cua tap d lieu co the thuoc ve cac cluster khac nhau tuy thuoc vao cach thiet lap gii han. Xet mot CSDL nhan vien, trong o moi nhan vien gom ba thuoc tnh: age, salary, va vested amount. Ngi s dung muon co mot bang ve o tuoi trung bnh cua cac nhan vien co vested amount trong khoang 100K-200K, 200K-400K, va 400K-1000K va co lng trong khoang 50K-100K, 100K-200K, 200K-300K. ay la loai d lieu ba chieu. Cac record cua d lieu n-chieu co the c xem nh la cac iem trong khong gian nchieu. Chang han, cac record dang (age, salary) co the c xem nh cac iem trong khong gian 2-chieu, vi chieu age va chieu salary. Hnh 3a va 3b minh hoa hai cach bieu dien cho v du nay.

Viec tm cac cluster trong khong gian nhieu chieu (4 chieu hay ln hn) th rat phc tap oi vi con ngi. Neu ch n gian bieu dien d lieu bang cac iem th se khong giup ch g nhieu. Tuy nhien, cac thuat toan clustering t ong tm cac cluster nh the trong tap d lieu. Moi cluster c the hien bi phan bo cua chnh no. Thuat toan Microsoft Clustering co c s la thuat toan Expectation and Maximization (EM). Thuat toan nay lap i lap lai gia hai bc. Trong bc au tien, goi la bc E hay Expectation, thanh phan cluster cua moi trng hp c tnh ra. Trong bc th hai, goi la bc M hay Maximization, cac thong so (parameter) cua cac mo hnh c c lng lai da vao cac thanh phan cluster nay. EM tng t vi K-Means, vi cac bc chnh sau ay: 1. Thiet lap cac phng tien khi tao 2. Gan cac trng hp cho moi phng tien bang cach s dung mot vai o o khoang cach 3. Tnh ra cac phng tien mi da vao cac thanh vien cua moi cluster 4. Thiet lap cac bien cho vung cha mi da vao cac phng tien mi 5. Lap lai chu ky cho en khi hoi tu. EM khac vi K-Means nhieu kha canh. iem khac biet chu yeu la EM khong xac nh bien ro rang gia cac cluster. Mot trng hp c gan cho moi cluster vi mot

Nghien cu cong cu Data Mining trong SQL Server 2000

Trang 6

xac suat nao o. Sau ay la minh hoa mot vai lan lap cua thuat toan EM cho tap d lieu mot chieu. Gia s d lieu trong moi cluster co phan bo Gauss. Cac phng tien cua moi cluster c hoan oi nhau sau moi lan lap.

Hau het cac thuat toan Clustering eu phai oc tat ca cac con tro d lieu vao bo nh, ieu nay co the gay ra cac van e nghiem trong ve kha nang tai cua bo nh khi x ly mot tap d lieu ln. e giai quyet van e nay, thuat toan Microsoft Clustering dung mot pham vi c s, o chon la lu tr cac phan CSDL quan trong va tom tat cac phan khac. T tng chu ao la oc d lieu vao cac vung nh theo tng khoi va da vao mo hnh data mining a c cap nhat e gom cac trng hp gan nhau da vao phan bo Gauss, v the cac trng hp o c nen lai. Thuat toan Microsoft Clustering ch can mot lan duyet qua d lieu tho.

3. Xay dng cac mo hnh Data Mining bang Analysis Services 2000
3.1. Nguon d lieu cho mo hnh data mining Xet cau hoi: hay ch ra cac khach hang co nhieu nguy c ri bo ngan hang nhat da tren thong tin cua khach hang, thong tin giao dch cua ho vi ngan hang. e tra li cau hoi nay, cac bang CSDL quan he sau ay c dung en: - Bang Customer: cha cac thong tin ve khach hang cua ngan hang bao gom: age (tuoi cua khach hang), income (thu nhap), educational level (trnh o hoc van), house value (gia tr nha), loan (n), - Bang Purchases: cha cac thong tin giao dch cua khach hang bao gom: checking accounts (tai khoan vang lai), money market savings (tien gi tiet kiem),

Nghien cu cong cu Data Mining trong SQL Server 2000 Mo hnh quan he cho hai bang nay nh sau:

Trang 7

3.2. Tao mo hnh data mining Khi tao mot mo hnh data mining (DMM), ban phai nh ngha cau truc va cac thuoc tnh cho mo hnh. e nh nghia mot DMM mi trong Microsolf OLE DB for Data Mining API, dung lenh CREATE DATA MINING MODEL. Tng t nh lenh CREATE TABLE, lenh tao mo hnh nay ch nh ngha cau truc va cac thuoc tnh cua no, ch hoan toan khong co d lieu. Cung tng t nh the, lenh nay nh ngha khoa, cot, thuat toan c dung va cac tham so dung cho viec huan luyen DMM sau nay. Cu phap nh ngha mo hnh data mining: CREATE MINING MODEL <ten mo hnh> (<nh ngha cac cot>) USING <Dch vu>[(<cac tham so dch vu>)] Tuy nhien, do cac cot cua DMM yeu cau cac thong tin ac thu, nen co mot so m rong c a vao cu phap SQL chuan. Sau ay la mot v du ap dung cho cau truc bang c mo ta tren: CREATE MINING MODEL [Model_MDT_Churn_Prediction] ([Customer Id] LONG KEY, [Income] DOUBLE CONTINUOUS , [Other Income] DOUBLE CONTINUOUS , [Loan] DOUBLE CONTINUOUS , [Age] DOUBLE CONTINUOUS , [Region Name] TEXT DISCRETE , [Home Years] DOUBLE CONTINUOUS , [House Value] DOUBLE CONTINUOUS , [Education Level] TEXT DISCRETE , [Home Type] TEXT DISCRETE , [Churn Yes No] TEXT DISCRETE PREDICT) USING Microsoft_Decision_Trees

Nghien cu cong cu Data Mining trong SQL Server 2000

Trang 8

Cac t khoa LONG, DOUBLE va TEXT nh ngha kieu d lieu cua cot. Tuy nhien co mot vai m rong so vi SQL chuan. T khoa KEY ch nh cot (cac cot) lam khoa. Hai t khoa CONTINUOUS (lien tuc) va DISCRETE (ri rac) la hai gia tr co the co cho cac cot noi dung. T khoa PREDICT ch nh cot ket qua d bao. Chu y: ban cung co the tao mo hnh data mining t Analysis Manager, khi o lenh CREATE MINING MODEL c phat sinh t ong. 3.3. Huan luyen mo hnh data mining Sau khi tao mo hnh data mining, bc tiep theo la huan luyen mo hnh. Huan luyen mo hnh ngha la chay mo hnh tren d lieu dung e huan luyen (training data) bang cach dung mot thuat toan ac thu nao o. ay la bc ton nhieu thi gian nhat. Thuat toan co the lap lai mot vai lan tren tap d lieu huan luyen e tm ra cac mau an ben trong tap d lieu nay. OLE DB for Data Mining API che giau cac phc tap cua viec huan luyen mo hnh bang cach cung cap lenh INSERT nh la lenh dung e huan luyen. Mac du co mot lng d lieu khong lo c a vao mo hnh data mining trong giai oan nay, nhng no khong lu tr bat ky d lieu nao, thay vao o no lu tr cac mau cua chung. Khi mo hnh a c huan luyen, ng dung khach co the duyet qua noi dung cua mo hnh va thc hien cac truy van tren tap d lieu mi nay. Cu phap cua lenh INSERT: INSERT [INTO] <ten mo hnh> [ <cac cot c anh xa cua mo hnh > ] <truy van d lieu nguon>

Nghien cu cong cu Data Mining trong SQL Server 2000

Trang 9

V du: Huan luyen cho mo hnh Model_MDT_Churn_ Prediction c tao ra

tren.
INSERT INTO [Model_MDT_Churn_ Prediction] (SKIP, [Income], [Other Income], [Loan], [Age], [Region Name], [Home Years], [House Value], [Education Level], [Home Type], [Churn Yes No]) OPENROWSET(SQLOLEDB, , SELECT DISTINCT [CustomerID], [Income], [OtherIncome], [Loan], [Age], [RegionName], [HomeYears], [HouseValue], [EducationLevel], [HomeType], [Churn_Yes_No] FROM Customers) 3.4. Duyet qua noi dung cua mo hnh data mining Khi mo hnh a c huan luyen, t Analysis Manager ban co the duyet qua noi dung mo hnh dung tree browser. Trong browser nay, noi dung hien th dang o hoa, va cho phep lt qua cac phan noi dung khac nhau. Noi dung cua mot DMM la tap cac luat, cac cong thc, cac phan lp, cac phan bo, cac node, hay bat ky thong tin nao khac co nguon goc t mot tap d lieu ac biet bang cach dung ky thuat data mining. Tuy theo ky thuat data mining c dung khi tao DMM ma loai noi dung co the khac nhau gia cac mo hnh. Noi dung DMM cua mot cay quyet nh se khac vi noi dung DMM cua clustering. Duyet qua noi dung cua mo hnh co the cung cap cac kien thc quan trong ben trong d lieu. Trong nhieu trng hp, no cho phep cac nha phan tch d lieu hieu c cac mau va cac qui luat va d oan cac ac iem cua d lieu mi.

Nghien cu cong cu Data Mining trong SQL Server 2000

Trang 10

Sau ay la mau c tm thay bi thuat toan Decision Trees chay tren tap d lieu huan luyen:

Ta cung co the duyet qua tat ca cac trng hp co the co cua mo hnh. Xet mot mo hnh DMM vi cac cot nh sau: Gender (gii tnh), Age (tuoi) va HairColor (mau toc). Sau khi mo hnh nay c huan luyen, cot Gender se co cac trang thai (gia tr) Male (nam), Female (n), Missing (khong biet). oi vi cot HairColor, DMM nhn thay va ghi nh cac gia tr Black, Gray, va Missing. oi vi cot Age, mac du DMM thay tat ca cac gia tr lien tuc cua no, nhng khong ghi nh tng gia tr phan biet ma ch ghi nh cac gia tr minimum (nho nhat), mean (trung bnh), maximum (ln nhat). Gia s mo hnh c xay dng e d bao cot HairColor t mot tap d lieu 100 ngi, noi dung cua DMM co the nh sau:

Nghien cu cong cu Data Mining trong SQL Server 2000

Trang 11

Cau truy van: SELECT *, PredictProbability(HairColor) FROM HairColorPredictDMM Co ket qua nh sau:
Gender Male Male Male Male Male Male Male Male Male Male Male Male Female Female Female Female Female Female Female Female Age 2 2 2 91 91 91 45 45 45 NULL NULL NULL 2 2 2 91 91 91 45 45 HairColor Black Gray NULL Black Gray NULL Black Gray NULL Black Gray NULL Black Gray NULL Black Gray NULL Black Gray P(HairColor) .667 .267 .067 .300 .625 .075 .667 .267 .067 .600 .350 .05 .933 .067 .000 .300 .625 .075 .933 .067

Nghien cu cong cu Data Mining trong SQL Server 2000


Gender Female Female Female Female NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL Age 45 NULL NULL NULL 2 2 2 91 91 91 45 45 45 NULL NULL NULL HairColor NULL Black Gray NULL Black Gray NULL Black Gray NULL Black Gray NULL Black Gray NULL P(HairColor) .000 .600 .350 .05 .800 .167 .033 .300 .625 .075 .800 .167 .033 .600 .350 .05

Trang 12

Cau truy van: SELECT Age, PredictProbability(HairColor) FROM HairColorPredictDMM WHERE Gender = 'Male' and HairColor = 'Black' Co ket qua nh sau:
Gender Male Male Male Male Age 2 91 45 NULL HairColor Black Black Black Black P(HairColor) .667 .300 .667 .600

3.5. Dung mo hnh data mining thc hien d bao Sau khi c huan luyen, mo hnh co the c dung e thc hien cac d bao tren cac tap d lieu mi. Trong OLE DB for Data Mining API, lenh dung e thc hien d bao la lenh SELECT. Lenh nay thc hien ket mot mo hnh data mining vi mot bang input mi. Phep ket ac biet nay c goi la PREDICTION JOIN. Cu phap tong quat cua lenh SELECT: SELECT [FLATTENED] <SELECT-expressions> FROM <ten mo hnh> PREDICTION JOIN <truy van d lieu nguon> ON <ieu kien ket>

Nghien cu cong cu Data Mining trong SQL Server 2000 [WHERE <WHERE-expression>]

Trang 13

Menh e <truy van d lieu nguon>: ch nh tap d lieu mi co cac thuoc tnh c d bao bang cach ket hp tap nay vi tri thc trong mo hnh DMM. PREDICTION JOIN: cac trng hp thc te t <truy van d lieu nguon> c ket hp vi tap cac trng hp co the co t mo hnh <ten mo hnh> thong qua phep toan PREDICTION JOIN. S ket hp cua cac trng hp trong d lieu nguon vi tat ca cac trng hp co the co thong qua PREDICTION JOIN ve mat ng ngha khac vi phep ket trong CSDL quan he chuan, v ly do n gian sau ay: - Cac trng hp trong DMM khong the hien tat ca cac gia tr co the co cua mot thuoc tnh (cot) co kieu dang CONTINUOUS, tuy nhien mot PREDICTION JOIN phai ket hp mot gia tr continuous chnh xac cua mot trng hp trong d lieu nguon vi cac gia tr phan bo trong DMM. Vi v du tap cua tat cac trng hp co the co neu tren, lenh sau ay tra ve khong co record nao bi v cac trng hp co the co trong DMM co cot Age ch cha cac gia tr Minimum, Mean, Maximum, va Missing ng vi (2, 45, 91, Missing): SELECT * FROM GenderPredictDMM WHERE Gender = 'Male' AND Age = 30 Tuy nhien, mot PREDICTION JOIN s dung cay quyet nh c mo ta cho mo hnh nay tm thay mot phan bo tren HairColor cho phai nam 30 tuoi nh sau: Black = .667, Grey = .267, Missing = .067. - Cac trng hp cua DMM the hien ay u cac gia tr co the co cho mot cot c dung e d bao, trong khi ngi thc hien d bao thng mong i mot gia tr n tot nhat. Xet cau truy van sau: SELECT * FROM GenderPredictDMM WHERE Gender = 'Male' AND Age = 45 Ket qua nh sau:
Gender Male Male Male Age 45 45 45 HairColor Black Gray NULL

- PREDICTION JOIN co the can co mot vai rang buoc va gia nh khi gap cac gia tr khong xac nh c (missing) trong trng hp nguon. Mot PREDICTION JOIN gia mot mo hnh n gian va mot trng hp ma trong o age la 30, gender khong biet, se cho ket qua cua HairColor la Black vi xac suat la 80%.

Nghien cu cong cu Data Mining trong SQL Server 2000

Trang 14

Tong quat, PREDICTION JOIN se chon mot trng hp t tap d lieu input, va da vao ieu kien mo ta trong menh e ON e tm tap cac trng hp tng ng trong DMM. Menh e <SELECT-expressions> : la mot tap cac phat bieu phan cach bi dau phay, mot phat bieu co the la mot cot n gian dung e tham chieu, hay cha cac chc nang d bao. Cac cot co the c tham chieu t DMM hay t truy van d lieu nguon. ON va ieu kien ket: moi dong trong tap cac trng hp co the co cua DMM la duy nhat, nen no co the c ket vi cac dong trong truy van nguon cua cac trng hp thc s thong qua menh e <ieu kien ket> cua t khoa ON. ieu kien ket se ket hp cac cot trong DMM vi cac cot trong truy van nguon. ieu kien ket co mot phat bieu = cho cac cot c ket, va cac phat bieu c noi vi nhau qua t khoa AND trong trng co nhieu cot ket. Menh e WHERE : gii han cac trng hp tra ve t truy van d bao. V du: Hay d bao cac khach hang co nhieu nguy c ri bo ngan hang nhat (=80%) da tren thong tin cua khach hang: SELECT FLATTENED [T1].[CustomerID], [T1].[Income], T1.[OtherIncome], [T1].[Loan], [T1].[Age], [T1].[RegionName], [T1].[HomeYears], [T1].[HouseValue], [T1].[EducationLevel], [T1].[HomeType], [T1].[Churn_Yes_No]) FROM [Model_MDT_Churn_Prediction] AS [M1] PREDICTION JOIN OPENROWSET('SQLOLEDB', ;data source=D:\customer.mdb', SELECT DISTINCT [CustomerID], [Income], [OtherIncome], [Loan], [Age], [RegionName], [HomeYears], [HouseValue], [EducationLevel], [HomeType], [Churn_Yes_No] FROM Customers) AS [T1] ON [M1]. [Customer Id]= [T1]. [CustomerID] WHERE PredictProbability([M1]. [Churn_Yes_No]) > 0.8.

Nghien cu cong cu Data Mining trong SQL Server 2000

Trang 15

4. Ket luan

Data mining ang nhanh chong tr thanh mot ky thuat phan tch c s dung rong rai. Bao cao nay mo ta hai thuat toan data mining trong SQL Server 2000 Analysis Services: Microsoft Decision Trees (MDT) va Microsoft Clustering. Bao cao cung a ra cach xay dng cac mo hnh data mining giup giai quyet cac van e trong kinh doanh. Cac ket qua thc nghiem trong huan luyen cac mo hnh data mining, dung ca hai thuat toan vi cac thiet lap cac thong so khac nhau c trnh bay trong phu luc A. Cac ket qua nay chng to hai thuat toan nay thc thi rat nhanh va co the ap dung tren cac tap d lieu ln. Chang han, thuat toan Microsoft Decision Trees ton khoang 100 phut e huan luyen mot mo hnh data ming vi 10 trieu trng hp va 25 thuoc tnh. Vi SQL Server 2000 Analysis Services, data mining khong con la ac quyen cua cac nha thong ke. Ngi s dung khong can biet en cac phc tap cua cac thuat toan data mining. Moi ngi phat trien CSDL eu co kha nang tao va huan luyen cac mo hnh data mining va nhung cac tnh nang nang cao vao cac ng dung cua ho.

Nghien cu cong cu Data Mining trong SQL Server 2000

Trang 16

Phu luc A: Ket qua th nghiem


Sau ay nhom xin trnh bay ve qua trnh chay th nghiem khi dung hai thuat toan cay quyet nh va thuat toan clustering. Viec hieu ro ve tac ong cua cac yeu to ti thi gian thc thi thuat toan se giup cac nha phat trien co mot s la chon mo hnh toi u nhat, giam thieu thi gian thc thi tren may. Khi thc thi thuat toan, cac yeu to sau ay se anh hng ti thi gian thc thi: - So cac trng hp. - So lng cac thuoc tnh. - So cac trang thai (gia tr). - So cac trang thai cua thuoc tnh lien ket . - S tha tht cua bang (sparseness of the table). - So lng phan lp trong thuat toan phan lp. Trong cac th nghiem sau, mot thong so se c thay oi, cac thong so khac c gi nguyen. Thi gian thc hien se cho chung ta biet ve tac ong cua yeu to thay oi o ti qua trnh thc thi thuat toan. A.1. Ket qua thc thi tren thuat toan cay quyet nh

A.1.1. Ket qua thc thi Training khi khong co s lien ket gia cac bang
Thng th sau khi chuan b d lieu, cac d lieu nay nam tren mot bang. Va cac d oan thng da tren bang nay.

Tac ong cua so lng cac thuoc tnh tham gia d oan (input Attributes)
Cac thong so Training cases Predictable Attribute Input Attributes Number of states So lng 1 000 000. 1 Varying :10, 20, 50, 100, 200 25

Nghien cu cong cu Data Mining trong SQL Server 2000

Trang 17

Nhan xet : - Thi gian thc thi tang tuyen tnh khi so lng thuoc tnh tang. - Thi gian thc thi kha nhanh : 130 phut cho 1 trieu trng hp vi 200 thuoc tnh.

Tac ong cua kch thc d lieu ( so cac trng hp) .


Cac thong so Training cases Predictable Attribute Input Attributes Number of states So lng Varying :10 000 en 10 trieu. 1 20 25

Nghien cu cong cu Data Mining trong SQL Server 2000

Trang 18

Nhan xet : - Thi gian thc thi tang tuyen tnh khi so lng cac trng hp tang. - Thi gian thc thi kha nhanh : 20 giay cho 10 000 trng hp va100 phut cho 20 trieu trng hp.

Tac ong cua so lng cac trang thai cua thuoc tnh tham gia d oan.
Cac thong so Training cases Predictable Attribute Input Attributes Number of states So lng 1 trieu. 1 20 Varying : 2,5,10,25,50

Nhan xet : - Thi gian thc thi tang tuyen tnh khi so lng cac trang thai nho hn 10. - Khi so lng cac trang thai tang, thuat toan se kho khan trong viec xac nh d lieu hu dung khi tao cay. Khi o chieu cao cua cay giam va dan en thi gian training giam.

Tac ong cua so lng cac thuoc tnh can phai d oan ( Predictable Attributes)
Cac thong so Training cases Predictable Attribute Input Attributes Number of states So lng 1 trieu. Varying :1,2,4,16,32. 40 25

Nghien cu cong cu Data Mining trong SQL Server 2000

Trang 19

Nhan xet: thi gian thc thi tang hi manh hn tuyen tnh tuy thuoc vao so lng cac thuoc tnh d oan. Nguyen do la khi co nhieu hn mot thuoc tnh can d oan th viec tao cay co the lam song song.

A.1.2. Ket qua thc thi Training khi co s lien ket gia cac bang
Bang ket hp la mot khai niem mi c gii thieu trong OLE DB cho Data mining. ay la mot ac tnh kha manh, no cho phep tra li nhieu cau hoi d oan phc tap. V du nh cau hoi liet ke cac san pham khac co the hap dan khach hang da vao cac san pham ma ho a mua. Neu khong co khai niem bang ket hp th viec phan tch d lieu cho cau hoi nay la rat kho khan.

Tac ong cua so lng cac trang thai cua thuoc tnh trong bang ket hp
Cac thong so Case table Training cases Predictable Attribute Input Attributes Number of states Nested table Input Attributes Number of states ( banking product) Products perchased per Custommer 200 000. 1 5 25 5 Varying: 100 en 1000 So lng

50

Nghien cu cong cu Data Mining trong SQL Server 2000

Trang 20

Nhan xet : - Thi gian thc thi mat nhieu thi gian hn khi khong co bang ket hp. - Khi so lng san pham tang tren 255 th thi gian bat au giam. Nguyen do la khi o thuat toan s dung ky thuat la chon ac iem e loc ra nhng thong tin quan trong nhat, con nhng san pham con lai th dung mo hnh le (marginal model). - Khi so lng cac ga tr khoa cua bang ket hp nhieu hn, va khi mc o giao dch cua khach hang van duy tr nh cu th cac tr cua khoa cua bang ket hp phan bo tha hn. Do o, co t mau lien quan cho moi khoa. Cay tr nen nho hn, va thi gian training giam.

Tac ong cua so lng san pham ma khach hang mua.


Cac thong so Case table Training cases Predictable Attribute Input Attributes Number of states Nested table Input Attributes Number of states ( banking product) Products perchased per Custommer 200 000. 1 5 25 5 1000 10 50 So lng

Nghien cu cong cu Data Mining trong SQL Server 2000

Trang 21

Nhan xet: thi gian thc thi tang tuyen tnh.

Tac ong cua so lng cac trng hp trong bang chnh. Cac thong so
Case table

So lng Varying 10 000 1 5 25 5 20 25 200 000.

Training cases Predictable Attribute Input Attributes Number of states


Nested table

Input Attributes
Number of states ( banking product) Products perchased per Custommer

Nhan xet : thi gian thc thi tang tuyen tnh. A.2. Ket qua thc thi tren thuat toan Clustering (phan lp)

A.2.1. Ket qua thc thi Training khi khong co s lien ket gia cac bang Tac ong cua so lng cac phan lp (Number of clusters)

Nghien cu cong cu Data Mining trong SQL Server 2000


Cac thong so Training cases Predictable Attribute Input Attributes Number of states Identifiable clusters So lng 1 000 000. 1 20 20 5,10,20

Trang 22

Nhan xet: thi gian thc thi gan tuyen tnh.

Thc hien tng t oi vi: Tac ong cua so lng cac thuoc tnh tham gia d oan
Cac thong so Training cases Predictable Attribute Input Attributes Number of states Identifiable clusters So lng 1 000 000. 1 20 20 10

Nhan xet : - Thi gian thc thi tang tuyen tnh. Vi mot trieu trng hp, ton khoang 230 phut vi 50 thuoc tnh input. - oi vi cac bien lien tuc th ton nhieu thi gian training hn so vi cac bien ri rac. Nguyen do la cac tnh toan lan can cho cac bien lien tuc th phc tap hn so vi cac bien ri rac.

Nghien cu cong cu Data Mining trong SQL Server 2000

Trang 23

Tac ong cua kch thc d lieu ( so cac trng hp) .


Cac thong so Training cases Predictable Attribute Input Attributes Number of states Identifiable clusters So lng 10 000,25 000, 50 000, 75 000, 100 000, 1 trieu. 1 20 50 10

Nhan xet : - Thi gian thc thi tang tuyen tnh. - Ton 100 phut cho mot trieu trng hp va 910 phut cho 10 trieu trng hp. Thuat toan Microsoft Clustering th cham hn khoang 8 lan so vi thuat toan MDT trong trng hp nay.

Thc hien tng t oi vi Tac ong cua so lng cac trang thai cua thuoc tnh tham gia d oan. A.2.2. Ket qua thc thi Training khi co s lien ket gia cac bang Tac ong cua so lng cac trang thai cua thuoc tnh trong bang ket hp
Cac thong so Case table Training cases Predictable Attribute Input Attributes Number of states Nested table Input Attributes Number of states ( banking product) Products perchased per Custommer 200 000. 1 5 20 5 Varying: 100 en 1000 25 So lng

Nghien cu cong cu Data Mining trong SQL Server 2000

Trang 24

Nhan xet : - Thi gian training giam khi so lng cac trang thai tang. Co hai ly do dan en ket qua nay. Th nhat, thuat toan la chon thuoc tnh ngan can so thuoc tnh tang vt qua 255. Th hai, khi so thuoc tnh giam, mat o phan bo cua d lieu thap. Ket qua la co khong u mau cho thuat toan e xac nh cac cluster tao thanh, do o thuat toan dung t lan lap hn. - So cac thuoc tnh input cang luc cang giam la do s chon la ac iem. Mot vai thuoc tnh c nhom lai vi nhau la do s phan bo tha tht cua d lieu.

Thc hien tng t oi vi: Tac ong cua so lng san pham ma khach hang mua Tac ong cua so lng cac trng hp trong bang chnh.

Nghien cu cong cu Data Mining trong SQL Server 2000

Trang 25

Phu luc B: Mot so thuat ng


Phan nay gii thieu ngan gon mot so thuat ng data mining. Cac thuat ng nay c gii thieu trong Microsoft OLE DB for Data Mining specification. Data Mining Model (mo hnh data mining): Mot data mining model th tng t vi mot bang quan he. No cha cac cot khoa, cac cot input, va cac cot d bao. Mot mo hnh c gan vi mot thuat toan data mining. Sau giai oan huan luyen, mo hnh data mining lu tr cac mau c kham pha bi thuat toan data mining tren tap d lieu dung cho viec huan luyen. Mot mo hnh data mining co the c xem nh la mot bang thc s cha cac dong ng cho moi ket hp co the co cua cac gia tr phan biet tren tng cot cua mo hnh. Khi a c huan luyen, mo hnh co the c dung cho viec d bao. Columns (cot): Mot cot trong mo hnh data mining th tng t vi mot cot trong mot bang quan he, con c goi la bien hay thuoc tnh. Co ba loai cot khac nhau trong mo hnh data mining: cot input, cot d bao, hay cot input va d bao. Mo hnh data mining s dung tap cac thuoc tnh input cua trng hp e d bao cac thuoc tnh output. Trong bao cao nay, cot va thuoc tnh c dung nh nhau. States (trang thai): Moi thuoc tnh co the co mot tap cac gia tr co the co cua no. Cac gia tr nay c goi la trang thai cua thuoc tnh. Cases (trng hp): Data mining lien quan en viec phan tch cac trng hp. Mot trng hp la mot thc the thong tin c ban. Mot trng hp co the la n gian, chang han khi phan tch loan risk cua khach hang, thong tin khach hang la mot trng hp. Mot trng hp co the phc tap hn, chang han mo hnh data mining co the d bao danh sach cac san pham khach hang se mua da vao thong tin khach hang va thong tin giao dch cua ho. Mo hnh nay ket ket hp thong tin khach hang vi danh sach cac san pham ma ho mua. Loai trng hp nay c goi la trng hp ket hp (nested case). Trong bao cao nay, thuat ng kch thc mau (sample size) c dung e ch so cac trng hp. Case Tables (bang trng hp) va Nested Tables (bang ket hp): Bang trng hp la bang cha thong tin trng hp lien quan vi phan d lieu khong c ket hp. Bang kep hp la la bang cha thong tin lien quan en phan d lieu c ket hp.

Nghien cu cong cu Data Mining trong SQL Server 2000

Trang 26

Phu luc C: Chng trnh demo


- Chng trnh demo co the download tai: http://download.microsoft.com/download/biztalkserver/book/1.0/nt5xp/enus/sql2kdatamining.msi Chng trnh nay co kch thc la 32 MB, c xuat ban thang 9-2002. - a mem nh kem lu file word cua bao cao nay.

Nghien cu cong cu Data Mining trong SQL Server 2000

Trang 27

Tai lieu tham khao:


[1]Performance Study of Microsoft Data Mining Algorithms
Sanjay Soni - UNISYS Zhaohui Tang - Microsoft Jim Yang Microsoft

[2]Cac He C so tri thc GS.TSKH : Hoang Kiem TS:o Van Nhn. ThS: o Phuc. 2002 [3]Knowledge-Based System for Engineers and Scientists Adrian A.Hopegood 1993.

[4] OLE DB for Data Mining Specification Version 1.0 Microsoft Corporation JULY 2000

You might also like