Professional Documents
Culture Documents
E e
E E
= E
2 1
,
2 1
) , (
) 1 (
2
) (
d d
d d s s
PHN HOCH TOP-DOWN V BOTTOM-UP
(xy dng dendrogram)
THUT TON K-MEAN
Gii thiu
Dng cng: theo trng tm ca mi cm
Theo phn t i din cho mi cm
i tng: d=(d
1
, d
2
, , d
n
)
Ni dung k-mean cng
Khi ng: Chn ty cc vect trng tm cho cc cm c
*/ Trong khi iu kin lm tt hn vn cn
Vi mi i tng d
Tm cm c c trng tm gn d nht
Gn d vo cm c
Vi mi cm c
Tnh ton li trng tm theo theo cc i tng thuc n /*
iu kin lm tt hn
Khng/chuyn t i tng t cm ny sang cm khc
Hoc s thay i t
Thi gian thc hin
Thut ton k-mean mm lin quan chn vect i din cm
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
38
Hi quy (Regression)
x
y
y = x + 1
X1
Y1
Y1
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
39
Chapter 2: Tin x l d liu
Hiu d liu v chun b d liu
Vai tr ca tin x l d liu
Lm sch d liu
Tch hp v chuyn dng d liu
Rt gn d liu
Ri rc v sinh kin trc khi nim
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
40
Tch hp d liu
Tch hp d liu (Data integration):
Kt hp d liu t nhiu ngun thnh mt ngun lu
tr chung
Tch hp s
Tch hp sieu d liu t cc ngun khc nhau
Vn nh danh thc th: xc nh thc th thc t t
ngun d liu phc, chng hn, A.cust-id B.cust-#
Pht hin v gii quyt vn thit nht qu d liu
Cng mt thc th thc s: gi tr thuc tnh cc ngun
khc nhau l khc nhau
Nguyn nhn: trnh by khc nhau, c khc nhau,
chng hn, n v quc t khc vi Anh quc
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
41
Nm bt d tha trong tch hp d liu
(Handling Redundancy in Data Integration)
D tha d liu: thng c khi tch hp t nhiu ngun
khc nhau
Mt thuc tnh c nhiu tn khc nhau cc CSDL
khc nhau
Mt thuc tnh: thuc tnh ngun gc trong CSDL
khc, chng hn, doanh thu hng nm
D liu d tha c th wocj pht hin khi phn tch
tng quan
Tch hp cn trng d liu ngun phc gip gim/trnh
d tha, thiu nht qun v tng hiu qu tc v cht
lng
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
42
Chuyn dng d liu
Lm trn (Smoothing): loi b nhiu t d liu
Tng hp (Aggregation): tm tt, xy dng khi d liu
Tng qut ha (Generalization): leo kin trc khi nim
Chun ha (Normalization): thu nh vo min nh, ring
Chun ha min-max
Chun ha z-score
Chun ha t l thp phn
Xy dng thuc tnh/c trng
Thuc tnh mi c xy dng t cc thuc tnh c
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
43
Chuyn i d liu: Chun ha
Chun ha min-max
Chun ha z-score
Chun ha t l thp phn
A A A
A A
A
min new min new max new
min max
min v
v _ ) _ _ ( ' +
=
A
A
dev stand
mean v
v
_
'
=
j
v
v
10
' = j : s nguyn nh nht m Max(| |)<1
' v
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
44
Chapter 2: Tin x l d liu
Hiu d liu v chun b d liu
Vai tr ca tin x l d liu
Lm sch d liu
Tch hp v chuyn dng d liu
Rt gn d liu
Ri rc v sinh kin trc khi nim
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
45
Chin lc rt gn d liu
(Data Reduction Strategies)
Kho d liu cha ti hng TB
Phn tch/khai ph d liu phc mt thi gian rt di khi chy trn
tp ton b d liu
Rt gn d liu
C c trnh by gn ca tp d liu m nh hn nhiu v khi
lng m sinh ra cng (hoc hu nh cng) kt qu.
Chin lc rt gn d liu
Tp hp khi d liu
Gim a chiu loi b thuc tnh khng quan trng
Nn d liu
Gim tnh s ha d liu thnh m hnh
Ri rc ha v sinh cy khi nim
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
46
Kt hp khi d liu (Data Cube
Aggregation)
Mc thp nht ca khi d liu
Tng hp d liu thnh mt c th quan tm
Chng hn, mt khch hng trong kho d liu cuc gi
in thoi.
Cc mc phc hp ca tch hp thnh khi d liu
Gim thm kch thc d liu
Tham kho mc thch hp
S dng trnh din nh nht gii bi ton
Nn s dng d liu khi lp phng khi tr li cu hi
tng hp thng tin
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
47
Rt gn chiu
Rt gn c trng (nh., la chn tp con thuc tnh):
La chn tp nh nht cc c trng m phn b xc
sut ca cc lp khc nhau cho gi tr khi cho gi tr ca
cc lp ny gn nh phn b vn c cho gi tr ca
cc c trng
Rt gn # ca cc mu trong tp mu d dng hn
hiu d liu
Phng php Heuristic (c lc lng m # php chn):
Khn ngoan chn chuyn tip t pha trc
Kt hp chon chuyn tip v loi b lc hu.
Rt gn cu qyuyt nh
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
48
V d rt gn cy quyt nh
(Example of Decision Tree Induction)
Tp thuc tnh khi to:
{A1, A2, A3, A4, A5, A6}
A4 ?
A1?
A6?
Class 1
Class 2
Class 1
Class 2
> Tp thuc tinh rt gn: {A1, A4, A6}
Phn lp cy quyt nh
th dng cy
nh trong l mt hm test
Cc nhnh tng ng vi kt qu kim tra ti
nh trong
Cc l l cc nhn, hoc cc lp.
Phn lp cy quyt nh
Phn lp cy quyt nh
Phn lp cy quyt nh
Xy dng cy quyt nh:
Xy dng cy quyt nh
Phng php top-down
Ct ta cy (pruning)
Phng php bottom-up: xc nh v loi b nhng
nhnh rm r tng chnh xc khi phn lp
nhng i tng mi
S dng cy quyt nh: phn lp cc i tng
cha c gn nhn
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
54
Nn d liu (Data Compression)
Nn xu (String compression)
Tn ti l thuyt y v thut ton iu chnh tt
Mt mt thng thng
Cho php lng hn ch thao tc khng cn m rng
Nn Audio/video
Nn mt mt in hnh, vi tinh ch tin b
i khi cc on nh tn hiu c th xy dng li m
khng cn xy dng li ton b
Dy thi gian khng phi l audio
Ngn in hnh v chm theo thi gian
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
55
Nn d liu (Data Compression)
D liu thc (Original Data)
DL c nn
Compressed
Data
mt mt
lossless
D liu thc c xp x
Original Data
Approximated
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
56
Chuyn dng sng (Wavelet
Transformation)
Discrete wavelet transform (DWT): linear signal
processing, multiresolutional analysis
Compressed approximation: store only a small fraction of
the strongest of the wavelet coefficients
Similar to discrete Fourier transform (DFT), but better
lossy compression, localized in space
Method:
Length, L, must be an integer power of 2 (padding with 0s, when
necessary)
Each transform has 2 functions: smoothing, difference
Applies to pairs of data, resulting in two set of data of length L/2
Applies two functions recursively, until reaches the desired length
Haar2 Daubechie4
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
57
DWT cho nn nh
Image
Low Pass High Pass
Low Pass High Pass
Low Pass High Pass
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
58
Cho N vector d liu k-chiu, tm c (<= k) vector trc
giao tt nht trnh din d liu.
Tp d liu gc c rt gn thnh N vector d liu c
chiu: c thnh phn chnh (chiu c rt gn).
Mi vector d liu l t hp tuyn tnh ca cc vector
thnh phn chnh.
Ch p dng cho d liu s.
Dng khi s chiu vector ln.
Phn tch thnh phn chnh (Principal
Component Analysis )
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
59
X1
X2
Y1
Y2
Phn tch thnh phn chnh (PCA)
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
60
Rt gn kch thc s
Phng php tham s
Assume the data fits some model, estimate model
parameters, store only the parameters, and discard
the data (except possible outliers)
Log-linear models: obtain value at a point in m-D
space as the product on appropriate marginal
subspaces
Non-parametric methods
Do not assume models
Major families: histograms, clustering, sampling
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
61
Hi quy v m hnh logarit tuyn tnh
Linear regression: Data are modeled to fit a straight line
Often uses the least-square method to fit the line
Multiple regression: allows a response variable Y to be
modeled as a linear function of multidimensional feature
vector
Log-linear model: approximates discrete
multidimensional probability distributions
Kho d liu v khai ph d liu: Chng 2
Linear regression: Y = o + | X
Two parameters , o and | specify the line and are to
be estimated by using the data at hand.
using the least squares criterion to the known values
of Y1, Y2, , X1, X2, .
Multiple regression: Y = b0 + b1 X1 + b2 X2.
Many nonlinear functions can be transformed into the
above.
Log-linear models:
The multi-way table of joint probabilities is
approximated by a product of lower-order tables.
Probability: p(a, b, c, d) = oab |ac_ad obcd
Phn tch hi quy v m hnh logarit tuyn tnh
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
63
Lc (Histograms)
A popular data reduction
technique
Divide data into buckets
and store average (sum)
for each bucket
Can be constructed
optimally in one
dimension using dynamic
programming
Related to quantization
problems.
0
5
10
15
20
25
30
35
40
10000 30000 50000 70000 90000
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
64
Phn cm
Partition data set into clusters, and one can store
cluster representation only
Can be very effective if data is clustered but not if data
is smeared
Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
There are many choices of clustering definitions and
clustering algorithms
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
65
Rt gn mu (Sampling)
Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
Choose a representative subset of the data
Simple random sampling may have very poor
performance in the presence of skew
Develop adaptive sampling methods
Stratified sampling:
Approximate the percentage of each class (or
subpopulation of interest) in the overall database
Used in conjunction with skewed data
Sampling may not reduce database I/Os (page at a time).
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
66
Rt gn mu (Sampling)
Raw Data
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
67
Rt gn mu (Sampling)
Raw Data
Cluster/Stratified Sample
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
68
Rt gn phn cp
Use multi-resolution structure with different degrees of
reduction
Hierarchical clustering is often performed but tends to
define partitions of data sets rather than clusters
Parametric methods are usually not amenable to
hierarchical representation
Hierarchical aggregation
An index tree hierarchically divides a data set into
partitions by value range of some attributes
Each partition can be considered as a bucket
Thus an index tree with aggregates stored at each
node is a hierarchical histogram
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
69
Chapter 2: Tin x l d liu
Hiu d liu v chun b d liu
Vai tr ca tin x l d liu
Lm sch d liu
Tch hp v chuyn dng d liu
Rt gn d liu
Ri rc v sinh kin trc khi nim
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
70
Ri rc ha
Three types of attributes:
Nominal values from an unordered set
Ordinal values from an ordered set
Continuous real numbers
Discretization:
divide the range of a continuous attribute into
intervals
Some classification algorithms only accept categorical
attributes.
Reduce data size by discretization
Prepare for further analysis
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
71
Ri rc ha v kin trc khi nim
Discretization
reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values
Concept hierarchies
reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute
age) by higher level concepts (such as young, middle-
aged, or senior)
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
72
Ri rc ha v kin trc khi nim vi d liu
s
Binning (see sections before)
Histogram analysis (see sections before)
Clustering analysis (see sections before)
Entropy-based discretization
Segmentation by natural partitioning
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
73
Ri rc ha da trn Entropy
Given a set of samples S, if S is partitioned into two
intervals S1 and S2 using boundary T, the entropy after
partitioning is
The boundary that minimizes the entropy function over all
possible boundaries is selected as a binary discretization.
The process is recursively applied to partitions obtained
until some stopping criterion is met, e.g.,
Experiments show that it may reduce data size and
improve classification accuracy
E S T
S
Ent
S
Ent
S
S
S
S
( , )
| |
| |
( )
| |
| |
( ) = +
1
1
2
2
Ent S E T S ( ) ( , ) >o
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
74
Phn on bng phn hoch t nhin
Quy tc n gin 3-4-5 c dng phn on d liu s
thnh cc on tng i thng nht, t nhin.
Hng ti s gi tr khc bit vng quan trng nht
Nu 3, 6, 7 hoc 9 gi tr khc bit th chia min thnh
3 on tng ng.
Nu ph 2, 4, hoc 8 gi tr phn bit th chia thnh 4.
Nu ph 1, 5, hoc 10 gi tr phn bit th chia thnh 5.
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
75
V d lut 3-4-5
(-$4000 -$5,000)
(-$400 - 0)
(-$400 -
-$300)
(-$300 -
-$200)
(-$200 -
-$100)
(-$100 -
0)
(0 - $1,000)
(0 -
$200)
($200 -
$400)
($400 -
$600)
($600 -
$800) ($800 -
$1,000)
($2,000 - $5, 000)
($2,000 -
$3,000)
($3,000 -
$4,000)
($4,000 -
$5,000)
($1,000 - $2, 000)
($1,000 -
$1,200)
($1,200 -
$1,400)
($1,400 -
$1,600)
($1,600 -
$1,800)
($1,800 -
$2,000)
msd=1,000 Low=-$1,000 High=$2,000 Step 2:
Step 4:
Step 1: -$351 -$159 profit $1,838 $4,700
Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
count
(-$1,000 - $2,000)
(-$1,000 - 0)
(0 -$ 1,000)
Step 3:
($1,000 - $2,000)
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
76
Sinh kin trc khi nim cho d liu phn loi
c t mt th t b phn gi tr thuc tnh theo mc
s do ngi dng hoc chuyn gias
street<city<state<country
c t thnh cu trc phn cp nh nhm d liu
{Urbana, Champaign, Chicago}<Illinois
c t theo tp cc thuc tnh.
T ng sp xp mt phn bng cch phn tch s
lng cc gi tr khc bit
Nh, street < city <state < country
c t mt phn th t b phn
Nh, ch street < city m khng c ci khc
September 9, 2014
Kho d liu v khai ph d liu: Chng 2
77
Sinh kin trc khi nim t ng
Some concept hierarchies can be automatically
generated based on the analysis of the number of
distinct values per attribute in the given data set
The attribute with the most distinct values is placed
at the lowest level of the hierarchy
Note: Exceptionweekday, month, quarter, year
country
province_or_ state
city
street
15 distinct values
65 distinct values
3567 distinct values
674,339 distinct values