Professional Documents
Culture Documents
Data Preprocessing
Attribute-value data:
Data types
numeric, categorical (see the
hierarchy for its relationship)
static, dynamic (temporal)
Other kinds of data
distributed data
text, Web, meta data
images, audio/video
Chapter 2: Data
Preprocessing
Data cleaning
Data reduction
Discretization
Summary
e.g., occupation=
e.g., Salary=-10
Multi-Dimensional Measure of
Data Quality
Accuracy
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
Data cleaning
Data integration
Data reduction
Data transformation
Chapter 2: Data
Preprocessing
Data cleaning
Data reduction
Discretization
Summary
Data Cleaning
Importance
Missing Data
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of
entry
not register history or changes of the data
10
11
Noisy Data
12
Binning method:
Clustering
13
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
Partition into (equi-depth) bins:
Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34
Bin 1: 9, 9, 9, 9
Bin 2: 23, 23, 23, 23
Bin 3: 29, 29, 29, 29
Bin 1: 4, 4, 4, 15
Bin 2: 21, 21, 25, 25
Bin 3: 26, 26, 26, 34
14
Outlier Removal
Removal methods
Clustering
Curve-fitting
Hypothesis-testing with a given model
15
Chapter 2: Data
Preprocessing
Data cleaning
Data reduction
Discretization
Summary
16
Data Integration
Data integration:
Schema integration
for the same real world entity, attribute values from different
sources are different, e.g., different scales, metric vs. British
units
17
Data
Transformation
Attribute/feature construction
Aggregation: summarization
18
Data Transformation:
Normalization
min-max normalization
v minA
v'
(new _ maxA new _ minA) new _ minA
maxA minA
z-score normalization
v meanA
v'
stand _ devA
normalization by decimal scaling
v
v' j Where j is the smallest integer such that Max(| v' |)<1
10
19
Chapter 2: Data
Preprocessing
Data cleaning
Data reduction
Discretization
Summary
20
21
Dimensionality Reduction
22
Histograms
A popular data
reduction technique
Divide data into buckets
and store average
(sum) for each bucket
23
Clustering
24
Sampling
Stratified sampling:
Approximate the percentage of each class (or
subpopulation of interest) in the overall database
Used in conjunction with skewed data
25
Sampling
Raw Data
Cluster/Stratified Sample
26
Chapter 2: Data
Preprocessing
Data cleaning
Data reduction
Discretization
Summary
27
Discretization
Discretization:
Some techniques:
28
Discretization
Concept hierarchies
29
Binning
Bin 1: 0, 4, 12
Bin 2: 16, 16, 18
Bin 3: 24, 26, 28
Entropy-based (1)
31
Entropy-based (2)
32
Summary
Discretization
33