Professional Documents
Culture Documents
number
of values
Example: customer ages
Equi-width
:binning 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
Equi-width
:binning 0-22 22-31 62-80
38-44 48-55
32-38 44-48 55-62
Data Cleaning: Noisy Data
4) Regression:
Data can be
smoothed by
fitting the data
to a function.
Data Cleaning: Inconsistent Data
Example
A particular bank's database had
about 5% of it's customers born on
11/11/11, which is usually the
default value for the birthday
attribute.
Data inconsistencies may be
corrected manually using external
references. For example, errors made
at data entry level may be corrected
by performing a paper trace.
Inconsistent Data : Unified
Date Format
We want to transform all dates to the
same format internally
Some systems accept dates in many
formats
e.g. “Sep 24, 2003” , 9/24/03, 24.09.03, etc
dates are transformed internally to a standard value
Frequently, just the year (YYYY) is
sufficient
For more details, we may need the
month, the day, the hour, etc
Representing date as YYYYMM or
YYYYMMDD can be OK.
Data Integration
Data Integration
Combines data from multiple sources
into a coherent store
Increasingly data a mining projects
require data from more than one data
source
Such as multiple databases, data
warehouse, flat files and historical
data.
Data Integration
Data Warehouse:
is a structure that links information
from two or more databases.
Data warehouse brings data from
different data sources into a central
repository.
It perform some data integration, clean-
up, and summarization, and distribute
the information data marts.
Data marts are used to house subsets of
data from the central repository that
has been selected and prepared for
specific end user ( they often called
departmental data warehouse)
Data Integration
Data Transformation
Data transformation
Transform the data into a form
appropriate for given data mining
method
Data is transformed or consolidated into
forms appropriate for mining.
Methods include:
Smoothing
aggregation
generalization
normalization (min-max)
Data Transformation
Normalization:
Where the attributes are scaled so as to fall
within a small specified ranges such as -1.0
to 1.0:
min-max normalization: Perform a liner
transformation on the original data.
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
Data Transformation
73600 12000
(1.0 0) 0 0.716
98000 12000
Data Transformation
v meanA
v'
stand _ devA
This method is used when the actual minimum
and maximum of attribute are unknown , our
outliers that dominate the min-max
normalization
Data Transformation
73600 54000
1.225
16000
Data Transformation
Data reduction
Obtains a reduced representation of the data set that is
much smaller in volume but yet produces the same (or
almost the same) analytical results
Data Reduction
Feature Selection:
We want to choose features
(attributes) that are relevant to our
data-mining application in order to
achieve maximum performance with
the minimum measurement and
processing effort.
Data Reduction: Feature Selection
1) Redundant features
duplicate much or all of the information contained in
one or more other attributes
E.g., purchase price of a product and the amount of
sales tax paid
2) Irrelevant features
contain no information that is useful for the data
mining task at hand
E.g., students' ID is often irrelevant to the task of
predicting students' GPA
Data Reduction : Feature
Selection