Data Cleaning: Missing Values: - For Example in Attribute Income If

Data Cleaning: Missing values
4) Use the attribute mean to fill missing

values. For example in attribute income if
the mean income is 28000, use this value
to replace the missing values.
5) Use the attribute mean for all samples
belonging to the same class as the given
record: For example, if classifying
customers according to credit risk,
replace the missing value with the mean
income value for customers in the same
credit risk category as that of the given
record.
Data Cleaning: Missing values
6) Use advanced method such as

bayesian formalism or decision tree
to predict the missing value using
other values.
Data Cleaning: Noisy Data
 Noise is a random error in measured

variable.
 Source of Noisy data:
- Data entry problem
- Faulty data collection instruments
- Data transmition
- Technology Limitation
 How to handle noisy data:

1) Binning method:
-first sort data and partition into
(equi-depth) bins
Then one can smooth by bin
means, smooth by bin median,
smooth by bin boundaries, etc.
 Example:
Sorted data for price (in dollars): 4, 8, 9, 15,
21, 21, 24, 25, 26, 28, 29, 34
Partition into (equi-depth) bins:
-Bin 1: 4, 8, 9, 15
-Bin 2: 21, 21, 24, 25
-Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
-Bin 1: 9, 9, 9, 9
-Bin 2: 23, 23, 23, 23
-Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
-Bin 1: 4, 4, 4, 15
-Bin 2: 21, 21, 25, 25
-Bin 3: 26, 26, 26, 34
number
of values
Example: customer ages
Equi-width
:binning 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
Equi-width
:binning 0-22 22-31 62-80
38-44 48-55
32-38 44-48 55-62
2) Clustering: Outliers may

be detected by clustering,
where similar values
are organized into groups,
values that fall outside the
set of clusters may be
considered outliers.
3) Combined computer and human

inspections: Outliers may be
identified by detect suspicious values
and check by human
4) Regression:
Data can be
smoothed by
fitting the data
to a function.
Data Cleaning: Inconsistent Data
Data which is inconsistent with our models,

should be dealt with. Common sense can
also be used to detect such kind of
inconsistency.
 The same name occurring differently in an
application
 Different names can appear to be the same
(Dennis Vs Denis)
 Inappropriate values (Males being pregnant,
or having an negative age)
Data Cleaning: Inconsistent Data
 Example
 A particular bank's database had
about 5% of it's customers born on
11/11/11, which is usually the
default value for the birthday
attribute.
 Data inconsistencies may be
corrected manually using external
references. For example, errors made
at data entry level may be corrected
by performing a paper trace.
Inconsistent Data : Unified
Date Format
We want to transform all dates to the
same format internally
 Some systems accept dates in many
formats
 e.g. “Sep 24, 2003” , 9/24/03, 24.09.03, etc
 dates are transformed internally to a standard value
 Frequently, just the year (YYYY) is
sufficient
 For more details, we may need the
month, the day, the hour, etc
 Representing date as YYYYMM or
YYYYMMDD can be OK.
Data Integration
Data Integration
Combines data from multiple sources
into a coherent store
Increasingly data a mining projects
require data from more than one data
source
Such as multiple databases, data
warehouse, flat files and historical
data.
Data Integration
 Data is stored in many systems

across enterprise and outside the
enterprise
 The source of data fall into two

categories: internal and external.
 Internal sources that are generated through
enterprise activities such as databases, historical
data, Web sites and warehouses.
 External sources such as credit bureaus, phone
companies and demographical information.
Data Integration
 Data can be in many format

 Data can be in DBMS
 Data in a flat file
 Data in Data Warehouse
Data Integration
 Data Warehouse:
is a structure that links information
from two or more databases.
Data warehouse brings data from
different data sources into a central
repository.
It perform some data integration, clean-
up, and summarization, and distribute
the information data marts.
Data marts are used to house subsets of
data from the central repository that
has been selected and prepared for
specific end user ( they often called
departmental data warehouse)
Data Integration
Data Transformation
Data transformation
Transform the data into a form
appropriate for given data mining
method
Data is transformed or consolidated into
forms appropriate for mining.
Methods include:
 Smoothing
 aggregation
 generalization
 normalization (min-max)
Data Transformation
 Smoothing: remove noise from data.

Such techniques include binning ,
clustering and regression ( as
described in missing values).
 Aggregation, Generalization: will be
discussed later.
Data Transformation
 Normalization:
Where the attributes are scaled so as to fall
within a small specified ranges such as -1.0
to 1.0:
min-max normalization: Perform a liner
transformation on the original data.
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
Data Transformation
 Example: Suppose the minimum and

maximum values for the attribute
income are 12000$ and 98000$ . We
would like to map income to range 0.0
to 1.0. By min-max normalization , a
value of 73000 for income is
transformed to
73600  12000
(1.0  0) 0  0.716
98000  12000
Data Transformation
 z-score normalization: The values for

an attribute A are normalized based
on mean and standard deviation of A.
v  meanA
v' 
stand _ devA
 This method is used when the actual minimum
and maximum of attribute are unknown , our
outliers that dominate the min-max
normalization
Data Transformation
 Example: Suppose the mean and

standard deviation of the values for
attribute income are 54000and
16000. With z-score normalization ,
a value of 73600 for income is
transformed to
73600  54000
 1.225
16000
Data Transformation
 Normalization by decimal scaling

normalizes by moving the decimal
point of values of attribute A. The
number of decimal points moved
depends on the maximum absolute
vale of A. A value v of A normalized to
v’ by computing
v
v'  j Where j is the smallest integer such that Max(| |)<1
10
Data Transformation
 Example: suppose that records values

of A range from -986 to 917. The
maximum absolute value of A is 986.
To normalize, we therefore divide
each value by 1000 (j=3) so that
-986 normalizes to -0.986
Data Reduction
 Warehouse may store terabytes of data:
Complex data analysis/mining may take a very
long time to run on the complete data set
 Data reduction
Obtains a reduced representation of the data set that is
much smaller in volume but yet produces the same (or
almost the same) analytical results
Data Reduction
 The choice of data representation,

and selection, reduction, or
transformation of features is
probably the most important issue
that determines the quality of a data-
mining solution.
 Therefore, the three basic operations
in a data-reduction process are delete
a column (feature selection), delete a
row (sampling) , and reduce the
number of values in a column
(Discretization).
Data Reduction
 Feature Selection:
We want to choose features
(attributes) that are relevant to our
data-mining application in order to
achieve maximum performance with
the minimum measurement and
processing effort.
Data Reduction: Feature Selection
1) Redundant features
 duplicate much or all of the information contained in
one or more other attributes
 E.g., purchase price of a product and the amount of
sales tax paid
2) Irrelevant features
 contain no information that is useful for the data
mining task at hand
 E.g., students' ID is often irrelevant to the task of
predicting students' GPA
Data Reduction : Feature
Selection
 First: Remove fields with no or little

variability
 Examine the number of distinct field
values
 Rule of thumb: remove a field where almost all
values are the same (e.g. null), except possibly in
minp % or less of all records.
 minp could be 0.5% or more generally less than 5%
of the number of targets of the smallest class

Data Cleaning: Missing Values: - For Example in Attribute Income If

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Cleaning: Missing Values: - For Example in Attribute Income If

Uploaded by

Copyright:

Available Formats

Data Cleaning: Missing values

4) Use the attribute mean to fill missing

6) Use advanced method such as

 Noise is a random error in measured

 How to handle noisy data:

2) Clustering: Outliers may

3) Combined computer and human

Data which is inconsistent with our models,

 Data is stored in many systems

 The source of data fall into two

 Data can be in many format

 Smoothing: remove noise from data.

 Example: Suppose the minimum and

 z-score normalization: The values for

 Example: Suppose the mean and

 Normalization by decimal scaling

 Example: suppose that records values

 The choice of data representation,

 First: Remove fields with no or little

You might also like