You are on page 1of 30

Data Cleaning: Missing values

4) Use the attribute mean to fill missing


values. For example in attribute income if
the mean income is 28000, use this value
to replace the missing values.
5) Use the attribute mean for all samples
belonging to the same class as the given
record: For example, if classifying
customers according to credit risk,
replace the missing value with the mean
income value for customers in the same
credit risk category as that of the given
record.
Data Cleaning: Missing values

6) Use advanced method such as


bayesian formalism or decision tree
to predict the missing value using
other values.
Data Cleaning: Noisy Data

 Noise is a random error in measured


variable.
 Source of Noisy data:
- Data entry problem
- Faulty data collection instruments
- Data transmition
- Technology Limitation
Data Cleaning: Noisy Data

 How to handle noisy data:


1) Binning method:
-first sort data and partition into
(equi-depth) bins
Then one can smooth by bin
means, smooth by bin median,
smooth by bin boundaries, etc.
Data Cleaning: Noisy Data
 Example:
Sorted data for price (in dollars): 4, 8, 9, 15,
21, 21, 24, 25, 26, 28, 29, 34
Partition into (equi-depth) bins:
-Bin 1: 4, 8, 9, 15
-Bin 2: 21, 21, 24, 25
-Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
-Bin 1: 9, 9, 9, 9
-Bin 2: 23, 23, 23, 23
-Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
-Bin 1: 4, 4, 4, 15
-Bin 2: 21, 21, 25, 25
-Bin 3: 26, 26, 26, 34
Data Cleaning: Noisy Data

number
of values
Example: customer ages

Equi-width
:binning 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80

Equi-width
:binning 0-22 22-31 62-80
38-44 48-55
32-38 44-48 55-62
Data Cleaning: Noisy Data

2) Clustering: Outliers may


be detected by clustering,
where similar values
are organized into groups,
values that fall outside the
set of clusters may be
considered outliers.
Data Cleaning: Noisy Data

3) Combined computer and human


inspections: Outliers may be
identified by detect suspicious values
and check by human
Data Cleaning: Noisy Data

4) Regression:
Data can be
smoothed by
fitting the data
to a function.
Data Cleaning: Inconsistent Data

Data which is inconsistent with our models,


should be dealt with. Common sense can
also be used to detect such kind of
inconsistency.
 The same name occurring differently in an
application
 Different names can appear to be the same
(Dennis Vs Denis)
 Inappropriate values (Males being pregnant,
or having an negative age)
Data Cleaning: Inconsistent Data

 Example
 A particular bank's database had
about 5% of it's customers born on
11/11/11, which is usually the
default value for the birthday
attribute.
 Data inconsistencies may be
corrected manually using external
references. For example, errors made
at data entry level may be corrected
by performing a paper trace.
Inconsistent Data : Unified
Date Format
We want to transform all dates to the
same format internally
 Some systems accept dates in many
formats
 e.g. “Sep 24, 2003” , 9/24/03, 24.09.03, etc
 dates are transformed internally to a standard value
 Frequently, just the year (YYYY) is
sufficient
 For more details, we may need the
month, the day, the hour, etc
 Representing date as YYYYMM or
YYYYMMDD can be OK.
Data Integration

Data Integration
Combines data from multiple sources
into a coherent store
Increasingly data a mining projects
require data from more than one data
source
Such as multiple databases, data
warehouse, flat files and historical
data.
Data Integration

 Data is stored in many systems


across enterprise and outside the
enterprise

 The source of data fall into two


categories: internal and external.
 Internal sources that are generated through
enterprise activities such as databases, historical
data, Web sites and warehouses.
 External sources such as credit bureaus, phone
companies and demographical information.
Data Integration

 Data can be in many format


 Data can be in DBMS
 Data in a flat file
 Data in Data Warehouse
Data Integration

 Data Warehouse:
is a structure that links information
from two or more databases.
Data warehouse brings data from
different data sources into a central
repository.
It perform some data integration, clean-
up, and summarization, and distribute
the information data marts.
Data marts are used to house subsets of
data from the central repository that
has been selected and prepared for
specific end user ( they often called
departmental data warehouse)
Data Integration
Data Transformation

Data transformation
Transform the data into a form
appropriate for given data mining
method
Data is transformed or consolidated into
forms appropriate for mining.
Methods include:
 Smoothing
 aggregation
 generalization
 normalization (min-max)
Data Transformation

 Smoothing: remove noise from data.


Such techniques include binning ,
clustering and regression ( as
described in missing values).
 Aggregation, Generalization: will be
discussed later.
Data Transformation

 Normalization:
Where the attributes are scaled so as to fall
within a small specified ranges such as -1.0
to 1.0:
min-max normalization: Perform a liner
transformation on the original data.

v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
Data Transformation

 Example: Suppose the minimum and


maximum values for the attribute
income are 12000$ and 98000$ . We
would like to map income to range 0.0
to 1.0. By min-max normalization , a
value of 73000 for income is
transformed to

73600  12000
(1.0  0) 0  0.716
98000  12000
Data Transformation

 z-score normalization: The values for


an attribute A are normalized based
on mean and standard deviation of A.

v  meanA
v' 
stand _ devA
 This method is used when the actual minimum
and maximum of attribute are unknown , our
outliers that dominate the min-max
normalization
Data Transformation

 Example: Suppose the mean and


standard deviation of the values for
attribute income are 54000and
16000. With z-score normalization ,
a value of 73600 for income is
transformed to

73600  54000
 1.225
16000
Data Transformation

 Normalization by decimal scaling


normalizes by moving the decimal
point of values of attribute A. The
number of decimal points moved
depends on the maximum absolute
vale of A. A value v of A normalized to
v’ by computing
v
v'  j Where j is the smallest integer such that Max(| |)<1
10
Data Transformation

 Example: suppose that records values


of A range from -986 to 917. The
maximum absolute value of A is 986.
To normalize, we therefore divide
each value by 1000 (j=3) so that
-986 normalizes to -0.986
Data Reduction
 Warehouse may store terabytes of data:
Complex data analysis/mining may take a very
long time to run on the complete data set

 Data reduction
Obtains a reduced representation of the data set that is
much smaller in volume but yet produces the same (or
almost the same) analytical results
Data Reduction

 The choice of data representation,


and selection, reduction, or
transformation of features is
probably the most important issue
that determines the quality of a data-
mining solution.
 Therefore, the three basic operations
in a data-reduction process are delete
a column (feature selection), delete a
row (sampling) , and reduce the
number of values in a column
(Discretization).
Data Reduction

 Feature Selection:
We want to choose features
(attributes) that are relevant to our
data-mining application in order to
achieve maximum performance with
the minimum measurement and
processing effort.
Data Reduction: Feature Selection

1) Redundant features
 duplicate much or all of the information contained in
one or more other attributes
 E.g., purchase price of a product and the amount of
sales tax paid
2) Irrelevant features
 contain no information that is useful for the data
mining task at hand
 E.g., students' ID is often irrelevant to the task of
predicting students' GPA
Data Reduction : Feature
Selection

 First: Remove fields with no or little


variability
 Examine the number of distinct field
values
 Rule of thumb: remove a field where almost all
values are the same (e.g. null), except possibly in
minp % or less of all records.
 minp could be 0.5% or more generally less than 5%
of the number of targets of the smallest class

You might also like