Week2 PDF

Data Preprocessing
Week 2
Topics
DataTypes
DataRepositories
DataPreprocessing
Presenthomeworkassignment#1
Team Homework Assignment #2

Read
Readpp.227
pp 227 240,pp.250
240 pp 250 250,andpp.259
250 and pp 259 263thetext
263 the text
book.
p
, , , ,
DoExamples5.3,5.4,5.8,5.9,andExercise5.5.
WriteanRprogramtoverifyyouranswerforExercise5.5.
Refertopp.453 458ofthelabbook.
Explorefrequentpatternminingtoolsandplaythemfor
Exercise5.5
Prepare for the results of the homework assignment
Preparefortheresultsofthehomeworkassignment.
Duedate
beginningofthelectureonFridayFebruary11
g
g
y
y th.
Team Homework Assignment #3
P
Preparefortheonepagedescriptionofyourgroupproject
f h
d
i i
f
j
topic
Prepareforpresentationusingslides
Prepare for presentation using slides
Duedate
beginningofthelectureonFridayFebruary11th.
Figurre 1.4 Data

a Mining as a step in the proce
ess of know
wledge
disco
overy
Why Data Preprocessing Is Important?
WelcometotheRealWorld!
Noqualitydata,noqualityminingresults!
Preprocessingisoneofthemostcriticalstepsinadatamining
process
Major Tasks in Data

P
Preprocessing
i
Figure 2.1 Forms of data preprocessing
Why Data Preprocessing is Beneficial to

D
Data
Mi
Mining?
i ?
Lessdata
dataminingmethodscanlearnfaster
Higheraccuracy
Hi h
dataminingmethodscangeneralizebetter
Simpleresults
Simple results
theyareeasiertounderstand
Fewerattributes
Forthenextroundofdatacollection,savingcanbemade
byremovingredundantandirrelevantfeatures
8
Data Cleaning
Remarks on Data Cleaning
Datacleaningisoneofthebiggestproblemsindata
warehousing RalphKimball
Datacleaningisthenumberoneproblemindata
warehousing DCIsurvey
10
Why Data Is Dirty?
IIncomplete,noisy,andinconsistentdataare
l
i
di
i
d
commonplacepropertiesoflargerealworld
databases (p 48)
databases.(p.48)
Therearemanypossiblereasonsfornoisydata.(p.
48)
11
Types of Dirty Data Cleaning Methods
Missingvalues
Fillinmissingvalues
Noisydata(incorrectvalues)
Identifyoutliersandsmoothoutnoisydata
12
Methods for Missing Values (1)
Ignorethetuple
Fillinthemissingvaluemanually
Fill in the missing value manually
Useaglobalconstanttofillinthemissingvalue
13
Methods for Missing Values (2)
Usetheattributemeantofillinthemissingvalue
Usetheattributemeanforallsamplesbelongingtothesame
classasthegiventuple
Usethemostprobablevaluetofillinthemissingvalue
14
Methods for Noisy Data
Binning
Regression
Clustering
15
Binning
16
Regression
17
Clustering
Figure2.12 A2Dplotofcustomerdatawithrespecttocustomerlocationsinacity,
showingthreedataclusters.Eachclustercentroidismarkedwitha+,representingthe
averagepointonspacethatcluster.Outliersmaybedetectedasvaluesthatfalloutsideof
i t
th t l t O tli
b d t t d
l
th t f ll t id f
thesetsofclusters.
18
Data Integration
19
Data Integration
Schemaintegrationandobjectmatching
Entityidentificationproblem
Redundantdata(betweenattributes)occuroftenwhen
integrationofmultipledatabases
Redundantattributesmaybeabletobedetectedby
correlationanalysis,andchisquaremethod
l ti
l i
d hi
th d
20
Schema Integration and Object Matching
custom_id andcust_number
Schemaconflict
HandS,and1 and2 forpay_type inonedatabase
Valueconflict
Solutions
metadata(dataaboutdata)
t d t (d t b t d t )
21
Detecting Redundancy (1)
Ifanattributedcanbederivedfromanotherattributeora
setofattributes,itmayberedundant
22
Detecting Redundancy (2)
Someredundanciescanbedetectedbycorrelationanalysis
Correlationcoefficientfornumericdata
Chisquaretestforcategoricaldata
Thesecanbealsousedfordatareduction
23
Chi-square
Chi
square Test
Forcategorical(discrete)data,acorrelationrelationship
betweentwoattributes,AandB,canbediscoveredbya2
test
Giventhedegreeoffreedom,thevalueof2isusedto
decidecorrelationbasedonasignificancelevel
24
Chi-square Test for Categorical

D t
Data
(Observed Expected )
=
Expected
(o e )
2 =
eij
i =1 j =1
c
ij
ij
count ( A = ai ) count ( B = bj )
eij =
N
p.68
Thelargerthe2 value,themorelikelythevariablesarerelated.
25
Chi-square
Chi
square Test
fiction
non_ffiction
Total
male
250
female
200
Total
450
50
300
1000
1200
1050
1500
Table2.2 A 2 X 2 contingency table for the data of Example 2.1.

Are gender and preferred_reading correlated?
The2 statisticteststhehypothesisthatgender andpreferred_reading are
independent.Thetestisbasedonasignificantlevel,with(r 1)x(c 1)degreeof
freedom.
26
Table of Percentage Points of

th 2
the
2 Distribution
Di t ib ti
27
Correlation Coefficient
N
(a A)(b B) (a b ) N AB
i
rA, B =
i =1
NAB
1 rA, B +1
i i
i =1
NAB
p.68
28
http://upload.wikimedia.org/wikipedia/commons/0/02/Correlation_examples.png
29
Data Transformation
30
Data Transformation/Consolidation
Smoothing
Aggregation
Generalization
Normaliza on
Attributeconstruc on
31
Smoothing
Removenoisefromthedata
Binning,regression,andclustering
32
Data Normalization
Min
Min-max
max normalization
v minA
v' =
(new _ maxA new _ minA) + new _ minA
maxA minA
z-score normalization
v' =
v A
Normalization by decimal scaling

v
v' = j
10
where j is the smallest integer such that

Max(||) < 1
33
Data Normalization
Supposethattheminimumandmaximumvaluesforattribute
incomeare$12,000and$98,000,respectively.Wewouldlike
to map income to the range [0 0 1 0] Do Minmax
tomapincometotherange[0.0,1.0].DoMin
max
normalization,zscorenormalization,anddecimalscalingfor
theattributeincome
34
Attribution Construction
New
Newattributesareconstructedfromgivenattributesand
attributes are constructed from given attributes and
addedinordertohelpimproveaccuracyandunderstanding
ofstructureinhighdimensiondata
Example
Addtheattributearea basedontheattributesheight and
width
35
Data Reduction
36
Data Reduction
Datareductiontechniquescanbeappliedtoobtainareduced
representation of the data set that is much smaller in volume
representationofthedatasetthatismuchsmallerinvolume,
yetcloselymaintainstheintegrityoftheoriginaldata
37
Data Reduction
(DataCube)Aggregation
Attribute(Subset)Selection
DimensionalityReduction
Numerosity Reduction
DataDiscretization
C
ConceptHierarchyGeneration
t Hi
h G
ti
38
The
The Curse of Dimensionality(1)
Dimensionality (1)
Size
Thesizeofadatasetyieldingthesamedensityofdata
pointsinanndimensionalspaceincreaseexponentially
with dimensions
withdimensions
Radius
Alargerradiusisneededtoencloseafactionofthedata
pointsinahighdimensionalspace
39
The
The Curse of Dimensionality(2)
Dimensionality (2)
Distance
Almosteverypointisclosertoanedgethantoanother
samplepointinahighdimensionalspace
Outlier
Almosteverypointisanoutlierinahighdimensional
space
40
Data Cube Aggregation
Summarize(aggregate)databasedondimensions
Theresultingdatasetissmallerinvolume,withoutlossof
information necessary for analysis task
informationnecessaryforanalysistask
Concepthierarchiesmayexistforeachattribute,allowingthe
analysisofdataatmultiplelevelsofabstraction
41
Data Aggregation
Figure 2.13 Sales data for a given branch of AllElectronics for the
years 2002 to 2004. On the left, the sales are shown per quarter. On
the right, the data are aggregated to provide the annual sales
42
Data Cube
Providefastaccesstoprecomputed,summarizeddata,
thereby benefiting online
therebybenefitingon
lineanalyticalprocessingaswellas
analytical processing as well as
datamining
43
Data Cube - Example
Figure 2.14 A data cube for sales at AllElectronics

44
Attribute Subset Selection (1)

Attributeselectioncanhelpinthephasesofdatamining
(knowledgediscovery)process
Byattributeselection,
wecanimprovedataminingperformance(speedof
l
learning,predictiveaccuracy,orsimplicityofrules)
i
di i
i li i
f l )
wecanvisualizethedataformodelselected
wereducedimensionalityandremovenoise.
we reduce dimensionality and remove noise
45
Attribute(Feature)selectionisasearchproblem
Searchdirections
(Sequential)Forwardselection
(Sequential)Backwardselection(elimination)
Bidirectionalselection
Decisiontreealgorithm(induction)
D ii t
l ith (i d ti )
46

Attribute(Feature)selectionisasearchproblem
Searchstrategies
Exhaustivesearch
E h ti
h
Heuristicsearch
Selectioncriteria
Selection criteria
Statisticsignificance
Informationgain
g
etc.
47
Figure 2.15. Greedy (heuristic) methods for attribute subset selection

48
Data Discretization
R
Reducethenumberofvaluesforagiven
d
th
b
f l
f
i
continuousattributebydividingtherange
of the attribute into intervals
oftheattributeintointervals
Intervallabelscanthenbeusedtoreplace
actualdatavalues
t ld t
l
Split(topdown)vs.merge(bottomup)
Discretization canbeperformedrecursively
onanattribute
49/51
Why Discretization is Used?
Reducedatasize.
Transformingquantitativedatatoqualitativedata.
50
Interval Merge by 2 Analysis
Mergingbased(bottomup)
Merge:Findthebestneighboringintervalsandmergethemto
formlargerintervalsrecursively
ChiMerge [Kerber AAAI1992,SeealsoLiuetal.DMKD2002]
51/51
Initially,eachdistinctvalueofanumericalattributeAis
considered to be one interval
consideredtobeoneinterval
2testsareperformedforeverypairofadjacentintervals
Adjacentintervalswiththeleast 2valuesaremerged
together,sincelow 2valuesforapairindicatesimilarclass
distributions
Thismergeprocessproceedsrecursivelyuntilapredefined
This merge process proceeds recursively until a predefined
stoppingcriterionismet
52
Entropy-Based
Entropy
Based Discretization
Thegoalofthisalgorithmistofindthesplitwiththe
maximuminformationgain.
Theboundarythatminimizestheentropyoverall
possibleboundariesisselected
Theprocessisrecursivelyappliedtopartitions
obtaineduntilsomestoppingcriterionismet
Suchaboundarymayreducedatasizeandimprove
h b
d
d
d
d
classificationaccuracy
53/51
What is Entropy?
The entropy is a
measure of the
uncertainty associated
with
ith a random
d
variable
i bl
As uncertainty and or
randomness increases
for a result set so does
the entropy
Values range from 0 1
to represent the entropy
of information
54
Entropy Example
55
EntropyExample
py
p
56
EntropyExample(contd)
py
p (
)
57
CalculatingEntropy
g
py
Form classes:
m
Entropy ( S ) = pi log 2 pi
For 2 classes:
For2classes:
i =1
Entropy ( S ) = p1 log 2 p1 p2 log 2 p2

Calculatedbasedontheclassdistributionofthesamples
insetS.
pi istheprobabilityofclassi inS
m isthenumberofclasses(classvalues)
58
CalculatingEntropyFromSplit
g
py
p
Entropy
EntropyofsubsetsS
of subsets S1 andS
and S2 arecalculated.
are calculated
Thecalculationsareweightedbytheirprobabilityofbeingin
setSandsummed.
Informulabelow,
Sistheset
TisthevalueusedtosplitSintoS
T is the value used to split S into S1 andS
and S2
E (S , T ) =
S1
S
Entropy ( S1 ) +
S2
S
Entropy ( S 2 )
59
CalculatingInformationGain
g
IInformationGain
f
ti G i =Differenceinentropybetween
Diff
i
t
b t
originalset(S)andweightedsplit(S1 +S2)
Gain( S , T ) = Entopy ( S ) E ( S , T )
Gain( S ,56) = 0.991076 0.766289
Gain( S ,56) = 0.224788

0 224788
compareto
Gain( S ,46) = 0.091091

60
Numeric Concept Hierarchy
Aconcepthierarchyforagivennumericalattribute
definesadiscretization oftheattribute
Recursivelyreducethedatabycollectingand
replacinglowlevelconceptsbyhigherlevelconcepts
61
A Concept Hierarchy for the

Attribute Price
Figure 2.22. A concept hierarchy for the attribute price.
62/51
Segmentation by Natural Partitioning
Asimply345rulecanbeusedtosegmentnumericdatainto
relativelyuniform,naturalintervals
l
l
f
l
l
Ifanintervalcovers3,6,7or9distinctvaluesatthemostsignificant
digit,partitiontherangeinto3equiwidthintervals
Ifitcovers2,4,or8distinctvaluesatthemostsignificantdigit,
partitiontherangeinto4intervals
Ifitcovers1,5,or10distinctvaluesatthemostsignificantdigit,
, ,
g
g ,
partitiontherangeinto5intervals
63/51
Fig
gure 2.23. Automattic generation of a concep
pt
Hie
erarchy fo
or profit based
b
on 3-4-5 rule
e.
64
Concept Hierarchy Generation for

C t
Categorical
i l Data
D t
Specification
Specificationofapartialorderingofattributesexplicitlyatthe
of a partial ordering of attributes explicitly at the
schemalevelbyusersorexperts
Specificationofaportionofahierarchybyexplicitdata
grouping
Specificationofasetofattributes,butnotoftheirpartial
ordering
65
Automatic Concept Hierarchy

Generation
country
15distinctvalules
provinceorstate 365distinctvalues
city
3,567distinctvalues
street
674,339distinctvalues
Based on the number of distinct values per attributes p 95

Basedonthenumberofdistinctvaluesperattributes,p.95
66
Datapreprocessing
Datacleaning
Missingvalues
Usethemostprobablevaluetofillinthemissingvalue(andfiveothermethods)
Noisydata
Binning;Regression;Clusttering
Dataintegration
EntityIDproblem
Metadata
Redundancy
Correlation analysis (Correlation coefficient chi square test)
Correlationanalysis(Correlationcoefficient,chisquaretest)
Datatrasnformation
Smoothing
Datacleaning
Aggregation
Data reduction
Datareduction
Generailization
Datareduction
Normalization
Minmax;zscore;decimalscaling
AttributeConstruction
Datareduction
Datacubeaggregation
Datacubestoremultidimensionalaggregatedinformation
Attributesubsetselection
Stepwiseforwardselection;stepwisebackwardselection;combination;decisiontreeinduction
Dimensionalityreduction
Discretewavelettrasnforms(DWT);Principlecomponentsanalysis(PCA);
NumerosityReduction
Regressionandloglinearmodels;histograms;clustering;sampling
Datadiscretization
Binning;historgramanalysis;entropybaseddiscretization;
Bi
i hi t
l i
t
b d di
ti ti
Intervalmergingbychisquareanalysis;clusteranalysis;intuitivepartitioning
Concepthierarchy
Concepthierarchygeneration
67

Week2 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week2 PDF

Uploaded by

Copyright:

Available Formats

Data Preprocessing

Team Homework Assignment #2

Team Homework Assignment #3

Figurre 1.4 Data

Why Data Preprocessing Is Important?

Major Tasks in Data

Figure 2.1 Forms of data preprocessing

Why Data Preprocessing is Beneficial to

Remarks on Data Cleaning

Why Data Is Dirty?

Types of Dirty Data Cleaning Methods

Methods for Missing Values (1)

Methods for Missing Values (2)

Methods for Noisy Data

Schema Integration and Object Matching

Detecting Redundancy (1)

Detecting Redundancy (2)

Chi-square Test for Categorical

Table2.2 A 2 X 2 contingency table for the data of Example 2.1.

Table of Percentage Points of

Normalization by decimal scaling

where j is the smallest integer such that

Data Cube Aggregation

Data Cube - Example

Figure 2.14 A data cube for sales at AllElectronics

Attribute Subset Selection (1)

Attribute Subset Selection (2)

Attribute Subset Selection (3)

Attribute Subset Selection (4)

Figure 2.15. Greedy (heuristic) methods for attribute subset selection

Why Discretization is Used?

Interval Merge by 2 Analysis

Entropy ( S ) = p1 log 2 p1 p2 log 2 p2

Gain( S ,56) = 0.224788

Gain( S ,46) = 0.091091

Numeric Concept Hierarchy

A Concept Hierarchy for the

Figure 2.22. A concept hierarchy for the attribute price.

Segmentation by Natural Partitioning

Concept Hierarchy Generation for

Automatic Concept Hierarchy

Based on the number of distinct values per attributes p 95

You might also like