Professional Documents
Culture Documents
Datamining
Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition
Why Mine Data? Scientific Viewpoint
• Data collected and stored at
enormous speeds (GB/hour)
– remote sensors on a satellite
– telescopes scanning the skies
– microarrays generating gene
expression data
– scientific simulations
generating terabytes of data
• Traditional techniques infeasible for raw data
• Data mining may help scientists
– in classifying and segmenting data
– in Hypothesis Formation
Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition
Large
Scale
Data
:
Sky
Survey
Cataloging
– Goal:
To
predict
class
(star
or
galaxy)
of
sky
objects,
especially
visually
faint
ones,
based
on
the
telescopic
survey
images
(from
Palomar
Observatory).
– 3000
images
with
23,040
x
23,040
pixels
per
image.
– Approach:
• Segment
the
image.
• Measure
image
aJributes
(features)
-‐
40
of
them
per
object.
• Model
the
class
based
on
these
features.
• Success
Story:
Could
find
16
new
high
red-‐shiP
quasars,
some
of
the
farthest
objects
that
are
difficult
to
find!
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html 6
Microarray
http://cmgm.stanford.edu/pbrown/array.html
7
Definition of Datamining
• Description
find human-interpretable patterns that describe the
data
Datamining Tasks
• Prediction
classification
regression
deviation detection
• Description
clustering
association rule discovery
sequential pattern discovery
Datamining Techniques
• Prediction Decision Tree
classification Rule-based
Bayesian
regression Artificial Neural network
Support Vector Machine
deviation detection
• Description
clustering
association rule discovery
sequential pattern discovery
Datamining Techniques
• Prediction Linear Regression
classification Regression Tree
Artificial Neural network
regression Support Vector Machine
deviation detection
• Description
clustering
association rule discovery
sequential pattern discovery
Datamining Techniques
• Prediction
classification
regression
deviation detection K-means Clustering
Self Organizing Map
Agglomerative
• Description Hierarchical Clustering
DBSCAN
clustering
association rule discovery
sequential pattern discovery
Nearest Neighbor Classifier
Rule : find the most similar pattern from the training set,
then assign the class of the test data by the
class of that pattern
X is test pattern
Class of the nearest pattern is A
Class A
X
class of X is A
Class B
Association Rules & Basket Analysis
Association Rules
(Rakesh Agrawal@IBM Almaden Research Center)
Marketing Strategy:
n Sell A, B and C as one set
n Place A, B and C in one corner
n Etc
A, B ⇒ C
Association Rules
(Rakesh Agrawal@IBM Almaden Research Center)
X ⊆I
Y ⊆I X ⇒Y
X ∩Y = φ
Confidence & Supports
X ⇒Y
antecedent
consequent
TID Items
001 Beer, coca cola, diapers
002 Beer, diapers
003 Beer,flour
004 Butter, egg, flour
⇒ diapers
beer 50% 67%
beer ⇒ coca cola 25% 33%
butter ⇒ flour 25% 100%
beer ⇒ flour 25% 33%
Confidence & Supports
m ⎛ m ⎞ k
• Items : m à the number of association rules (
∑k =2 ⎜⎜ k ⎟⎟ 2 − 2 )
⎝ ⎠
x1
w1
x2
Input Signal w2 Output
& n #
x3 .. w3
f y
y = f $ ∑ xi × wi !
. wn
% i =1 "
xn w= synapses
f = Activation Function
Two aspects:
- architecture : how the neurons are connected
- training algorithm:
algorithm to adjust the synapses to enable the
ANN perform desired input-output mapping
Artificial Neural Network
two aspects:
- architecture : multilayer perceptron
- training algorithm: backpropagation algorithm
(invented by Rumelhart, 1986)
Class -1 Class +1
Optimal Hyperplane by SVM
Margin
Class -1 Class +1
Non Linear Classification in SVM
Φ
X Hyperplane
Φ(X )
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition
Classifying Galaxies
Courtesy: http://aps.umn.edu
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition
Further Readings
• Buku-buku datamining a.l.
• Pang Ning Tan, Michael Steinbach, Vipin Kumar: Introduction to
Datamining, Addison Wesley, 2006
• Datamining: Teknik Pemanfaatan Data untuk Keperluan Bisnis, Budi
Santosa, Graha Ilmu, 2007
• Tutorial on datamining by Dr.Iko Pramudiono (NTT, Japan)
http://datamining.japati.net/
• Datamining, Knowledge Discovery and Bioinformatics, Shinichi Morishita
(winner of KDD 2001): http://asnugroho.wordpress.com/2006/02/05/
datamining-knowledge-discovery-bioinformatics-terjemahan-artikel-prof-
shinichi-morishita/ (password: gomibako)
• AS. Nugroho: Datamining dalam Bioinformatika: menggali informasi
terpendam dalam lautan data biologi, SDA Asia Magazine, No.13, pp.
64-66, March 2006 http://asnugroho.wordpress.com/2006/02/06/peran-
datamining-dalam-bioinformatika/