Pengantar Datamining: Anto Satriyo Nugroho, DR - Eng

Pengantar
Datamining
Anto Satriyo Nugroho, Dr.Eng
Center for Information & Communication Technology,

Agency for the Assessment & Application of Technology (PTIK-BPPT)
Email: anto.satriyo@bppt.go.id, asnugroho@ieee.org
URL: http://asnugroho.net
1
Agenda
•  Apakah Datamining itu ?

•  Teknik dalam datamining
•  Contoh Aplikasi Datamining
•  Tutorial Pemakaian Software Datamining “WEKA”
•  Further Readings
Why Mine Data? Commercial Viewpoint
•  Lots of data is being collected
and warehoused
–  Web data, e-commerce
–  purchases at department/
grocery stores
–  Bank/Credit Card
transactions
•  Computers have become cheaper and more powerful

•  Competitive Pressure is Strong
–  Provide better, customized services for an edge (e.g. in Customer
Relationship Management)
Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition
Why Mine Data? Scientific Viewpoint
•  Data collected and stored at
enormous speeds (GB/hour)
–  remote sensors on a satellite
–  telescopes scanning the skies
–  microarrays generating gene
expression data
–  scientific simulations
generating terabytes of data
•  Traditional techniques infeasible for raw data
•  Data mining may help scientists
–  in classifying and segmenting data
–  in Hypothesis Formation
Large Scale Data : Sky Survey Cataloging
–  Goal: To predict class (star or galaxy) of sky objects, especially
visually faint ones, based on the telescopic survey images
(from Palomar Observatory).
–  3000 images with 23,040 x 23,040 pixels per image.
–  Approach:
•  Segment the image.
•  Measure image aJributes (features) -‐ 40 of them per
object.
•  Model the class based on these features.
•  Success Story: Could find 16 new high red-‐shiP quasars,
some of the farthest objects that are difficult to find!
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html 6
Microarray
n  Measuring the expression of

genes
n  Possible to obtain the expression
of thousands of genes
n  Disease classification

http://cmgm.stanford.edu/pbrown/array.html
7
Definition of Datamining
•  Definition: automatically (or semiautomatically) process of

discovering meaningful pattern in data
•  extracting
–  implicit
–  previously unknown
–  potentially useful
information from data

Proses dalam datamining
Datamining Tasks
•  Prediction
use some variables to predict unknown or future
values of other variables
•  Description
find human-interpretable patterns that describe the
data
Datamining Tasks
•  Prediction
classification
regression
deviation detection
•  Description
clustering
association rule discovery
sequential pattern discovery
Datamining Techniques
•  Prediction Decision Tree
classification Rule-based
Bayesian
regression Artificial Neural network
Support Vector Machine
deviation detection
•  Description
clustering
•  Prediction Linear Regression
classification Regression Tree
Artificial Neural network
regression Support Vector Machine
deviation detection
•  Description
clustering
•  Prediction
classification
regression
deviation detection K-means Clustering
Self Organizing Map
Agglomerative
•  Description Hierarchical Clustering
DBSCAN
clustering
Nearest Neighbor Classifier
Rule : find the most similar pattern from the training set,
then assign the class of the test data by the
class of that pattern
X is test pattern
Class of the nearest pattern is A
Class A
X
class of X is A
Class B
Association Rules & Basket Analysis
Association Rules
(Rakesh Agrawal@IBM Almaden Research Center)
Cash register data :

“Customer who bought A and B will have high
probability to buy expensive product C”
Marketing Strategy:
n  Sell A, B and C as one set
n  Place A, B and C in one corner
n  Etc
A, B ⇒ C
Association Rules
(Rakesh Agrawal@IBM Almaden Research Center)
Ι = {i1 , i1 ,..., im } ：　Items (products)

D
Database　　　　：　transactions
X ⊆I
Y ⊆I X ⇒Y
X ∩Y = φ
Confidence & Supports
X ⇒Y
antecedent consequent
Support s% : The ratio between transaction X　　　　　

∪Y
to the total transactions
Confidence c% : 　The ratio between transactions
X ⇒Y to the total transactions
of product X
TID Items
001 Beer, coca cola, diapers
002 Beer, diapers
003 Beer,flour
004 Butter, egg, flour
Association Rule Support confidence
⇒　diapers
beer　　 50% 67%
beer　　 ⇒　coca cola 25% 33%
butter ⇒　flour 25% 100%
beer　　 ⇒　flour 25% 33%
m ⎛ m ⎞ k
•  Items : m à the number of association rules (
∑k =2 ⎜⎜ k ⎟⎟ 2 − 2 )
⎝ ⎠
•  m: 100 à about 57,000 rules m: 100 à5.15 x 1047

•  Large number of rules are generated, but the only few of
them are really useful
•  Useful rules :
–  high score of both support & confidence
–  Low score of support : the rules are applicable for only
few cases
Artificial Neural Networks
mathematical model of information processing in

human brain
x1
w1
x2
Input Signal w2 Output
& n #
x3 .. w3
f y
y = f $ ∑ xi × wi !
. wn
% i =1 "
xn w= synapses
f = Activation Function
Mc Culloch-Pitts model (1943)‫‏‬

Artificial Neural Network
Two aspects:
- architecture : how the neurons are connected
- training algorithm:
algorithm to adjust the synapses to enable the
ANN perform desired input-output mapping
two aspects:
- architecture : multilayer perceptron
- training algorithm: backpropagation algorithm
(invented by Rumelhart, 1986)‫‏‬
Input Layer Hidden Layer Output Layer

w w
Input information Output

(training phase)
decrement of error during the training phase

of neural networks
=
“knowledge” acquisition
Support Vector Machines
•  Invented by Vapnik (1992)

•  SVM satisfied three conditions for ideal pattern
recognition method
–  Robustness
–  Theoretically Analysis
–  Feasibility
•  In principal, SVM works as binary classifier
•  Structural-Risk Minimization
Binary Classification
Discrimination boundaries
Class -1　　　　　Class +1
Optimal Hyperplane by SVM
Margin
Class -1　　　　　Class +1
Non Linear Classification in SVM
Φ
X Hyperplane
Φ(X )
　　　Input Space High-dimensional Feature Space

Application of Datamining
•  Fog forecasting
•  Bioinformatics
•  Sky survey Cataloging (Fayyad et al.)
•  Spatio-Temporal Analysis of Disease Spreading using
Webmining
•  Foreign Exchange Rate Prediction
•  Network Intrusion Detection
•  Etc.
Sky Survey Cataloging
•  Sky Survey Cataloging
–  Goal: To predict class (star or galaxy) of sky objects, especially
visually faint ones, based on the telescopic survey images (from
Palomar Observatory).
–  3000 images with 23,040 x 23,040 pixels per image.
–  Approach:
•  Segment the image.
•  Measure image attributes (features) - 40 of them per object.
•  Model the class based on these features.
•  Success Story: Could find 16 new high red-shift quasars, some
of the farthest objects that are difficult to find!
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Classifying Galaxies
Courtesy: http://aps.umn.edu
Early Class: Attributes:

•  Stages of Formation •  Image features,
•  Characteristics of light
waves received, etc.
Intermediate
Late
Data Size:
•  72 million stars, 20 million galaxies
•  Object Catalog: 9 GB
•  Image Database: 150 GB
Further Readings
•  Buku-buku datamining a.l.
•  Pang Ning Tan, Michael Steinbach, Vipin Kumar: Introduction to
Datamining, Addison Wesley, 2006
•  Datamining: Teknik Pemanfaatan Data untuk Keperluan Bisnis, Budi
Santosa, Graha Ilmu, 2007
•  Tutorial on datamining by Dr.Iko Pramudiono (NTT, Japan)
http://datamining.japati.net/
•  Datamining, Knowledge Discovery and Bioinformatics, Shinichi Morishita
(winner of KDD 2001): http://asnugroho.wordpress.com/2006/02/05/
datamining-knowledge-discovery-bioinformatics-terjemahan-artikel-prof-
shinichi-morishita/ (password: gomibako)‫‏‬
•  AS. Nugroho: Datamining dalam Bioinformatika: menggali informasi
terpendam dalam lautan data biologi, SDA Asia Magazine, No.13, pp.
64-66, March 2006 http://asnugroho.wordpress.com/2006/02/06/peran-
datamining-dalam-bioinformatika/

Pengantar Datamining: Anto Satriyo Nugroho, DR - Eng

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pengantar Datamining: Anto Satriyo Nugroho, DR - Eng

Uploaded by

Copyright:

Available Formats

Pengantar

Anto Satriyo Nugroho, Dr.Eng

Center for Information & Communication Technology,

•  Apakah Datamining itu ?

•  Computers have become cheaper and more powerful

n  Measuring the expression of

•  Definition: automatically (or semiautomatically) process of

Cash register data :

Ι = {i1 , i1 ,..., im } ：　Items (products)

Support s% : The ratio between transaction X

Association Rule Support confidence

•  m: 100 à about 57,000 rules m: 100 à5.15 x 1047

mathematical model of information processing in

Mc Culloch-Pitts model (1943)‫‏‬

Input Layer Hidden Layer Output Layer

Input information Output

decrement of error during the training phase

•  Invented by Vapnik (1992)

Input Space High-dimensional Feature Space

Early Class: Attributes:

You might also like

Pengantar Datamining: Anto Satriyo Nugroho, DR - Eng

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pengantar Datamining: Anto Satriyo Nugroho, DR - Eng

Uploaded by

Copyright:

Available Formats

Pengantar

Anto Satriyo Nugroho, Dr.Eng

Center for Information & Communication Technology,

• Apakah Datamining itu ?

• Computers have become cheaper and more powerful

n Measuring the expression of

• Definition: automatically (or semiautomatically) process of

Cash register data :

Ι = {i1 , i1 ,..., im } ： Items (products)

Support s% : The ratio between transaction X

Association Rule Support confidence

• m: 100 à about 57,000 rules m: 100 à5.15 x 1047

mathematical model of information processing in

Mc Culloch-Pitts model (1943)‫‏‬

Input Layer Hidden Layer Output Layer

Input information Output

decrement of error during the training phase

• Invented by Vapnik (1992)

Input Space High-dimensional Feature Space

Early Class: Attributes:

You might also like

•  Apakah Datamining itu ?

•  Computers have become cheaper and more powerful

n  Measuring the expression of

•  Definition: automatically (or semiautomatically) process of

Ι = {i1 , i1 ,..., im } ：　Items (products)

Support s% : The ratio between transaction X　　　　　

•  m: 100 à about 57,000 rules m: 100 à5.15 x 1047

•  Invented by Vapnik (1992)

　　　Input Space High-dimensional Feature Space