Professional Documents
Culture Documents
Analy,cs
By
Dr. Atanu Rakshit
Email: atanu.rakshit@iimrohtak.ac.in
atanu.raks@gmail.com
Business Analy,cs
Text Book:
Business Intelligence A Managerial Approach by
Efraim Turban, Ramesh Sharda, Dursun Delen and
Devid King, 2/e, Pearson, 2012
Reference Material:
Business Analy,cs for Manager by Gert H. N.
Laursen and Jesper Thorlund, Wiley, 2010
Business Analy,cs
Reference Material:
Decision Support and Business Intelligence
Systems by Efraim Turban, Ramesh Sharda and
Dursun Delen, 9/e, Pearson, 2012
Business Intelligence Strategy A Prac,cal Guide
for Achieving BI Excellence by John Boyer, Bill
Frank, Brian Green and Tracy Harris, MC Press,
2010
Business Analy,cs
Sessions Plan
Introduc,on to Business Analy,cs
Data Warehousing
Data Mining for Business Intelligence
Business Analy,cs Model
The Business Analy,cs at the Analy,cs Level
Business Analy,cs at the Strategic Level
Business Analy,cs at the Func,onal Level
Business Performance Management
Big Data Analy,cs
Project Presenta,on
Business Analy,cs
Introduction to Data
Mining
Learning Objec,ves
Dene data mining as an enabling technology for
business intelligence
Understand the objec,ves and benets of business
analy,cs and data mining
Recognize the wide range of applica,ons of data
mining
Learn the standardized data mining processes
CRISP-DM
SEMMA
KDD
Learning Objec,ves
Understand the steps involved in data
preprocessing for data mining
Learn dierent methods and algorithms of data
mining
Build awareness of the exis,ng data mining
so^ware tools
Commercial versus free/open source
Opening Vigneae
Data Mining Goes to Hollywood!
Decision situa,on
Problem
Proposed solu,on
Results
Answer & discuss the case ques,ons
Opening Vigneae:
Data Mining Goes to Hollywood!
Class No.
Range
(in $Millions)
<1
>1
> 10
(Flop) < 10
< 20
Dependent
Variable
Independent
Variables
A Typical
Classification
Problem
> 20 > 40
> 65
> 100
> 150
> 200
< 40 < 65
< 100
< 150
< 200
(Blockbuster)
Independent Variable
Number of
Possible Values
Values
MPAA Rating
G, PG, PG-13, R, NR
Competition
Star value
Genre
10
Special effects
Sequel
Yes, No
Number of screens
Positive integer
Opening Vigneae:
Data Mining Goes to Hollywood!
The DM
Process
Map in
IBM
SPSS
Modeler
Model
Development
process
Model
Assessment
process
Opening Vigneae:
Data Mining Goes to Hollywood!
Prediction Models
Individual Models
Performance
Measure
SVM
ANN
Ensemble Models
C&RT
Random
Forest
Boosted
Tree
Fusion
(Average)
Count (Bingo)
192
182
140
189
187
194
Count (1-Away)
104
120
126
121
104
120
Accuracy (% Bingo)
55.49%
52.60%
40.46%
54.62%
54.05%
56.07%
Accuracy (% 1-Away)
85.55%
87.28%
76.88%
89.60%
84.10%
90.75%
0.93
0.87
1.05
0.76
0.84
0.63
Standard deviation
ial
e
Int
tis
tic
s
c
tifi
Ar
Pattern
Recognition
en
Sta
llig
Mathematical
Modeling
Machine
Learning
Databases
ce
DATA
MINING
- DM with different
data types?
Categorical
Nominal
Numerical
Ordinal
Interval
Ratio
Similarity Measures
Hierarchical Clustering
IR Systems
Imprecise Queries
Textual Data
Web Search Engines
Bayes Theorem
Regression Analysis
EM Algorithm
K-Means Clustering
Time Series Analysis
Neural Networks
Decision Tree Algorithms
Types of paaerns
Associa,on
Predic,on
Cluster (segmenta,on)
Sequen,al (or ,me series) rela,onships
Learning Method
Popular Algorithms
Supervised
Classification
Supervised
Regression
Supervised
Unsupervised
Link analysis
Unsupervised
Sequence analysis
Unsupervised
Unsupervised
K-means, ANN/SOM
Prediction
Association
Clustering
Outlier analysis
Unsupervised
Visualiza,on
Another data mining task?
Types of DM
Hypothesis-driven data mining
Discovery-driven data mining
Hypothesis vs Discovery
Tradi,onal analysis is via verica,on-driven
analysis
Requires hypothesis of the desired informa,on
(target)
Requires correct interpreta,on of proposed query
Insurance
Forecast claim costs for beaer business planning
Determine op,mal rate plans
Op,mize marke,ng to specic customers
Iden,fy and prevent fraudulent claim ac,vi,es
Data An Example
Regression analysis is a tool which involves building a
predictive model to relate a predictor variable, X, to a
response variable, Y, through a relationship of the form
Y = aX + b. For example, one might build a model
which would allow us to predict a persons annual
credit-card spending given their annual income. Clearly
the model would not be perfect, but since spending
typically increases with income, the model might well be
adequate as rough characterization.
For this example, one would have the following
scenario:
Descrip,ve Modeling
The goal of a descrip,ve model is to describe
all of the data (or the process genera,ng the
data)
Examples of such descrip,ons include models
for the overall probability distribu,on of the
data (density es,ma,on), par,,oning of the
p-dimensional space into groups (cluster
analysis, segmenta,on) and models describing
the rela,onship between variables
(dependency modeling)
Predic,ve Modeling
The aim is to build a model that will permit
the value of one variable to be predicted from
the known values of other variables
In classica,on, the variable being predicted is
categorical, while in regression the variable is
quata,ve
Basics
Given a set of transactions {T}, each containing
a subset of items from an item set {i1, i2, , im},
discovery of association relationships or
correlations among a set of items.
Want to find a group of items that tend to occur
together.
The association rules are often written as X =>
Y meaning that whenever X appears Y also
tends to appear. X and Y may be single items or
sets of items (same item not appearing in both).
Market-Basket Model
Large Sets
Items A = {A1, A2, , Am}
Example
Items A = {milk, coke, pepsi, beer, juice}.
Baskets
B1 = {m, c, b}
B3 = {m, b}
B5 = {m, p, b}
B7 = {c, b, j}
B2 = {m, p, j}
B4 = {c, j}
B6 = {m, c, b, j}
B8 = {b, c}
Associa,on Rules
If-then rules about basket contents
{A1, A2,, Ak} Aj
if basket has X={A1,,Ak}, then likely to have Aj
s up (X + A j )
s up (X )
Example
B1 = {m, c, b}
B3 = {m, b}
B5 = {m, p, b}
B7 = {c, b, j}
B2 = {m, p, j}
B4 = {c, j}
B6 = {m, c, b, j}
B8 = {b, c}
Associa,on Rule
{m, b} c
Support = 2
Condence = 2/4 = 50%
Applica,ons
Marke,ng
Cross Marke,ng
Aaached Mailing
Catalogue Design
Cross Sell
Up Sell
Store Layout
Promo,on (Segmenta,on)
Applica,ons
Medicine
Pa,ent submits their symptoms and done
mul,ple tests. Doctor diagnoses the problem and
prescribe medicines.
Medical condi,ons recommend minimal tests
Computer-Aided Detec,on
Applica,ons
Sports
Successful Movement Paaern
Preferred skill set Based on total environmental
condi,on
Agriculture
Remotely sensed imagery data of eld to associate
between aaributes of the loca,on and the crop yield
in this loca,on
Predic,ng pest aaacks
Weather condi,ons with the crop yield
Usage of various fer,lizers and pes,cides with crop
yield
Scenario 2
items = sentences
baskets = documents containing sentences
frequent sentence-groups = possible plagiarism
Scenario 2
baskets = web pages
items = incoming links
pages with similar in-links mirrors, or same
topic
Terminology
We assume that we have a set of transactions,
each transaction being a list of items (e.g. books)
Suppose X and Y appear together in only 1% of
the transactions but whenever X appears there
is 80% chance that Y also appears
The 1% presence of X and Y together is called
the support (or prevalence) of the rule and 80%
is called the confidence (or predictability) of the
rule
These are measures of interestingness of the
rule
Terminology
The support for X=>Y is the probability of both X
and Y appearing together, that is P(X U Y)
The confidence of X=>Y is the conditional
probability of Y appearing given that X exists,
that is:
P(Y | X) = P(X U Y) / P (Y)
Confidence denotes the strength of the
association. Support indicates the frequency of
the pattern. A minimum support is necessary if
an association is going to be of some business
value.
Items
Bread, Milk
2
3
4
5
k-itemset
An itemset that contains k items
Support count ()
Frequency of occurrence of an itemset
E.g. ({Milk, Bread,Diaper}) = 2
Support
Frac,on of transac,ons that contain an
itemset
E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset
An itemset whose support is greater than
or equal to a minsup threshold
TID
Items
Bread, Milk
2
3
4
5
Association Rule
An implication expression of the form
X Y, where X and Y are itemsets
Example:
{Milk, Diaper} {Beer}
Items
Bread, Milk
2
3
4
5
Example:
Confidence (c)
u
TID
s=
2
= 0.4
5
MS(I)
Sup(I)
0.10% 0.25%
0.20% 0.26%
0.30% 0.29%
0.50% 0.05%
3%
4.20%
AB
ABC
AC
ABD
AD
ABE
AE
ACD
BC
ACE
BD
ADE
BE
BCD
CD
BCE
CE
BDE
DE
CDE
B
C
MS(I)
Sup(I)
AB
ABC
AC
ABD
AD
ABE
AE
ACD
BC
ACE
BD
ADE
BE
BCD
CD
BCE
CE
BDE
DE
CDE
A
A
0.10% 0.25%
0.20% 0.26%
B
C
0.30% 0.29%
D
0.50% 0.05%
E
3%
4.20%
Items
Bread, Milk
2
3
4
5
Example of Rules:
{Milk,Diaper} {Beer} (s=0.4, c=0.67)
{Milk,Beer} {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} {Milk} (s=0.4, c=0.67)
{Beer} {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} {Milk,Beer} (s=0.4, c=0.5)
{Milk} {Diaper,Beer} (s=0.4, c=0.5)
Observations:
All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
Rules originating from the same itemset have identical support but
can have different confidence
Thus, we may decouple the support and confidence requirements
2. Rule Genera,on
Generate high condence rules from each frequent
itemset, where each rule is a binary par,,oning of a
frequent itemset
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ABCDE
ACDE
BCDE
TID
1
2
3
4
5
Items
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
List of
Candidates
Computa,onal Complexity
Given d unique items:
Total number of itemsets = 2d
Total number of possible associa,on rules:
d d k
R =
k j
= 3 2 +1
d 1
d k
k =1
j =1
d +1
X , Y : ( X Y ) s( X ) s(Y )
Support of an itemset never exceeds the support of its
subsets
This is known as the an,-monotone property of support
Rule Genera,on
Given a frequent itemset L, nd all non-empty
subsets f L such that f L f sa,ses the
minimum condence requirement
If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D,
A BCD,
AB CD,
BD AC,
ABD C,
B ACD,
AC BD,
CD AB,
ACD B,
C ABD,
AD BC,
BCD A,
D ABC
BC AD,
Rule Genera,on
How to eciently generate rules from frequent itemsets?
In general, condence does not have an an,-monotone
property
c(ABC D) can be larger or smaller than c(AB D)
But condence of rules generated from the same itemset
has an an,-monotone property
e.g., L = {A,B,C,D}:
c(ABC D) c(AB CD) c(A BCD)
Condence is an,-monotone w.r.t. number of items on
the RHS of the rule
CD=>AB
ABCD=>{ }
BCD=>A
BD=>AC
D=>ABC
Pruned
Rules
ACD=>B
BC=>AD
C=>ABD
ABD=>C
AD=>BC
B=>ACD
ABC=>D
AC=>BD
A=>BCD
AB=>CD
D=>ABC
Count
4
2
4
3
4
1
Items (1-itemsets)
Minimum Support = 3
If every subset is considered,
6C + 6C + 6C = 41
1
2
3
With support-based pruning,
6 + 6 + 1 = 13
Itemset
{Bread,Milk}
{Bread,Beer}
{Bread,Diaper}
{Milk,Beer}
{Milk,Diaper}
{Beer,Diaper}
Count
3
2
3
2
3
3
Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Triplets (3-itemsets)
Itemset
{Bread,Milk,Diaper}
Count
3
Apriori Algorithm
Method:
Let k=1
Generate frequent itemsets of length 1
Repeat un,l no new frequent itemsets are iden,ed
Generate length (k+1) candidate itemsets from length k
frequent itemsets
Prune candidate itemsets containing subsets of length
k that are infrequent
Count the support of each candidate by scanning the
DB
Eliminate candidates that are infrequent, leaving only
those that are frequent
Candidate coun,ng:
TID
1
2
3
4
5
Hash Structure
Items
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Buckets
Support
distribution of a
retail data set
One-item Itemsets
Two-item Itemsets
Three-item Itemsets
Transaction
No
SKUs
(Item No)
Itemset
(SKUs)
Support
Itemset
(SKUs)
Support
Itemset
(SKUs)
Support
1, 2, 3, 4
1, 2
1, 2, 4
2, 3, 4
1, 3
2, 3, 4
2, 3
1, 4
1, 2, 4
2, 3
1, 2, 3, 4
2, 4
2, 4
3, 4
Classification task
Input: a training set of tuples, each labeled
with one class label
Output: a model (classier) that assigns a
class label to each tuple based on the other
aaributes
The model can be used to predict the class
of new tuples, for which the class label is
missing or unknown
What is Classification
Data classica,on is a two-step process
rst step: a model is built describing a
predetermined set of data classes or concepts
second step: the model is used for
classica,on
Training Data
Model
Development
Classifier
Preprocessed
Data
1/3
Testing Data
Model
Assessment
(scoring)
Prediction
Accuracy
Accuracy of models:
the known class of test samples is matched
against the class predicted by the model
accuracy rate = % of test set samples correctly
classied by the model
Training step
Classification
algorithm
training
data
Age
20
18
40
50
35
30
32
40
Car Type
Combi
Sports
Sports
Family
Minivan
Combi
Family
Combi
Risk
High
High
High
Low
Low
High
Low
Low
Classifier
(model)
if age < 31
or Car Type =Sports
then Risk = High
Test step
Classifier
(model)
test
data
Age
27
34
66
44
Car Type
Sports
Family
Family
Sports
Risk
High
Low
High
High
Risk
High
Low
Low
High
Classification (prediction)
Classifier
(model)
new
data
Age
27
34
55
34
Car Type
Sports
Minivan
Family
Sports
Risk
Risk
High
Low
Low
High
Comparing Classification
Methods
PredicBve accuracy: this refers to the ability of
the model to correctly predict the class label of
new or previously unseen data
Speed: this refers to the computa,on costs
involved in genera,ng and using the model
Robustness: this is the ability of the model to
make correct predic,ons given noisy data or data
with missing values
Comparing Classification
Methods
Scalability: this refers to the ability to construct the
model eciently given large amount of data
Interpretability: this refers to the level of
understanding and insight that is provided by the
model
Simplicity:
decision tree size
rule compactness
Problem formulation
Given records in the database with class
label nd model for each class.
Age
20
18
40
50
35
30
32
40
Car Type
Combi
Sports
Sports
Family
Minivan
Combi
Family
Combi
Risk
High
High
High
Low
Low
High
Low
Low
Age < 31
Car Type
is sports
High
High
Low
Classification techniques
Classification by Decision
Tree Induction
A decision tree is a tree structure, where
each internal node denotes a test on an
aaribute,
each branch represents the outcome of the test,
leaf nodes represent classes or class
distribu,ons
Age < 31
N
Car Type
is sports
High
High
Low
Many variants:
from machine learning (ID3, C4.5)
from sta,s,cs (CART) Classica,on And
Regression Trees
from paaern recogni,on (CHAID) Chi-squared
Automa,c Interac,on Detec,on
Tree Building
In the growth phase the tree is built by recursively
par,,oning the data un,l each par,,on is either
"pure" (contains members of the same class) or
suciently small.
The form of the split used to par,,on the data
depends on the type of the aaribute used in the
split:
for a con,nuous aaribute A, splits are of the form
value(A)<x where x is a value in the domain of A.
for a categorical aaribute A, splits are of the form
value(A)X where Xdomain(A)
Split Criteria
Gini index (CART, SPRINT)
select aaribute that minimize impurity of a split
Gini index
Given a sample training set where each record
represents a car-insurance applicant. We want to
build a model of what makes an applicant a high or
low insurance risk.
Classifier
(model)
Training set
RID
0
1
2
3
4
5
Age
23
17
43
68
32
20
Car Type
family
sport
sport
family
truck
family
Risk
high
high
high
low
low
high
Gini index
SPRINT algorithm:
Par??on(Data S) {
if (all points in S are of the same class) then
return;
for each aaribute A do
evaluate splits on aaribute A;
Use best split found to par,,on S into S1 and S2
Par,,on(S1);
Par,,on(S2);
}
Ini?al call: Par,,on(Training Data)
Gini index
Deni,on:
gini(S) = 1 - pj2
where:
S is a data set containing examples from n classes
pj is a rela,ve frequency of class j in S
Gini index
If dataset S is split into S1 and S2, then spling index
is dened as follows:
Example
Training set
RID
0
1
2
3
4
5
Age
23
17
43
68
32
20
Car Type
family
sport
sport
family
truck
family
Risk
high
high
high
low
low
high
Example
Attribute list for Age
Age
17
20
23
32
43
68
RID
1
5
0
4
2
3
Risk
high
high
high
low
high
low
RID
0
1
2
3
4
5
Risk
high
high
high
low
low
high
Example
Possible values of a split point for Age aaribute are:
Age17, Age20, Age23, Age32, Age43, Age68
Tuple count
Age<=17
Age>17
High
1
3
Low
0
2
G(Age<=17) = 1- (12+02) = 0
G(Age>17) = 1- ((3/5)2+(2/5)2) = 1 - (13/25)2 = 12/25
GSPLIT = (1/6) * 0 + (5/6) * (12/25) = 2/5
Example
Tuple count
Age<=20
Age>20
High
2
2
Low
0
2
G(Age<=20) = 1- (12+02) = 0
G(Age>20) = 1- ((1/2)2+(1/2)2) = 1/2
GSPLIT = (2/6) * 0 + (4/6) * (1/8) = 1/3
Tuple count
Age<=23
Age>23
High
3
1
Low
0
2
G(Age23) = 1- (12+02) = 0
G(Age>23) = 1- ((1/3)2+(2/3)2) = 1 - (1/9) - (4/9) = 4/9
GSPLIT = (3/6) * 0 + (3/6) * (4/9) = 2/9
Example
Tuple count
Age<=32
Age>32
High
3
1
Low
1
1
Example
Decision tree after the first split of the example set:
Age27.5
Risk = High
Age>27.5
Risk = Low
Example
Attribute lists are divided at the split
point.
Attribute lists for Age27.5:
Age
17
20
23
RID
1
5
0
Risk
high
high
high
Car Type
family
sport
family
RID
0
1
5
Risk
high
high
high
Car Type
sport
family
truck
RID
2
3
4
Risk
high
low
low
RID
4
2
3
Risk
low
high
low
Example
Evaluating splits for categorical attributes
We have to evaluate splitting index for each of the 2N
combinations, where N is the cardinality of the categorical
attribute.
Tuple count
Car type= {sport}
Car type ={family]
Car type = {truck}
High Low
1
0
0
1
0
1
Example
G(Car type { sport, family }) = 1 - (1/2)2 - (1/2)2 = 1/2
G(Car type { sport, truck }) = 1/2
G(Car type { family, truck }) = 1 - 02 - 12 = 0
GSPLIT(Car type { sport }) = (1/3) * 0 + (2/3) * 0 = 0
GSPLIT(Car type { family }) = (1/3) * 0 + (2/3)*(1/2) = 1/3
GSPLIT(Car type { truck }) = (1/3) * 0 + (2/3)*(1/2) = 1/3
GSPLIT(Car type { sport, family}) = (2/3)*(1/2)+(1/3)*0= 1/3
GSPLIT(Car type { sport, truck}) = (2/3)*(1/2)+(1/3)*0= 1/3
GSPLIT(Car type { family, truck }) = (2/3)*0+(1/3)*0=0
Example
The lowest value of GSPLIT is for Car type {sport},
thus this is our split point. Decision tree a^er the
second split of the example set:
Age27.5
Age>27.5
Risk = High
Car type {sport}
Risk = High
Risk = Low
Information Gain
The informa,on gain measure is used to select
the test aaribute at each node in the tree
The aaribute with the highest informa,on gain
(or greatest entropy reduc,on) is chosen as the
test aaribute for the current node
This aaribute minimizes the informa,on needed
to classify the samples in the resul,ng par,,ons
Information Gain
Let S be a set consis,ng of s data samples.
Suppose the class label aaribute has m dis,nct
values dening m classes, Ci (for i=1, ..., m)
Let si be the number of samples of S in class Ci
The expected informa,on needed to classify a
given sample is given by
Information Gain
Let aaribute A have v dis,nct values, {a1, a2, ...,
av}. Aaribute A can be used to par,,on S into {S1,
S2, ..., Sv}, where Sj contains those samples in S
that have value aj of A
If A were selected as the test aaribute, then these
subsets would correspond to the branches grown
from the node containing the set S
Information Gain
Let sij be the number of samples of the class Ci in
a subset Sj. The entropy, or expected informa,on
based on the par,,oning into subsets by A, is
given by:
E(A1, A2, ...Av) = (s1j + s2j +...+smj)/s*
* I(s1j, s2j, ..., smj)
The smaller the entropy value, the greater the
purity of the subset par,,ons.
Information Gain
The term (s1j + s2j +...+smj)/s acts as the weight of
the jth subset and is the number of samples in the
subset (i.e. having value aj of A) divided by the
total number of samples in S. Note that for a
given subset Sj,
Information Gain
The encoding informa,on that would be gained
by branching on A is
Gain(A) is the expected reduc,on in entropy
caused by knowing the value of aaribute A
Example
RID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Age
<=30
<=30
31..40
>40
>40
>40
31..40
<=30
<=30
>40
<=30
31..40
31..40
>40
Income
high
high
high
medium
low
low
low
medium
low
medium
medium
medium
high
medium
student
no
no
no
no
yes
yes
yes
no
yes
yes
yes
no
yes
no
credit_rating
fair
excellent
fair
fair
fair
excellent
excellent
fair
fair
fair
excellent
excellent
fair
excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Example
Let us consider the following training set of
C2 correspond to no - s2 = 5
I(s1, s2)=I(9, 5)= - 9/14log29/14 5/14log25/14=0.94
Example
Next, we need to compute the entropy of each
aaribute. Let start with the aaribute age
for age=<=30
s11=2 s21=3 I(s11, s21) = 0.971
for age=31..40
s12=4 s22=0 I(s12, s22) = 0
for age=>40
s13=2 s23=3 I(s13, s23) = 0.971
Example
The entropy of age is,
E(age)=5/14 *I(s11, s21) +4/14* I(s12, s22) +
+ 5/14* I(s13, s23) = 0.694
The gain in informa,on from such a par,,oning
would be:
Gain(age) = I(s1, s2) E(age) = 0.246
Example
We can compute
Gain(income)=0.029,
Gain(student)=0.151, and
Gain(credit_raBng)=0.048
Since age has the highest informa,on gain amont
the aaributes, it is selected as the test atribute. A
node is created and labeled with age, and branches
are grown for each of the aaributes values.
Example
age
<=30
buys_computers:
yes, no
31..40
>40
buys_computers:
yes, no
buys_computers: yes
Example
age
<=30
student
no
no
>40
31..40
credit_rating
yes
yes
excellent
yes
no
fair
yes
class A 40
class B 30
class C 20
class D 10
yes
if age < 40
class A 40
no
class B 30
class C 20
class D 10
yes
class A 40
class D 10
if age < 65
no
class B 30
class C 20
Tree pruning
When a decision tree is built, many of the branches
will reect anomalies in the training data due to
noise or outliers.
Tree pruning methods typically use sta,s,cal
measures to remove the least reliable branches,
generally resul,ng in faster classica,on and an
improvement in the ability of the tree to correctly
classify independent test data
Tree pruning
Prepruning approach (stopping): a tree is
pruned by hal,ng its construc,on early (i.e. by
deciding not to further split or par,,on the
subset of training samples). Upon hal,ng, the
node becomes a leaf. The leaf hold the most
frequent class among the subset samples
Postpruning approach (pruning): removes
branches from a fully grown tree. A tree node is
pruned by removing its branches. The lowest
unpruned node becomes a leaf and is labeled by
the most frequent class among its former
branches
Extracting Classification
Rules from Decision Trees
The knowledge represented in decision trees can be
extracted and represented in the form of
classica,on IF-THEN rules.
One rule is created for each path from the root to a
leaf node
Each aaribute-value pair along a given path forms a
conjunc,on in the rule antecedent; the leaf node
holds the class predic,on, forming the rule
consequent
Extracting Classification
Rules from Decision Trees
The decision tree of Example (7) can be converted to classica,on
rules:
IF age=<=30 AND student=no THEN buys_computers=no
IF age=<=30 AND student=yes THEN buys_computers=yes
IF age=31..40
THEN buys_computers=yes
IF age=>40 AND credit_ra,ng=excellent
THEN buys_computers=no
IF age=>40 AND credit_ra,ng=fair
THEN buys_computers=yes
Decision Trees
DT algorithms mainly dier on
Spling criteria
Which variable to split rst?
What values to use to split?
How many splits to form for each node?
Stopping criteria
When to stop building the tree
Decision Trees
Alterna,ve spling criteria
Gini index determines the purity of a specic class
as a result of a decision to branch along a
par,cular aaribute/value
Used in CART
The balls of same colour are clustered into a group as shown below :
Thus, we see clustering means grouping of data or dividing a large data set into smaller data
sets of some similarity.
x11
...
x
i1
...
x
n1
(Flat File of
Aaributes/coordinates)
Dissimilarity matrix
(one mode)
Or distance matrix
...
x1f
...
...
...
...
xif
...
...
...
...
...
...
...
xnf
0
d(2,1)
d(3,1) d ( 3,2) 0
:
:
:
d ( n,1) d ( n,2) ...
x1p
...
xip
...
xnp
... 0
Interval-Scaled Aaributes
Binary Aaributes
Nominal Aaributes
Ordinal Aaributes
Ra,o-Scaled Aaributes
Aaributes of Mixed Type
Manhaaan distance
q = 1
Clustering Methods
Par,,oning methods
Hierarchical methods
Density-based methods
Grid-based methods
Model-based methods
Clustering Methods
Par,,oning methods
Divide the objects into a set of par,,ons based on
some criteria
Improve the par,,ons by shi^ing objects between
them for higher intraclass similarity, interclass
dissimilarity and other such criteria
Two popular heuris,c methods
k-means algorithm
k-medoids algorithm
Clustering Methods
Hierarchical methods
Build up or break down groups of objects in a recursive
manner
Two main approaches
Agglomera,ve approach
Divisive approach
Wikipedia
Clustering Methods
Density-based methods
Grow a given cluster un,l the density decreases
below a certain threshold
Grid-based methods
Form a grid structure by quan,zing the object
space into a nite number of grid cells
Model-based methods
Hypothesize a model and nd the best t of the
data to the chosen model
Algorithm:
Select ini,al par,,on with K clusters: random,
rst K, K separated points
Collabora,ve Filtering
Given database of user preferences, predict
preference of new user
Example: predict what new movies you will like based
on
your past preferences
others with similar past preferences
their preferences for the new movies
Example: predict what books/CDs a person may want
to buy
(and suggest it, or give discounts to tempt
customer)
Step 2
Step 3
K-Mean Example
ID
Name
Height
Weight
x1
Ram
64
60
x2
Shyam
60
61
x3
Gita
59
70
x4
Mohan
68
71
Biological NN
Dendrites
Synapse
Synapse
Axon
Biological
versus
Ar,cial
Neural
Networks
Axon
Artificial NN
x1
Inputs
x2
.
.
.
xn
Neuron
Dendrites
Neuron
Y1
w1
Outputs
w2
Weights
Processing
Element (PE)
S =
X iW
i =1
Summation
wn
Biological
Neuron
Dendrites
Axon
Synapse
Slow
Many (109)
f (S )
i
Transfer
Function
Artificial
Node (or PE)
Input
Output
Weight
Fast
Few (102)
Y2
.
.
.
Yn
Elements/Concepts of ANN
Processing element (PE)
Informa,on processing
Network structure
Feedforward vs. recurrent vs. mul,-layer
Learning parameters
Supervised/unsupervised, backpropaga,on,
learning rate, momentum
Data Mining
So^ware
Commercial
KXEN
RapidMiner
MATLAB
Other commercial tools
KNIME
Microsoft SQL Server
Other free tools
Zementis
Oracle DM
Statsoft Statistica
Miner3D
Alone
Thinkanalytics
20
40
60
80
100
120
KDD Process
KDD Issues
Mul?media Data
Missing Data
Irrelevant Data
Noisy Data
Changing Data
Integra?on
Applica?on
Preprocessing:
Remove iden,fying URLs
Remove error logs
Transforma?on:
Sessionize (sort and group)
Data Mining:
Iden,fy and count paaerns
Construct data structure
Interpreta?on/Evalua?on:
Iden,fy and display frequently accessed sequences.
1
Business
Understanding
2
Data
Understanding
Data
Preparation
Data Sources
Deployment
Model
Building
5
Testing and
Evaluation
Accounts for
~85% of total
project time
Business
Understanding
Data
Understanding
Data
Preparation
Modeling
Evaluation
Deployment
Determine
Business
Objectives
Collect
Initial
Data
Select
Data
Select
Modeling
Technique
Evaluate
Results
Plan
Deployment
Assess
Situation
Describe
Data
Clean
Data
Generate
Test
Design
Review
Process
Plan
Monitoring &
Maintenance
Determine
Data Mining
Goals
Explore
Data
Construct
Data
Build
Model
Determine
Next Steps
Produce
Final
Report
Produce
Project
Plan
Verify
Data
Quality
Integrate
Data
Assess
Model
Format
Data
178
Review
Project
Business
Understanding
Data
Understanding
Date
Preparation
Modeling
Evaluation
Deployment
Determine Collect
business
initial
objectives data
Select
data
Select
Evaluate Plan
modeling results
deploymen
technique
t
Assess
situation
Clean
data
Generate
test
design
Describe
data
Determine
data
Explore
mining
data
goals
Produce
project
plan
Verify
data
quality
Construct Build
model
data
Integrate
data
Format
data
Assess
model
179
Review
process
Determi
ne next
steps
Plan
monitoring
and
maintenan
ce
Produce
final report
Review
project
180
Business understanding
Business
Data
Understanding Understanding
Data
Preparation
Modeling
Evaluation
Background
Business
Objectives
Business
Success
Criteria
Assess
Situation
Inventory of
Resources
Requirements,
Assumptions&
constraints
Risks and
Contingencies
Determine
Data Mining
Goals
Data Mining
Goals
Data Mining
Success
Criteria
Produce
Project
Plan
Project
Plan
Initial Assessment
Of Tools and
Techniques
Determine
Business
Objectives
186
Deployment
Terminology
Costs and
Benefits
Business understanding
1. Determine business objectives
thoroughly understand, from a business
perspective, what the client really wants to
accomplish.Often the client has many
competing objectives and constraints that
must be properly balanced. The analysts
goal is to uncover important factors,at the
beginning, that can influence the outcome
of the project.
187
Business understanding
2.Assess situation
detailed fact-finding about all of there
sources,constraints, assumptions and
other factors that should be considered in
determining the data analysis goal and
project plan.
188
Business understanding
3. Determine data mining goals
A business goal states objectives in business
terminology. A data mining
goal states project objectives in technical terms.
4 Produce project plan
Describe the intended plan for achieving the
data mining goals and thereby achieving the
business goals.
189
Data understanding
Business
Data
Understanding Understanding
Data
Preparation
Collect
Initial
Data
Initial Data
Collection
Report
Describe
Data
Data
Description
Report
Explore
Data
Data
Exploration
Report
Verify
Data
Quality
Data
Quality
Report
Modeling
190
Evaluation
Deployment
Data understanding
1 Collect initial data
Acquire within the project the data (or
access to the data) listed in the project
resources. This initial collection includes
data loading if necessary for data
understanding.
2 Describe data
Examine the gross or surface
properties of the acquired data and report
on the results.
191
Data understanding
3 Explore data
This task tackles the data mining questions,
which can be addressed using querying,
visualization and reporting. These analyses may
address directly the data min-ing goals; they
may also contribute to or refine the data
description and quality reports and feed into the
transformation and other data preparation
needed for further analysis.
4 Verify data quality
Examine the quality of the data.
192
Data preparation
Business
Data
Understanding Understanding
Data
Preparation
Data Set
Select Data
Rationale for
Inclusion/
exclusion
Clean Data
Data Cleaning
Report
Construct Data
Derived
Attributes
Integrate Data
Merged Data
Format Data
Reformatted
Data
Modeling
Data Set
Description
Generated
Records
193
Evaluation
Deployment
Data preparation
1 Select data
Decide on the data to be used for analysis.
Criteria include relevance to the data
mining goals, quality and technical
constraints
194
Data preparation
2 Clean data
Raise the data quality to the level required
by the selected analysis techniques. This
may involve selection of clean subsets of
the data, the insertion of suitable defaults
or more ambitious techniques such as the
estimation of missing data by modeling.
195
Data preparation
3 Construct data
This task includes constructive data
preparation operations such as the
production of derived attributes, entire new
records or transformed values for existing
attributes.
4 Integrate data
Information is combined from multiple
tables or records to create new records or
values.
196
Data preparation
5 Format data
Formatting transformations refer to
primarily syntactic modifications made to
the data that do not change its meaning,
but might be required by the modeling tool.
197
Modeling
Business
Data
Understanding Understanding
Data
Preparation
Modeling
Select
Modeling
Technique
Modeling
Technique
Generate
Test
Design
Test Design
Build
Model
Parameter
Settings
Models
Assess
Model
Model
Assessment
Revised
Parameter
Settings
Evaluation
Modeling
Assumptions
198
Model
Descrip,on
Deployment
Modeling
1 Select modeling technique
select the actual modeling technique that
is to be used.
2 Generate test design
Before we actually build a model, we need
to generate a procedure or mechanism to
test the models quality and validity.
199
Modeling
3 Build model
Run the modeling tool on the prepared dataset
to create one or more models.
4 Assess model
The data mining engineer tries to rank the
models. He assesses the models according to
the evaluation criteria. As far as possible he also
takes into account business objectives and
business success criteria. He also compares all
results according to the evaluation criteria.
200
Evaluation
Business
Data
Understanding Understanding
Data
Preparation
Evaluate
Results
Assessment of
Data Mining
Results
Review
Process
Review of
Process
Determine
Next Steps
List of
Possible
Actions
Modeling
Approved
Models
Decision
201
Evaluation
Deployment
Evaluation
1 Evaluate results
This step assesses the degree to which
the model meets the business objectives
and seeks to determine if there is some
business reason why this model is
deficient. Another option of evaluation is to
test the model(s) on test applications in
the real application if time and budget
constraints permit.
202
Evaluation
2 Review process
Do a more thorough review of the data
mining engagement in order to determine
if there is any important factor or task that
has somehow been overlooked. This
review also covers quality assurance
issues, e.g.,
203
Evaluation
3 Determine next steps
According to the assessment results and
the process review, the project decides
how to proceed at this stage. This task
includes analyses of remaining resources
and budget that influences the decisions.
204
Deployment
Business
Data
Understanding Undertanding
Data
Preparation
Plan
Deployment
Deployment
Plan
Plan
Monitoring &
Maintenance
Monitoring &
Maintenance
Plan
Produce
Final
Report
Final
Report
Review
Project
Experience
Documentation
Modeling
Final
Presentation
205
Evaluation
Deployment
Deployment
1 Plan deployment
this task takes the evaluation results and
concludes a strategy for deployment.
If a general procedure has been identified
to create the relevant model(s), this
procedure is documented here for later
deployment.
206
Deployment
2 Plan monitoring and maintenance
In order to monitor the deployment of the
data mining result(s), the project needs a
detailed plan on the monitoring process.
This plan takes into account the specific
type of deployment.
207
Deployment
3 Produce final report
At the end of the project, the project leader
and his team write up a final report.
Depending on the deployment plan, this
report may be only a summary of the
project and its experiences or it may be a
final and comprehensive presentation of
the data mining result(s).
208
Deployment
4 Review project
Assess what went right and what went
wrong, what was done well and what
needs to be improved.
209
210
Sample
(Generate a representative
sample of the data)
Assess
Explore
SEMMA
Model
Modify
Real-world
Data
Data Consolidation
Collect data
Select data
Integrate data
Data Cleaning
Data Transformation
Normalize data
Discretize/aggregate data
Construct new attributes
Data Reduction
Well-formed
Data
Q & A