Ba 3.4

Business
Analy,cs
By
Dr. Atanu Rakshit
Email: atanu.rakshit@iimrohtak.ac.in
atanu.raks@gmail.com
Business Analy,cs
Text Book:
Business Intelligence A Managerial Approach by
Efraim Turban, Ramesh Sharda, Dursun Delen and
Devid King, 2/e, Pearson, 2012
Reference Material:
Business Analy,cs for Manager by Gert H. N.
Laursen and Jesper Thorlund, Wiley, 2010
Business Analy,cs
Reference Material:
Decision Support and Business Intelligence
Systems by Efraim Turban, Ramesh Sharda and
Dursun Delen, 9/e, Pearson, 2012
Business Intelligence Strategy A Prac,cal Guide
for Achieving BI Excellence by John Boyer, Bill
Frank, Brian Green and Tracy Harris, MC Press,
2010
Business Analy,cs
Sessions Plan
Introduc,on to Business Analy,cs
Data Warehousing
Data Mining for Business Intelligence
Business Analy,cs Model
The Business Analy,cs at the Analy,cs Level
Business Analy,cs at the Strategic Level
Business Analy,cs at the Func,onal Level
Business Performance Management
Big Data Analy,cs
Project Presenta,on
Business Analy,cs
Introduction to Data
Mining
Learning Objec,ves
Dene data mining as an enabling technology for
business intelligence
Understand the objec,ves and benets of business
analy,cs and data mining
Recognize the wide range of applica,ons of data
mining
Learn the standardized data mining processes
CRISP-DM
SEMMA
KDD
Learning Objec,ves
Understand the steps involved in data
preprocessing for data mining
Learn dierent methods and algorithms of data
mining
Build awareness of the exis,ng data mining
so^ware tools
Commercial versus free/open source
Understand the pi_alls and myths of data mining
Opening Vigneae
Data Mining Goes to Hollywood!
Decision situa,on
Problem
Proposed solu,on
Results
Answer & discuss the case ques,ons
Opening Vigneae:
Class No.
Range
(in $Millions)
<1
>1
> 10
(Flop) < 10
< 20
Dependent
Variable
Independent
Variables
A Typical
Classification
Problem

> 20 > 40
> 65
> 100
> 150
> 200
< 40 < 65
< 100
< 150
< 200
(Blockbuster)
Independent Variable
Number of
Possible Values
Values
MPAA Rating
G, PG, PG-13, R, NR
Competition
High, Medium, Low
Star value
High, Medium, Low
Genre
10
Sci-Fi, Historic Epic Drama,

Modern Drama, Politically
Related, Thriller, Horror,
Comedy, Cartoon, Action,
Documentary
Special effects
High, Medium, Low
Sequel
Yes, No
Number of screens
Positive integer
Opening Vigneae:
The DM
Process
Map in
IBM
SPSS
Modeler
Model
Development
process
Model
Assessment
process
Opening Vigneae:
Prediction Models
Individual Models
Performance
Measure
SVM
ANN
Ensemble Models
C&RT
Random
Forest
Boosted
Tree
Fusion
(Average)
Count (Bingo)
192
182
140
189
187
194
Count (1-Away)
104
120
126
121
104
120
Accuracy (% Bingo)
55.49%
52.60%
40.46%
54.62%
54.05%
56.07%
Accuracy (% 1-Away)
85.55%
87.28%
76.88%
89.60%
84.10%
90.75%
0.93
0.87
1.05
0.76
0.84
0.63
Standard deviation
* Training set: 1998 2005 movies; Test set: 2006 movies
Data Mining Concepts and Deni,ons

Why Data Mining?
More intense compe,,on at the global scale
Recogni,on of the value in data sources
Availability of quality data on customers, vendors,
transac,ons, Web, etc.
Consolida,on and integra,on of data repositories
into data warehouses
The exponen,al increase in data processing and
storage capabili,es; and decrease in cost
Movement toward conversion of informa,on
resources into nonphysical form
Deni,on of Data Mining
The nontrivial process of iden,fying valid, novel,

poten,ally useful, and ul,mately understandable
paaerns in data stored in structured databases
- Fayyad et al., (1996)
Keywords in this deni,on: Process, nontrivial,

valid, novel, poten,ally useful, understandable
Data mining: a misnomer?
Other names: knowledge extrac,on, paaern
analysis, knowledge discovery, informa,on
harves,ng, paaern searching, data dredging
Data Mining at the Intersec,on of

Many Disciplines

ial
e
Int
tis
tic
s
c
tifi
Ar
Pattern
Recognition
en
Sta
llig
Mathematical
Modeling
Machine
Learning
Databases
Management Science &

Information Systems
ce
DATA
MINING
Data Mining Characteris,cs/Objec,ves

Source of data for DM is oên a consolidated data
warehouse (not always!).
DM environment is usually a client-server or a
Web-based informa,on systems architecture.
Data is the most cri,cal ingredient for DM which
may include so^/unstructured data.
The miner is oên an end user.
Striking it rich requires crea,ve thinking.
Data mining tools capabili,es and ease of use are
essen,al (Web, Parallel processing, etc.).
Data in Data Mining

Data: a collec,on of facts usually obtained as the result of
experiences, observa,ons, or experiments
Data may consist of numbers, words, and images
Data: lowest level of abstrac,on (from which informa,on
and knowledge are derived)
Data
- DM with different
data types?
Categorical
Nominal
- Other data types?
Numerical
Ordinal
Interval
Ratio
Data Mining An Overview

The process of seeking rela,onships within a data set
of seeking accurate, convenient and useful summary
representa,ons of some aspect of the data
involves a number of steps:
Determining the nature and structure of the
representa,on to be used
Deciding how to quan,fy and compare how well dierent
representa,ons t the data (that is, choosing a score
func,on)
Choosing an algorithmic process to op,mize the score
func,on
Deciding what principles of data management are required
to implement the algorithms eciently
Data Mining Development

Relational Data Model
SQL
Association Rule Algorithms
Data Warehousing
Scalability Techniques
Similarity Measures
Hierarchical Clustering
IR Systems
Imprecise Queries
Textual Data
Web Search Engines
Bayes Theorem
Regression Analysis
EM Algorithm
K-Means Clustering
Time Series Analysis
Algorithm Design Techniques

Algorithm Analysis
Data Structures
Neural Networks
Decision Tree Algorithms
What Does DM Do?

How Does it Work?
DM extracts paaerns from data
Paaern?
A mathema,cal (numeric and/or symbolic)
rela,onship among data items
Types of paaerns
Associa,on
Predic,on
Cluster (segmenta,on)
Sequen,al (or ,me series) rela,onships
A Taxonomy for Data Mining Tasks

Data Mining
Learning Method
Popular Algorithms
Supervised
Classification and Regression Trees,

ANN, SVM, Genetic Algorithms
Classification
Supervised
Decision trees, ANN/MLP, SVM, Rough

sets, Genetic Algorithms
Regression
Supervised
Linear/Nonlinear Regression, Regression

trees, ANN/MLP, SVM
Unsupervised
Apriory, OneR, ZeroR, Eclat
Link analysis
Unsupervised
Expectation Maximization, Apriory

Algorithm, Graph-based Matching
Sequence analysis
Unsupervised
Apriory Algorithm, FP-Growth technique
Unsupervised
K-means, ANN/SOM
Prediction
Association
Clustering
Outlier analysis
Unsupervised
K-means, Expectation Maximization (EM)
Other Data Mining Tasks

These are in addi,on to the primary DM
tasks (predic,on, associa,on, clustering)
Time-series forecas,ng
Part of sequence or link analysis?
Visualiza,on
Another data mining task?
Types of DM
Hypothesis-driven data mining
Discovery-driven data mining
Hypothesis vs Discovery
Tradi,onal analysis is via verica,on-driven
analysis
Requires hypothesis of the desired informa,on
(target)
Requires correct interpreta,on of proposed query
Discovery-driven data mining

Finds data with common characteris,cs
Results are ideal solu,ons to discovery
Finds results without previous hypothesis
Data Mining Applica,ons

Customer Rela,onship Management
Maximize return on marke,ng campaigns
Improve customer reten,on (churn analysis)
Maximize customer value (cross- or up-selling)
Iden,fy and treat most valued customers
Banking & Other Financial

Automate the loan applica,on process
Detec,ng fraudulent transac,ons
Maximize customer value (cross- and up-selling)
Op,mizing cash reserves with forecas,ng
Data Mining Applica,ons (cont.)

Retailing and Logis,cs
Op,mize inventory levels at dierent loca,ons
Improve the store layout and sales promo,ons
Op,mize logis,cs by predic,ng seasonal eects
Minimize losses due to limited shelf life
Manufacturing and Maintenance

Predict/prevent machinery failures
Iden,fy anomalies in produc,on systems to op,mize
manufacturing capacity
Discover novel paaerns to improve product quality

Brokerage and Securi,es Trading
Predict changes on certain bond prices
Forecast the direc,on of stock uctua,ons
Assess the eect of events on market movements
Iden,fy and prevent fraudulent ac,vi,es in trading
Insurance
Forecast claim costs for beaer business planning
Determine op,mal rate plans
Op,mize marke,ng to specic customers
Iden,fy and prevent fraudulent claim ac,vi,es

Computer hardware and so^ware
Science and engineering
Government and defense

Homeland security and law enforcement
Travel industry
Healthcare
Highly popular application
areas for data mining
Medicine
Entertainment industry
Sports
Etc.
Data Mining Process
A manifesta,on of best prac,ces

A systema,c way to conduct DM projects
Dierent groups have dierent versions
Most common standard processes:
CRISP-DM (Cross-Industry Standard Process for
Data Mining)
SEMMA (Sample, Explore, Modify, Model, and
Assess)
KDD (Knowledge Discovery in Databases)
Data Mining Process
Source: KDNuggets.com, August 2007
Data An Example
Regression analysis is a tool which involves building a
predictive model to relate a predictor variable, X, to a
response variable, Y, through a relationship of the form
Y = aX + b. For example, one might build a model
which would allow us to predict a persons annual
credit-card spending given their annual income. Clearly
the model would not be perfect, but since spending
typically increases with income, the model might well be
adequate as rough characterization.
For this example, one would have the following
scenario:
Data Mining An Example

The representa,on is a model in which the
response variable, spending, is linearly related
to the predictor variable, income
The score func,on most commonly used in
this situa,on is the sum of squared
discrepancies between the predicted spending
from the model and observed spending in the
group of people described by the data. The
smaller this sum is, the beaer the model t
the data
Data Mining An Example

The op,miza,on algorithm is quite simple in
the case of linear regression: a and b can be
expressed as explicit func,ons of the
observed values of spending and income
Unless the data set is very large, few data
management problems arise with regression
algorithms. Simple summaries of the data (the
sum, sum of squares, and sums of product of
the X and Y values) are sucient to compute
es,mates of a and b. This means that a single
pass through the data will yield es,mates
Data Mining Tasks
Exploratory Data Analysis (EDA)

Descrip,ve Modeling
Predic,ve Modeling
Discovering Paaerns and Rules
Retrieval by Content
Exploratory Data Analysis (EDA)

The goal here is to explore the data without
any clear ideas of what we are looking for
Typically, EDA techniques are interac,ve and
visual, and there are many eec,ve graphical
display methods for rela,vely small, lowdimensional set
Some examples of EDA applica,ons are:
Pie chart
Descrip,ve Modeling
The goal of a descrip,ve model is to describe
all of the data (or the process genera,ng the
data)
Examples of such descrip,ons include models
for the overall probability distribu,on of the
data (density es,ma,on), par,,oning of the
p-dimensional space into groups (cluster
analysis, segmenta,on) and models describing
the rela,onship between variables
(dependency modeling)
Predic,ve Modeling
The aim is to build a model that will permit
the value of one variable to be predicted from
the known values of other variables
In classica,on, the variable being predicted is
categorical, while in regression the variable is
quata,ve
Components of Data Mining

Algorithms
The data mining algorithms have four basic
components:
Model or Paaern Structure: Determining the
underlying structure or func,onal form that we
seek from the data
Score Func,on: Judging the quality of a aed
model
Op,miza,on and search method: Op,mizing the
score func,on and searching over dierent model
and paaern structures
Data Management Strategy: Handling data access
eciently during the search / op,miza,on
Associa,ons or Basket Analysis

Huge amount of data is stored electronically in all
retail outlets due to barcoding of all goods sold.
Natural to try to find some useful information
from this mountains of data.
A conceptually simple yet interesting example is
to find association rules from these large
databases.
Associa,ons or Basket Analysis

Association rules mining or market basket
analysis searches for interesting customer
habits by looking at associations.
The classical example is the one where a store
was reported to have discovered that people
buying nappies tend also to buy beer
Applications in marketing, store layout,
customer segmentation, medicine, finance, etc.
Basics
Given a set of transactions {T}, each containing
a subset of items from an item set {i1, i2, , im},
discovery of association relationships or
correlations among a set of items.
Want to find a group of items that tend to occur
together.
The association rules are often written as X =>
Y meaning that whenever X appears Y also
tends to appear. X and Y may be single items or
sets of items (same item not appearing in both).
Market-Basket Model
Large Sets
Items A = {A1, A2, , Am}
e.g., products sold in supermarket
Baskets B = {B1, B2, , Bn}
small subsets of items in A

e.g., items bought by customer in one transac,on
Support sup(X) = number of baskets with itemset X

Frequent Itemset Problem
Given support threshold s
Frequent Itemsets sup(X) s
Find all frequent itemsets
Example
Items A = {milk, coke, pepsi, beer, juice}.
Baskets
B1 = {m, c, b}
B3 = {m, b}
B5 = {m, p, b}
B7 = {c, b, j}
B2 = {m, p, j}
B4 = {c, j}
B6 = {m, c, b, j}
B8 = {b, c}
Support threshold s=3

Frequent itemsets
{m}, {c}, {b}, {j}, {m, b}, {c, b}, {j, c}
Applica,on 1 (Retail Stores)

Real market baskets
chain stores keep TBs of customer purchase info
Value?
how typical customers navigate stores
posi,oning temp,ng items
suggests ,e-in tricks e.g., hamburger sale while
raising ketchup price

High support needed, or no $$s
Associa,on Rules
If-then rules about basket contents
{A1, A2,, Ak} Aj
if basket has X={A1,,Ak}, then likely to have Aj
Support (of rule)

su p (X A j) = su p (X + A j)
Condence probability of Aj given A1,,Ak

c o nf(X A j ) =
s up (X + A j )
s up (X )
Finding Associa,on Rules

Goal nd all associa,on rules such that
s
support
condence c
Reduc,on to Frequent Itemsets Problems

Find all frequent itemsets X
Given X={A1, ,Ak}, generate all rules X-Aj Aj
Support = sup(X)
Condence = sup(X)/sup(X-Aj)
Observe X-Aj also frequent support known
Example
B1 = {m, c, b}
B3 = {m, b}
B5 = {m, p, b}
B7 = {c, b, j}
B2 = {m, p, j}
B4 = {c, j}
B6 = {m, c, b, j}
B8 = {b, c}
Associa,on Rule
{m, b} c
Support = 2
Condence = 2/4 = 50%
Applica,ons
Marke,ng
Cross Marke,ng
Aaached Mailing
Catalogue Design
Cross Sell
Up Sell
Store Layout
Promo,on (Segmenta,on)
Applica,ons
Medicine
Pa,ent submits their symptoms and done
mul,ple tests. Doctor diagnoses the problem and
prescribe medicines.
Medical condi,ons recommend minimal tests
Computer-Aided Detec,on
Applica,ons
Sports
Successful Movement Paaern
Preferred skill set Based on total environmental
condi,on
Agriculture
Remotely sensed imagery data of eld to associate
between aaributes of the loca,on and the crop yield
in this loca,on
Predic,ng pest aaacks
Weather condi,ons with the crop yield
Usage of various fer,lizers and pes,cides with crop
yield
Applica,on 2 (Informa,on Retrieval)

Scenario 1
baskets = documents
items = words in documents
frequent word-groups = linked concepts.
Scenario 2
items = sentences
baskets = documents containing sentences
frequent sentence-groups = possible plagiarism
Applica,on 3 (Web Search)

Scenario 1
baskets = web pages
items = outgoing links
pages with similar references about same topic
Scenario 2
baskets = web pages
items = incoming links
pages with similar in-links mirrors, or same
topic
Terminology
We assume that we have a set of transactions,
each transaction being a list of items (e.g. books)
Suppose X and Y appear together in only 1% of
the transactions but whenever X appears there
is 80% chance that Y also appears
The 1% presence of X and Y together is called
the support (or prevalence) of the rule and 80%
is called the confidence (or predictability) of the
rule
These are measures of interestingness of the
rule
Terminology
The support for X=>Y is the probability of both X
and Y appearing together, that is P(X U Y)
The confidence of X=>Y is the conditional
probability of Y appearing given that X exists,
that is:
P(Y | X) = P(X U Y) / P (Y)
Confidence denotes the strength of the
association. Support indicates the frequency of
the pattern. A minimum support is necessary if
an association is going to be of some business
value.
Associa,on Rule Mining

A very popular DM method in business
Finds interes,ng rela,onships (ani,es) between
variables (items or events)
Part of machine learning family
Employs unsupervised learning
There is no output variable
Also known as market basket analysis
Oên used as an example to describe DM to
ordinary people, such as the famous rela,onship
between diapers and beers!

Input: the simple point-of-sale transac,on data
Output: Most frequent ani,es among items
Example: according to the transac,on data
Customer who bought a laptop computer and a virus
protec,on so^ware, also bought extended service plan 70
percent of the ,me"
How do you use such a paaern/knowledge?
Put the items next to each other for ease of nding
Promote the items as a package (do not put one on sale if the
other(s) are on sale)
Place items far apart from each other so that the customer has to
walk the aisles to search for it, and by doing so poten,ally see and
buy other items

Representa,ve applica,ons of associa,on rule
mining include
In business: cross-marke,ng, cross-selling, store design,
catalog design, e-commerce site design, op,miza,on of
online adver,sing, product pricing, and sales/promo,on
congura,on
In medicine: rela,onships between symptoms and
illnesses; diagnosis and pa,ent characteris,cs and
treatments (to be used in medical DSS); and genes and
their func,ons (to be used in genomics projects)

Are all associa,on rules interes,ng and useful?

A Generic Rule: X Y [S%, C%]

X, Y: products and/or services

X: Le^-hand-side (LHS)
Y: Right-hand-side (RHS)
S: Support: how oên X and Y go together
C: Condence: how oên Y go together with the X

Example: {Laptop Computer, An,virus So^ware}

{Extended Service Plan} [30%, 70%]

Given a set of transac,ons, nd rules that will predict the
occurrence of an item based on the occurrences of other items
in the transac,on
Market-Basket transactions
TID
Items
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs

Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Example of Association Rules

{Diaper} {Beer},
{Milk, Bread} {Eggs,Coke},
{Beer, Bread} {Milk},
Implication means co-occurrence, not

causality!
Deni,on: Frequent Itemset

Itemset
A collec,on of one or more items
Example: {Milk, Bread, Diaper}
k-itemset
An itemset that contains k items
Support count ()
Frequency of occurrence of an itemset
E.g. ({Milk, Bread,Diaper}) = 2
Support
Frac,on of transac,ons that contain an
itemset
E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset
An itemset whose support is greater than
or equal to a minsup threshold
TID
Items
Bread, Milk
2
3
4
5

Deni,on: Associa,on Rule
Association Rule
An implication expression of the form
X Y, where X and Y are itemsets
Example:
{Milk, Diaper} {Beer}
Rule Evaluation Metrics

Support (s)
u
Measures how often items in Y

appear in transactions that
contain X
Items
Bread, Milk
2
3
4
5

Example:
Fraction of transactions that contain

both X and Y
{Milk, Diaper } Beer
Confidence (c)
u
TID
s=
(Milk, Diaper, Beer)

|T|
2
= 0.4
5
(Milk, Diaper, Beer) 2

c=
= = 0.67
(Milk, Diaper )
3
Associa,on Rule Mining Task

Given a set of transac,ons T, the goal of associa,on rule
mining is to nd all rules having
support minsup threshold
condence minconf threshold
Brute-force approach:
List all possible associa,on rules
Compute the support and condence for each rule
Prune rules that fail the minsup and minconf thresholds
Computa,onally prohibi,ve!
Eect of Support Distribu,on

How to set the appropriate minsup threshold?
If minsup is set too high, we could miss itemsets involving
interes,ng rare items (e.g., expensive products)

If minsup is set too low, it is computa,onally expensive
and the number of itemsets is very large
Using a single minimum support threshold may not be
eec,ve
Mul,ple Minimum Support

How to apply mul,ple minimum supports?
MS(i): minimum support for item i
e.g.: MS(Milk)=5%, MS(Coke) = 3%,
MS(Broccoli)=0.1%,
MS(Salmon)=0.5%
MS({Milk, Broccoli}) = min (MS(Milk), MS(Broccoli))

= 0.1%
Challenge: Support is no longer an,-monotone
Suppose: Support(Milk, Coke) = 1.5% and

Support(Milk, Coke, Broccoli) = 0.5%
{Milk,Coke} is infrequent but {Milk,Coke,Broccoli} is
frequent

Item
MS(I)
Sup(I)
0.10% 0.25%
0.20% 0.26%
0.30% 0.29%
0.50% 0.05%
3%
4.20%
AB
ABC
AC
ABD
AD
ABE
AE
ACD
BC
ACE
BD
ADE
BE
BCD
CD
BCE
CE
BDE
DE
CDE
B
C

Item
MS(I)
Sup(I)
AB
ABC
AC
ABD
AD
ABE
AE
ACD
BC
ACE
BD
ADE
BE
BCD
CD
BCE
CE
BDE
DE
CDE
A
A
0.10% 0.25%
0.20% 0.26%
B
C
0.30% 0.29%
D
0.50% 0.05%
E
3%
4.20%
Mul,ple Minimum Support (Liu

1999)
Order the items according to their minimum support (in
ascending order)
e.g.: MS(Milk)=5%, MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5%
Ordering: Broccoli, Salmon, Coke, Milk
Need to modify Apriori such that:
L1 : set of frequent items
F1 : set of items whose support is MS(1)

where MS(1) is mini( MS(i) )
C2 : candidate itemsets of size 2 is generated from F1
instead of L1
Mining Associa,on Rules

TID
Items
Bread, Milk
2
3
4
5

Example of Rules:
{Milk,Diaper} {Beer} (s=0.4, c=0.67)
{Milk,Beer} {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} {Milk} (s=0.4, c=0.67)
{Beer} {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} {Milk,Beer} (s=0.4, c=0.5)
{Milk} {Diaper,Beer} (s=0.4, c=0.5)
Observations:
All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
Rules originating from the same itemset have identical support but
can have different confidence
Thus, we may decouple the support and confidence requirements
Mining Associa,on Rules

Two-step approach:
1. Frequent Itemset Genera,on
Generate all itemsets whose support minsup

2. Rule Genera,on
Generate high condence rules from each frequent
itemset, where each rule is a binary par,,oning of a
frequent itemset
Frequent itemset genera,on is s,ll computa,onally

expensive
Frequent Itemset Genera,on

null
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ABCDE
ACDE
BCDE
Given d items, there are

2d possible candidate
itemsets

Brute-force approach:
Each itemset in the lace is a candidate frequent itemset
Count the support of each candidate by scanning the database
Transactions
TID
1
2
3
4
5
Items
Bread, Milk
List of
Candidates
Match each transac,on against every candidate

Complexity ~ O(NMw) => Expensive since M = 2d !!!
Computa,onal Complexity
Given d unique items:
Total number of itemsets = 2d
Total number of possible associa,on rules:
d d k
R =
k j
= 3 2 +1
d 1
d k
k =1
j =1
d +1
If d=6, R = 602 rules

Strategies
Reduce the number of candidates (M)
Complete search: M=2d
Use pruning techniques to reduce M
Reduce the number of transac,ons (N)
Reduce size of N as the size of itemset increases
Used by DHP and ver,cal-based mining algorithms
Reduce the number of comparisons (NM)
Use ecient data structures to store the candidates or
transac,ons
No need to match every candidate against every
transac,on
Reducing Number of Candidates

Apriori principle:
If an itemset is frequent, then all of its subsets must also be
frequent
Apriori principle holds due to the following property of the
support measure:
X , Y : ( X Y ) s( X ) s(Y )
Support of an itemset never exceeds the support of its
subsets
This is known as the an,-monotone property of support
Rule Genera,on
Given a frequent itemset L, nd all non-empty
subsets f L such that f L f sa,ses the
minimum condence requirement
If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D,
A BCD,
AB CD,
BD AC,

ABD C,
B ACD,
AC BD,
CD AB,
ACD B,
C ABD,
AD BC,

BCD A,
D ABC
BC AD,
If |L| = k, then there are 2k 2 candidate

associa,on rules (ignoring L and L)
Rule Genera,on
How to eciently generate rules from frequent itemsets?
In general, condence does not have an an,-monotone
property
c(ABC D) can be larger or smaller than c(AB D)
But condence of rules generated from the same itemset
has an an,-monotone property
e.g., L = {A,B,C,D}:

c(ABC D) c(AB CD) c(A BCD)

Condence is an,-monotone w.r.t. number of items on
the RHS of the rule
Rule Genera,on for Apriori

Algorithm
Lattice of rules
Low
Confidence
Rule
CD=>AB
ABCD=>{ }
BCD=>A
BD=>AC
D=>ABC
Pruned
Rules
ACD=>B
BC=>AD
C=>ABD
ABD=>C
AD=>BC
B=>ACD
ABC=>D
AC=>BD
A=>BCD
AB=>CD
Rule Genera,on for Apriori

Algorithm
Candidate rule is generated by merging two rules
that share the same prex
CD=>AB
BD=>AC
in the rule consequent
join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC
Prune rule D=>ABC if its
subset AD=>BC does not have
high condence
D=>ABC
Illustra,ng Apriori Principle

Item
Bread
Coke
Milk
Beer
Diaper
Eggs
Count
4
2
4
3
4
1
Items (1-itemsets)
Minimum Support = 3
If every subset is considered,
6C + 6C + 6C = 41
1
2
3
With support-based pruning,
6 + 6 + 1 = 13
Itemset
{Bread,Milk}
{Bread,Beer}
{Bread,Diaper}
{Milk,Beer}
{Milk,Diaper}
{Beer,Diaper}
Count
3
2
3
2
3
3
Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Triplets (3-itemsets)
Itemset
{Bread,Milk,Diaper}
Count
3
Apriori Algorithm
Method:

Let k=1
Generate frequent itemsets of length 1
Repeat un,l no new frequent itemsets are iden,ed
Generate length (k+1) candidate itemsets from length k
frequent itemsets
Prune candidate itemsets containing subsets of length
k that are infrequent
Count the support of each candidate by scanning the
DB
Eliminate candidates that are infrequent, leaving only
those that are frequent
Mul,ple Minimum Support (Liu

1999)
Modica,ons to Apriori:
In tradi,onal Apriori,
A candidate (k+1)-itemset is generated by merging two
frequent itemsets of size k
The candidate is pruned if it contains any infrequent subsets
of size k
Pruning step has to be modied:
Prune only if subset contains the rst item
e.g.: Candidate={Broccoli, Coke, Milk} (ordered according to

minimum support)
{Broccoli, Coke} and {Broccoli, Milk} are frequent but
{Coke, Milk} is infrequent
Candidate is not pruned because {Coke,Milk} does not
contain
the rst item, i.e., Broccoli.
Reducing Number of Comparisons
Candidate coun,ng:
Scan the database of transac,ons to determine

the support of each candidate itemset
To reduce the number of comparisons, store the
candidates in a hash structure
Instead of matching each transac,on against every
candidate, match it against candidates contained in the
hashed buckets
Transactions
TID
1
2
3
4
5
Hash Structure
Items
Bread, Milk
Buckets
Eect of Support Distribu,on

Many real data sets have skewed support
distribu,on
Support
distribution of a
retail data set

Algorithms are available for genera,ng
associa,on rules
Apriori
Eclat
FP-Growth
+ Deriva,ves and hybrids of the three
The algorithms help iden,fy the frequent item

sets, which are, then converted to associa,on
rules

Apriori Algorithm
Finds subsets that are common to at least a
minimum number of the itemsets
Uses a boaom-up approach
frequent subsets are extended one item at a ,me (the
size of frequent subsets increases from one-item
subsets to two-item subsets, then three-item subsets,
and so on)
groups of candidates at each level are tested against
the data for minimum support
(see the gure)

Apriori Algorithm
Raw Transaction Data
One-item Itemsets
Two-item Itemsets
Three-item Itemsets
Transaction
No
SKUs
(Item No)
Itemset
(SKUs)
Support
Itemset
(SKUs)
Support
Itemset
(SKUs)
Support
1, 2, 3, 4
1, 2
1, 2, 4
2, 3, 4
1, 3
2, 3, 4
2, 3
1, 4
1, 2, 4
2, 3
1, 2, 3, 4
2, 4
2, 4
3, 4
Data Mining Methods:

Classica,on
Most frequently used DM method
Part of the machine-learning family
Employ supervised learning
Learn from past data, classify new data
The output variable is categorical (nominal
or ordinal) in nature
Classica,on versus regression?
Classica,on versus clustering?
Classification task
Input: a training set of tuples, each labeled
with one class label
Output: a model (classier) that assigns a
class label to each tuple based on the other
aaributes
The model can be used to predict the class
of new tuples, for which the class label is
missing or unknown
What is Classification
Data classica,on is a two-step process
rst step: a model is built describing a
predetermined set of data classes or concepts
second step: the model is used for
classica,on
Each tuple is assumed to belong to a

predened class, as determined by one of
the aaributes, called the class label
a(ribute
Data tuples are also referred to as samples,
examples, or objects
Es,ma,on Methodologies for

Classica,on
Simple split (or holdout or test sample es,ma,on)
Split the data into 2 mutually exclusive sets training
(~70%) and tes,ng (30%)

2/3
Training Data
Model
Development
Classifier
Preprocessed
Data
1/3
Testing Data
Model
Assessment
(scoring)
Prediction
Accuracy
For ANN, the data is split into three sub-sets (training

[~60%], valida,on [~20%], tes,ng [~20%])
Train and test

The tuples (examples, samples) are divided
into training set + test set
Classica,on model is built in two steps:
training - build the model from the training set
test - check the accuracy of the model using
test set
Train and test

Kind of models:
if - then rules
logical formulae
decision trees
Accuracy of models:
the known class of test samples is matched
against the class predicted by the model
accuracy rate = % of test set samples correctly
classied by the model
Training step
Classification
algorithm
training
data
Age
20
18
40
50
35
30
32
40
Car Type
Combi
Sports
Sports
Family
Minivan
Combi
Family
Combi
Risk
High
High
High
Low
Low
High
Low
Low
Classifier
(model)
if age < 31
or Car Type =Sports
then Risk = High
Test step
Classifier
(model)
test
data
Age
27
34
66
44
Car Type
Sports
Family
Family
Sports
Risk
High
Low
High
High
Risk
High
Low
Low
High
Classification (prediction)
Classifier
(model)
new
data
Age
27
34
55
34
Car Type
Sports
Minivan
Family
Sports
Risk
Risk
High
Low
Low
High
Classification vs. Prediction

There are two forms of data analysis that can
be used to extract models describing data
classes or to predict future data trends:
classica,on: predict categorical labels
predic,on: models con,nuous-valued func,ons
Comparing Classification
Methods
PredicBve accuracy: this refers to the ability of
the model to correctly predict the class label of
new or previously unseen data
Speed: this refers to the computa,on costs
involved in genera,ng and using the model
Robustness: this is the ability of the model to
make correct predic,ons given noisy data or data
with missing values
Comparing Classification
Methods
Scalability: this refers to the ability to construct the
model eciently given large amount of data
Interpretability: this refers to the level of
understanding and insight that is provided by the
model
Simplicity:
decision tree size
rule compactness
Domain-dependent quality indicators
Problem formulation
Given records in the database with class
label nd model for each class.
Age
20
18
40
50
35
30
32
40
Car Type
Combi
Sports
Sports
Family
Minivan
Combi
Family
Combi
Risk
High
High
High
Low
Low
High
Low
Low
Age < 31
Car Type
is sports
High
High
Low
Classification techniques
Decision Tree Classica,on

Bayesian Classiers
Neural Networks
Sta,s,cal Analysis
Gene,c Algorithms
Rough Set Approach
k-nearest neighbor classiers
Classification by Decision
Tree Induction
A decision tree is a tree structure, where
each internal node denotes a test on an
aaribute,
each branch represents the outcome of the test,
leaf nodes represent classes or class
distribu,ons
Age < 31
N
Car Type
is sports
High
High
Low
Decision Tree Induction

A decision tree is a class discriminator that
recursively par,,ons the training set un,l each
par,,on consists en,rely or dominantly of
examples from one class.
Each non-leaf node of the tree contains a split
point, which is a test on one or more aaributes
and determines how the data is par,,oned

Basic algorithm: a greedy algorithm that
constructs decision trees in a top-down recursive

divide-and-conquer manner.
Many variants:
from machine learning (ID3, C4.5)
from sta,s,cs (CART) Classica,on And
Regression Trees
from paaern recogni,on (CHAID) Chi-squared
Automa,c Interac,on Detec,on
Main dierence: split criterion

The algorithm consists of two phases:
Build an ini,al tree from the training data such
that each leaf node is pure
Prune this tree to increase its accuracy on test
data
Tree Building
In the growth phase the tree is built by recursively
par,,oning the data un,l each par,,on is either
"pure" (contains members of the same class) or
suciently small.
The form of the split used to par,,on the data
depends on the type of the aaribute used in the
split:
for a con,nuous aaribute A, splits are of the form
value(A)<x where x is a value in the domain of A.
for a categorical aaribute A, splits are of the form
value(A)X where Xdomain(A)
Tree Building Algorithm

Make Tree (Training Data T)
{
Par,,on(T)
}
Par??on(Data S)
{
if (all points in S are in the same class) then

return
for each aaribute A do

evaluate splits on aaribute A;
use best split found to par,,on S into S1 and S2
Par,,on(S1)
Par,,on(S2)
}
Tree Building Algorithm

While growing the tree, the goal at each node is
to determine the split point that "best" divides
the training records belonging to that leaf
To evaluate the goodness of the split some
spling indices have been proposed
Split Criteria
Gini index (CART, SPRINT)
select aaribute that minimize impurity of a split
Informa,on gain (ID3, C4.5)

to measure impurity of a split use entropy
select aaribute that maximize entropy
reduc,on
2 con,ngency table sta,s,cs (CHAID)

measures correla,on between each aaribute
and the class label
select aaribute with maximal correla,on
Gini index
Given a sample training set where each record
represents a car-insurance applicant. We want to
build a model of what makes an applicant a high or
low insurance risk.
Classifier
(model)
Training set
RID
0
1
2
3
4
5
Age
23
17
43
68
32
20
Car Type
family
sport
sport
family
truck
family
Risk
high
high
high
low
low
high
The model built can be used to

screen future insurance applicants
by classifying them into the High
or Low risk categories
Gini index
SPRINT algorithm:
Par??on(Data S) {
if (all points in S are of the same class) then

return;
for each aaribute A do

evaluate splits on aaribute A;
Use best split found to par,,on S into S1 and S2
Par,,on(S1);
Par,,on(S2);
}
Ini?al call: Par,,on(Training Data)
Gini index
Deni,on:

gini(S) = 1 - pj2
where:
S is a data set containing examples from n classes
pj is a rela,ve frequency of class j in S
E.g. two classes, Pos and Neg, and dataset S with p

Pos-elements and n Neg-elements.
ppos= p/(p+n)
pneg = n/(n+p)
gini(S) = 1 - ppos2 - pneg2
Gini index
If dataset S is split into S1 and S2, then spling index
is dened as follows:
giniSPLIT(S) = (p1+ n1)/(p+n)*gini(S1) +
(p2+ n2)/(p+n)* gini(S2)

where p1, n1 (p2, n2) denote p1 Pos-elements and n1
Neg-elements in the dataset S1 (S2), respec,vely.

In this deni,on the "best" split point is the one with
the lowest value of the giniSPLIT index.
Example
Training set
RID
0
1
2
3
4
5
Age
23
17
43
68
32
20
Car Type
family
sport
sport
family
truck
family
Risk
high
high
high
low
low
high
Example
Attribute list for Age
Age
17
20
23
32
43
68
RID
1
5
0
4
2
3
Risk
high
high
high
low
high
low
Attribute list for Car Type

Car Type
family
sport
sport
family
truck
family
RID
0
1
2
3
4
5
Risk
high
high
high
low
low
high
Example
Possible values of a split point for Age aaribute are:
Age17, Age20, Age23, Age32, Age43, Age68
Tuple count
Age<=17
Age>17
High
1
3
Low
0
2
G(Age<=17) = 1- (12+02) = 0
G(Age>17) = 1- ((3/5)2+(2/5)2) = 1 - (13/25)2 = 12/25
GSPLIT = (1/6) * 0 + (5/6) * (12/25) = 2/5
Example
Tuple count
Age<=20
Age>20
High
2
2
Low
0
2
G(Age<=20) = 1- (12+02) = 0
G(Age>20) = 1- ((1/2)2+(1/2)2) = 1/2
GSPLIT = (2/6) * 0 + (4/6) * (1/8) = 1/3
Tuple count
Age<=23
Age>23
High
3
1
Low
0
2
G(Age23) = 1- (12+02) = 0
G(Age>23) = 1- ((1/3)2+(2/3)2) = 1 - (1/9) - (4/9) = 4/9
GSPLIT = (3/6) * 0 + (3/6) * (4/9) = 2/9
Example
Tuple count
Age<=32
Age>32
High
3
1
Low
1
1
G(Age32) = 1- ((3/4)2+(1/4)2) = 1 - (10/16) = 6/16 = 3/8

G(Age>32) = 1- ((1/2)2+(1/2)2) = 1/2
GSPLIT = (4/6)*(3/8) + (2/6)*(1/2) = (1/8) + (1/6)=14/48= 7/24
The lowest value of GSPLIT is for Age23, thus we

have a split point at Age=(23+32) / 2 = 27.5
Example
Decision tree after the first split of the example set:
Age27.5
Risk = High
Age>27.5
Risk = Low
Example
Attribute lists are divided at the split
point.
Attribute lists for Age27.5:
Age
17
20
23
RID
1
5
0
Risk
high
high
high
Car Type
family
sport
family
RID
0
1
5
Risk
high
high
high
Car Type
sport
family
truck
RID
2
3
4
Risk
high
low
low
Attribute lists for Age>27.5

Age
32
43
68
RID
4
2
3
Risk
low
high
low
Example
Evaluating splits for categorical attributes
We have to evaluate splitting index for each of the 2N
combinations, where N is the cardinality of the categorical
attribute.
Tuple count
Car type= {sport}
Car type ={family]
Car type = {truck}
High Low
1
0
0
1
0
1
G(Car type {sport}) = 1 12 02 = 0

G(Car type {family}) = 1 02 12 = 0
G(Car type {truck}) = 1 02 12 = 0
Example
G(Car type { sport, family }) = 1 - (1/2)2 - (1/2)2 = 1/2
G(Car type { sport, truck }) = 1/2
G(Car type { family, truck }) = 1 - 02 - 12 = 0

GSPLIT(Car type { sport }) = (1/3) * 0 + (2/3) * 0 = 0
GSPLIT(Car type { family }) = (1/3) * 0 + (2/3)*(1/2) = 1/3
GSPLIT(Car type { truck }) = (1/3) * 0 + (2/3)*(1/2) = 1/3
GSPLIT(Car type { sport, family}) = (2/3)*(1/2)+(1/3)*0= 1/3
GSPLIT(Car type { sport, truck}) = (2/3)*(1/2)+(1/3)*0= 1/3
GSPLIT(Car type { family, truck }) = (2/3)*0+(1/3)*0=0
Example
The lowest value of GSPLIT is for Car type {sport},
thus this is our split point. Decision tree aêr the
second split of the example set:
Age27.5
Age>27.5
Risk = High
Car type {sport}
Risk = High
Car type {family, truck}
Risk = Low
Information Gain
The informa,on gain measure is used to select
the test aaribute at each node in the tree
The aaribute with the highest informa,on gain
(or greatest entropy reduc,on) is chosen as the
test aaribute for the current node
This aaribute minimizes the informa,on needed
to classify the samples in the resul,ng par,,ons
Information Gain
Let S be a set consis,ng of s data samples.
Suppose the class label aaribute has m dis,nct
values dening m classes, Ci (for i=1, ..., m)
Let si be the number of samples of S in class Ci
The expected informa,on needed to classify a
given sample is given by

I(s1, s2, ..., sm) = - pi log2(pi)
where pi is the probability that an arbitrary

sample belongs to class Ci and is es,mated by si/s.
Information Gain
Let aaribute A have v dis,nct values, {a1, a2, ...,
av}. Aaribute A can be used to par,,on S into {S1,
S2, ..., Sv}, where Sj contains those samples in S
that have value aj of A
If A were selected as the test aaribute, then these
subsets would correspond to the branches grown
from the node containing the set S
Information Gain
Let sij be the number of samples of the class Ci in
a subset Sj. The entropy, or expected informa,on
based on the par,,oning into subsets by A, is
given by:
E(A1, A2, ...Av) = (s1j + s2j +...+smj)/s*
* I(s1j, s2j, ..., smj)
The smaller the entropy value, the greater the
purity of the subset par,,ons.
Information Gain
The term (s1j + s2j +...+smj)/s acts as the weight of
the jth subset and is the number of samples in the
subset (i.e. having value aj of A) divided by the
total number of samples in S. Note that for a
given subset Sj,

I(s1j, s2j, ..., smj) = - pij log2(pij)
where pij = sij/|Sj| and is the probability that a

sample in Sj belongs to class Ci
Information Gain
The encoding informa,on that would be gained
by branching on A is
Gain(A) = I(s1, s2, ..., sm) E(A)

Gain(A) is the expected reduc,on in entropy
caused by knowing the value of aaribute A
Example
RID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Age
<=30
<=30
31..40
>40
>40
>40
31..40
<=30
<=30
>40
<=30
31..40
31..40
>40
Income
high
high
high
medium
low
low
low
medium
low
medium
medium
medium
high
medium
student
no
no
no
no
yes
yes
yes
no
yes
yes
yes
no
yes
no
credit_rating
fair
excellent
fair
fair
fair
excellent
excellent
fair
fair
fair
excellent
excellent
fair
excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Example
Let us consider the following training set of
tuples taken from the customer database.

The class label aaribute, buys_computer, has two
dis,nct values (yes, no), therefore, there are two
classes (m=2).
C1 correspond to yes s1 = 9
C2 correspond to no - s2 = 5
I(s1, s2)=I(9, 5)= - 9/14log29/14 5/14log25/14=0.94
Example
Next, we need to compute the entropy of each
aaribute. Let start with the aaribute age
for age=<=30
s11=2 s21=3 I(s11, s21) = 0.971
for age=31..40
s12=4 s22=0 I(s12, s22) = 0
for age=>40
s13=2 s23=3 I(s13, s23) = 0.971

Example
The entropy of age is,
E(age)=5/14 *I(s11, s21) +4/14* I(s12, s22) +
+ 5/14* I(s13, s23) = 0.694

The gain in informa,on from such a par,,oning
would be:
Gain(age) = I(s1, s2) E(age) = 0.246
Example
We can compute
Gain(income)=0.029,
Gain(student)=0.151, and
Gain(credit_raBng)=0.048
Since age has the highest informa,on gain amont
the aaributes, it is selected as the test atribute. A
node is created and labeled with age, and branches
are grown for each of the aaributes values.
Example
age
<=30
buys_computers:
yes, no
31..40
>40
buys_computers:
yes, no
buys_computers: yes
Example
age
<=30
student
no
no
>40
31..40
credit_rating
yes
yes
excellent
yes
no
fair
yes
Entropy vs. Gini index

Gini index tends to isolate
the largest class from all
other classes
Entropy tends to find groups

of classes that add up to 50%
of the data
class A 40
class B 30
class C 20
class D 10
class A 40
class B 30
class C 20
class D 10
yes
if age < 40
class A 40
no
class B 30
class C 20
class D 10
yes
class A 40
class D 10
if age < 65
no
class B 30
class C 20
Entropy vs. Gini index

Gini will tend to nd the largest class, and
entropy tends to nd groups of classes that
make up 50% of the data
Gini to minimize misclassica,on. Entropy
for exploratory analysis.
Gini is intended for con,nuous aaributes, and
Entropy for aaributes that occur in classes
Tree pruning
When a decision tree is built, many of the branches
will reect anomalies in the training data due to
noise or outliers.
Tree pruning methods typically use sta,s,cal
measures to remove the least reliable branches,
generally resul,ng in faster classica,on and an
improvement in the ability of the tree to correctly
classify independent test data
Tree pruning
Prepruning approach (stopping): a tree is
pruned by hal,ng its construc,on early (i.e. by
deciding not to further split or par,,on the
subset of training samples). Upon hal,ng, the
node becomes a leaf. The leaf hold the most
frequent class among the subset samples
Postpruning approach (pruning): removes
branches from a fully grown tree. A tree node is
pruned by removing its branches. The lowest
unpruned node becomes a leaf and is labeled by
the most frequent class among its former
branches
Extracting Classification
Rules from Decision Trees
The knowledge represented in decision trees can be
extracted and represented in the form of
classica,on IF-THEN rules.
One rule is created for each path from the root to a
leaf node
Each aaribute-value pair along a given path forms a
conjunc,on in the rule antecedent; the leaf node
holds the class predic,on, forming the rule
consequent
Extracting Classification
Rules from Decision Trees
The decision tree of Example (7) can be converted to classica,on
rules:
IF age=<=30 AND student=no THEN buys_computers=no
IF age=<=30 AND student=yes THEN buys_computers=yes
IF age=31..40
THEN buys_computers=yes
IF age=>40 AND credit_ra,ng=excellent

THEN buys_computers=no
IF age=>40 AND credit_ra,ng=fair

THEN buys_computers=yes
Other Classification Methods

There is a number of classica,on methods in the
literature:
Bayesian classiers
Neural-network classiers
K-nearest neighbor classiers
Associa,on-based classiers
Rough and fuzzy sets
Decision Trees
DT algorithms mainly dier on
Spling criteria
Which variable to split rst?
What values to use to split?
How many splits to form for each node?
Stopping criteria
When to stop building the tree
Pruning (generaliza,on method)

Pre-pruning versus post-pruning
Most popular DT algorithms include

ID3, C4.5, C5; CART; CHAID; M5
Decision Trees
Alterna,ve spling criteria
Gini index determines the purity of a specic class
as a result of a decision to branch along a
par,cular aaribute/value
Used in CART
Informa,on gain uses entropy to measure the

extent of uncertainty or randomness of a
par,cular aaribute/value split
Used in ID3, C4.5, C5
Chi-square sta,s,cs (used in CHAID)
Cluster Analysis for Data Mining

Used for automa,c iden,ca,on of natural
groupings of things
Part of the machine-learning family
Employ unsupervised learning
Learns the clusters of things from past data,
then assigns new instances
There is no output variable
Also known as segmenta,on

Clustering results may be used to
Iden,fy natural groupings of customers
Iden,fy rules for assigning new cases to classes
for targe,ng/diagnos,c purposes
Provide characteriza,on, deni,on, labeling of
popula,ons
Decrease the size and complexity of problems for
other data mining methods
Iden,fy outliers in a specic domain (e.g., rareevent detec,on)

Analysis methods
Sta,s,cal methods (including both hierarchical
and nonhierarchical), such as k-means, k-modes,
and so on.
Neural networks (adap,ve resonance theory
[ART], self-organizing map [SOM])
Fuzzy logic (e.g., fuzzy c-means algorithm)
Gene,c algorithms
Divisive versus Agglomera,ve methods

How many clusters?
There is no truly op,mal way to calculate it
Heuris,cs are oên used
Look at the sparseness of clusters

Number of clusters = (n/2)1/2 (n: no of data points)
Use Akaike informa,on criterion (AIC)
Use Bayesian informa,on criterion (BIC)
Most cluster analysis methods involve the use of a

distance measure to calculate the closeness between
pairs of items.
Euclidian versus Manhaaan (rec,linear) distance
What is Cluster Analysis?

Cluster: Collec,on of data objects
(Intraclass similarity) - Objects are similar to objects in same cluster
(Interclass dissimilarity) - Objects are dissimilar to objects in other
clusters
Examples of clusters?
Cluster Analysis Sta,s,cal method to iden,fy and group sets of similar
objects into classes
Good clustering methods produce high quality clusters with high
intraclass similarity and interclass dissimilarity
Unlike classica,on, it is unsupervised learning
Clustering (slide from Han and Kamber)

Clustering of data is a method by which large sets of data is grouped into clusters of smaller
sets of similar data.
The example below demonstrates the clustering of balls of same colour. There are a total of
10 balls which are of three different
colours. We are interested in clustering of balls of the three different colours into three
different groups.
The balls of same colour are clustered into a group as shown below :
Thus, we see clustering means grouping of data or dividing a large data set into smaller data
sets of some similarity.
Usual Working Data Structures

Data matrix
(two modes)
x11
...
x
i1
...
x
n1
(Flat File of
Aaributes/coordinates)

Dissimilarity matrix
(one mode)
Or distance matrix
...
x1f
...
...
...
...
xif
...
...
...
...
...
...
...
xnf
0
d(2,1)
d(3,1) d ( 3,2) 0
:
:
:
d ( n,1) d ( n,2) ...
x1p
...
xip
...
xnp
... 0
Data Types and Distance Metrics
Interval-Scaled Aaributes
Binary Aaributes
Nominal Aaributes
Ordinal Aaributes
Ra,o-Scaled Aaributes
Aaributes of Mixed Type

Interval-Scaled AWributes
Using Interval-Scaled Values
Step 1: Standardize the data
To ensure they all have equal weight
To match up dierent scales into a uniform, single scale
Not always needed! Some,mes we require unequal
weights for an aaribute
Step 2: Compute dissimilarity between records

Use Euclidean, Manhaaan or Minkowski distance

Interval-Scaled AWributes
Minkowski distance

d (i, j) = q (| x x |q + | x x |q +...+ | x x |q )
i1
j1
i2
j2
ip
jp

Euclidean distance
q = 2
Manhaaan distance
q = 1
What are the shapes of these clusters?

Spherical in shape.
Clustering Methods
Par,,oning methods
Hierarchical methods
Density-based methods
Grid-based methods
Model-based methods
Choice of algorithm depends on type of data

available and the nature and purpose of the
applica,on
Clustering Methods
Par,,oning methods
Divide the objects into a set of par,,ons based on
some criteria
Improve the par,,ons by shiîng objects between
them for higher intraclass similarity, interclass
dissimilarity and other such criteria
Two popular heuris,c methods
k-means algorithm
k-medoids algorithm
Clustering Methods
Hierarchical methods
Build up or break down groups of objects in a recursive
manner
Two main approaches
Agglomera,ve approach
Divisive approach
Wikipedia
Clustering Methods
Density-based methods
Grow a given cluster un,l the density decreases
below a certain threshold
Grid-based methods
Form a grid structure by quan,zing the object
space into a nite number of grid cells
Model-based methods
Hypothesize a model and nd the best t of the
data to the chosen model
K-Means Clustering Algorithm

K-Means algorithm is a type of par,,oning method
Group instances based on aaributes into k groups
High intra-cluster similarity; Low inter-cluster similarity
Cluster similarity is measured in regards to the mean value of
objects in the cluster.
How does K-means work ?

First, select K random instances from the data ini,al cluster centers
Second, each instance is assigned to its closest (most similar) cluster center
Third, each cluster center is updated to the mean of its cons,tuent instances
Repeat steps two and three ,ll there is no further change in assignment of
instances to clusters
Par,,onal methods: K-means

Criteria: minimize sum of square of distance
Between each point and centroid of the cluster.
Between each pair of points in the cluster
Algorithm:
Select ini,al par,,on with K clusters: random,
rst K, K separated points
Repeat un,l stabiliza,on:

Assign each point to closest cluster center
Generate new cluster centers
Adjust clusters by merging/spling
Collabora,ve Filtering
Given database of user preferences, predict
preference of new user
Example: predict what new movies you will like based
on
your past preferences
others with similar past preferences
their preferences for the new movies
Example: predict what books/CDs a person may want
to buy
(and suggest it, or give discounts to tempt
customer)
K-Means Clustering Algorithm
Cluster Analysis for Data Mining -

k-Means Clustering Algorithm
Step 1
Step 2
Step 3
K-Mean Example
ID
Name
Height
Weight
x1
Ram
64
60
x2
Shyam
60
61
x3
Gita
59
70
x4
Mohan
68
71
Ar,cial Neural Networks

for Data Mining
Ar,cial neural networks (ANN or NN) is a brain
metaphor for informa,on processing
a.k.a. Neural Compu,ng
Very good at capturing highly complex nonlinear func,ons!
Many uses predic,on (regression, classica,on),
clustering/segmenta,on
Many applica,on areas nance, medicine, marke,ng,

manufacturing, service opera,ons, informa,on systems,
etc.
Biological NN
Dendrites
Synapse
Synapse
Axon
Biological
versus
Ar,cial
Neural
Networks
Axon
Artificial NN
x1
Inputs
x2
.
.
.
xn
Neuron
Dendrites
Neuron
Y1
w1
Outputs
w2
Weights
Processing
Element (PE)
S =
X iW
i =1
Summation
wn
Biological
Neuron
Dendrites
Axon
Synapse
Slow
Many (109)
f (S )
i
Transfer
Function
Artificial
Node (or PE)
Input
Output
Weight
Fast
Few (102)
Y2
.
.
.
Yn
Elements/Concepts of ANN
Processing element (PE)
Informa,on processing
Network structure
Feedforward vs. recurrent vs. mul,-layer
Learning parameters
Supervised/unsupervised, backpropaga,on,
learning rate, momentum
ANN So^ware NN shells, integrated modules in

comprehensive DM so^ware,
Data Mining
So^ware
SPSS PASW Modeler (formerly Clementine)
SAS / SAS Enterprise Miner

Microsoft Excel
R
Your own code
Weka (now Pentaho)
Commercial
KXEN
IBM SPSS Modeler (formerly

Clemen,ne)
SAS Enterprise Miner
IBM Intelligent Miner
StatSo^ Sta,s,ca Data
Miner
many more
Free and/or Open Source

RapidMiner
Weka
many more
RapidMiner
MATLAB
Other commercial tools
KNIME
Microsoft SQL Server
Other free tools
Zementis
Oracle DM
Statsoft Statistica
Salford CART, Mars, other

Orange
Angoss
C4.5, C5.0, See5
Bayesia
Insightful Miner/S-Plus (now TIBCO)
Megaputer
Viscovery
Clario Analytics
Total (w/ others)
Miner3D
Alone
Thinkanalytics
Source: KDNuggets.com, May 2009
20
40
60
80
100
120
Data Mining in MS SQL Server 2008
Data Mining Myths

Data mining
provides instant solu,ons/predic,ons.
is not yet viable for business applica,ons.
requires a separate, dedicated database.
can only be done by those with advanced degrees.
is only for large rms that have lots of customer
data.
is another name for good-old sta,s,cs.
Common Data Mining Blunders

1.
2.
3.
4.
5.
Selec,ng the wrong problem for data mining

Ignoring what your sponsor thinks data mining is
and what it really can/cannot do
Not leaving sucient ,me for data acquisi,on,
selec,on and prepara,on
Looking only at aggregated results and not at
individual records/predic,ons
Being sloppy about keeping track of the data
mining procedure and results
Common Data Mining Mistakes

6.
7.
8.
9.
10.
Ignoring suspicious (good or bad) ndings and

quickly moving on
Running mining algorithms repeatedly and
blindly, without thinking about the next stage
Naively believing everything you are told about
the data
Naively believing everything you are told about
your own data mining analysis
Measuring your results dierently from the way
your sponsor measures them
Data Mining vs. KDD

Knowledge Discovery in Databases (KDD):
process of nding useful informa,on and
paaerns in data.
Data Mining: Use of algorithms to extract the
informa,on and paaerns derived by the KDD
process.
KDD Process
Modified from [FPSS96C]
Selec?on: Obtain data from various sources.

Preprocessing: Cleanse data.
Transforma?on: Convert to common format.
Transform to new format.
Data Mining: Obtain desired results.
Interpreta?on/Evalua?on: Present results to
user in meaningful manner.
KDD Issues
Mul?media Data
Missing Data
Irrelevant Data
Noisy Data
Changing Data
Integra?on
Applica?on
KDD Process Ex: Web Log

Selec?on:
Select log data (dates and loca,ons) to use
Preprocessing:
Remove iden,fying URLs
Remove error logs
Transforma?on:
Sessionize (sort and group)
Data Mining:
Iden,fy and count paaerns
Construct data structure
Interpreta?on/Evalua?on:
Iden,fy and display frequently accessed sequences.
Poten?al User Applica?ons:

Cache predic,on
Personaliza,on
Data Mining Process: CRISP-DM

(Cross Industry Standard Process for Data Mining)
1
Business
Understanding
2
Data
Understanding
Data
Preparation
Data Sources
Deployment
Model
Building
5
Testing and
Evaluation
Data Mining Process: CRISP-DM

Step 1: Business Understanding
Step 2: Data Understanding
Step 3: Data Prepara,on (!)
Step 4: Model Building
Step 5: Tes,ng and Evalua,on
Step 6: Deployment
Accounts for
~85% of total
project time
The process is highly repe,,ve and

experimental (DM: art versus science?)
Business
Understanding
Data
Understanding
Data
Preparation
Modeling
Evaluation
Deployment
Determine
Business
Objectives
Collect
Initial
Data
Select
Data
Select
Modeling
Technique
Evaluate
Results
Plan
Deployment
Assess
Situation
Describe
Data
Clean
Data
Generate
Test
Design
Review
Process
Plan
Monitoring &
Maintenance
Determine
Data Mining
Goals
Explore
Data
Construct
Data
Build
Model
Determine
Next Steps
Produce
Final
Report
Produce
Project
Plan
Verify
Data
Quality
Integrate
Data
Assess
Model
Format
Data
178
Review
Project
Business
Understanding
Data
Understanding
Date
Preparation
Modeling
Evaluation
Deployment
Determine Collect
business
initial
objectives data
Select
data
Select
Evaluate Plan
modeling results
deploymen
technique
t
Assess
situation
Clean
data
Generate
test
design
Describe
data
Determine
data
Explore
mining
data
goals
Produce
project
plan
Verify
data
quality
Construct Build
model
data
Integrate
data
Format
data
Assess
model
179
Review
process
Determi
ne next
steps
Plan
monitoring
and
maintenan
ce
Produce
final report
Review
project
The CRISP-DM reference model

Business understanding
focuses on understanding the project
objectives and requirements from a
business perspective, then converting this
knowledge into a data mining problem
definition and a preliminary plan designed
to achieve the objectives
180

Data understanding
starts with an initial data collection and
proceeds with activities in order to get
familiar with the data, to identify data
quality problems, to discover first insights
into the data or to detect interesting
subsets to form hypotheses for hidden
information.
181

Data preparation
covers all activities to construct the final
dataset from the initial raw data. Data
preparation tasks are likely to be
performed multiple times and not in any
prescribed order. Tasks include table,
record and attribute selection as well as
transformation and cleaning of data for
modeling tools.

182

Modeling
various modeling techniques are selected
and applied and their parameters are
calibrated to optimal values. Typically,
there are several techniques for the same
data mining problem type. Some
techniques have specific requirements on
the form of data. Therefore, stepping back
to the data preparation phase is often
necessary
183

Evaluation
thoroughly evaluate the model and review
the steps executed to construct the model
to be certain it properly achieves the
business objectives.A key objective is to
determine if there is some important
business issue that has not been
sufficiently considered. At the end of this
phase, a decision on the use of the data
mining results should be reached
184

Deployment
the knowledge gained will need to be
organized and presented in a way that the
customer can use it. However, depending
on the requirements, the deployment
phase can be as simple as generating a
report or as complex as implementing a
repeatable data mining process across the
enterprise.
185
Business
Data
Understanding Understanding
Data
Preparation
Modeling
Evaluation
Background
Business
Objectives
Business
Success
Criteria
Assess
Situation
Inventory of
Resources
Requirements,
Assumptions&
constraints
Risks and
Contingencies
Determine
Data Mining
Goals
Data Mining
Goals
Data Mining
Success
Criteria
Produce
Project
Plan
Project
Plan
Initial Assessment
Of Tools and
Techniques
Determine
Business
Objectives
186
Deployment
Terminology
Costs and
Benefits
1. Determine business objectives
thoroughly understand, from a business
perspective, what the client really wants to
accomplish.Often the client has many
competing objectives and constraints that
must be properly balanced. The analysts
goal is to uncover important factors,at the
beginning, that can influence the outcome
of the project.
187
2.Assess situation
detailed fact-finding about all of there
sources,constraints, assumptions and
other factors that should be considered in
determining the data analysis goal and
project plan.
188
3. Determine data mining goals
A business goal states objectives in business
terminology. A data mining
goal states project objectives in technical terms.
4 Produce project plan
Describe the intended plan for achieving the
data mining goals and thereby achieving the
business goals.
189
Data understanding
Business
Data
Data
Preparation
Collect
Initial
Data
Initial Data
Collection
Report
Describe
Data
Data
Description
Report
Explore
Data
Data
Exploration
Report
Verify
Data
Quality
Data
Quality
Report
Modeling
190
Evaluation
Deployment
Data understanding
1 Collect initial data
Acquire within the project the data (or
access to the data) listed in the project
resources. This initial collection includes
data loading if necessary for data
understanding.
2 Describe data
Examine the gross or surface
properties of the acquired data and report
on the results.
191
Data understanding
3 Explore data
This task tackles the data mining questions,
which can be addressed using querying,
visualization and reporting. These analyses may
address directly the data min-ing goals; they
may also contribute to or refine the data
description and quality reports and feed into the
transformation and other data preparation
needed for further analysis.
4 Verify data quality
Examine the quality of the data.
192
Data preparation
Business
Data
Data
Preparation
Data Set
Select Data
Rationale for
Inclusion/
exclusion
Clean Data
Data Cleaning
Report
Construct Data
Derived
Attributes
Integrate Data
Merged Data
Format Data
Reformatted
Data
Modeling
Data Set
Description
Generated
Records
193
Evaluation
Deployment
Data preparation
1 Select data
Decide on the data to be used for analysis.
Criteria include relevance to the data
mining goals, quality and technical
constraints
194
Data preparation
2 Clean data
Raise the data quality to the level required
by the selected analysis techniques. This
may involve selection of clean subsets of
the data, the insertion of suitable defaults
or more ambitious techniques such as the
estimation of missing data by modeling.
195
Data preparation
3 Construct data
This task includes constructive data
preparation operations such as the
production of derived attributes, entire new
records or transformed values for existing
attributes.
4 Integrate data
Information is combined from multiple
tables or records to create new records or
values.
196
Data preparation
5 Format data
Formatting transformations refer to
primarily syntactic modifications made to
the data that do not change its meaning,
but might be required by the modeling tool.
197
Modeling
Business
Data
Data
Preparation
Modeling
Select
Modeling
Technique
Modeling
Technique
Generate
Test
Design
Test Design
Build
Model
Parameter
Settings
Models
Assess
Model
Model
Assessment
Revised
Parameter
Settings
Evaluation
Modeling
Assumptions
198
Model
Descrip,on
Deployment
Modeling
1 Select modeling technique
select the actual modeling technique that
is to be used.
2 Generate test design
Before we actually build a model, we need
to generate a procedure or mechanism to
test the models quality and validity.
199
Modeling
3 Build model
Run the modeling tool on the prepared dataset
to create one or more models.
4 Assess model
The data mining engineer tries to rank the
models. He assesses the models according to
the evaluation criteria. As far as possible he also
takes into account business objectives and
business success criteria. He also compares all
results according to the evaluation criteria.
200
Evaluation
Business
Data
Data
Preparation
Evaluate
Results
Assessment of
Data Mining
Results
Review
Process
Review of
Process
Determine
Next Steps
List of
Possible
Actions
Modeling
Approved
Models
Decision
201
Evaluation
Deployment
Evaluation
1 Evaluate results
This step assesses the degree to which
the model meets the business objectives
and seeks to determine if there is some
business reason why this model is
deficient. Another option of evaluation is to
test the model(s) on test applications in
the real application if time and budget
constraints permit.
202
Evaluation
2 Review process
Do a more thorough review of the data
mining engagement in order to determine
if there is any important factor or task that
has somehow been overlooked. This
review also covers quality assurance
issues, e.g.,
203
Evaluation
3 Determine next steps
According to the assessment results and
the process review, the project decides
how to proceed at this stage. This task
includes analyses of remaining resources
and budget that influences the decisions.
204
Deployment
Business
Data
Understanding Undertanding
Data
Preparation
Plan
Deployment
Deployment
Plan
Plan
Monitoring &
Maintenance
Monitoring &
Maintenance
Plan
Produce
Final
Report
Final
Report
Review
Project
Experience
Documentation
Modeling
Final
Presentation
205
Evaluation
Deployment
Deployment
1 Plan deployment
this task takes the evaluation results and
concludes a strategy for deployment.
If a general procedure has been identified
to create the relevant model(s), this
procedure is documented here for later
deployment.
206
Deployment
2 Plan monitoring and maintenance
In order to monitor the deployment of the
data mining result(s), the project needs a
detailed plan on the monitoring process.
This plan takes into account the specific
type of deployment.
207
Deployment
3 Produce final report
At the end of the project, the project leader
and his team write up a final report.
Depending on the deployment plan, this
report may be only a summary of the
project and its experiences or it may be a
final and comprehensive presentation of
the data mining result(s).
208
Deployment
4 Review project
Assess what went right and what went
wrong, what was done well and what
needs to be improved.
209
The CRISP-DM user guide

The user guide gives more detailed tips
and hints for each phase and each task
within a phase and depicts how to do a
data mining project.
Provide activities involved with each
generic task within a phase.
210
Data Mining Process: SEMMA

Sample
(Generate a representative
sample of the data)
Assess
Explore
(Evaluate the accuracy and

usefulness of the models)
(Visualization and basic

description of the data)
SEMMA
Model
Modify
(Use variety of statistical and

machine learning models )
(Select variables, transform

variable representations)
Data Prepara,on A Cri,cal DM

Task

Real-world
Data
Data Consolidation
Collect data
Select data
Integrate data
Data Cleaning
Impute missing values

Reduce noise in data
Eliminate inconsistencies
Data Transformation
Normalize data
Discretize/aggregate data
Construct new attributes
Data Reduction
Reduce number of variables

Reduce number of cases
Balance skewed data
Well-formed
Data
Q & A

Ba 3.4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ba 3.4

Uploaded by

Copyright:

Available Formats

Business

Understand the pi_alls and myths of data mining

High, Medium, Low

High, Medium, Low

Sci-Fi, Historic Epic Drama,

High, Medium, Low

* Training set: 1998 2005 movies; Test set: 2006 movies

Data Mining Concepts and Deni,ons

Deni,on of Data Mining

The nontrivial process of iden,fying valid, novel,

Keywords in this deni,on: Process, nontrivial,

Data Mining at the Intersec,on of

Management Science &

Data Mining Characteris,cs/Objec,ves

Data in Data Mining

- Other data types?

Data Mining An Overview

Data Mining Development

Algorithm Design Techniques

What Does DM Do?

A Taxonomy for Data Mining Tasks

Classification and Regression Trees,

Decision trees, ANN/MLP, SVM, Rough

Linear/Nonlinear Regression, Regression

Apriory, OneR, ZeroR, Eclat

Expectation Maximization, Apriory

Apriory Algorithm, FP-Growth technique

K-means, Expectation Maximization (EM)

Other Data Mining Tasks

Discovery-driven data mining

Data Mining Applica,ons

Banking & Other Financial

Data Mining Applica,ons (cont.)

Manufacturing and Maintenance

Data Mining Applica,ons (cont.)

Data Mining Applica,ons (cont.)

Government and defense

Data Mining Process

A manifesta,on of best prac,ces

Data Mining Process

Source: KDNuggets.com, August 2007

Data Mining An Example

Data Mining An Example

Data Mining Tasks

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA)

Components of Data Mining

Associa,ons or Basket Analysis

Associa,ons or Basket Analysis

e.g., products sold in supermarket

Baskets B = {B1, B2, , Bn}

small subsets of items in A

Support sup(X) = number of baskets with itemset X

Support threshold s=3

{m}, {c}, {b}, {j}, {m, b}, {c, b}, {j, c}

Applica,on 1 (Retail Stores)

High support needed, or no $$s

Support (of rule)

Condence probability of Aj given A1,,Ak

Finding Associa,on Rules

Reduc,on to Frequent Itemsets Problems

Observe X-Aj also frequent support known

Applica,on 2 (Informa,on Retrieval)