You are on page 1of 213

Business

Analy,cs

By
Dr. Atanu Rakshit
Email: atanu.rakshit@iimrohtak.ac.in
atanu.raks@gmail.com

Business Analy,cs
Text Book:
Business Intelligence A Managerial Approach by
Efraim Turban, Ramesh Sharda, Dursun Delen and
Devid King, 2/e, Pearson, 2012

Reference Material:
Business Analy,cs for Manager by Gert H. N.
Laursen and Jesper Thorlund, Wiley, 2010

Business Analy,cs
Reference Material:
Decision Support and Business Intelligence
Systems by Efraim Turban, Ramesh Sharda and
Dursun Delen, 9/e, Pearson, 2012
Business Intelligence Strategy A Prac,cal Guide
for Achieving BI Excellence by John Boyer, Bill
Frank, Brian Green and Tracy Harris, MC Press,
2010

Business Analy,cs
Sessions Plan
Introduc,on to Business Analy,cs
Data Warehousing
Data Mining for Business Intelligence
Business Analy,cs Model
The Business Analy,cs at the Analy,cs Level
Business Analy,cs at the Strategic Level
Business Analy,cs at the Func,onal Level
Business Performance Management
Big Data Analy,cs
Project Presenta,on

Business Analy,cs

Introduction to Data
Mining

Learning Objec,ves
Dene data mining as an enabling technology for
business intelligence
Understand the objec,ves and benets of business
analy,cs and data mining
Recognize the wide range of applica,ons of data
mining
Learn the standardized data mining processes
CRISP-DM
SEMMA
KDD

Learning Objec,ves
Understand the steps involved in data
preprocessing for data mining
Learn dierent methods and algorithms of data
mining
Build awareness of the exis,ng data mining
so^ware tools
Commercial versus free/open source

Understand the pi_alls and myths of data mining

Opening Vigneae
Data Mining Goes to Hollywood!
Decision situa,on
Problem
Proposed solu,on
Results
Answer & discuss the case ques,ons

Opening Vigneae:
Data Mining Goes to Hollywood!
Class No.
Range
(in $Millions)

<1

>1

> 10

(Flop) < 10

< 20

Dependent
Variable
Independent
Variables

A Typical
Classification
Problem

> 20 > 40

> 65

> 100

> 150

> 200

< 40 < 65

< 100

< 150

< 200

(Blockbuster)

Independent Variable

Number of
Possible Values
Values

MPAA Rating

G, PG, PG-13, R, NR

Competition

High, Medium, Low

Star value

High, Medium, Low

Genre

10

Sci-Fi, Historic Epic Drama,


Modern Drama, Politically
Related, Thriller, Horror,
Comedy, Cartoon, Action,
Documentary

Special effects

High, Medium, Low

Sequel

Yes, No

Number of screens

Positive integer

Opening Vigneae:
Data Mining Goes to Hollywood!
The DM
Process
Map in
IBM
SPSS
Modeler

Model
Development
process

Model
Assessment
process

Opening Vigneae:
Data Mining Goes to Hollywood!
Prediction Models
Individual Models
Performance
Measure

SVM

ANN

Ensemble Models

C&RT

Random
Forest

Boosted
Tree

Fusion
(Average)

Count (Bingo)

192

182

140

189

187

194

Count (1-Away)

104

120

126

121

104

120

Accuracy (% Bingo)

55.49%

52.60%

40.46%

54.62%

54.05%

56.07%

Accuracy (% 1-Away)

85.55%

87.28%

76.88%

89.60%

84.10%

90.75%

0.93

0.87

1.05

0.76

0.84

0.63

Standard deviation

* Training set: 1998 2005 movies; Test set: 2006 movies

Data Mining Concepts and Deni,ons


Why Data Mining?
More intense compe,,on at the global scale
Recogni,on of the value in data sources
Availability of quality data on customers, vendors,
transac,ons, Web, etc.
Consolida,on and integra,on of data repositories
into data warehouses
The exponen,al increase in data processing and
storage capabili,es; and decrease in cost
Movement toward conversion of informa,on
resources into nonphysical form

Deni,on of Data Mining

The nontrivial process of iden,fying valid, novel,


poten,ally useful, and ul,mately understandable
paaerns in data stored in structured databases
- Fayyad et al., (1996)

Keywords in this deni,on: Process, nontrivial,


valid, novel, poten,ally useful, understandable
Data mining: a misnomer?
Other names: knowledge extrac,on, paaern
analysis, knowledge discovery, informa,on
harves,ng, paaern searching, data dredging

Data Mining at the Intersec,on of


Many Disciplines

ial
e
Int

tis
tic
s

c
tifi
Ar

Pattern
Recognition

en

Sta

llig

Mathematical
Modeling

Machine
Learning

Databases

Management Science &


Information Systems

ce

DATA
MINING

Data Mining Characteris,cs/Objec,ves


Source of data for DM is o^en a consolidated data
warehouse (not always!).
DM environment is usually a client-server or a
Web-based informa,on systems architecture.
Data is the most cri,cal ingredient for DM which
may include so^/unstructured data.
The miner is o^en an end user.
Striking it rich requires crea,ve thinking.
Data mining tools capabili,es and ease of use are
essen,al (Web, Parallel processing, etc.).

Data in Data Mining


Data: a collec,on of facts usually obtained as the result of
experiences, observa,ons, or experiments
Data may consist of numbers, words, and images
Data: lowest level of abstrac,on (from which informa,on
and knowledge are derived)
Data

- DM with different
data types?

Categorical

Nominal

- Other data types?

Numerical

Ordinal

Interval

Ratio

Data Mining An Overview


The process of seeking rela,onships within a data set
of seeking accurate, convenient and useful summary
representa,ons of some aspect of the data
involves a number of steps:
Determining the nature and structure of the
representa,on to be used
Deciding how to quan,fy and compare how well dierent
representa,ons t the data (that is, choosing a score
func,on)
Choosing an algorithmic process to op,mize the score
func,on
Deciding what principles of data management are required
to implement the algorithms eciently

Data Mining Development


Relational Data Model
SQL
Association Rule Algorithms
Data Warehousing
Scalability Techniques

Similarity Measures
Hierarchical Clustering
IR Systems
Imprecise Queries
Textual Data
Web Search Engines
Bayes Theorem
Regression Analysis
EM Algorithm
K-Means Clustering
Time Series Analysis

Algorithm Design Techniques


Algorithm Analysis
Data Structures

Neural Networks
Decision Tree Algorithms

What Does DM Do?


How Does it Work?
DM extracts paaerns from data
Paaern?
A mathema,cal (numeric and/or symbolic)
rela,onship among data items

Types of paaerns
Associa,on
Predic,on
Cluster (segmenta,on)
Sequen,al (or ,me series) rela,onships

A Taxonomy for Data Mining Tasks


Data Mining

Learning Method

Popular Algorithms

Supervised

Classification and Regression Trees,


ANN, SVM, Genetic Algorithms

Classification

Supervised

Decision trees, ANN/MLP, SVM, Rough


sets, Genetic Algorithms

Regression

Supervised

Linear/Nonlinear Regression, Regression


trees, ANN/MLP, SVM

Unsupervised

Apriory, OneR, ZeroR, Eclat

Link analysis

Unsupervised

Expectation Maximization, Apriory


Algorithm, Graph-based Matching

Sequence analysis

Unsupervised

Apriory Algorithm, FP-Growth technique

Unsupervised

K-means, ANN/SOM

Prediction

Association

Clustering

Outlier analysis

Unsupervised

K-means, Expectation Maximization (EM)

Other Data Mining Tasks


These are in addi,on to the primary DM
tasks (predic,on, associa,on, clustering)
Time-series forecas,ng
Part of sequence or link analysis?

Visualiza,on
Another data mining task?

Types of DM
Hypothesis-driven data mining
Discovery-driven data mining

Hypothesis vs Discovery
Tradi,onal analysis is via verica,on-driven
analysis
Requires hypothesis of the desired informa,on
(target)
Requires correct interpreta,on of proposed query

Discovery-driven data mining


Finds data with common characteris,cs
Results are ideal solu,ons to discovery
Finds results without previous hypothesis

Data Mining Applica,ons


Customer Rela,onship Management
Maximize return on marke,ng campaigns
Improve customer reten,on (churn analysis)
Maximize customer value (cross- or up-selling)
Iden,fy and treat most valued customers

Banking & Other Financial


Automate the loan applica,on process
Detec,ng fraudulent transac,ons
Maximize customer value (cross- and up-selling)
Op,mizing cash reserves with forecas,ng

Data Mining Applica,ons (cont.)


Retailing and Logis,cs
Op,mize inventory levels at dierent loca,ons
Improve the store layout and sales promo,ons
Op,mize logis,cs by predic,ng seasonal eects
Minimize losses due to limited shelf life

Manufacturing and Maintenance


Predict/prevent machinery failures
Iden,fy anomalies in produc,on systems to op,mize
manufacturing capacity
Discover novel paaerns to improve product quality

Data Mining Applica,ons (cont.)


Brokerage and Securi,es Trading
Predict changes on certain bond prices
Forecast the direc,on of stock uctua,ons
Assess the eect of events on market movements
Iden,fy and prevent fraudulent ac,vi,es in trading

Insurance
Forecast claim costs for beaer business planning
Determine op,mal rate plans
Op,mize marke,ng to specic customers
Iden,fy and prevent fraudulent claim ac,vi,es

Data Mining Applica,ons (cont.)


Computer hardware and so^ware
Science and engineering

Government and defense


Homeland security and law enforcement
Travel industry
Healthcare
Highly popular application
areas for data mining
Medicine
Entertainment industry
Sports
Etc.

Data Mining Process

A manifesta,on of best prac,ces


A systema,c way to conduct DM projects
Dierent groups have dierent versions
Most common standard processes:
CRISP-DM (Cross-Industry Standard Process for
Data Mining)
SEMMA (Sample, Explore, Modify, Model, and
Assess)
KDD (Knowledge Discovery in Databases)

Data Mining Process

Source: KDNuggets.com, August 2007

Data An Example
Regression analysis is a tool which involves building a
predictive model to relate a predictor variable, X, to a
response variable, Y, through a relationship of the form
Y = aX + b. For example, one might build a model
which would allow us to predict a persons annual
credit-card spending given their annual income. Clearly
the model would not be perfect, but since spending
typically increases with income, the model might well be
adequate as rough characterization.
For this example, one would have the following
scenario:

Data Mining An Example


The representa,on is a model in which the
response variable, spending, is linearly related
to the predictor variable, income
The score func,on most commonly used in
this situa,on is the sum of squared
discrepancies between the predicted spending
from the model and observed spending in the
group of people described by the data. The
smaller this sum is, the beaer the model t
the data

Data Mining An Example


The op,miza,on algorithm is quite simple in
the case of linear regression: a and b can be
expressed as explicit func,ons of the
observed values of spending and income
Unless the data set is very large, few data
management problems arise with regression
algorithms. Simple summaries of the data (the
sum, sum of squares, and sums of product of
the X and Y values) are sucient to compute
es,mates of a and b. This means that a single
pass through the data will yield es,mates

Data Mining Tasks

Exploratory Data Analysis (EDA)


Descrip,ve Modeling
Predic,ve Modeling
Discovering Paaerns and Rules
Retrieval by Content

Exploratory Data Analysis (EDA)


The goal here is to explore the data without
any clear ideas of what we are looking for
Typically, EDA techniques are interac,ve and
visual, and there are many eec,ve graphical
display methods for rela,vely small, lowdimensional set
Some examples of EDA applica,ons are:
Pie chart

Descrip,ve Modeling
The goal of a descrip,ve model is to describe
all of the data (or the process genera,ng the
data)
Examples of such descrip,ons include models
for the overall probability distribu,on of the
data (density es,ma,on), par,,oning of the
p-dimensional space into groups (cluster
analysis, segmenta,on) and models describing
the rela,onship between variables
(dependency modeling)

Predic,ve Modeling
The aim is to build a model that will permit
the value of one variable to be predicted from
the known values of other variables
In classica,on, the variable being predicted is
categorical, while in regression the variable is
quata,ve

Components of Data Mining


Algorithms
The data mining algorithms have four basic
components:
Model or Paaern Structure: Determining the
underlying structure or func,onal form that we
seek from the data
Score Func,on: Judging the quality of a aed
model
Op,miza,on and search method: Op,mizing the
score func,on and searching over dierent model
and paaern structures
Data Management Strategy: Handling data access
eciently during the search / op,miza,on

Associa,ons or Basket Analysis


Huge amount of data is stored electronically in all
retail outlets due to barcoding of all goods sold.
Natural to try to find some useful information
from this mountains of data.
A conceptually simple yet interesting example is
to find association rules from these large
databases.

Associa,ons or Basket Analysis


Association rules mining or market basket
analysis searches for interesting customer
habits by looking at associations.
The classical example is the one where a store
was reported to have discovered that people
buying nappies tend also to buy beer
Applications in marketing, store layout,
customer segmentation, medicine, finance, etc.

Basics
Given a set of transactions {T}, each containing
a subset of items from an item set {i1, i2, , im},
discovery of association relationships or
correlations among a set of items.
Want to find a group of items that tend to occur
together.
The association rules are often written as X =>
Y meaning that whenever X appears Y also
tends to appear. X and Y may be single items or
sets of items (same item not appearing in both).

Market-Basket Model
Large Sets
Items A = {A1, A2, , Am}

e.g., products sold in supermarket

Baskets B = {B1, B2, , Bn}

small subsets of items in A


e.g., items bought by customer in one transac,on

Support sup(X) = number of baskets with itemset X


Frequent Itemset Problem
Given support threshold s
Frequent Itemsets sup(X) s
Find all frequent itemsets

Example
Items A = {milk, coke, pepsi, beer, juice}.
Baskets
B1 = {m, c, b}
B3 = {m, b}
B5 = {m, p, b}
B7 = {c, b, j}

B2 = {m, p, j}
B4 = {c, j}
B6 = {m, c, b, j}
B8 = {b, c}

Support threshold s=3


Frequent itemsets

{m}, {c}, {b}, {j}, {m, b}, {c, b}, {j, c}

Applica,on 1 (Retail Stores)


Real market baskets
chain stores keep TBs of customer purchase info
Value?
how typical customers navigate stores
posi,oning temp,ng items
suggests ,e-in tricks e.g., hamburger sale while
raising ketchup price

High support needed, or no $$s

Associa,on Rules
If-then rules about basket contents
{A1, A2,, Ak} Aj
if basket has X={A1,,Ak}, then likely to have Aj

Support (of rule)


su p (X A j) = su p (X + A j)

Condence probability of Aj given A1,,Ak


c o nf(X A j ) =

s up (X + A j )
s up (X )

Finding Associa,on Rules


Goal nd all associa,on rules such that
s
support
condence c

Reduc,on to Frequent Itemsets Problems


Find all frequent itemsets X
Given X={A1, ,Ak}, generate all rules X-Aj Aj
Support = sup(X)
Condence = sup(X)/sup(X-Aj)

Observe X-Aj also frequent support known

Example
B1 = {m, c, b}
B3 = {m, b}
B5 = {m, p, b}
B7 = {c, b, j}

B2 = {m, p, j}
B4 = {c, j}
B6 = {m, c, b, j}
B8 = {b, c}

Associa,on Rule
{m, b} c
Support = 2
Condence = 2/4 = 50%

Applica,ons
Marke,ng
Cross Marke,ng
Aaached Mailing
Catalogue Design
Cross Sell
Up Sell
Store Layout
Promo,on (Segmenta,on)

Applica,ons
Medicine
Pa,ent submits their symptoms and done
mul,ple tests. Doctor diagnoses the problem and
prescribe medicines.
Medical condi,ons recommend minimal tests

Computer-Aided Detec,on

Applica,ons
Sports
Successful Movement Paaern
Preferred skill set Based on total environmental
condi,on
Agriculture
Remotely sensed imagery data of eld to associate
between aaributes of the loca,on and the crop yield
in this loca,on
Predic,ng pest aaacks
Weather condi,ons with the crop yield
Usage of various fer,lizers and pes,cides with crop
yield

Applica,on 2 (Informa,on Retrieval)


Scenario 1
baskets = documents
items = words in documents
frequent word-groups = linked concepts.

Scenario 2
items = sentences
baskets = documents containing sentences
frequent sentence-groups = possible plagiarism

Applica,on 3 (Web Search)


Scenario 1
baskets = web pages
items = outgoing links
pages with similar references about same topic

Scenario 2
baskets = web pages
items = incoming links
pages with similar in-links mirrors, or same
topic

Terminology
We assume that we have a set of transactions,
each transaction being a list of items (e.g. books)
Suppose X and Y appear together in only 1% of
the transactions but whenever X appears there
is 80% chance that Y also appears
The 1% presence of X and Y together is called
the support (or prevalence) of the rule and 80%
is called the confidence (or predictability) of the
rule
These are measures of interestingness of the
rule

Terminology
The support for X=>Y is the probability of both X
and Y appearing together, that is P(X U Y)
The confidence of X=>Y is the conditional
probability of Y appearing given that X exists,
that is:
P(Y | X) = P(X U Y) / P (Y)
Confidence denotes the strength of the
association. Support indicates the frequency of
the pattern. A minimum support is necessary if
an association is going to be of some business
value.

Associa,on Rule Mining


A very popular DM method in business
Finds interes,ng rela,onships (ani,es) between
variables (items or events)
Part of machine learning family
Employs unsupervised learning
There is no output variable
Also known as market basket analysis
O^en used as an example to describe DM to
ordinary people, such as the famous rela,onship
between diapers and beers!

Associa,on Rule Mining


Input: the simple point-of-sale transac,on data
Output: Most frequent ani,es among items
Example: according to the transac,on data
Customer who bought a laptop computer and a virus
protec,on so^ware, also bought extended service plan 70
percent of the ,me"
How do you use such a paaern/knowledge?
Put the items next to each other for ease of nding
Promote the items as a package (do not put one on sale if the
other(s) are on sale)
Place items far apart from each other so that the customer has to
walk the aisles to search for it, and by doing so poten,ally see and
buy other items

Associa,on Rule Mining


Representa,ve applica,ons of associa,on rule
mining include
In business: cross-marke,ng, cross-selling, store design,
catalog design, e-commerce site design, op,miza,on of
online adver,sing, product pricing, and sales/promo,on
congura,on
In medicine: rela,onships between symptoms and
illnesses; diagnosis and pa,ent characteris,cs and
treatments (to be used in medical DSS); and genes and
their func,ons (to be used in genomics projects)

Associa,on Rule Mining


Are all associa,on rules interes,ng and useful?

A Generic Rule: X Y [S%, C%]


X, Y: products and/or services


X: Le^-hand-side (LHS)
Y: Right-hand-side (RHS)
S: Support: how o^en X and Y go together
C: Condence: how o^en Y go together with the X

Example: {Laptop Computer, An,virus So^ware}


{Extended Service Plan} [30%, 70%]

Associa,on Rule Mining


Given a set of transac,ons, nd rules that will predict the
occurrence of an item based on the occurrences of other items
in the transac,on
Market-Basket transactions
TID

Items

Bread, Milk

2
3
4
5

Bread, Diaper, Beer, Eggs


Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke

Example of Association Rules


{Diaper} {Beer},
{Milk, Bread} {Eggs,Coke},
{Beer, Bread} {Milk},

Implication means co-occurrence, not


causality!

Deni,on: Frequent Itemset


Itemset
A collec,on of one or more items
Example: {Milk, Bread, Diaper}

k-itemset
An itemset that contains k items

Support count ()
Frequency of occurrence of an itemset
E.g. ({Milk, Bread,Diaper}) = 2

Support
Frac,on of transac,ons that contain an
itemset
E.g. s({Milk, Bread, Diaper}) = 2/5

Frequent Itemset
An itemset whose support is greater than
or equal to a minsup threshold

TID

Items

Bread, Milk

2
3
4
5

Bread, Diaper, Beer, Eggs


Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke

Deni,on: Associa,on Rule

Association Rule
An implication expression of the form
X Y, where X and Y are itemsets
Example:
{Milk, Diaper} {Beer}

Rule Evaluation Metrics


Support (s)
u

Measures how often items in Y


appear in transactions that
contain X

Items

Bread, Milk

2
3
4
5

Bread, Diaper, Beer, Eggs


Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke

Example:

Fraction of transactions that contain


both X and Y

{Milk, Diaper } Beer

Confidence (c)
u

TID

s=

(Milk, Diaper, Beer)


|T|

2
= 0.4
5

(Milk, Diaper, Beer) 2


c=
= = 0.67
(Milk, Diaper )
3

Associa,on Rule Mining Task


Given a set of transac,ons T, the goal of associa,on rule
mining is to nd all rules having
support minsup threshold
condence minconf threshold
Brute-force approach:
List all possible associa,on rules
Compute the support and condence for each rule
Prune rules that fail the minsup and minconf thresholds
Computa,onally prohibi,ve!

Eect of Support Distribu,on


How to set the appropriate minsup threshold?
If minsup is set too high, we could miss itemsets involving
interes,ng rare items (e.g., expensive products)

If minsup is set too low, it is computa,onally expensive
and the number of itemsets is very large
Using a single minimum support threshold may not be
eec,ve

Mul,ple Minimum Support


How to apply mul,ple minimum supports?
MS(i): minimum support for item i
e.g.: MS(Milk)=5%, MS(Coke) = 3%,
MS(Broccoli)=0.1%,
MS(Salmon)=0.5%
MS({Milk, Broccoli}) = min (MS(Milk), MS(Broccoli))


= 0.1%
Challenge: Support is no longer an,-monotone
Suppose: Support(Milk, Coke) = 1.5% and

Support(Milk, Coke, Broccoli) = 0.5%
{Milk,Coke} is infrequent but {Milk,Coke,Broccoli} is
frequent

Mul,ple Minimum Support


Item

MS(I)

Sup(I)

0.10% 0.25%

0.20% 0.26%

0.30% 0.29%

0.50% 0.05%

3%

4.20%

AB

ABC

AC

ABD

AD

ABE

AE

ACD

BC

ACE

BD

ADE

BE

BCD

CD

BCE

CE

BDE

DE

CDE

B
C

Mul,ple Minimum Support


Item

MS(I)

Sup(I)

AB

ABC

AC

ABD

AD

ABE

AE

ACD

BC

ACE

BD

ADE

BE

BCD

CD

BCE

CE

BDE

DE

CDE

A
A

0.10% 0.25%

0.20% 0.26%

B
C

0.30% 0.29%
D

0.50% 0.05%
E

3%

4.20%

Mul,ple Minimum Support (Liu


1999)
Order the items according to their minimum support (in
ascending order)
e.g.: MS(Milk)=5%, MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5%
Ordering: Broccoli, Salmon, Coke, Milk
Need to modify Apriori such that:
L1 : set of frequent items
F1 : set of items whose support is MS(1)

where MS(1) is mini( MS(i) )
C2 : candidate itemsets of size 2 is generated from F1
instead of L1

Mining Associa,on Rules


TID

Items

Bread, Milk

2
3
4
5

Bread, Diaper, Beer, Eggs


Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke

Example of Rules:
{Milk,Diaper} {Beer} (s=0.4, c=0.67)
{Milk,Beer} {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} {Milk} (s=0.4, c=0.67)
{Beer} {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} {Milk,Beer} (s=0.4, c=0.5)
{Milk} {Diaper,Beer} (s=0.4, c=0.5)

Observations:
All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
Rules originating from the same itemset have identical support but
can have different confidence
Thus, we may decouple the support and confidence requirements

Mining Associa,on Rules


Two-step approach:
1. Frequent Itemset Genera,on
Generate all itemsets whose support minsup

2. Rule Genera,on
Generate high condence rules from each frequent
itemset, where each rule is a binary par,,oning of a
frequent itemset

Frequent itemset genera,on is s,ll computa,onally


expensive

Frequent Itemset Genera,on


null

AB

AC

AD

AE

BC

BD

BE

CD

CE

DE

ABC

ABD

ABE

ACD

ACE

ADE

BCD

BCE

BDE

CDE

ABCD

ABCE

ABDE

ABCDE

ACDE

BCDE

Given d items, there are


2d possible candidate
itemsets

Frequent Itemset Genera,on


Brute-force approach:
Each itemset in the lace is a candidate frequent itemset
Count the support of each candidate by scanning the database
Transactions

TID
1
2
3
4
5

Items
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke

List of
Candidates

Match each transac,on against every candidate


Complexity ~ O(NMw) => Expensive since M = 2d !!!

Computa,onal Complexity
Given d unique items:
Total number of itemsets = 2d
Total number of possible associa,on rules:

d d k
R =

k j
= 3 2 +1
d 1

d k

k =1

j =1

d +1

If d=6, R = 602 rules

Frequent Itemset Genera,on


Strategies
Reduce the number of candidates (M)
Complete search: M=2d
Use pruning techniques to reduce M
Reduce the number of transac,ons (N)
Reduce size of N as the size of itemset increases
Used by DHP and ver,cal-based mining algorithms
Reduce the number of comparisons (NM)
Use ecient data structures to store the candidates or
transac,ons
No need to match every candidate against every
transac,on

Reducing Number of Candidates


Apriori principle:
If an itemset is frequent, then all of its subsets must also be
frequent
Apriori principle holds due to the following property of the
support measure:

X , Y : ( X Y ) s( X ) s(Y )
Support of an itemset never exceeds the support of its
subsets
This is known as the an,-monotone property of support

Rule Genera,on
Given a frequent itemset L, nd all non-empty
subsets f L such that f L f sa,ses the
minimum condence requirement
If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D,
A BCD,
AB CD,
BD AC,

ABD C,
B ACD,
AC BD,
CD AB,

ACD B,
C ABD,
AD BC,

BCD A,
D ABC
BC AD,

If |L| = k, then there are 2k 2 candidate


associa,on rules (ignoring L and L)

Rule Genera,on
How to eciently generate rules from frequent itemsets?
In general, condence does not have an an,-monotone
property
c(ABC D) can be larger or smaller than c(AB D)
But condence of rules generated from the same itemset
has an an,-monotone property
e.g., L = {A,B,C,D}:


c(ABC D) c(AB CD) c(A BCD)

Condence is an,-monotone w.r.t. number of items on
the RHS of the rule

Rule Genera,on for Apriori


Algorithm
Lattice of rules
Low
Confidence
Rule

CD=>AB

ABCD=>{ }

BCD=>A

BD=>AC

D=>ABC

Pruned
Rules

ACD=>B

BC=>AD

C=>ABD

ABD=>C

AD=>BC

B=>ACD

ABC=>D

AC=>BD

A=>BCD

AB=>CD

Rule Genera,on for Apriori


Algorithm
Candidate rule is generated by merging two rules
that share the same prex
CD=>AB
BD=>AC
in the rule consequent
join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC
Prune rule D=>ABC if its
subset AD=>BC does not have
high condence

D=>ABC

Illustra,ng Apriori Principle


Item
Bread
Coke
Milk
Beer
Diaper
Eggs

Count
4
2
4
3
4
1

Items (1-itemsets)

Minimum Support = 3
If every subset is considered,
6C + 6C + 6C = 41
1
2
3
With support-based pruning,
6 + 6 + 1 = 13

Itemset
{Bread,Milk}
{Bread,Beer}
{Bread,Diaper}
{Milk,Beer}
{Milk,Diaper}
{Beer,Diaper}

Count
3
2
3
2
3
3

Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)

Triplets (3-itemsets)
Itemset
{Bread,Milk,Diaper}

Count
3

Apriori Algorithm
Method:

Let k=1
Generate frequent itemsets of length 1
Repeat un,l no new frequent itemsets are iden,ed
Generate length (k+1) candidate itemsets from length k
frequent itemsets
Prune candidate itemsets containing subsets of length
k that are infrequent
Count the support of each candidate by scanning the
DB
Eliminate candidates that are infrequent, leaving only
those that are frequent

Mul,ple Minimum Support (Liu


1999)
Modica,ons to Apriori:
In tradi,onal Apriori,
A candidate (k+1)-itemset is generated by merging two
frequent itemsets of size k
The candidate is pruned if it contains any infrequent subsets
of size k
Pruning step has to be modied:
Prune only if subset contains the rst item
e.g.: Candidate={Broccoli, Coke, Milk} (ordered according to




minimum support)
{Broccoli, Coke} and {Broccoli, Milk} are frequent but
{Coke, Milk} is infrequent
Candidate is not pruned because {Coke,Milk} does not
contain
the rst item, i.e., Broccoli.

Reducing Number of Comparisons

Candidate coun,ng:

Scan the database of transac,ons to determine


the support of each candidate itemset
To reduce the number of comparisons, store the
candidates in a hash structure
Instead of matching each transac,on against every
candidate, match it against candidates contained in the
hashed buckets
Transactions

TID
1
2
3
4
5

Hash Structure

Items
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke

Buckets

Eect of Support Distribu,on


Many real data sets have skewed support
distribu,on

Support
distribution of a
retail data set

Associa,on Rule Mining


Algorithms are available for genera,ng
associa,on rules
Apriori
Eclat
FP-Growth
+ Deriva,ves and hybrids of the three

The algorithms help iden,fy the frequent item


sets, which are, then converted to associa,on
rules

Associa,on Rule Mining


Apriori Algorithm
Finds subsets that are common to at least a
minimum number of the itemsets
Uses a boaom-up approach
frequent subsets are extended one item at a ,me (the
size of frequent subsets increases from one-item
subsets to two-item subsets, then three-item subsets,
and so on)
groups of candidates at each level are tested against
the data for minimum support
(see the gure)

Associa,on Rule Mining


Apriori Algorithm
Raw Transaction Data

One-item Itemsets

Two-item Itemsets

Three-item Itemsets

Transaction
No

SKUs
(Item No)

Itemset
(SKUs)

Support

Itemset
(SKUs)

Support

Itemset
(SKUs)

Support

1, 2, 3, 4

1, 2

1, 2, 4

2, 3, 4

1, 3

2, 3, 4

2, 3

1, 4

1, 2, 4

2, 3

1, 2, 3, 4

2, 4

2, 4

3, 4

Data Mining Methods:


Classica,on
Most frequently used DM method
Part of the machine-learning family
Employ supervised learning
Learn from past data, classify new data
The output variable is categorical (nominal
or ordinal) in nature
Classica,on versus regression?
Classica,on versus clustering?

Classification task
Input: a training set of tuples, each labeled
with one class label
Output: a model (classier) that assigns a
class label to each tuple based on the other
aaributes
The model can be used to predict the class
of new tuples, for which the class label is
missing or unknown

What is Classification
Data classica,on is a two-step process
rst step: a model is built describing a
predetermined set of data classes or concepts
second step: the model is used for
classica,on

Each tuple is assumed to belong to a


predened class, as determined by one of
the aaributes, called the class label
a(ribute
Data tuples are also referred to as samples,
examples, or objects

Es,ma,on Methodologies for


Classica,on
Simple split (or holdout or test sample es,ma,on)
Split the data into 2 mutually exclusive sets training
(~70%) and tes,ng (30%)

2/3

Training Data

Model
Development

Classifier

Preprocessed
Data
1/3
Testing Data

Model
Assessment
(scoring)

Prediction
Accuracy

For ANN, the data is split into three sub-sets (training


[~60%], valida,on [~20%], tes,ng [~20%])

Train and test


The tuples (examples, samples) are divided
into training set + test set
Classica,on model is built in two steps:
training - build the model from the training set
test - check the accuracy of the model using
test set

Train and test


Kind of models:
if - then rules
logical formulae
decision trees

Accuracy of models:
the known class of test samples is matched
against the class predicted by the model
accuracy rate = % of test set samples correctly
classied by the model

Training step
Classification
algorithm
training
data
Age
20
18
40
50
35
30
32
40

Car Type
Combi
Sports
Sports
Family
Minivan
Combi
Family
Combi

Risk
High
High
High
Low
Low
High
Low
Low

Classifier
(model)
if age < 31
or Car Type =Sports
then Risk = High

Test step
Classifier
(model)

test
data
Age
27
34
66
44

Car Type
Sports
Family
Family
Sports

Risk
High
Low
High
High

Risk
High
Low
Low
High

Classification (prediction)
Classifier
(model)

new
data
Age
27
34
55
34

Car Type
Sports
Minivan
Family
Sports

Risk

Risk
High
Low
Low
High

Classification vs. Prediction


There are two forms of data analysis that can
be used to extract models describing data
classes or to predict future data trends:
classica,on: predict categorical labels
predic,on: models con,nuous-valued func,ons

Comparing Classification
Methods
PredicBve accuracy: this refers to the ability of
the model to correctly predict the class label of
new or previously unseen data
Speed: this refers to the computa,on costs
involved in genera,ng and using the model
Robustness: this is the ability of the model to
make correct predic,ons given noisy data or data
with missing values

Comparing Classification
Methods
Scalability: this refers to the ability to construct the
model eciently given large amount of data
Interpretability: this refers to the level of
understanding and insight that is provided by the
model
Simplicity:
decision tree size
rule compactness

Domain-dependent quality indicators

Problem formulation
Given records in the database with class
label nd model for each class.
Age
20
18
40
50
35
30
32
40

Car Type
Combi
Sports
Sports
Family
Minivan
Combi
Family
Combi

Risk
High
High
High
Low
Low
High
Low
Low

Age < 31
Car Type
is sports
High
High

Low

Classification techniques

Decision Tree Classica,on


Bayesian Classiers
Neural Networks
Sta,s,cal Analysis
Gene,c Algorithms
Rough Set Approach
k-nearest neighbor classiers

Classification by Decision
Tree Induction
A decision tree is a tree structure, where
each internal node denotes a test on an
aaribute,
each branch represents the outcome of the test,
leaf nodes represent classes or class
distribu,ons
Age < 31
N

Car Type
is sports

High
High

Low

Decision Tree Induction


A decision tree is a class discriminator that
recursively par,,ons the training set un,l each
par,,on consists en,rely or dominantly of
examples from one class.
Each non-leaf node of the tree contains a split
point, which is a test on one or more aaributes
and determines how the data is par,,oned

Decision Tree Induction


Basic algorithm: a greedy algorithm that

constructs decision trees in a top-down recursive


divide-and-conquer manner.

Many variants:
from machine learning (ID3, C4.5)
from sta,s,cs (CART) Classica,on And
Regression Trees
from paaern recogni,on (CHAID) Chi-squared
Automa,c Interac,on Detec,on

Main dierence: split criterion

Decision Tree Induction


The algorithm consists of two phases:
Build an ini,al tree from the training data such
that each leaf node is pure
Prune this tree to increase its accuracy on test
data

Tree Building
In the growth phase the tree is built by recursively
par,,oning the data un,l each par,,on is either
"pure" (contains members of the same class) or
suciently small.
The form of the split used to par,,on the data
depends on the type of the aaribute used in the
split:
for a con,nuous aaribute A, splits are of the form
value(A)<x where x is a value in the domain of A.
for a categorical aaribute A, splits are of the form
value(A)X where Xdomain(A)

Tree Building Algorithm


Make Tree (Training Data T)
{
Par,,on(T)
}
Par??on(Data S)
{
if (all points in S are in the same class) then

return
for each aaribute A do

evaluate splits on aaribute A;
use best split found to par,,on S into S1 and S2
Par,,on(S1)
Par,,on(S2)
}

Tree Building Algorithm


While growing the tree, the goal at each node is
to determine the split point that "best" divides
the training records belonging to that leaf
To evaluate the goodness of the split some
spling indices have been proposed

Split Criteria
Gini index (CART, SPRINT)
select aaribute that minimize impurity of a split

Informa,on gain (ID3, C4.5)


to measure impurity of a split use entropy
select aaribute that maximize entropy
reduc,on

2 con,ngency table sta,s,cs (CHAID)


measures correla,on between each aaribute
and the class label
select aaribute with maximal correla,on

Gini index
Given a sample training set where each record
represents a car-insurance applicant. We want to
build a model of what makes an applicant a high or
low insurance risk.
Classifier
(model)

Training set
RID
0
1
2
3
4
5

Age
23
17
43
68
32
20

Car Type
family
sport
sport
family
truck
family

Risk
high
high
high
low
low
high

The model built can be used to


screen future insurance applicants
by classifying them into the High
or Low risk categories

Gini index
SPRINT algorithm:
Par??on(Data S) {
if (all points in S are of the same class) then

return;
for each aaribute A do

evaluate splits on aaribute A;
Use best split found to par,,on S into S1 and S2
Par,,on(S1);
Par,,on(S2);
}
Ini?al call: Par,,on(Training Data)

Gini index
Deni,on:

gini(S) = 1 - pj2
where:
S is a data set containing examples from n classes
pj is a rela,ve frequency of class j in S

E.g. two classes, Pos and Neg, and dataset S with p


Pos-elements and n Neg-elements.
ppos= p/(p+n)
pneg = n/(n+p)
gini(S) = 1 - ppos2 - pneg2

Gini index
If dataset S is split into S1 and S2, then spling index
is dened as follows:

giniSPLIT(S) = (p1+ n1)/(p+n)*gini(S1) +

(p2+ n2)/(p+n)* gini(S2)


where p1, n1 (p2, n2) denote p1 Pos-elements and n1
Neg-elements in the dataset S1 (S2), respec,vely.

In this deni,on the "best" split point is the one with
the lowest value of the giniSPLIT index.

Example
Training set
RID
0
1
2
3
4
5

Age
23
17
43
68
32
20

Car Type
family
sport
sport
family
truck
family

Risk
high
high
high
low
low
high

Example
Attribute list for Age
Age
17
20
23
32
43
68

RID
1
5
0
4
2
3

Risk
high
high
high
low
high
low

Attribute list for Car Type


Car Type
family
sport
sport
family
truck
family

RID
0
1
2
3
4
5

Risk
high
high
high
low
low
high

Example
Possible values of a split point for Age aaribute are:
Age17, Age20, Age23, Age32, Age43, Age68

Tuple count
Age<=17
Age>17

High
1
3

Low
0
2

G(Age<=17) = 1- (12+02) = 0
G(Age>17) = 1- ((3/5)2+(2/5)2) = 1 - (13/25)2 = 12/25
GSPLIT = (1/6) * 0 + (5/6) * (12/25) = 2/5

Example
Tuple count
Age<=20
Age>20

High
2
2

Low
0
2

G(Age<=20) = 1- (12+02) = 0
G(Age>20) = 1- ((1/2)2+(1/2)2) = 1/2
GSPLIT = (2/6) * 0 + (4/6) * (1/8) = 1/3
Tuple count
Age<=23
Age>23

High
3
1

Low
0
2

G(Age23) = 1- (12+02) = 0
G(Age>23) = 1- ((1/3)2+(2/3)2) = 1 - (1/9) - (4/9) = 4/9
GSPLIT = (3/6) * 0 + (3/6) * (4/9) = 2/9

Example
Tuple count
Age<=32
Age>32

High
3
1

Low
1
1

G(Age32) = 1- ((3/4)2+(1/4)2) = 1 - (10/16) = 6/16 = 3/8


G(Age>32) = 1- ((1/2)2+(1/2)2) = 1/2
GSPLIT = (4/6)*(3/8) + (2/6)*(1/2) = (1/8) + (1/6)=14/48= 7/24

The lowest value of GSPLIT is for Age23, thus we


have a split point at Age=(23+32) / 2 = 27.5

Example
Decision tree after the first split of the example set:

Age27.5

Risk = High

Age>27.5

Risk = Low

Example
Attribute lists are divided at the split
point.
Attribute lists for Age27.5:
Age
17
20
23

RID
1
5
0

Risk
high
high
high

Car Type
family
sport
family

RID
0
1
5

Risk
high
high
high

Car Type
sport
family
truck

RID
2
3
4

Risk
high
low
low

Attribute lists for Age>27.5


Age
32
43
68

RID
4
2
3

Risk
low
high
low

Example
Evaluating splits for categorical attributes
We have to evaluate splitting index for each of the 2N
combinations, where N is the cardinality of the categorical
attribute.
Tuple count
Car type= {sport}
Car type ={family]
Car type = {truck}

High Low
1
0
0
1
0
1

G(Car type {sport}) = 1 12 02 = 0


G(Car type {family}) = 1 02 12 = 0
G(Car type {truck}) = 1 02 12 = 0

Example
G(Car type { sport, family }) = 1 - (1/2)2 - (1/2)2 = 1/2
G(Car type { sport, truck }) = 1/2
G(Car type { family, truck }) = 1 - 02 - 12 = 0

GSPLIT(Car type { sport }) = (1/3) * 0 + (2/3) * 0 = 0
GSPLIT(Car type { family }) = (1/3) * 0 + (2/3)*(1/2) = 1/3
GSPLIT(Car type { truck }) = (1/3) * 0 + (2/3)*(1/2) = 1/3
GSPLIT(Car type { sport, family}) = (2/3)*(1/2)+(1/3)*0= 1/3
GSPLIT(Car type { sport, truck}) = (2/3)*(1/2)+(1/3)*0= 1/3
GSPLIT(Car type { family, truck }) = (2/3)*0+(1/3)*0=0

Example
The lowest value of GSPLIT is for Car type {sport},
thus this is our split point. Decision tree a^er the
second split of the example set:
Age27.5

Age>27.5

Risk = High
Car type {sport}

Risk = High

Car type {family, truck}

Risk = Low

Information Gain
The informa,on gain measure is used to select
the test aaribute at each node in the tree
The aaribute with the highest informa,on gain
(or greatest entropy reduc,on) is chosen as the
test aaribute for the current node
This aaribute minimizes the informa,on needed
to classify the samples in the resul,ng par,,ons

Information Gain
Let S be a set consis,ng of s data samples.
Suppose the class label aaribute has m dis,nct
values dening m classes, Ci (for i=1, ..., m)
Let si be the number of samples of S in class Ci
The expected informa,on needed to classify a
given sample is given by

I(s1, s2, ..., sm) = - pi log2(pi)

where pi is the probability that an arbitrary


sample belongs to class Ci and is es,mated by si/s.

Information Gain
Let aaribute A have v dis,nct values, {a1, a2, ...,
av}. Aaribute A can be used to par,,on S into {S1,
S2, ..., Sv}, where Sj contains those samples in S
that have value aj of A
If A were selected as the test aaribute, then these
subsets would correspond to the branches grown
from the node containing the set S

Information Gain
Let sij be the number of samples of the class Ci in
a subset Sj. The entropy, or expected informa,on
based on the par,,oning into subsets by A, is
given by:
E(A1, A2, ...Av) = (s1j + s2j +...+smj)/s*
* I(s1j, s2j, ..., smj)
The smaller the entropy value, the greater the
purity of the subset par,,ons.

Information Gain
The term (s1j + s2j +...+smj)/s acts as the weight of
the jth subset and is the number of samples in the
subset (i.e. having value aj of A) divided by the
total number of samples in S. Note that for a
given subset Sj,

I(s1j, s2j, ..., smj) = - pij log2(pij)

where pij = sij/|Sj| and is the probability that a


sample in Sj belongs to class Ci

Information Gain
The encoding informa,on that would be gained
by branching on A is

Gain(A) = I(s1, s2, ..., sm) E(A)


Gain(A) is the expected reduc,on in entropy
caused by knowing the value of aaribute A

Example
RID
1
2
3
4
5
6
7
8
9
10
11
12
13
14

Age
<=30
<=30
31..40
>40
>40
>40
31..40
<=30
<=30
>40
<=30
31..40
31..40
>40

Income
high
high
high
medium
low
low
low
medium
low
medium
medium
medium
high
medium

student
no
no
no
no
yes
yes
yes
no
yes
yes
yes
no
yes
no

credit_rating
fair
excellent
fair
fair
fair
excellent
excellent
fair
fair
fair
excellent
excellent
fair
excellent

buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no

Example
Let us consider the following training set of

tuples taken from the customer database.


The class label aaribute, buys_computer, has two
dis,nct values (yes, no), therefore, there are two
classes (m=2).
C1 correspond to yes s1 = 9

C2 correspond to no - s2 = 5
I(s1, s2)=I(9, 5)= - 9/14log29/14 5/14log25/14=0.94

Example
Next, we need to compute the entropy of each
aaribute. Let start with the aaribute age
for age=<=30
s11=2 s21=3 I(s11, s21) = 0.971
for age=31..40
s12=4 s22=0 I(s12, s22) = 0
for age=>40
s13=2 s23=3 I(s13, s23) = 0.971

Example
The entropy of age is,
E(age)=5/14 *I(s11, s21) +4/14* I(s12, s22) +
+ 5/14* I(s13, s23) = 0.694

The gain in informa,on from such a par,,oning
would be:
Gain(age) = I(s1, s2) E(age) = 0.246

Example
We can compute
Gain(income)=0.029,
Gain(student)=0.151, and
Gain(credit_raBng)=0.048
Since age has the highest informa,on gain amont
the aaributes, it is selected as the test atribute. A
node is created and labeled with age, and branches
are grown for each of the aaributes values.

Example
age
<=30

buys_computers:
yes, no

31..40

>40

buys_computers:
yes, no
buys_computers: yes

Example
age
<=30

student
no

no

>40

31..40

credit_rating

yes
yes

excellent

yes

no

fair

yes

Entropy vs. Gini index


Gini index tends to isolate
the largest class from all
other classes

Entropy tends to find groups


of classes that add up to 50%
of the data
class A 40
class B 30
class C 20
class D 10

class A 40
class B 30
class C 20
class D 10

yes

if age < 40

class A 40

no
class B 30
class C 20
class D 10

yes
class A 40
class D 10

if age < 65

no

class B 30
class C 20

Entropy vs. Gini index


Gini will tend to nd the largest class, and
entropy tends to nd groups of classes that
make up 50% of the data
Gini to minimize misclassica,on. Entropy
for exploratory analysis.
Gini is intended for con,nuous aaributes, and
Entropy for aaributes that occur in classes

Tree pruning
When a decision tree is built, many of the branches
will reect anomalies in the training data due to
noise or outliers.
Tree pruning methods typically use sta,s,cal
measures to remove the least reliable branches,
generally resul,ng in faster classica,on and an
improvement in the ability of the tree to correctly
classify independent test data

Tree pruning
Prepruning approach (stopping): a tree is
pruned by hal,ng its construc,on early (i.e. by
deciding not to further split or par,,on the
subset of training samples). Upon hal,ng, the
node becomes a leaf. The leaf hold the most
frequent class among the subset samples
Postpruning approach (pruning): removes
branches from a fully grown tree. A tree node is
pruned by removing its branches. The lowest
unpruned node becomes a leaf and is labeled by
the most frequent class among its former
branches

Extracting Classification
Rules from Decision Trees
The knowledge represented in decision trees can be
extracted and represented in the form of
classica,on IF-THEN rules.
One rule is created for each path from the root to a
leaf node
Each aaribute-value pair along a given path forms a
conjunc,on in the rule antecedent; the leaf node
holds the class predic,on, forming the rule
consequent

Extracting Classification
Rules from Decision Trees
The decision tree of Example (7) can be converted to classica,on
rules:
IF age=<=30 AND student=no THEN buys_computers=no
IF age=<=30 AND student=yes THEN buys_computers=yes
IF age=31..40
THEN buys_computers=yes
IF age=>40 AND credit_ra,ng=excellent




THEN buys_computers=no
IF age=>40 AND credit_ra,ng=fair




THEN buys_computers=yes

Other Classification Methods


There is a number of classica,on methods in the
literature:
Bayesian classiers
Neural-network classiers
K-nearest neighbor classiers
Associa,on-based classiers
Rough and fuzzy sets

Decision Trees
DT algorithms mainly dier on
Spling criteria
Which variable to split rst?
What values to use to split?
How many splits to form for each node?

Stopping criteria
When to stop building the tree

Pruning (generaliza,on method)


Pre-pruning versus post-pruning

Most popular DT algorithms include


ID3, C4.5, C5; CART; CHAID; M5

Decision Trees
Alterna,ve spling criteria
Gini index determines the purity of a specic class
as a result of a decision to branch along a
par,cular aaribute/value
Used in CART

Informa,on gain uses entropy to measure the


extent of uncertainty or randomness of a
par,cular aaribute/value split
Used in ID3, C4.5, C5

Chi-square sta,s,cs (used in CHAID)

Cluster Analysis for Data Mining


Used for automa,c iden,ca,on of natural
groupings of things
Part of the machine-learning family
Employ unsupervised learning
Learns the clusters of things from past data,
then assigns new instances
There is no output variable
Also known as segmenta,on

Cluster Analysis for Data Mining


Clustering results may be used to
Iden,fy natural groupings of customers
Iden,fy rules for assigning new cases to classes
for targe,ng/diagnos,c purposes
Provide characteriza,on, deni,on, labeling of
popula,ons
Decrease the size and complexity of problems for
other data mining methods
Iden,fy outliers in a specic domain (e.g., rareevent detec,on)

Cluster Analysis for Data Mining


Analysis methods
Sta,s,cal methods (including both hierarchical
and nonhierarchical), such as k-means, k-modes,
and so on.
Neural networks (adap,ve resonance theory
[ART], self-organizing map [SOM])
Fuzzy logic (e.g., fuzzy c-means algorithm)
Gene,c algorithms

Divisive versus Agglomera,ve methods

Cluster Analysis for Data Mining


How many clusters?
There is no truly op,mal way to calculate it
Heuris,cs are o^en used

Look at the sparseness of clusters


Number of clusters = (n/2)1/2 (n: no of data points)
Use Akaike informa,on criterion (AIC)
Use Bayesian informa,on criterion (BIC)

Most cluster analysis methods involve the use of a


distance measure to calculate the closeness between
pairs of items.
Euclidian versus Manhaaan (rec,linear) distance

What is Cluster Analysis?


Cluster: Collec,on of data objects
(Intraclass similarity) - Objects are similar to objects in same cluster
(Interclass dissimilarity) - Objects are dissimilar to objects in other
clusters
Examples of clusters?
Cluster Analysis Sta,s,cal method to iden,fy and group sets of similar
objects into classes
Good clustering methods produce high quality clusters with high
intraclass similarity and interclass dissimilarity
Unlike classica,on, it is unsupervised learning

Clustering (slide from Han and Kamber)


Clustering of data is a method by which large sets of data is grouped into clusters of smaller
sets of similar data.
The example below demonstrates the clustering of balls of same colour. There are a total of
10 balls which are of three different
colours. We are interested in clustering of balls of the three different colours into three
different groups.

The balls of same colour are clustered into a group as shown below :

Thus, we see clustering means grouping of data or dividing a large data set into smaller data
sets of some similarity.

Usual Working Data Structures


Data matrix
(two modes)

x11

...
x
i1
...
x
n1

(Flat File of
Aaributes/coordinates)

Dissimilarity matrix
(one mode)
Or distance matrix

...

x1f

...

...
...

...
xif

...
...

...
...

...

...
...

xnf

0
d(2,1)
d(3,1) d ( 3,2) 0

:
:
:
d ( n,1) d ( n,2) ...

x1p

...
xip

...
xnp

... 0

Data Types and Distance Metrics

Interval-Scaled Aaributes
Binary Aaributes
Nominal Aaributes
Ordinal Aaributes
Ra,o-Scaled Aaributes
Aaributes of Mixed Type

Data Types and Distance Metrics


Interval-Scaled AWributes
Using Interval-Scaled Values
Step 1: Standardize the data
To ensure they all have equal weight
To match up dierent scales into a uniform, single scale
Not always needed! Some,mes we require unequal
weights for an aaribute

Step 2: Compute dissimilarity between records


Use Euclidean, Manhaaan or Minkowski distance

Data Types and Distance Metrics


Interval-Scaled AWributes
Minkowski distance

d (i, j) = q (| x x |q + | x x |q +...+ | x x |q )
i1
j1
i2
j2
ip
jp

Euclidean distance
q = 2

Manhaaan distance
q = 1

What are the shapes of these clusters?


Spherical in shape.

Clustering Methods

Par,,oning methods
Hierarchical methods
Density-based methods
Grid-based methods
Model-based methods

Choice of algorithm depends on type of data


available and the nature and purpose of the
applica,on

Clustering Methods
Par,,oning methods
Divide the objects into a set of par,,ons based on
some criteria
Improve the par,,ons by shi^ing objects between
them for higher intraclass similarity, interclass
dissimilarity and other such criteria
Two popular heuris,c methods
k-means algorithm
k-medoids algorithm

Clustering Methods
Hierarchical methods
Build up or break down groups of objects in a recursive
manner
Two main approaches
Agglomera,ve approach
Divisive approach

Wikipedia

Clustering Methods
Density-based methods
Grow a given cluster un,l the density decreases
below a certain threshold

Grid-based methods
Form a grid structure by quan,zing the object
space into a nite number of grid cells

Model-based methods
Hypothesize a model and nd the best t of the
data to the chosen model

K-Means Clustering Algorithm


K-Means algorithm is a type of par,,oning method
Group instances based on aaributes into k groups
High intra-cluster similarity; Low inter-cluster similarity
Cluster similarity is measured in regards to the mean value of
objects in the cluster.

How does K-means work ?


First, select K random instances from the data ini,al cluster centers
Second, each instance is assigned to its closest (most similar) cluster center
Third, each cluster center is updated to the mean of its cons,tuent instances
Repeat steps two and three ,ll there is no further change in assignment of
instances to clusters

Par,,onal methods: K-means


Criteria: minimize sum of square of distance
Between each point and centroid of the cluster.
Between each pair of points in the cluster

Algorithm:
Select ini,al par,,on with K clusters: random,
rst K, K separated points

Repeat un,l stabiliza,on:


Assign each point to closest cluster center
Generate new cluster centers
Adjust clusters by merging/spling

Collabora,ve Filtering
Given database of user preferences, predict
preference of new user
Example: predict what new movies you will like based
on
your past preferences
others with similar past preferences
their preferences for the new movies
Example: predict what books/CDs a person may want
to buy
(and suggest it, or give discounts to tempt
customer)

K-Means Clustering Algorithm

Cluster Analysis for Data Mining -


k-Means Clustering Algorithm
Step 1

Step 2

Step 3

K-Mean Example
ID

Name

Height

Weight

x1

Ram

64

60

x2

Shyam

60

61

x3

Gita

59

70

x4

Mohan

68

71

Ar,cial Neural Networks


for Data Mining
Ar,cial neural networks (ANN or NN) is a brain
metaphor for informa,on processing
a.k.a. Neural Compu,ng
Very good at capturing highly complex nonlinear func,ons!
Many uses predic,on (regression, classica,on),
clustering/segmenta,on

Many applica,on areas nance, medicine, marke,ng,


manufacturing, service opera,ons, informa,on systems,
etc.

Biological NN

Dendrites

Synapse

Synapse

Axon

Biological
versus
Ar,cial
Neural
Networks

Axon

Artificial NN

x1
Inputs

x2

.
.
.
xn

Neuron

Dendrites

Neuron

Y1

w1
Outputs

w2
Weights

Processing
Element (PE)

S =

X iW

i =1

Summation

wn

Biological
Neuron
Dendrites
Axon
Synapse
Slow
Many (109)

f (S )
i

Transfer
Function

Artificial
Node (or PE)
Input
Output
Weight
Fast
Few (102)

Y2
.
.
.
Yn

Elements/Concepts of ANN
Processing element (PE)
Informa,on processing
Network structure
Feedforward vs. recurrent vs. mul,-layer

Learning parameters
Supervised/unsupervised, backpropaga,on,
learning rate, momentum

ANN So^ware NN shells, integrated modules in


comprehensive DM so^ware,

Data Mining
So^ware

SPSS PASW Modeler (formerly Clementine)

SAS / SAS Enterprise Miner


Microsoft Excel
R
Your own code
Weka (now Pentaho)

Commercial

KXEN

IBM SPSS Modeler (formerly


Clemen,ne)
SAS Enterprise Miner
IBM Intelligent Miner
StatSo^ Sta,s,ca Data
Miner
many more

Free and/or Open Source


RapidMiner
Weka
many more

RapidMiner

MATLAB
Other commercial tools
KNIME
Microsoft SQL Server
Other free tools
Zementis
Oracle DM
Statsoft Statistica

Salford CART, Mars, other


Orange
Angoss
C4.5, C5.0, See5
Bayesia
Insightful Miner/S-Plus (now TIBCO)
Megaputer
Viscovery
Clario Analytics

Total (w/ others)

Miner3D

Alone

Thinkanalytics

Source: KDNuggets.com, May 2009

20

40

60

80

100

120

Data Mining in MS SQL Server 2008

Data Mining Myths


Data mining
provides instant solu,ons/predic,ons.
is not yet viable for business applica,ons.
requires a separate, dedicated database.
can only be done by those with advanced degrees.
is only for large rms that have lots of customer
data.
is another name for good-old sta,s,cs.

Common Data Mining Blunders


1.
2.
3.
4.
5.

Selec,ng the wrong problem for data mining


Ignoring what your sponsor thinks data mining is
and what it really can/cannot do
Not leaving sucient ,me for data acquisi,on,
selec,on and prepara,on
Looking only at aggregated results and not at
individual records/predic,ons
Being sloppy about keeping track of the data
mining procedure and results

Common Data Mining Mistakes


6.
7.
8.
9.
10.

Ignoring suspicious (good or bad) ndings and


quickly moving on
Running mining algorithms repeatedly and
blindly, without thinking about the next stage
Naively believing everything you are told about
the data
Naively believing everything you are told about
your own data mining analysis
Measuring your results dierently from the way
your sponsor measures them

Data Mining vs. KDD


Knowledge Discovery in Databases (KDD):
process of nding useful informa,on and
paaerns in data.
Data Mining: Use of algorithms to extract the
informa,on and paaerns derived by the KDD
process.

KDD Process

Modified from [FPSS96C]

Selec?on: Obtain data from various sources.


Preprocessing: Cleanse data.
Transforma?on: Convert to common format.
Transform to new format.
Data Mining: Obtain desired results.
Interpreta?on/Evalua?on: Present results to
user in meaningful manner.

KDD Issues

Mul?media Data
Missing Data
Irrelevant Data
Noisy Data
Changing Data
Integra?on
Applica?on

KDD Process Ex: Web Log


Selec?on:
Select log data (dates and loca,ons) to use

Preprocessing:
Remove iden,fying URLs
Remove error logs

Transforma?on:
Sessionize (sort and group)

Data Mining:
Iden,fy and count paaerns
Construct data structure

Interpreta?on/Evalua?on:
Iden,fy and display frequently accessed sequences.

Poten?al User Applica?ons:


Cache predic,on
Personaliza,on

Data Mining Process: CRISP-DM


(Cross Industry Standard Process for Data Mining)

1
Business
Understanding

2
Data
Understanding

Data
Preparation

Data Sources

Deployment

Model
Building

5
Testing and
Evaluation

Data Mining Process: CRISP-DM


Step 1: Business Understanding
Step 2: Data Understanding
Step 3: Data Prepara,on (!)
Step 4: Model Building
Step 5: Tes,ng and Evalua,on
Step 6: Deployment

Accounts for
~85% of total
project time

The process is highly repe,,ve and


experimental (DM: art versus science?)

Business
Understanding

Data
Understanding

Data
Preparation

Modeling

Evaluation

Deployment

Determine
Business
Objectives

Collect
Initial
Data

Select
Data

Select
Modeling
Technique

Evaluate
Results

Plan
Deployment

Assess
Situation

Describe
Data

Clean
Data

Generate
Test
Design

Review
Process

Plan
Monitoring &
Maintenance

Determine
Data Mining
Goals

Explore
Data

Construct
Data

Build
Model

Determine
Next Steps

Produce
Final
Report

Produce
Project
Plan

Verify
Data
Quality

Integrate
Data

Assess
Model

Format
Data

178

Review
Project

Business
Understanding

Data
Understanding

Date
Preparation

Modeling

Evaluation

Deployment

Determine Collect
business
initial
objectives data

Select
data

Select
Evaluate Plan
modeling results
deploymen
technique
t

Assess
situation

Clean
data

Generate
test
design

Describe
data

Determine
data
Explore
mining
data
goals
Produce
project
plan

Verify
data
quality

Construct Build
model
data
Integrate
data
Format
data

Assess
model
179

Review
process
Determi
ne next
steps

Plan
monitoring
and
maintenan
ce
Produce
final report
Review
project

The CRISP-DM reference model


Business understanding
focuses on understanding the project
objectives and requirements from a
business perspective, then converting this
knowledge into a data mining problem
definition and a preliminary plan designed
to achieve the objectives

180

The CRISP-DM reference model


Data understanding
starts with an initial data collection and
proceeds with activities in order to get
familiar with the data, to identify data
quality problems, to discover first insights
into the data or to detect interesting
subsets to form hypotheses for hidden
information.
181

The CRISP-DM reference model


Data preparation
covers all activities to construct the final
dataset from the initial raw data. Data
preparation tasks are likely to be
performed multiple times and not in any
prescribed order. Tasks include table,
record and attribute selection as well as
transformation and cleaning of data for
modeling tools.

182

The CRISP-DM reference model


Modeling
various modeling techniques are selected
and applied and their parameters are
calibrated to optimal values. Typically,
there are several techniques for the same
data mining problem type. Some
techniques have specific requirements on
the form of data. Therefore, stepping back
to the data preparation phase is often
necessary
183

The CRISP-DM reference model


Evaluation
thoroughly evaluate the model and review
the steps executed to construct the model
to be certain it properly achieves the
business objectives.A key objective is to
determine if there is some important
business issue that has not been
sufficiently considered. At the end of this
phase, a decision on the use of the data
mining results should be reached
184

The CRISP-DM reference model


Deployment
the knowledge gained will need to be
organized and presented in a way that the
customer can use it. However, depending
on the requirements, the deployment
phase can be as simple as generating a
report or as complex as implementing a
repeatable data mining process across the
enterprise.
185

Business understanding
Business
Data
Understanding Understanding

Data
Preparation

Modeling

Evaluation

Background

Business
Objectives

Business
Success
Criteria

Assess
Situation

Inventory of
Resources

Requirements,
Assumptions&
constraints

Risks and
Contingencies

Determine
Data Mining
Goals

Data Mining
Goals

Data Mining
Success
Criteria

Produce
Project
Plan

Project
Plan

Initial Assessment
Of Tools and
Techniques

Determine
Business
Objectives

186

Deployment

Terminology

Costs and
Benefits

Business understanding
1. Determine business objectives
thoroughly understand, from a business
perspective, what the client really wants to
accomplish.Often the client has many
competing objectives and constraints that
must be properly balanced. The analysts
goal is to uncover important factors,at the
beginning, that can influence the outcome
of the project.
187

Business understanding
2.Assess situation
detailed fact-finding about all of there
sources,constraints, assumptions and
other factors that should be considered in
determining the data analysis goal and
project plan.

188

Business understanding
3. Determine data mining goals
A business goal states objectives in business
terminology. A data mining
goal states project objectives in technical terms.
4 Produce project plan
Describe the intended plan for achieving the
data mining goals and thereby achieving the
business goals.

189

Data understanding
Business
Data
Understanding Understanding

Data
Preparation

Collect
Initial
Data

Initial Data
Collection
Report

Describe
Data

Data
Description
Report

Explore
Data

Data
Exploration
Report

Verify
Data
Quality

Data
Quality
Report

Modeling

190

Evaluation

Deployment

Data understanding
1 Collect initial data
Acquire within the project the data (or
access to the data) listed in the project
resources. This initial collection includes
data loading if necessary for data
understanding.
2 Describe data
Examine the gross or surface
properties of the acquired data and report
on the results.
191

Data understanding
3 Explore data
This task tackles the data mining questions,
which can be addressed using querying,
visualization and reporting. These analyses may
address directly the data min-ing goals; they
may also contribute to or refine the data
description and quality reports and feed into the
transformation and other data preparation
needed for further analysis.
4 Verify data quality
Examine the quality of the data.
192

Data preparation
Business
Data
Understanding Understanding

Data
Preparation

Data Set

Select Data

Rationale for
Inclusion/
exclusion

Clean Data

Data Cleaning
Report

Construct Data

Derived
Attributes

Integrate Data

Merged Data

Format Data

Reformatted
Data

Modeling

Data Set
Description

Generated
Records

193

Evaluation

Deployment

Data preparation
1 Select data
Decide on the data to be used for analysis.
Criteria include relevance to the data
mining goals, quality and technical
constraints

194

Data preparation
2 Clean data
Raise the data quality to the level required
by the selected analysis techniques. This
may involve selection of clean subsets of
the data, the insertion of suitable defaults
or more ambitious techniques such as the
estimation of missing data by modeling.

195

Data preparation
3 Construct data
This task includes constructive data
preparation operations such as the
production of derived attributes, entire new
records or transformed values for existing
attributes.
4 Integrate data
Information is combined from multiple
tables or records to create new records or
values.
196

Data preparation
5 Format data
Formatting transformations refer to
primarily syntactic modifications made to
the data that do not change its meaning,
but might be required by the modeling tool.

197

Modeling
Business
Data
Understanding Understanding

Data
Preparation

Modeling

Select
Modeling
Technique

Modeling
Technique

Generate
Test
Design

Test Design

Build
Model

Parameter
Settings

Models

Assess
Model

Model
Assessment

Revised
Parameter
Settings

Evaluation

Modeling
Assumptions

198

Model
Descrip,on

Deployment

Modeling
1 Select modeling technique
select the actual modeling technique that
is to be used.
2 Generate test design
Before we actually build a model, we need
to generate a procedure or mechanism to
test the models quality and validity.
199

Modeling
3 Build model
Run the modeling tool on the prepared dataset
to create one or more models.
4 Assess model
The data mining engineer tries to rank the
models. He assesses the models according to
the evaluation criteria. As far as possible he also
takes into account business objectives and
business success criteria. He also compares all
results according to the evaluation criteria.
200

Evaluation
Business
Data
Understanding Understanding

Data
Preparation

Evaluate
Results

Assessment of
Data Mining
Results

Review
Process

Review of
Process

Determine
Next Steps

List of
Possible
Actions

Modeling

Approved
Models

Decision
201

Evaluation

Deployment

Evaluation
1 Evaluate results
This step assesses the degree to which
the model meets the business objectives
and seeks to determine if there is some
business reason why this model is
deficient. Another option of evaluation is to
test the model(s) on test applications in
the real application if time and budget
constraints permit.
202

Evaluation
2 Review process
Do a more thorough review of the data
mining engagement in order to determine
if there is any important factor or task that
has somehow been overlooked. This
review also covers quality assurance
issues, e.g.,

203

Evaluation
3 Determine next steps
According to the assessment results and
the process review, the project decides
how to proceed at this stage. This task
includes analyses of remaining resources
and budget that influences the decisions.

204

Deployment
Business
Data
Understanding Undertanding

Data
Preparation

Plan
Deployment

Deployment
Plan

Plan
Monitoring &
Maintenance

Monitoring &
Maintenance
Plan

Produce
Final
Report

Final
Report

Review
Project

Experience
Documentation

Modeling

Final
Presentation

205

Evaluation

Deployment

Deployment
1 Plan deployment
this task takes the evaluation results and
concludes a strategy for deployment.
If a general procedure has been identified
to create the relevant model(s), this
procedure is documented here for later
deployment.

206

Deployment
2 Plan monitoring and maintenance
In order to monitor the deployment of the
data mining result(s), the project needs a
detailed plan on the monitoring process.
This plan takes into account the specific
type of deployment.

207

Deployment
3 Produce final report
At the end of the project, the project leader
and his team write up a final report.
Depending on the deployment plan, this
report may be only a summary of the
project and its experiences or it may be a
final and comprehensive presentation of
the data mining result(s).
208

Deployment
4 Review project
Assess what went right and what went
wrong, what was done well and what
needs to be improved.

209

The CRISP-DM user guide


The user guide gives more detailed tips
and hints for each phase and each task
within a phase and depicts how to do a
data mining project.
Provide activities involved with each
generic task within a phase.

210

Data Mining Process: SEMMA


Sample
(Generate a representative
sample of the data)

Assess

Explore

(Evaluate the accuracy and


usefulness of the models)

(Visualization and basic


description of the data)

SEMMA

Model

Modify

(Use variety of statistical and


machine learning models )

(Select variables, transform


variable representations)

Data Prepara,on A Cri,cal DM


Task

Real-world
Data

Data Consolidation

Collect data
Select data
Integrate data

Data Cleaning

Impute missing values


Reduce noise in data
Eliminate inconsistencies

Data Transformation

Normalize data
Discretize/aggregate data
Construct new attributes

Data Reduction

Reduce number of variables


Reduce number of cases
Balance skewed data

Well-formed
Data

Q & A

You might also like