Professional Documents
Culture Documents
CHAID
(Chi-Square Automatic Interaction Detector)
As we discussed in the previous chapters, the Analytic problems can be broadly classified
into two classes. Classes where dependent variable is continuous and others where
dependent variable is nominal. In this chapter, we will discuss a technique that will be
useful for handling nominal dependent variable. This broadly falls in the class of
techniques that can be called Decision Trees.
4.1 DECISION TREES
Decision tree is a graphical decision support tool. As the name indicates, it uses an
inverted tree structure to represent the decision outcome. It is recursive in nature and start
with the entire sample available which is called root node. The objective of the analysis is
to split the root node such that the branches are more similar within but different with
respect to target variable. It can handle quantitative and qualitative (eg. Gender, sales
regions, occupations etc.) variables with no compromise. The dependent variable should
be nominal with multiple categories but in most of the real life situations, it is binary. The
independent variable also should be categorical. Hence, before processing, the continuous
variables should be categorized using appropriate logic. This was the requirement many
years back. Now most of the algorithms got built in module for categorizing continuous
variables.
Campaign Management:- Lets consider a simple case of application of decision tree.
This refers to the campaign executed by a large store to enroll their existing customers for
an upgrade of their service. The campaign was addressed to all existing customers and
response information is available. The business would like to understand what is driving
the response of the customers. This will help them to better understand the customers.
Moreover, it will help them to target the next campaign so that promotion cost is less.
The data used consists of demographic, purchase information and response to the
campaign (yes/no). The dependent variable in this case is binary. Independent variables
also should be categorical. For example, the average annual purchase was categorized
into 4 levels (level1 - < $625, level2 - $625 - $1250, level3 - $1250 - $2500, level4
- >$2500). A total of 5500 customers were targeted and the entire lot will be used for this
analysis. The decision tree developed is given below.
Regi Mathew
All Customers
# = 5500
Response Rate = 12%
Customers
# = 4000
Response Rate = 7%
Customers
# = 500
Response = 25%
Customers
# = 1500
Response Rate = 20%
Customers
# = 2000
Response = 5%
Customers
# = 1500
Response = 2%
The set of all customers form the root node. It shows that the campaign was addressed to
5500 customers and 12% responded to the campaign. The next objective is to divide this
group of customers based any of the profile information that is available such that the
response rate is significantly different. The algorithm identified that the level of average
monthly purchase as the best variable explaining the response. It also identified that the
best option is to group many levels of the purchase into two (it could have been more
than two). The groups were formed based on whether their purchase is less or more than
$1250. The impact was that the two groups had significantly different response rate.
The next step is to consider each of these two groups as root node and analyze against
each of the profile variables to identify the best variable that will split these nodes. For
the first node (<$1250), it was found that the distance to residence is the best variable that
explained the response. For customers with purchase >=$1250, it was not able to
improve further using any of the variables. Hence, tree growing was terminated at this
node. Those who stay within 2 kms of the store responded eagerly compared to others
who stayed away. The algorithm also made a distinction between those who stayed
within 10kms and more than 10kms. Although the response rate was low, there is a
significant difference between the three groups. The segment of customers with annual
purchase less than $1250 and staying within 2kms seems to be best with 25% response
rate. While the segment with purchase less than $1250 and staying more than 10kms
away was the worst with just 2% response rate. This example illustrated the application
of decision trees for data analysis. Most important feature of a tree diagram is its visual
impact.
Lecture Notes for Private Circulation only
Regi Mathew
Customer Response
(Yes/No)
Internal
Node
Average
Purchase - Low
Average
Purchase - High
Distance
High
Root
Node
Distance Low
Leaf
Nodes
Regi Mathew
Decision tree is strongly linked to rule induction too. Each of the leaf can be converted
into a rule by associating the path from the root node. For example, the rule related to the
first leaf is that if a customer got annual purchase amount less than $1250 and if she is
staying less than 2km from the store, then there is 25% chance that she will respond to
the campaign. This means that a decision maker can cross check the result with his
domain knowledge and validate it.
It is extremely useful for uncovering structure of a data and it is quite visual in nature.
The output of a decision tree is intuitive in nature and can be used to communicate with
non-technical individuals. Hence, it is quite common to use this at the exploratory stage
of studies with nominal dependent variable. In analytics, most commonly used decision
tree algorithm is CHAID.
CHAID
CHAID stands for Chi-squared Automatic Interaction Detector and was originally
proposed by Kass(1980) for nominal dependent variable. It was extended to ordinal
dependent variable by by Magidson(1993) who illustrated how this extension could be
used to take advantage of known facts like profitability of each category of the dependent
variable. The algorithm incorporates sequential merge and split based on test statistic. It
uses Chi-square goodness of fit test to determine the best next split at each step and
hence, the name. The sequential steps followed as below.
- Tabulate dependent variable with the independent variable (assume k
categories). This will result in large number of tables as categories of the independent
variable can be merged in many combinations. In order to reduce the computing load,
Kass suggested following approach.
Regi Mathew
Regi Mathew
Node 0
#
6674
82310
%
7.50
92.5
88984
100.00
utilunsec_cat
Chi-square =14730 (df=3, p=0.000)
utilunsec_cat=4
Category
1
0
Total
Node 1
#
4479
9450
13929
%
17.90
82.10
25.00
utilunsec_cat=2
Node 2
#
755
10916
11671
Category
1
0
Total
%
6.50
93.50
13.10
utilunsec_cat=1
Category
1
0
Total
Node 3
#
977
37384
38361
utilunsec_cat=3
%
2.50
97.50
48.10
Category
1
0
Total
Node 4
#
463
24560
25023
%
6.50
93.50
28.10
age_cat
Chi-square =1156 (df=3, p=0.000)
age_cat=1
Category
1
0
Total
Node 5
#
1930
2001
3931
age_cat=2
age_cat=3
%
49.10
50.90
4.40
Category
1
0
Total
Node 6
#
831
2773
3604
%
23.10
76.90
4.10
Category
1
0
Total
Node 7
#
1446
2556
4002
age_cat=4
%
36.10
63.90
4.50
Category
1
0
Total
Node 8
#
272
2120
2392
%
11.40
88.60
2.70
into four nodes based on utilization of unsecured credit (utilunsec_cat). The algorithm
didnt merge any categories of this variable as the chi-square was highest when it was not
merged. The chi-square analysis result is provided below for the root node. It shows that
the chi-square value is 14370 and corresponding probability is 0. This is the result of a
test of independence applied to the relation between delinquency and utilization of
unsecured credit. To clarify, the table and corresponding test of independence is provided
below.
Regi Mathew
utilunsec_cat
Delinquency Category
Total
24560
37384
10916
9450
82310
463
977
755
4479
6674
25023
38361
11671
13929
88984
Delinqency
Total
Chi-Square Test
Statistic
Value
Pearson Chi-Square
14730.98a
0.000
Likelihood Ratio
10605.68
0.000
Linear-by-Linear Association
10663.85
0.000
N of Valid Cases
df
Asymp. Sig.
88984
a. 0 cells (0.0%) have expected count less than 5. The minimum expected count
is 875.35.
Regi Mathew
20.0%
18.0%
16.0%
Proportion of
Delinquents
14.0%
12.0%
10.0%
8.0%
6.0%
4.0%
2.0%
0.0%
< 0.03
0.03 - 0.32
0.32 - 0.56
Utilization of Unsecured Credit
Observed
> 0.56
The chart shows that delinquency doesnt vary monotonically with utilization of
unsecured credit. There is not much difference in delinquency between 2nd and 3rd
category. While monotonic relationship expect a secular increase with levels of the
variable. CHAID is appropriate for this situation as it doesnt assume this. Techniques
like logistic regression would underestimate the impact of the variable on delinquency. In
such a scenario, it may not make sense to use the variable without processing.
Regi Mathew
Another critical advantage of CHAID, as the name indicates is that it brings out the
interactions between variables most vividly. It easily handles interactions, which can
cause other techniques difficulty. Interactions are situations when the behavior of an
independent variable depends on other variable. The chart below illustrates an instance of
such interaction. It shows the impact of utilization of unsecured credit on delinquency at
two levels of age. For the age group <41 years, the impact on delinquency decreases
uniformly with increase in utilization. However, higher age group (52-63) shows a
CHAID vs CART
Under the broad umbrella of tree structured data analysis, there is another
technique called CART (Classification and Regression Trees) similar to CHAID. It is
based on the method proposed by Breiman et al (1984) to predict quantitative
variable (continuous). Remember, CHAID requires the dependent variable to be
categorical in nature.
The algorithm performs binary splitting. It starts by sorting all n cases based on
predictor variable (if it is quantitative) and examines n-1 ways to split the cases into
two. For each split, the algorithm will compute the within-cluster sum of squares
about the mean of the dependent variable. Out of all the splits compared, it will
choose the best splits to represent the predictors contribution. This process is
continued for all the other predictors. The finalized split is chosen based on the
predictor and the cut point which yields the smallest overall within-cluster sum of
squares.
If the predictor variable is categorical, a different approach is adopted. Since
categories cannot be sorted, all possible splits between categories will have to be
considered. For creating k categories into two groups, there will be 2k-1 possible
splits. Each split is assessed based on the within-cluster sum of squares as for a
quantitative predictor. Regression trees is similar to regression/ANOVA modeling, in
which the dependent variable is quantitative and model evaluation is based on sum
of error squares. Hence, the name regression trees.
Following are the summary of comparison between CHAID and CART.
-
Regi Mathew
contrasting picture. For this age group, the delinquency rate is similar in the initial levels
of utilization while it drops and stays same at higher levels of utilization. Hence, when
we examine the relationship between utilization and delinquency, Age is an interaction
variable. CHAID lays out all the significant interactions between variables and helps to
understand the dynamics.
CHAID could be a technique of choice when there are serious interactions between
variables. Although it can be handled by multiplying the variables in linear models, its
ability to capture multiple interactions are limited. Even while it is not the final model, it
is being used as a tool to understand the important variables and the interactions between
them. This information will be used for selecting and processing variables for modeling.
0.09
Proportion of delinquents
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
< 0.03
0.03 - 0.32
0.32 - 0.56
Utilization of unsecured Credit
Age < 41
> 0.56
Age 52 - 63
Regi Mathew
10
Minimum size of a parent node:- This will ensure that CHAID will not try to split a
node if the size is less than this.
Minimum size of a child node:- This will ensure no node is smaller than this size.
These values depend on sample size and explanatory power of the independent variable.
Usual practice is to take an output based on the default values of the tool and then fine
tune these values to get the best output.
Strengths and Weakness of CHAID
Strength
Weakness
It can handle qualitative and quantitative It bundles observations and attach same
information with ease.
probability to everybody in the node.
This is a serious weakness in decision
making situation as there is no
discrimination between observation in
the node.
The rule like result is intuitive and a
It requires large sample as nodes and
decision maker can cross check with his leaves requires enough sample size to be
domain knowledge.
significant.
It vividly brings out the interaction
Continuous variables will have to
between variables. This information will binned into categories before we can
be helpful even if CHAID is
apply the CHAID alogorithm (most
intermediate analysis.
CHAID packages now have the built in
algorithm to take care of this).
Regi Mathew
11
Actual Yes
Class No
Yes
fYY
fNY
Predicted Class
No
fYN
fNN
Regi Mathew
12
0.6
Error Rate
0.5
0.4
0.3
0.2
0.1
0
0
20
40
60
80 100 120 140 160 180 200 220 240 260 280 300 320
Number of Nodes
Training Sample
Testing Sample
As the number of nodes increase, error rate of Training datasets keep decreasing but for
Testing dataset, it start increasing after certain level. With increase in the number nodes,
the complexity of the tree increase and it becomes too much dependent on the data. In
other words, the leaf nodes increase so much that it perfectly fits the data. It will even fit
an error in the dataset. Obviously, the Testing dataset will not mirror this and the error
rate may even increase. The tree become progressively less generalizable and this
phenomenon is called overfitting.
As in the case of Linear modeling, it is critical to divide the data into two and check for
overfitting before finalizing the tree.
4.4 APPLICATIONS
As explained above, the CHAID is extremely versatile in its applications and usefulness.
Broadly most of the applications are in the area of exploration where the purpose is to
explore the data and identify important predictors and the interrelationships. Hence, this
forms a critical intermediate step before the final stage which could be a mathematical
model. Dependence based segmentation is another common application. In rare
situations, the CHAID output is used for implementation without any further follow up
analysis. Let us review few of the published applications of CHAID.
Moore et al (2010) conducted a study using CHAID to profile the private label apparel
consumer using demographic and behavioral predictors. It examined cross-shopping
behaviors among purchasers of private label apparel across the five top US private label
apparel retailers.
Lecture Notes for Private Circulation only
Regi Mathew
13
They have examined various version of the tree to describe the private label consumer.
Demographic indicators were more prominent than behavioral characteristics when
predicting private label purchase across the five retailers. In particular, household size
and income represented the primary predictor for private label purchasing that are
positioned on price (Wal-Mart, Kohls, and JC Penney). Private label purchasers of
Target and Macys indicated behavioral predictors as significant drivers of choice.
Target private label patrons indicated low levels of bargain shopping while Macys
patrons were frequent mall shoppers. The study concluded that private label purchasers
are more different than similar in terms of demographics and behaviors according to the
selected retail format, with several exceptions between the Wal-Mart, Kohls and JC
Penney formats which appear to share competitive space.
Galguera et al (2006) used CHAID to segment loyalty card holders based on
demographic data. The study found that variables like age, education, location of
residence and shopping frequency lead to a higher likelihood of possessing loyalty cards.
The segments formed could be used a better target for loyalty programmes. This crosscountry approach showed the consistent nature of the predictors of loyalty card
possession.
Zuccaro, C. (2010) assessed the structural characteristics (conceptual
utility) and predictive precision of the most popular classification and predictive
techniques employed in customer relationship management and customer scoring. The
study compared discriminant analysis, binary logistic regression, artificial neural
networks, C5 algorithm, and regression trees employing Chi-squared Automatic
Interaction Detector (CHAID).
The study found that logistic regression provides easily interpretable parameters through
its logit. The logits can be interpreted in the same way as regression slopes. Additionally,
the logits can be converted to odds providing a common sense evaluation of the relative
importance of each independent variable. The technique provides robust statistical tests to
evaluate the model parameters. Both CHAID and the C5 algorithm provide visual tools
(regression tree) and semantic rules (rule set for classification) to facilitate the
interpretation of the model parameters. These are highly desirable properties when the
researcher attempts to explain the conceptual and operational foundations of the model.
Berry and Linoff (1997) provides one of the rare published instance of a CHAID output
being used for implementation. This case refers to a large business involved in printing.
The problem is about the occurrence of lines in the drum of the roller being used for
printing. Since the cause of this problem is not clear, the business collected large number
of possible variables and let the CHAID to segment the cases based on the occurrence of
lines. Based on the result, business put in place a number of controls and the problem
(lines on the cylinder) started decreasing significantly.
Regi Mathew
14
Questions
1. An organization is troubled by increasing attrition and HR team was tasked with
identifying some remedies to stem this. The team collected data of all employees
(resigned and in service) who joined more than one year back
(DSCH07CHAIATTA.xls). The effort is at the exploratory stage and the objective
at this stage is to understand the variables better and uncover any
interrelationships. It is planning to present this result as pictorial as possible.
2. A financial institution has been offering personal loan for many years. The firm is
always interested in controlling and reducing delinquency (miss two or more
consecutive payment obligation) as it got direct bearing on the profitability.
Currently the firm is evaluating the factors that influenced delinquency so that it
can improve approval process. The data (DSCH07CHAIDLQTC.xls) contains a
set of selected information that were considered at the time of application. It also
contains if the customer ever became delinquent. The institution likes to
understand the important influencing factors and interaction of these factors. It
would like to present this pictorially to its senior management team. (Hint:- watch
out for variables that are used to create the delinquency variable).
3. A well-known financial firm is launching a new product. It would like to target its
existing customers in the first round of marketing. As a first step, it chose 25000
customers at random from the existing pool and mailed the brochures. The
response to this campaign was noted for few months. The data is now available
along with service usage and few profile information (DSCH07CHAIRESW.xls)
The next step in the process is Exploratory Data Analysis. The firm is interested
in identifying the important variables and interaction among them. It is also
interested in customer segments and its identification (Hint:- This data got
missing values and outliers for some of the variables. The file also provides
typical values of each of the variables. You should treat it before analysis).
4. A major problem faced by Telecom operators is the customers switching of
channels. They will be able to control this to some extend if there is a prior
warning about the customers who are going to switch. Hence, the requirement is
to build model to predict the defection. In order to develop this, the business
collected usage data of a proportional sample of customers who are with business
and who have defected (available in this dataset DSCH07CHAICHND.xls). The
next step is to conduct an Exploratory Data Analysis and the objective is to
identify important variable and the interrelationship. Business is also interested in
customer segments based on the churn behavior.
Regi Mathew
15
Source:- https://en.wikipedia.org/wiki/Lending_Club
Regi Mathew
16
References
Regi Mathew
17