You are on page 1of 17

CHAPTER 7

CHAID
(Chi-Square Automatic Interaction Detector)
As we discussed in the previous chapters, the Analytic problems can be broadly classified
into two classes. Classes where dependent variable is continuous and others where
dependent variable is nominal. In this chapter, we will discuss a technique that will be
useful for handling nominal dependent variable. This broadly falls in the class of
techniques that can be called Decision Trees.
4.1 DECISION TREES
Decision tree is a graphical decision support tool. As the name indicates, it uses an
inverted tree structure to represent the decision outcome. It is recursive in nature and start
with the entire sample available which is called root node. The objective of the analysis is
to split the root node such that the branches are more similar within but different with
respect to target variable. It can handle quantitative and qualitative (eg. Gender, sales
regions, occupations etc.) variables with no compromise. The dependent variable should
be nominal with multiple categories but in most of the real life situations, it is binary. The
independent variable also should be categorical. Hence, before processing, the continuous
variables should be categorized using appropriate logic. This was the requirement many
years back. Now most of the algorithms got built in module for categorizing continuous
variables.
Campaign Management:- Lets consider a simple case of application of decision tree.
This refers to the campaign executed by a large store to enroll their existing customers for
an upgrade of their service. The campaign was addressed to all existing customers and
response information is available. The business would like to understand what is driving
the response of the customers. This will help them to better understand the customers.
Moreover, it will help them to target the next campaign so that promotion cost is less.
The data used consists of demographic, purchase information and response to the
campaign (yes/no). The dependent variable in this case is binary. Independent variables
also should be categorical. For example, the average annual purchase was categorized
into 4 levels (level1 - < $625, level2 - $625 - $1250, level3 - $1250 - $2500, level4
- >$2500). A total of 5500 customers were targeted and the entire lot will be used for this
analysis. The decision tree developed is given below.

Lecture Notes for Private Circulation only

Regi Mathew

All Customers
# = 5500
Response Rate = 12%

Avg. Purchase/Month < $1250

Avg. Purchase/Month >= $1250

Customers
# = 4000
Response Rate = 7%

Distance < 2km

Customers
# = 500
Response = 25%

Customers
# = 1500
Response Rate = 20%

2km =< Distance <10km

Customers
# = 2000
Response = 5%

Distance > 10km

Customers
# = 1500
Response = 2%

The set of all customers form the root node. It shows that the campaign was addressed to
5500 customers and 12% responded to the campaign. The next objective is to divide this
group of customers based any of the profile information that is available such that the
response rate is significantly different. The algorithm identified that the level of average
monthly purchase as the best variable explaining the response. It also identified that the
best option is to group many levels of the purchase into two (it could have been more
than two). The groups were formed based on whether their purchase is less or more than
$1250. The impact was that the two groups had significantly different response rate.
The next step is to consider each of these two groups as root node and analyze against
each of the profile variables to identify the best variable that will split these nodes. For
the first node (<$1250), it was found that the distance to residence is the best variable that
explained the response. For customers with purchase >=$1250, it was not able to
improve further using any of the variables. Hence, tree growing was terminated at this
node. Those who stay within 2 kms of the store responded eagerly compared to others
who stayed away. The algorithm also made a distinction between those who stayed
within 10kms and more than 10kms. Although the response rate was low, there is a
significant difference between the three groups. The segment of customers with annual
purchase less than $1250 and staying within 2kms seems to be best with 25% response
rate. While the segment with purchase less than $1250 and staying more than 10kms
away was the worst with just 2% response rate. This example illustrated the application
of decision trees for data analysis. Most important feature of a tree diagram is its visual
impact.
Lecture Notes for Private Circulation only

Regi Mathew

In addition to understanding the profile of customers, it is a predictive tool too. For


example, it is predicting that there is 20% chance of response if the customers annual
purchase is more than $1250. While there is 25% chance of response if the customer buys
less than $1250 annually and stay within 2kms. For business, this is valuable information
as it has uncovered that the amount of purchase and the distance of residence are playing
a role in response. They can use this information to select the customers for next
campaign instead of resorting to carpet bombing. Most likely the store may target only
these two groups for campaigns next year.
Let us clarify some terminologies used in Decision trees.
Root node - doesnt have any incoming edges and all other nodes grow from this.
Internal node - have an incoming edge and one or multiple outgoing edge.
Leaf Node (terminal node) - got only an incoming edge.
The row of nodes except root node is considered for counting the number
of levels. Hence, in this exercise we do have 2 levels.

Customer Response
(Yes/No)
Internal
Node

Average
Purchase - Low

Average
Purchase - High

Distance
High

Root
Node

Distance Low

Leaf
Nodes

Figure-7.2 Decision Tree for Predicting Customer Response

Lecture Notes for Private Circulation only

Regi Mathew

Decision tree is strongly linked to rule induction too. Each of the leaf can be converted
into a rule by associating the path from the root node. For example, the rule related to the
first leaf is that if a customer got annual purchase amount less than $1250 and if she is
staying less than 2km from the store, then there is 25% chance that she will respond to
the campaign. This means that a decision maker can cross check the result with his
domain knowledge and validate it.
It is extremely useful for uncovering structure of a data and it is quite visual in nature.
The output of a decision tree is intuitive in nature and can be used to communicate with
non-technical individuals. Hence, it is quite common to use this at the exploratory stage
of studies with nominal dependent variable. In analytics, most commonly used decision
tree algorithm is CHAID.

CHAID
CHAID stands for Chi-squared Automatic Interaction Detector and was originally
proposed by Kass(1980) for nominal dependent variable. It was extended to ordinal
dependent variable by by Magidson(1993) who illustrated how this extension could be
used to take advantage of known facts like profitability of each category of the dependent
variable. The algorithm incorporates sequential merge and split based on test statistic. It
uses Chi-square goodness of fit test to determine the best next split at each step and
hence, the name. The sequential steps followed as below.
- Tabulate dependent variable with the independent variable (assume k
categories). This will result in large number of tables as categories of the independent
variable can be merged in many combinations. In order to reduce the computing load,
Kass suggested following approach.

Lecture Notes for Private Circulation only

Regi Mathew

- Take all possible pairs of categories of independent variable and tabulate


with dependent variable. Each of the pair will result in a 2x2 table. Check for the
table with least significant chi-square value and merge the pair. Continue this
process of merging till no table is insignificant.
- Repeat this process with each of the independent variables. Choose the
independent variable whose chi-square is largest as the first variable to split the
sample. The merged categories of this variable will be used for splitting the
sample. Now each of the sub-sample will be subjected to the same process as
above.
- Continue this process till no chi-square test is significant.

CHAID vs Cluster Analysis


It is quite natural to confuse between Cluster analysis and CHAID as both will result in
segments of entities at the end of analysis. The most critical difference is that there is a
dependent variable for CHAID (and independent variables) while there is no dependent
variable for Cluster analysis (only variables).
Hence, CHAID uses dependent variable and one or more independent variables. The
dependent variable is nominal in nature with multiple categories (usually binary). While
independent variable could be continuous or categorical in nature. For example if our
objective is to segment customers based on age and income such that the segments are very
different in terms of delinquency (binary variable - delinquent or not), the technique is
CHAID.
Cluster analysis doesnt have a dependent variable and the clusters are formed
based on all the variables. For example, if you want to segment customers based on Age and
Income so that the segments are homogeneous, then the technique is Cluster analysis. In
this case the segments will be homogenous interms of Age and Income. These segments will
be heterogeneous between them. Refer Chulis, K.(2002) and Tsiptsis K. and
A.Chorianopoulos (2009) for a formal coverage of this.
Chulis, K. (2002), Optimal segmentation approach and application,
http://www.ibm.com/developerworks/library/ba-optimal-segmentation/, Accessed on
28/Oct/2014.
Tsiptsis K. and A.Chorianopoulos (2009), Data Mining Techniques in CRM: Inside Customer
Segmentation, John Wiley & Sons, Ltd.

Lecture Notes for Private Circulation only

Regi Mathew

Example - Examining Delinquency:- This process can be well understood by


examining a sample CHAID output. The situation is about delinquency among credit
customers of a bank. A decision tree was built using CHAID to understand the critical
variables and the interactions. Various factors like income, age, utilization of unsecured
credit, number of dependent members in the family, occupation etc. were considered for
analysis. Continuous variables like Age, Income, and Utilization of unsecured credit were
categorized into 4 categories before analysis. A portion of the output is provided below
for illustration.
Root node shows the overall profile of the entire data. It shows that there were 6674
delinquents (7.5%) and 82310 non-delinquents (92.5%). This root node is further split
Category
Delinquent
Not
Delinquent
Total

Node 0
#
6674
82310

%
7.50
92.5

88984

100.00

utilunsec_cat
Chi-square =14730 (df=3, p=0.000)

utilunsec_cat=4

Category
1
0
Total

Node 1
#
4479
9450
13929

%
17.90
82.10
25.00

utilunsec_cat=2

Node 2
#
755
10916
11671

Category
1
0
Total

%
6.50
93.50
13.10

utilunsec_cat=1

Category
1
0
Total

Node 3
#
977
37384
38361

utilunsec_cat=3

%
2.50
97.50
48.10

Category
1
0
Total

Node 4
#
463
24560
25023

%
6.50
93.50
28.10

age_cat
Chi-square =1156 (df=3, p=0.000)

age_cat=1

Category
1
0
Total

Node 5
#
1930
2001
3931

age_cat=2

age_cat=3

%
49.10
50.90
4.40

Category
1
0
Total

Node 6
#
831
2773
3604

%
23.10
76.90
4.10

Category
1
0
Total

Node 7
#
1446
2556
4002

age_cat=4

%
36.10
63.90
4.50

Category
1
0
Total

Node 8
#
272
2120
2392

%
11.40
88.60
2.70

into four nodes based on utilization of unsecured credit (utilunsec_cat). The algorithm
didnt merge any categories of this variable as the chi-square was highest when it was not
merged. The chi-square analysis result is provided below for the root node. It shows that
the chi-square value is 14370 and corresponding probability is 0. This is the result of a
test of independence applied to the relation between delinquency and utilization of
unsecured credit. To clarify, the table and corresponding test of independence is provided
below.

Lecture Notes for Private Circulation only

Regi Mathew

Delinquency * Utilization of Unsecured Credit

utilunsec_cat
Delinquency Category

Total

24560

37384

10916

9450

82310

463

977

755

4479

6674

25023

38361

11671

13929

88984

Delinqency

Total

Chi-Square Test
Statistic

Value

Pearson Chi-Square

14730.98a

0.000

Likelihood Ratio

10605.68

0.000

Linear-by-Linear Association

10663.85

0.000

N of Valid Cases

df

Asymp. Sig.

88984

a. 0 cells (0.0%) have expected count less than 5. The minimum expected count
is 875.35.

Each of the nodes 1 to 4 belong to a category of utilization of unsecured credit. The


delinquency rate is different for each of this categories, intuitively supporting the reason
why this variable is the most important in explaining delinquency. The next step in the
process is to consider each of the nodes as a root node and look for the most important
variable. The result shows that Age is the best variable for Node1. Accordingly, the chisquare test is the result of test of independence between delinquency and Age for Node1.
For Age too, algorithm didnt merge any of the age categories. Hence, there were 4 nodes
as sub-nodes to Node1.
The result demonstrates the advantages of CHAID over other methods. It doesnt assume
a monotonic relationship between independent and dependent variable. It can handle nonlinearity and irregularity without any problem. An example is the utilization of unsecured
credit. As the utilization increases, the default rate increases, but it is not monotonic. The
figure below shows the relationship and the estimated expected relationship if it was
monotonic.

Lecture Notes for Private Circulation only

Regi Mathew

20.0%
18.0%
16.0%

Proportion of
Delinquents

14.0%
12.0%
10.0%
8.0%
6.0%
4.0%
2.0%
0.0%
< 0.03

0.03 - 0.32
0.32 - 0.56
Utilization of Unsecured Credit

Observed

> 0.56

Expected (if monotonic)

The chart shows that delinquency doesnt vary monotonically with utilization of
unsecured credit. There is not much difference in delinquency between 2nd and 3rd
category. While monotonic relationship expect a secular increase with levels of the
variable. CHAID is appropriate for this situation as it doesnt assume this. Techniques
like logistic regression would underestimate the impact of the variable on delinquency. In
such a scenario, it may not make sense to use the variable without processing.

CHAID is an Algorithm, not a Technique


It is important to keep in mind that CHAID is an algorithm unlike correlation, regression
etc. which are statistical techniques. An algorithm is a set of steps to get the output given
certain inputs. Although there is similarity in basic principles, various tools have
implemented this with minor differences. This is especially the case in terms of logic
behind categorization of continuous variables and optimization of the splits. Hence, it is
no wonder that the results of CHAID are not exactly same across tools. This could be
disconcerting to an analyst who would get same result out of any statistical techniques
irrespective of the tool.
CHAID provides many parameters that an analyst can adjust to get the best result like
number of levels, minimum or maximum number of observations in each node, leaf etc.
The choice of these can also lead to different results even within the same tool.

Lecture Notes for Private Circulation only

Regi Mathew

Another critical advantage of CHAID, as the name indicates is that it brings out the
interactions between variables most vividly. It easily handles interactions, which can
cause other techniques difficulty. Interactions are situations when the behavior of an
independent variable depends on other variable. The chart below illustrates an instance of
such interaction. It shows the impact of utilization of unsecured credit on delinquency at
two levels of age. For the age group <41 years, the impact on delinquency decreases
uniformly with increase in utilization. However, higher age group (52-63) shows a
CHAID vs CART
Under the broad umbrella of tree structured data analysis, there is another
technique called CART (Classification and Regression Trees) similar to CHAID. It is
based on the method proposed by Breiman et al (1984) to predict quantitative
variable (continuous). Remember, CHAID requires the dependent variable to be
categorical in nature.
The algorithm performs binary splitting. It starts by sorting all n cases based on
predictor variable (if it is quantitative) and examines n-1 ways to split the cases into
two. For each split, the algorithm will compute the within-cluster sum of squares
about the mean of the dependent variable. Out of all the splits compared, it will
choose the best splits to represent the predictors contribution. This process is
continued for all the other predictors. The finalized split is chosen based on the
predictor and the cut point which yields the smallest overall within-cluster sum of
squares.
If the predictor variable is categorical, a different approach is adopted. Since
categories cannot be sorted, all possible splits between categories will have to be
considered. For creating k categories into two groups, there will be 2k-1 possible
splits. Each split is assessed based on the within-cluster sum of squares as for a
quantitative predictor. Regression trees is similar to regression/ANOVA modeling, in
which the dependent variable is quantitative and model evaluation is based on sum
of error squares. Hence, the name regression trees.
Following are the summary of comparison between CHAID and CART.
-

CHAID require categorical dependent variable, while CART require


quantitative (continuous) dependent variable.
CHAID searches for multi-way splits, while CART performs only binary splits.
CHAID uses a p-value from a significance test to measure the desirability of a
split, while CART uses the reduction of an impurity measure.
CHAID uses a forward stopping rule to grow a tree, while CART deliberately
overfits and uses validation data to prune back (for details of above two
differences, refer Breiman et al (1984)).

L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression


Trees. Wadsworth, Belmont, 1984.

Lecture Notes for Private Circulation only

Regi Mathew

contrasting picture. For this age group, the delinquency rate is similar in the initial levels
of utilization while it drops and stays same at higher levels of utilization. Hence, when
we examine the relationship between utilization and delinquency, Age is an interaction
variable. CHAID lays out all the significant interactions between variables and helps to
understand the dynamics.
CHAID could be a technique of choice when there are serious interactions between
variables. Although it can be handled by multiplying the variables in linear models, its
ability to capture multiple interactions are limited. Even while it is not the final model, it
is being used as a tool to understand the important variables and the interactions between
them. This information will be used for selecting and processing variables for modeling.
0.09

Proportion of delinquents

0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
< 0.03

0.03 - 0.32
0.32 - 0.56
Utilization of unsecured Credit
Age < 41

> 0.56

Age 52 - 63

4.2 CONTROLLING COMPLEXITY OR PRUNING TREES


If there is no control applied, the tree can grow and theoretically there could be as many
leaves as the number of observations. Obviously, this will be so huge that there will be no
practical use. Hence, various options are provided that helps the analyst to control the
complexity of the trees. Availability of these options vary by the tool being used.
Maximum p value (split):- This refers to the significance level of the Chi-square test. If
the p-value is above this, the node will not be split. Usually it is around 0.05.
Maximum Number of Levels:- This refers to the number of rows in the decision tree
and can be pre-specified. It is rarely more than 4.

Lecture Notes for Private Circulation only

Regi Mathew

10

Minimum size of a parent node:- This will ensure that CHAID will not try to split a
node if the size is less than this.
Minimum size of a child node:- This will ensure no node is smaller than this size.
These values depend on sample size and explanatory power of the independent variable.
Usual practice is to take an output based on the default values of the tool and then fine
tune these values to get the best output.
Strengths and Weakness of CHAID
Strength
Weakness
It can handle qualitative and quantitative It bundles observations and attach same
information with ease.
probability to everybody in the node.
This is a serious weakness in decision
making situation as there is no
discrimination between observation in
the node.
The rule like result is intuitive and a
It requires large sample as nodes and
decision maker can cross check with his leaves requires enough sample size to be
domain knowledge.
significant.
It vividly brings out the interaction
Continuous variables will have to
between variables. This information will binned into categories before we can
be helpful even if CHAID is
apply the CHAID alogorithm (most
intermediate analysis.
CHAID packages now have the built in
algorithm to take care of this).

4.3 EVALUATING THE CHAID RESULT


A common method of evaluating the CHAID result is to construct the confusion matrix.
This matrix is constructed by classifying all observations into appropriate classes based
on the proportion of categories. For classification, we adopt a simple rule. If the
proportion of Yes in a leaf is more than or equal to 50%, then all members in the cell is
classified as Yes. Using this rule it is possible to classify each observation as predicted
Yes or No and it is possible to construct a table connecting actual and predicted
classes as below.

Lecture Notes for Private Circulation only

Regi Mathew

11

Actual Yes
Class No

Yes
fYY
fNY

Predicted Class
No
fYN
fNN

Table-7.5 Confusion Matrix for a 2-class Problem


Here fYY is the number of observations originally Yes and predicted as Yes. fNN is the
number observations that are originally No and predicted as No. These are correct
predictions. fYN is the number of observations that are actually Yes but predicted as
No, incorrect prediction.
These numbers are brought together as a single measure of accuracy. It is ratio of all total
correct prediction to the total number of observation.
Accuracy = (fYY + fNN) / (fYY + fNN + fYN + fNY)

Most classification algorithms provides tree that maximize accuracy.

4.4 AVOIDING OVERFITTING


As in the case of Linear Regression, overfitting could be an issue for CHAID too and
appropriate measures should be taken to control it. Most common approach is to divide
the sample into Training and Testing datasets and compare the performance between the
results. The basis of the test is that a good tree should be able to classify observations
correctly in the Training sample as well as Testing sample; the tree has not been exposed
to. The figure below shows that when the tree size is small the error rate of Training and
Testing datasets are comparable but high. This is due to underfitting. This occurs because
when the number of nodes are less, the tree is yet to learn all the intricacies of the data
structure and hence the predictive ability is equally low.

Lecture Notes for Private Circulation only

Regi Mathew

12

0.6

Error Rate

0.5
0.4
0.3
0.2
0.1
0
0

20

40

60

80 100 120 140 160 180 200 220 240 260 280 300 320

Number of Nodes
Training Sample

Testing Sample

Figure 7.4 Training and Testing Error Rates

As the number of nodes increase, error rate of Training datasets keep decreasing but for
Testing dataset, it start increasing after certain level. With increase in the number nodes,
the complexity of the tree increase and it becomes too much dependent on the data. In
other words, the leaf nodes increase so much that it perfectly fits the data. It will even fit
an error in the dataset. Obviously, the Testing dataset will not mirror this and the error
rate may even increase. The tree become progressively less generalizable and this
phenomenon is called overfitting.
As in the case of Linear modeling, it is critical to divide the data into two and check for
overfitting before finalizing the tree.

4.4 APPLICATIONS
As explained above, the CHAID is extremely versatile in its applications and usefulness.
Broadly most of the applications are in the area of exploration where the purpose is to
explore the data and identify important predictors and the interrelationships. Hence, this
forms a critical intermediate step before the final stage which could be a mathematical
model. Dependence based segmentation is another common application. In rare
situations, the CHAID output is used for implementation without any further follow up
analysis. Let us review few of the published applications of CHAID.
Moore et al (2010) conducted a study using CHAID to profile the private label apparel
consumer using demographic and behavioral predictors. It examined cross-shopping
behaviors among purchasers of private label apparel across the five top US private label
apparel retailers.
Lecture Notes for Private Circulation only

Regi Mathew

13

They have examined various version of the tree to describe the private label consumer.
Demographic indicators were more prominent than behavioral characteristics when
predicting private label purchase across the five retailers. In particular, household size
and income represented the primary predictor for private label purchasing that are
positioned on price (Wal-Mart, Kohls, and JC Penney). Private label purchasers of
Target and Macys indicated behavioral predictors as significant drivers of choice.
Target private label patrons indicated low levels of bargain shopping while Macys
patrons were frequent mall shoppers. The study concluded that private label purchasers
are more different than similar in terms of demographics and behaviors according to the
selected retail format, with several exceptions between the Wal-Mart, Kohls and JC
Penney formats which appear to share competitive space.
Galguera et al (2006) used CHAID to segment loyalty card holders based on
demographic data. The study found that variables like age, education, location of
residence and shopping frequency lead to a higher likelihood of possessing loyalty cards.
The segments formed could be used a better target for loyalty programmes. This crosscountry approach showed the consistent nature of the predictors of loyalty card
possession.
Zuccaro, C. (2010) assessed the structural characteristics (conceptual
utility) and predictive precision of the most popular classification and predictive
techniques employed in customer relationship management and customer scoring. The
study compared discriminant analysis, binary logistic regression, artificial neural
networks, C5 algorithm, and regression trees employing Chi-squared Automatic
Interaction Detector (CHAID).
The study found that logistic regression provides easily interpretable parameters through
its logit. The logits can be interpreted in the same way as regression slopes. Additionally,
the logits can be converted to odds providing a common sense evaluation of the relative
importance of each independent variable. The technique provides robust statistical tests to
evaluate the model parameters. Both CHAID and the C5 algorithm provide visual tools
(regression tree) and semantic rules (rule set for classification) to facilitate the
interpretation of the model parameters. These are highly desirable properties when the
researcher attempts to explain the conceptual and operational foundations of the model.
Berry and Linoff (1997) provides one of the rare published instance of a CHAID output
being used for implementation. This case refers to a large business involved in printing.
The problem is about the occurrence of lines in the drum of the roller being used for
printing. Since the cause of this problem is not clear, the business collected large number
of possible variables and let the CHAID to segment the cases based on the occurrence of
lines. Based on the result, business put in place a number of controls and the problem
(lines on the cylinder) started decreasing significantly.

Lecture Notes for Private Circulation only

Regi Mathew

14

Questions
1. An organization is troubled by increasing attrition and HR team was tasked with
identifying some remedies to stem this. The team collected data of all employees
(resigned and in service) who joined more than one year back
(DSCH07CHAIATTA.xls). The effort is at the exploratory stage and the objective
at this stage is to understand the variables better and uncover any
interrelationships. It is planning to present this result as pictorial as possible.

2. A financial institution has been offering personal loan for many years. The firm is
always interested in controlling and reducing delinquency (miss two or more
consecutive payment obligation) as it got direct bearing on the profitability.
Currently the firm is evaluating the factors that influenced delinquency so that it
can improve approval process. The data (DSCH07CHAIDLQTC.xls) contains a
set of selected information that were considered at the time of application. It also
contains if the customer ever became delinquent. The institution likes to
understand the important influencing factors and interaction of these factors. It
would like to present this pictorially to its senior management team. (Hint:- watch
out for variables that are used to create the delinquency variable).

3. A well-known financial firm is launching a new product. It would like to target its
existing customers in the first round of marketing. As a first step, it chose 25000
customers at random from the existing pool and mailed the brochures. The
response to this campaign was noted for few months. The data is now available
along with service usage and few profile information (DSCH07CHAIRESW.xls)
The next step in the process is Exploratory Data Analysis. The firm is interested
in identifying the important variables and interaction among them. It is also
interested in customer segments and its identification (Hint:- This data got
missing values and outliers for some of the variables. The file also provides
typical values of each of the variables. You should treat it before analysis).
4. A major problem faced by Telecom operators is the customers switching of
channels. They will be able to control this to some extend if there is a prior
warning about the customers who are going to switch. Hence, the requirement is
to build model to predict the defection. In order to develop this, the business
collected usage data of a proportional sample of customers who are with business
and who have defected (available in this dataset DSCH07CHAICHND.xls). The
next step is to conduct an Exploratory Data Analysis and the objective is to
identify important variable and the interrelationship. Business is also interested in
customer segments based on the churn behavior.

Lecture Notes for Private Circulation only

Regi Mathew

15

5. Lending Club is a US based consumer finance firm that operates a "peer-to-peer"


lending website for personal loans. The company assesses applicants' risk and lets
investors lend directly to individuals or spread their money across a number of
loans.
It enables borrowers to create loan listings on its website by supplying details
about themselves and the loans that they would like to request. All loans are
unsecured personal loans and can be between $1,000 - $35,000. On the basis of
the borrowers credit score, credit history, desired loan amount and the borrowers
debt-to-income ratio, Lending Club determines whether the borrower is credit
worthy and assigns to its approved loans a credit grade that determines payable
interest rate and fees. The standard loan period is three years1. The firm standout
for us because it provides detailed data on its website
(https://www.lendingclub.com/info/download-data.action). You may download it
and evaluate their decision by applying CHAID.

Source:- https://en.wikipedia.org/wiki/Lending_Club

Lecture Notes for Private Circulation only

Regi Mathew

16

References

Kass, G.V. (1980) An Exploratory Technique for Investigating Large Quantities of


Categorical Data, Journal of Applied Statistics 29(2): pp.119-127.
Magidson, J. (1993): The use of the new ordinal algorithm in CHAID to target
profitable segments, The Journal of Database Marketing, 1, 29-48.
Marguerite Moore, Jason M. Carpenter (2010), A decision tree approach to modeling the
private label apparel consumer, Marketing Intelligence & Planning, Vol. 28 No. 1, pp.
59-69.
Laura Galguera, David Luna, M. Paz Mndez (2006), Predictive segmentation in action
Using CHAID to segment loyalty card holders, International Journal of Market Research,
Vol. 48 Issue 4 pp.459-479.
Cataldo Zuccaro (2010), Classification and prediction in customer scoring, Journal of
Modelling in Management, Vol. 5 No. 1, pp. 38-53.
Berry, Michael, and Gordon Linoff (1997). Data Mining Techniques for Marketing,
Sales, and Customer Support. New York: John Wiley.

Lecture Notes for Private Circulation only

Regi Mathew

17

You might also like