You are on page 1of 6

Building Attrition Models Using Logistic Regression in Analysis Studio

1 of 6

https://www.appricon.com/index.php/building-attrition-models-using-logi...

Thursday, June 25, 2015

Home

Downloads

SUPPORT

Tutorials
Case Studies
FAQs

Text Size

Purchase

Support

Partners

About Us

Privacy

Site Map

Building Attrition Models Using Logistic Regression in Analysis Studio


In statistical analysis and data mining projects, building an attrition analysis (also known as churn analysis) is about finding
the relations between customers' attrition and the variables that affect it. Although attrition analysis models have specific
requirements, the process described in this article can be used for producing any binomial logistic model. The goal of attrition
analysis is to provide the manger or researcher the ability to understand what the most important variables that cause
attrition are and what the likelihood of a customer to churn is.

Sample Files
It may looks easy to draw the main reasons that affect attrition: customer satisfaction, length of service etc. Using those
rules-of-thumbs the user can predict 15% of all churners but using a statistical analysis procedure as in Analysis Studio can
yield more then 60% precision.
Analysis Studio makes use of four logistic regression methods to find the best model that can explain the main reasons for
attrition. In this "how-to" paper, we will discuss a simple yet powerful statistical analysis method for obtaining a good attrition
model. We will also discuss the model interpretation in order to deliver the manager / researcher tools to conduct and
analyze the model as well as deploy it as the final step of the data mining process.
A predictive statistical model based on logistic regression is a model that analyzes each variable weight and contribution to
the model goals. The variable contribution is measured in percents and the manager or researcher can understand the
weight of each variable on the model target variable (In this example: attrition). Having the weight of each variable and being
able to see the effect each variable has on the predictive model, makes logistic regression a preferred modeling method (as
opposed to other methods such as neural networks that act like a black box).
As suggested in the beginning of this article, you can also use the Analysis Studio Logistic regression procedure for a wide
variety of fields like Projects failure analysis, Employees attrition in HR, Social research, Engineering , Finance and other
research aiming to find an explanation to a binary event (like "0" / "1" , "Churned"/"Not churned" etc.) occurrence and
prediction.
Preparing the data set
Logistic regression produces a statistical model using a data set where the target variable is binary i.e. has two possible
values: "1" means that the event has occurred and "0" that means that the event has not occurred.
Attrition

Visits

Calls

Education

Age

Children

Customer_ID

12

12

61

102654

18

20

32

103540

14

20

35

104426

20

20

26

105312

90

12

25

106198

10

59

107084

70

10

46

107970

16

65

108856

10

57

109742

12

14

64

110628

40

72

111514

12

67

112400

12

15

33

113286

6/25/2015 1:33 PM

Building Attrition Models Using Logistic Regression in Analysis Studio

2 of 6

Home

Downloads

https://www.appricon.com/index.php/building-attrition-models-using-logi...

Purchase

About Us

Privacy

FAQ

Back to Top

12

12

33

115058

12

59

115944

14

60

116830

86

77

117716

12

14

52

118602

98

55

119488

Example data set


The data set contains 20 customers that have churned last year.10 of those customers have churned and 10 are still with the
company. The goal of the analysts in this data mining project is to detect the customers with personal risk of churning (e.g.
John Smith has 95% risk of churning) so the marketing or call center can contact them in advance. An important outcome is
the ability to analyze and understand what causes the attrition? What influence each variable has on customer's decision to
leave the company? Being able to answer these questions is what binds statistical analysis, data mining and business
decisions and many referred to as decision support aid.
Data set variables:
Children - Number of children that a customer has.
Age Customer's age.
Education Customer years of education.
Calls No. of customer calls the service center.
Visits No. of customer visits to the local service center.
Attrition - Whether the customer churned ("1") or not ("0") this is the model target variable.
To start using the logistic regression modeling process described here, you will need Analysis Studio . After starting the
software and retrieving the data, click the Statistics menu -> Logistic Regression.

As discussed above, the target variable (explained variable) is in this case: Attrition. It is the variable we would like to know
how changes of explanatory variable values (in this case: Age, Calls, Children, Education, Visits) affect it. Select the Attrition
variable from the explained variable box. To define our model we will move the desired or expected explanatory variables
from the Explanatory Variables frame on the left side of the wizard window and move them to the Selected Columns frame
on the right side of the wizard window.
In a good data mining or statistical analysis process the analyst makes assumptions and then tries to disprove them so at
the end, only strong well based assumptions are left standing.

6/25/2015 1:33 PM

Building Attrition Models Using Logistic Regression in Analysis Studio

3 of 6

https://www.appricon.com/index.php/building-attrition-models-using-logi...

Now we have all model components:


1. The explained variable that is the target variable "Attrition" that we would like to either predict or analyze.
2. The explanatory variables (assumptions for now) the variables that we think have an influence on the target variable
outcome. These are the selected columns are In this case: Age, Calls, Children, Education and Visits.
3. Selected modeling method: Enter All, which is the simplest modeling technique that will try to add all variables to the
model (this might be impossible due to correlated data).
All we have to do is to click the Next button and let the software do the model calculation for us. In the Analysis Studio
logistic regression modeling wizard, the process can be stopped, you may move back and forth to make changes and refine your
model before publishing it.

At this point, the logistic regression model is calculated and the process of reviewing, analyzing and refining it begins. The
screen should now display the ROC curve and the Area Under Curve (AUC) value. We will not discuss the ROC or AUC
methods here but, generally speaking, ROC and AUC measure the model's success to distinguish between "1" or "0" events
of the target variable. As the AUC figure is close to "1" this means that the model has a very high success distinguishing
between the binary events ("0" or "1") of the targeted variable. Values close to 0.5 usually mean that the model has low
performance and should be ignored.
AUC value

No distinguish ability

Not a very good model

Very good model

0.5

0.5-0.7

0.7-0.9

6/25/2015 1:33 PM

Building Attrition Models Using Logistic Regression in Analysis Studio

4 of 6

https://www.appricon.com/index.php/building-attrition-models-using-logi...

Excellent model

0.9-1.0

Note that excellent too good to be true models should be examined carefully to make sure that no variables "from the future"
are present. Imagine that our database contained a variable containing the number of next month's orders. Customers that
are not longer active will have 0 orders and will be a good false predictive explanatory variable if entered into the model.
Most good predictive business models produced have AUC of 0.7-0.8. This depends on data quality and the nature of the
problem.
In our example, we have an AUC of 0.83 so we should proceed to view the rest of the results for further analysis.
Interpreting the statistical parameters in the model can be a complicated task that is not for our How-to-paper. We will view
model parameters and predictive results that will help us to understand the attrition phenomena as well as our model
performance.
Note that you may make assumptions about model quality and performance yet in order to verify model correctness, quality
and true performance you will need to qualify as a professional statistician or analyst.
At this point, click the next button and then click finish. The model is now published and ready to be reviewed. In the main
attrition model (logistic regression) window; each variable has its own value regarding its contribution to the attrition
phenomena.
For example: Age has the value of 0.9512 which means that for each additional year the churn risk is decrease in 4.88%.
(1-0.9512) * 100 = 4.88, Calls has the value of 1.0458 which means that for each additional call to the call center the risk of
churning increase by 4.58%. (1.0458-1) * 100 = 4.58
Classifications
Click the classifications tab to view model results on the current data set. This tab shows how many cases were classified
as churners or non churners as opposed to how many customers actually churned or not.

1. Model performance identifying the non-churners (80% success).


2. Model performance identifying the churners (70% success).
3. Overall model performance (75% success).
After computing the logistic regression procedure, we can finally try to answer our question: What affects the attrition
phenomena and what is the weight of each explanatory variable on "Attrition"?
Analysis 6 armed you with four powerful analytics tools:
What-If scenario Allows you to analyze and view a specific case in order to learn from it on your customer's attrition, or
you can analyze a specific customer in order to understand how and why it was classified by the model. Take a look at the
image below:
1. The variables calculator that calculates the probability or risk of Attrition based on the given variable values: Age,
Calls, Children, Education and Visits.
2. The calculation result when parameters are entered.
In the example below, the probability of attrition is 40% for a customer that is 57 years old, had called the call center 12
times, has 2 children, has 12 years education and has visited the customer service centers two times.

6/25/2015 1:33 PM

Building Attrition Models Using Logistic Regression in Analysis Studio

5 of 6

https://www.appricon.com/index.php/building-attrition-models-using-logi...

Sensitivity Table - Analyze your customer's attrition sensitivity having values changes of one of the explained variables that
are part of the attrition model. Take a look at the image below containing a customer that has constant variable values:
1. The only variable change is the number of children, while the rest of the variables are constant.
2. The No. of children variable starts with no children at all (0) and ends with six children.
3. Attrition probability increases from 14.9% with no children to 90.78% risk at 6 children value (figure no.3).

Deploying the model The Deploy Model button, applies the model to the current data set (Current Results) or a different
data set (future data set in which attrition is unknown or a test data set which was not used to build the model yet the
attrition is known). In many data mining and statistical analysis projects in order to test model stability, analysts test the
attrition model on a data set that was not used to build the model. Using a test increases the confidence in the generated
model. Many statistical analysts deploy the model on the current data set (use the Deploy Model: Current Results option)
in order to test the model behaviour on the current data set (e.g. after deploying, create a new cross-tab table, put the
deciles column in the columns variable, the did hit variable in the row variables and view which customers where classified at
each deciles and whether the model performance was as expected).

6/25/2015 1:33 PM

Building Attrition Models Using Logistic Regression in Analysis Studio

6 of 6

https://www.appricon.com/index.php/building-attrition-models-using-logi...

The Deployment process generates different variables on the selected data set:
1. PROBABILITY This is the model outcome, a number between 0 and 1 that represents the probability of the data
record to have a value of 1 (in this case the probability of the customers attrition)
2. DECILE The deciles in which the record was classified, a number between 1 and 10. This number reflects a smaller
resolution of the probability yet it is more readable to the human eye and easier to interpret and use for further
analysis.
3. DID_HIT This variable shows whether the model classification was correct. In our example, this will be 1 for
customers classified as churners that truly churned or classified as non churners and are still our customers. The
DID_HIT will be 0 for customers that were classified as churners and are still our customers or classified as non
churners and yet churned.
In many cases when probability is around 50% the model performs poorly. This is logical since people or events that have
50% chance of becoming true are very hard to predict.
Below you can see the output of deploying the model on the current dataset (Current Results).

Another way to deploy the model is simply copying the formula to a new field in any SQL engine and getting future results
from the SQL tool that will compute the formula for given variables values.

6/25/2015 1:33 PM

You might also like