Professional Documents
Culture Documents
1 of 6
https://www.appricon.com/index.php/building-attrition-models-using-logi...
Home
Downloads
SUPPORT
Tutorials
Case Studies
FAQs
Text Size
Purchase
Support
Partners
About Us
Privacy
Site Map
Sample Files
It may looks easy to draw the main reasons that affect attrition: customer satisfaction, length of service etc. Using those
rules-of-thumbs the user can predict 15% of all churners but using a statistical analysis procedure as in Analysis Studio can
yield more then 60% precision.
Analysis Studio makes use of four logistic regression methods to find the best model that can explain the main reasons for
attrition. In this "how-to" paper, we will discuss a simple yet powerful statistical analysis method for obtaining a good attrition
model. We will also discuss the model interpretation in order to deliver the manager / researcher tools to conduct and
analyze the model as well as deploy it as the final step of the data mining process.
A predictive statistical model based on logistic regression is a model that analyzes each variable weight and contribution to
the model goals. The variable contribution is measured in percents and the manager or researcher can understand the
weight of each variable on the model target variable (In this example: attrition). Having the weight of each variable and being
able to see the effect each variable has on the predictive model, makes logistic regression a preferred modeling method (as
opposed to other methods such as neural networks that act like a black box).
As suggested in the beginning of this article, you can also use the Analysis Studio Logistic regression procedure for a wide
variety of fields like Projects failure analysis, Employees attrition in HR, Social research, Engineering , Finance and other
research aiming to find an explanation to a binary event (like "0" / "1" , "Churned"/"Not churned" etc.) occurrence and
prediction.
Preparing the data set
Logistic regression produces a statistical model using a data set where the target variable is binary i.e. has two possible
values: "1" means that the event has occurred and "0" that means that the event has not occurred.
Attrition
Visits
Calls
Education
Age
Children
Customer_ID
12
12
61
102654
18
20
32
103540
14
20
35
104426
20
20
26
105312
90
12
25
106198
10
59
107084
70
10
46
107970
16
65
108856
10
57
109742
12
14
64
110628
40
72
111514
12
67
112400
12
15
33
113286
6/25/2015 1:33 PM
2 of 6
Home
Downloads
https://www.appricon.com/index.php/building-attrition-models-using-logi...
Purchase
About Us
Privacy
FAQ
Back to Top
12
12
33
115058
12
59
115944
14
60
116830
86
77
117716
12
14
52
118602
98
55
119488
As discussed above, the target variable (explained variable) is in this case: Attrition. It is the variable we would like to know
how changes of explanatory variable values (in this case: Age, Calls, Children, Education, Visits) affect it. Select the Attrition
variable from the explained variable box. To define our model we will move the desired or expected explanatory variables
from the Explanatory Variables frame on the left side of the wizard window and move them to the Selected Columns frame
on the right side of the wizard window.
In a good data mining or statistical analysis process the analyst makes assumptions and then tries to disprove them so at
the end, only strong well based assumptions are left standing.
6/25/2015 1:33 PM
3 of 6
https://www.appricon.com/index.php/building-attrition-models-using-logi...
At this point, the logistic regression model is calculated and the process of reviewing, analyzing and refining it begins. The
screen should now display the ROC curve and the Area Under Curve (AUC) value. We will not discuss the ROC or AUC
methods here but, generally speaking, ROC and AUC measure the model's success to distinguish between "1" or "0" events
of the target variable. As the AUC figure is close to "1" this means that the model has a very high success distinguishing
between the binary events ("0" or "1") of the targeted variable. Values close to 0.5 usually mean that the model has low
performance and should be ignored.
AUC value
No distinguish ability
0.5
0.5-0.7
0.7-0.9
6/25/2015 1:33 PM
4 of 6
https://www.appricon.com/index.php/building-attrition-models-using-logi...
Excellent model
0.9-1.0
Note that excellent too good to be true models should be examined carefully to make sure that no variables "from the future"
are present. Imagine that our database contained a variable containing the number of next month's orders. Customers that
are not longer active will have 0 orders and will be a good false predictive explanatory variable if entered into the model.
Most good predictive business models produced have AUC of 0.7-0.8. This depends on data quality and the nature of the
problem.
In our example, we have an AUC of 0.83 so we should proceed to view the rest of the results for further analysis.
Interpreting the statistical parameters in the model can be a complicated task that is not for our How-to-paper. We will view
model parameters and predictive results that will help us to understand the attrition phenomena as well as our model
performance.
Note that you may make assumptions about model quality and performance yet in order to verify model correctness, quality
and true performance you will need to qualify as a professional statistician or analyst.
At this point, click the next button and then click finish. The model is now published and ready to be reviewed. In the main
attrition model (logistic regression) window; each variable has its own value regarding its contribution to the attrition
phenomena.
For example: Age has the value of 0.9512 which means that for each additional year the churn risk is decrease in 4.88%.
(1-0.9512) * 100 = 4.88, Calls has the value of 1.0458 which means that for each additional call to the call center the risk of
churning increase by 4.58%. (1.0458-1) * 100 = 4.58
Classifications
Click the classifications tab to view model results on the current data set. This tab shows how many cases were classified
as churners or non churners as opposed to how many customers actually churned or not.
6/25/2015 1:33 PM
5 of 6
https://www.appricon.com/index.php/building-attrition-models-using-logi...
Sensitivity Table - Analyze your customer's attrition sensitivity having values changes of one of the explained variables that
are part of the attrition model. Take a look at the image below containing a customer that has constant variable values:
1. The only variable change is the number of children, while the rest of the variables are constant.
2. The No. of children variable starts with no children at all (0) and ends with six children.
3. Attrition probability increases from 14.9% with no children to 90.78% risk at 6 children value (figure no.3).
Deploying the model The Deploy Model button, applies the model to the current data set (Current Results) or a different
data set (future data set in which attrition is unknown or a test data set which was not used to build the model yet the
attrition is known). In many data mining and statistical analysis projects in order to test model stability, analysts test the
attrition model on a data set that was not used to build the model. Using a test increases the confidence in the generated
model. Many statistical analysts deploy the model on the current data set (use the Deploy Model: Current Results option)
in order to test the model behaviour on the current data set (e.g. after deploying, create a new cross-tab table, put the
deciles column in the columns variable, the did hit variable in the row variables and view which customers where classified at
each deciles and whether the model performance was as expected).
6/25/2015 1:33 PM
6 of 6
https://www.appricon.com/index.php/building-attrition-models-using-logi...
The Deployment process generates different variables on the selected data set:
1. PROBABILITY This is the model outcome, a number between 0 and 1 that represents the probability of the data
record to have a value of 1 (in this case the probability of the customers attrition)
2. DECILE The deciles in which the record was classified, a number between 1 and 10. This number reflects a smaller
resolution of the probability yet it is more readable to the human eye and easier to interpret and use for further
analysis.
3. DID_HIT This variable shows whether the model classification was correct. In our example, this will be 1 for
customers classified as churners that truly churned or classified as non churners and are still our customers. The
DID_HIT will be 0 for customers that were classified as churners and are still our customers or classified as non
churners and yet churned.
In many cases when probability is around 50% the model performs poorly. This is logical since people or events that have
50% chance of becoming true are very hard to predict.
Below you can see the output of deploying the model on the current dataset (Current Results).
Another way to deploy the model is simply copying the formula to a new field in any SQL engine and getting future results
from the SQL tool that will compute the formula for given variables values.
6/25/2015 1:33 PM