Professional Documents
Culture Documents
Yiqing Huang1,2 , Fangzhou Zhu1,2 , Mingxuan Yuan3 , Ke Deng4 , Yanhua Li3 , Bing Ni3 ,
Wenyuan Dai3 , Qiang Yang3,5 , Jia Zeng1,2,3,∗
1
School of Computer Science and Technology, Soochow University, Suzhou 215006, China
2
Collaborative Innovation Center of Novel Software Technology and Industrialization
3
Huawei Noah’s Ark Lab, Hong Kong
4
School of Computer Science and Information Technology, RMIT University, Australia
5
Department of Computer Science and Engineering, Hong Kong University of Science and Technology
∗
Corresponding Author: zeng.jia@acm.org
ABSTRACT 14
Prepaid
Postpaid
We show that telco big data can make churn prediction much 12
more easier from the 3V’s perspectives: Volume, Variety,
Velocity. Experimental results confirm that the prediction 10
performance has been significantly improved by using a large
607
+^ZKXTGR
diction system deployed in this operator (0.68 precision) as /TZKXTGR ([YOTKYY6XUIKYY5VZOSO`GZOUT
3UTKZO`GZOUT
well as the current state-of-the-art research work [16, 14, 13, Applications 6XKIOYK +^VKXOKTIK 4KZ]UXQ6RGT *GZG
32, 1].1 The results are based on the 9-month dataset from Layer 3GXQKZOTM 8KZKTZOUT 5VZOSO`GZOUT +^VUY[XK
around two million prepaid customers in one of the biggest
operators in China. This study provides a clear illustra-
Capability Layer
Capability Bus
tion that bigger data indeed can be more valuable assets
([YOTKYY 4KZ]UXQ
for improving generalization ability of predictive modeling
(OM*GZG6RGZLUXS
)USVUTKZ )USVUTKZ
like churn prediction. Moreover, the results suggest that it Big Data Modeling & Mining
is worthwhile for telco operators to gather both more data Big Data
instances and more possible data features, plus the scal- Platform
Data Bus
Data Layer
able computing platform to take advantage of them. In this
sense, the big data platform plays a major role in the next-
Spark SQL Hive Hadoop
generation telco operation. As an additional contribution,
we introduce how to integrate churn prediction model with
Data Intergration/ETL/Pretreatment
retention campaign systems, automatically matching proper
retention campaigns with potential churners. In particular,
the churn prediction model and the feedback of retention Data
Resources (99 599
campaign form a closed loop in feature engineering.
To summarize, this paper makes two main contributions.
The primary contribution is an empirical demonstration that
Data Source User Base
Behavior Compliant Billing Voice/SMS CS PS MR Total
indeed churn prediction performance can be significantly im- Increment
CDR
608
providers to manage their networks (e.g., mobile networks). Classifiers
Retention Campaigns
It supports management functions such as network inven-
tory, service provisioning, network configuration and fault Feature Engineering
management. Note that both BSS and OSS often work sep-
arately, and have their own data and service responsibilities, Apache Spark
With the rapid expansion, the telco data storage has moved
to PB age, which requires a scalable big data computing HDFS
platform for monetization.
Figure 2 overviews the functional architecture of the telco
Figure 3: Churn prediction and retention systems.
big data platform, where data sources are from both BSS
and OSS. Generally, the data from BSS are composed of
User Base/Behavior, Compliant, Billing and Voice/Message diction system is supported by the telco big data platform,
call detailed records (CDR) tables, which cover users’ de- and is located at the red dashed lines in Figure 2.
mographic information, package, call times, call duration, Efficient management of telco big data for predictive mod-
messages, recharge history, account balance, calling num- eling poses a great challenge to the computing platform [11].
ber, cell tower ID, and complaint records. The data volume For example, there are around 2.3TB new coming data per
in BSS is around 24GB per day. The data from OSS can day from both BSS and OSS sources, where more than
be broadly categorized into three parts: circuit switch (CS), 97% volume of data is from OSS. The empirical results are
packet switch (PS) and measurement report (MR). CS data based on a scalable churn prediction and retention system
describe the call connection quality, e.g., call drop rate and using Apache Hadoop [31] and Spark [37] distributed ar-
call connection success rate. PS data are often called mobile chitectures including three key components: 1) data gath-
broadband (MBB) data, including data gathered by probes ering/integration, 2) feature engineering/classifiers, and 3)
using Deep Packet Inspection (DPI). PS data describe users’ retention campaign systems. In the data gathering and in-
mobile web behaviors, which is related to web speed, connec- tegration, we move data tables from different sources and
tion success rate, and mobile search queries. MR data are integrate some tables by ETL tools. Most data are stored
from radio network controller (RNC), which can be used to in regular tables (e.g., billing and demographic information)
estimate users’ approximate trajectories [17]. The data vol- or sparse matrix (e.g., unstructured information like textual
ume in OSS is around 2.2TB per day, occupying over 97% complaints, mobile search queries and trajectories) in HDFS.
data volume of the entire dataset. Also, we can use web The feature engineering and classification components are
crawler to obtain some Internet data (e.g., map information hand coded in Spark, which is based on Hive/Spark SQL
and social networks). and some widely-used unsupervised/supervised learning al-
More specifically, BSS data are from the traditional BI gorithms, including PageRank [26], label propagation [40],
systems widely used in telco operators, which consist of topic models (latent Dirichlet allocation or LDA) [4, 39], LI-
around 140 tables. OSS data are imported by the Huawei in- BLINEAR (L-2 regularized logistic regression) [12], LIBFM
tegrated solution called SmartCare, which collects the data (factorization machines) [30], gradient boosted decision tree
from probes and interpret them to x-Detail Record (xDR) (GBDT) and random forest (RF) [5]. This churn manage-
tables, such as User Flow Detail Record (UFDR), Trans- ment platform has been deployed in the real-world operator’s
action Detail Record (TDR), and Statistics Detail Record systems, which can scale up efficiently to massive data from
(SDR). All these tables have the key value called Interna- both BSS and OSS.
tional Mobile Subscriber Identification Number (IMSI) or
International Mobile Equipment Identity (IMEI), which can
be used for joint operation and feature engineering. We use 4. PREDICTION AND RETENTION
the multi-vendor data adaption module to change tables to The structure of the churn prediction system is illustrated
the standard format, which are imported to big data plat- in Figure 3. The HDFS and Apache Spark support the dis-
form through extract-transform-load (ETL) tools for data tributed storage and management of raw data. Feature en-
integration. We store these raw tables in Hadoop distributed gineering techniques are applied to select and extract rele-
file systems (HDFS), which communicates with Hive/Spark vant features for churn model training and prediction. In
SQL for feature engineering purposes. These compose the the business period (monthly), the churn classifier provides
data layer connected with the capability layer through the a list of likely churners for proper retention campaigns. In
data bus in the big telco data platform. The main workload particular, the churn prediction model and the feedback of
of data layer is to collect and update all tables from BSS retention campaign form a closed-loop.
and OSS periodically. In the capability layer, we build two
types of models, business and network components, based on 4.1 Feature Engineering
both unlabeled and labeled instances of features using data We make use of Hive and Spark SQL to quickly sanitize
mining and machine learning algorithms. Using capability the data and extract the huge amount of useful features. The
bus, these models can support application layer including in- raw data are stored on HDFS as Hive tables. Because Hive
ternal (e.g., precise marketing, experience & retention and runs much slower than Spark SQL, we use Spark SQL to
network plan optimization) and external applications (e.g., manipulate the tables. Spark SQL is a new SQL engine de-
data exposure for trading). The main workloads of applica- signed from ground-up for Spark.2 It brings native support
tion layer include customer insight (e.g., churn, complaint, for SQL to Spark and streamlines the process of querying
recommendation products or services) and network insight data stored both in RDD (Resilient Distributed Dataset)
(e.g., network planning and optimization). The churn pre- and Hive [37]. The Hive tables are directly imported into
2
https://spark.apache.org/sql/
609
Features Description Features Description
localbase outer call dur duration of local-base outer call ld call dur duration of long-distance call
roam call dur duration of roaming call localbase called dur duration of local-base called
ld called dur duration of long-distance called roam called dur duration of roaming called
cm dur duration of calling China Mobile ct dur duration of calling China Telecom
busy call dur duration of calling in busy time fest call dur duration of calling in festival
sms p2p inner mo cnt count of inner-net MO SMS sms p2p other mo cnt count of other MO SMS
sms p2p cm mo cnt count of MO SMS with China Mobile sms p2p ct mo cnt count of MO SMS with China Telecom
sms info mo cnt count of information-on-demand MO SMS sms p2p roam int mo cnt count of roaming international MO SMS
mms p2p inner mo cnt count of inner-net Mo MMS mms p2p other mo cnt count of other MO MMS
mms p2p cm mo cnt count of MO MMS with China Mobile mms p2p ct mo cnt count of MO MMS with China Telecom
mms p2p roam int mo cnt count of roaming international MO MMS all call cnt count of all call
voice cnt count of voice call local base call cnt count of local-base call
ld call cnt count of long-distance call roam call cnt count of roaming call
caller cnt count of caller call voice dur duration of voice call
caller dur duration of caller call localbase inner call dur duration of local-base inner-net call
free call dur duration of free call call 10010 cnt count of call to 10010 (service number)
call 10010 manual cnt count of call to manual 10010 sms bill cnt count of billing SMS
sms p2p mt cnt count of MT SMS mms cnt count of all MMS
mms p2p mt cnt count of MT MMS gprs all flux all GPRS flux
gender gender of user age age of user
credit value user credit value innet dura duration of innet
total charge the total cost in a month gprs flux the total GPRS flux
gprs charge the total cost of GPRS local call minutes minutes of local call
toll call minutes minutes of long-distance call roam call minutes minutes of roaming call
voice call minutes minutes of voice call p2p sms mo cnt count of MO SMS
p2p sms mo charge cost of MO SMS pspt type credentials type
is shanghai if user is shanghai resident or not town id user town id
sale id selling area id pagerank user effect calculated with pagerank
product id product id product price product price
product knd kind of product gift voice call dur duration of gift voice call
gift sms mo cnt count of gift MO SMS gift flux value gift flux
distinct serve count count of service SMS serve sms count count of service provider through SMS
innet dura duration in net balance account balance
balance rate recharge over account balance total charge recharge value
PAGESIZE page size PAGE SUCCEED FLAG page display success flag
L4 UL THROUGHPUT data upload speed L4 DW THROUGHPUT data download speed
TCP CONN STATES TCP connection status TCP RTT TCP return time
STREAMING FILESIZE streaming file size STREAMING DW PACKETS streaming download packets
ALERTTIME alert time ENDTIME alert end time
LAC local area code CI cell ID
Spark SQL for basic queries, including join queries and ag- tors (KQI) in OSS as well as some statistical features, such
gregation queries. For example, we need to join the local as the average data upload/download speed and the most
call table and the roam call table to combine the local call frequent connection locations from MR data. Due to the
and roam call features. Also, we need to aggregate local page limitation and the focus of this paper, we only give a
call tables of different days to summarize a customer’s call brief introduction of the KPI/KQI features extracted from
information in a time period (e.g., a month). All the in- CS/PS data. Interested users can check Huawei’s industrial
termediate results are stored as Hive tables, which can be OSS handbook for details.3
reused by other tasks since the feature engineering may be CS features represent the service quality of voice. The
repeated many times. Finally, a unified wide table is gener- KPI/KQI features used in CS data includes average Per-
ated, where each tuple in the table represents a customer’s ceived Call Success Rate, average E2E Call Connection De-
feature vector. The wide table is exported into Hive to build lay, average Perceived Call Drop Rate and average Voice
classifiers. An example of some basic features in the wide Quality. Perceived Call Success Rate indicates the call suc-
table and explanations can be found in Figure 4. cess rate perceived by a calling customer. It is calculated
These basic features can be broadly categorized into three from the formula, [Mobile Originated Call Alerting]/([Mobile
parts: 1) baseline features, 2) CS features, and 3) PS fea- Originated Call Attempt] + [Immediate Assignment Fail-
tures. The baseline features are extracted from BSS and ures]/2). Each item in the formula is computed from a group
are used in most previous research work, e.g., account bal- of PIs (performance indicators). E2E Call Connection De-
ance, call frequency, call duration, complaint frequency, data lay indicates the amount of time a calling customer waits
usage, recharge amount, and so on. We use these base- before hearing the ringback tone. It is defined as [E2E Call
line features to illustrate the differences between our solu- Connection Total Delay]/[Mobile Originated Call Alerting]
tion and previous work. These features compose a vector, = [Sum(Mobile Originated Connection Time - Mobile Orig-
xm = [x1 , . . . , xi , . . . , xj , . . . , xN ], for each customer m. inated Call Time)]/[Mobile Originated Call Alerting]. Per-
Besides these basic features, we also use some unsuper- ceived Call Drop Rate indicates the call drop rate perceived
vised, semi-supervised and supervised learning algorithms to by a customer. It is obtained from [Radio Call Drop After
extract the following complicated features: 1) Graph-based Answer]/[Call Answer]. Voice quality include Uplink MOS
Features, 2) Topic Features, and 3) Second-order Features. (Mean Opinion Score), Downlink MOS, IP MOS, One-way
Audio Count, Noise Count, and Echo Count. They can be
4.1.1 CS and PS Features used to assess the voice quality in the operator’s network
Both CS and PS features are from OSS. We use several
3
key performance indicators (KPI) and key quality indica- HUAWEI SEQ Analyst KQI&KPI Definition
610
and the customers’ experience of voice services. Totally, a group of customers that we have known that they are
each customer has a total of 9 CS KPI/KQI features. churners. Following the edge weights in the graph, we propa-
PS KPI/KQI features reflect the quality of data service, gate the churner probabilities from the customer seed vertex
which includes web, streaming and email. For example, we (the ones we have churner label information about) into the
use the web features such as average Page Response Success customers without churner labels. After several iterations,
Rate, average Page Response Delay, average Page Browsing the label propagation algorithm converges and the output
Success Rate, average Page Browsing Delay, and average is churner labels for the non-churner customers. Given the
Page Download Throughput. Page Response Success Rate edge weight matrix WM ×M , and a churner label probability
indicates the rate at which website access requests are suc- matrix YM ×2 , the label propagation algorithm follows the 3
cessfully responded to after the customer types a Uniform iterative steps:
Resource Locator (URL) in the address bar of a web browser.
It is calculated from [Page Response Successes]/[Page Re- 1. Y ← W Y ;
quests], where [Page Response Successes] = [First GET Suc- 2. Row-normalize Y to sum to 1;
cesses] and [Page Requests] = [First GET Requests]. Page
Response Delay indicates the amount of time a customer 3. Fix the churner labels in Y , Repeat from Step 1 until
waits before the desired web page information starts to dis- Y converges.
play in the title bar of a browser after the customer types
a URL in the address bar. It is obtained from [Total Page After label propagation, each customer (non-churner) is as-
Response Success Delay]/[Page Response Successes], where sociated with a churn label probability ym in matrix Y prop-
[Page Response Success Delay] = [First GET Response Suc- agated from the churners in the previous month. The higher
cess Delay] + [First TCP Connect Success Delay]. Similar value means that this non-churner has a higher likelihood to
KPI/KQI features are used for both streaming and email ser- be the churner in the graph. Since we have 3 undirected
vices. For each customer, we also select top 5 most frequent graphs, we obtain 3 churner label propagation features ym
locations or stay areas during a period (e.g., a month) repre- for each customer m. As a result, we have 6 graph-based fea-
sented by latitude and longitude from MR data. Therefore, tures for each customer plus 3 PageRank features, which are
each customer has 15 PS KPI/KQI features plus 10 most different and independent from label propagation features
frequent location features from PS data. because PageRank features do not involve churner labels.
611
The output of LDA contains two matrices {θ, φ}. In prac- from xM ×N ,
tice, the number of topics K in LDA is set to 10. The
matrix θ K×M is the low-dimensional topic features for all I(x1:M,i ) = G(x1:M,i ) − q × G(x1:P,i )
customers, when compared with the original sparse high- −(1 − q) × G(xP +1:M,i ), (5)
dimensional matrix xW ×M because K D. For both com-
2
plaint and search texts, we generate K = 10 dimensional G(·) = 1 − p2c , (6)
topic features of each customer, respectively. So, there are c=1
a total of 20 topic features for each customer.
where P is the splitting point that partition 1 : M instances
into two parts 1 : P and P + 1 : M , and q is the fraction of
4.1.4 Second-order Features instances going left to the splitting point P . The function
In the feature vector, xm = [x1 , . . . , xi , . . . , xj , . . . , xN ], G(·) is the Gini index for a group of customer instances, and
second-order features are defined as the product of any two p1 is the probability of churners and p2 is the probability of
feature values xi xj , which may help us find the hidden re- non-churners in this group. For each column of features,
lationship between a specific pair of features. Suppose that we visit all splitting points P and find the maximum Gini
there are N features, the total number of second-order fea- improvement I. Then, we find the maximum I among all
tures could be (N + 1)N/2, which is often a burden to build features, and use the feature with the maximum I as the
a classifier. So, we use LIBFM [30] regression model to se- splitting point in the node of a decision tree. Other parame-
lect the most useful second-order features. The regression ters of RF are given as follows: 1) the number of trees is 500
model can be used for classification problems by setting the and 2) the minimum samples in leaf nodes is 100. We fix the
response of regression as class labels. The LIBFM optimizes minimum samples in leaf nodes to 100 to avoid over-fitting.
the following objective function by the stochastic gradient The split process of each decision tree will stop when the
descent algorithm, number of samples in an individual node is less than 100.
After training RF, we obtain each feature’s importance
n
n
n value by summing the Gini improvements of all nodes,
y = w0 + wi xi + vi , vj xi xj , (3)
T
i=1 i=1 j=i+1
Importancei = I(x1:M,i ). (7)
t=1 Pi ∈treet
where w0 and wi are weight parameters, and vi is a K-length
vector for ith feature value. The inner product vi , vj de- The feature importance ranking could help us evaluate which
notes the weight for the second-order feature xi xj . The features are major factors in churn prediction.
larger weight means that the second-order feature is more
important [30]. So, we rank the learned weight vi , vj after
4.3 Retention Systems
optimizing (3), and select 20 second-order features with the The retention systems for potential churners make a closed
top largest weights for churn prediction. loop with churn prediction systems as shown in Figure 3.
Telco operators are not only concerned with the potential
churners, but also want to carry out effective retention cam-
4.2 Classifiers paigns to retain those potential churners for further profits.
For experiments we report, we choose the Random Forest Generally, if a customer accept a retention offer, he/she will
(RF) classifier [5] to make predictions. The major reason of keep on using the operator’s service for the next 5 months
this choice is that RF yields the best predictive performance to get the 1/5 offer per month. Nevertheless, there are mul-
among several widely-used classifiers (See comparisons in tiple retention offers and operators do not know which offers
Subsection 5.8). RF is a supervised learning method that match a specific group of potential churners. For example,
uses the bootstrap aggregating (bagging) technique for an some potential churners will not accept any offer, some want
ensemble of decision trees. Given a set of training instances, more cashback, some want more flux, and some want more
xm = [x1 , . . . , xi , . . . , xj , . . . , xN ], with class labels ym = free voice minutes. Before the automatic retention system is
{non-churner = 0, churner = 1}, RF fits a set of decision deployed, operators often match offers with potential churn-
trees ft , 1 ≤ t ≤ T to the random subspace of features. ers by domain knowledge, and the campaign results are not
The label prediction for test instance x is the average of the satisfactory. So, it is necessary to build an automatic reten-
predictions from all individual trees, tion systems for matching offers with churners.
We define this task as a multi-category classification prob-
1
T lem. A potential churner xm is classified into multiple cat-
y= ft (x), (4) egories ym = {0, 1, 2, . . . , C − 1}, where ym = 0 denotes
T t=1
that the potential churner will not accept any offer, and
other values means different types of retention offers. The
where y is the likelihood of being a churner. We rank this class labels (retention results) are accumulated after each
likelihood in descending order for predictive performance retention campaign. We train a RF retention classifier in
evaluation as well as for retention campaigns. Generally, Subsection 4.2 to do multi-category classification as shown
customers with higher likelihood have lower recharge rates in Figure 3. The retention classifier is updated if retention
in our experiments. √ campaign results are available, similar to the churn classi-
For each decision tree, it randomly selects a subset of N fier. Moreover, we use the label propagation algorithm to
from N features, and builds a splitting node by iterating propagate the campaign result label ym on three undirected
these features. The Gini improvement I(·) is used to deter- graphs in Subsection 4.1.2. These 3 × C new features are
mine the best splitting point given ith column of features added to the original churn prediction features for training
612
50
there are lots of churners, the total number of prepaid cus-
613
Table 1: Statistics of Dataset (9 Months).
Month 1 Month 2 Month 3 Month 4 Month 5 Month 6 Month 7 Month 8 Month 9
Churner 185779 173576 196984 184728 216010 201374 200492 199456 202873
No-Churner 1927748 1935496 1907548 1909698 1893469 1909472 1918349 1983917 1949832
Total 2113527 2109072 2104532 2094426 2109479 2110846 2118841 2183373 2152705
614
0.878 0.6 4 1 4
0.554 R@5×10 R@5×10
4 4
0.55 R@10×10 R@10×10
0.8775 4 4
0.552 R@20×10 0.9 R@20×10
0.5
0.877
0.55 0.45
0.8765 0.8
0.548 0.4
Precision
PR−AUC
Recall
AUC
0.876 0.35
0.546 0.7
0.3
0.8755 0.544
0.25
0.875 0.6
0.542
0.2
0.8745 0.54 0.15 0.5
0.874 0.538 0.1
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
Month(s) Month(s) Month(s) Month(s)
Figure 7: Adding more training data (x-axis) leads to continued improvement of predictive power (y-axis).
615
Table 6: Business value of churn prediction. Table 7: Comparison of methods for data imbalance.
Group A Methods AUC PR-AUC R@U P@U
Month Top 5 × 104 Top 5 × 104 ∼ 1 × 105 Not Balanced 0.83712 0.49092 0.38969 0.43239
Total Recharge Rate Total Recharge Rate Up Sampling 0.84140 0.51882 0.39180 0.44187
8 7994 134 1.68% 8032 808 10.06% Down Sampling 0.84791 0.51967 0.39757 0.45101
9 8024 84 1.04% 8050 798 9.91% Weighted Instance 0.87468 0.54123 0.41251 0.48133
Group B
Month Top 5 × 104 Top 5 × 104 ∼ 1 × 105
Total Recharge Rate Total Recharge Rate
8 7972 1474 18.49% 8026 2280 28.41% than Month 8. This provides one of the clearest illustra-
9 7988 2458 30.77% 8042 3194 39.72% tion that the integration of churn prediction and retention
system is indeed more valuable. From this perspective, our
prediction target changes to those potential churners with a
costs since we do not need to re-generate these big tables higher likelihood to be retained by a specific retention offer.
manually by every 5 days. So, Table 3 summarizes the final
predictive performance of our deployed churn prediction sys- 5.6 Early Signals
tem (Volume = 4 months, Variety = 140 features, Velocity Telco operators wish to predict churners as early as pos-
= 1 month). sible to provide proper retention strategies in a timely and
precise manner. In this setting, we change the sliding win-
5.5 Value dow in Figure 6. We use labels in Month N + 2 and features
The business value is a major concern of the deployed in Month N − 1 to train the classifier, and use features in
churn prediction and retention system. Every month, the Month N and the classifier to predict the potential churners
system provides a list of top U = 1 × 105 potential churners in Month N + 3. This setting can predict churners 3 months
ranking by classification likelihoods (4), which covers around earlier. Similarly, we can change the experimental setting
40% true churners as shown in Table 3. We select a subset to predict churners from 1 ∼ 4 months earlier. Using base-
of potential churners4 (some from top 5 × 104 and others line features, we show the predictive performance of earlier
from top 5 × 104 ∼ 1 × 105 ) for the following prepaid mobile features in Figure 8. Obviously, the earlier features pro-
recharge offers: 1) Get 100 cashback on recharge of 100, 2) vide the worse predictive performance. The accuracy signif-
Get 50 cashback on recharge of 100, 3) Get 500MB flux on icantly decreases with the increase of time interval between
recharge of 50, and 4) Get 200-minute voice call on recharge the observed features and the predicted labels in Figure 6.
of 50, when they enter the recharge period. In the A/B test, For example, the PR-AUC drops around 20% using features
it is desirable to see a very low recharge rate of the predicted from 1 month to 2 month earlier for prediction. The re-
potential churners without recharge offers (Group A), and sults imply that the prepaid customers often churn abruptly
a relatively higher recharge rate of potential churners with without providing enough early signals. Most churners show
recharge offers (Group B). We carry out retention campaigns their abnormal behaviors just before a month. These obser-
in two consecutive months 8 and 9. In Month 8, we do not vations are consistent with Figure 7, where adding more
train retention classifiers in Figure 6 to match churners with earlier instances of features provides little additional infor-
recharge offers, but send offers by short message service to mation to improve predictive performance.
churners according to domain knowledge guided by operator
experts. In Month 9, we use the campaign results as labels 5.7 Data Imbalance
(customers who accept different offers are defined as different Data imbalance has long been discussed in building clas-
class labels in Subsection 4.3) to build a retention classifier, sifiers [6, 15]. There are four widely-used methods to han-
matching proper offers with churners automatically. In this dle data imbalance: 1) Not Balanced, 2) Up Sampling, 3)
way, we can gradually improve the retention efficiency to Down Sampling and 4) Weighted Instance. The first method
retain more potential churners. directly train classifier using imbalanced churner and non-
Table 6 shows the recharge rates of A/B test in Months churner instances. The second method randomly copies the
8 and 9. In group A, the recharge rate is very low in both churner instances to the same number of non-churner in-
Months 8 and 9. In top 5 × 104 predicted churners, the stances. The third method randomly samples a subset of
recharge rate is less than 2%, which means 98% predicted non-churner instances to the same number of churner in-
churners in recharge period will not recharge within 15 days. stances. The fourth method assigns a proportion weight to
In top 5 × 104 ∼ 1 × 105 predicted churners, the recharge each instance, where higher weights are assigned to churn-
rate is around 10%. All these results confirm that the high ers and lower weights to non-churners. We use baseline fea-
accuracy of the churn prediction system. In Month 8, ran- tures to evaluate different methods for data imbalance. Ta-
domly assigning recharge offers in Group B can significantly ble 7 shows the average predictive performance using differ-
enhance the recharge rate, e.g., from 1.68% to 18.48% and ent methods (variance is too small to be shown). Clearly, the
from 10.06% to 28.41%, when compared with Group A. Fur- Weighted Instance method outperforms other methods, hav-
thermore, in Month 9, after matching recharge offers with ing around 10% improvement over the Not Balanced method
churners in Group B, we see a further significant increase of in terms of PR-AUC. So, we advocate the Weighted Instance
recharge rate, e.g., from 1.04% to 30.77% and from 9.91% to method to handle data imbalance in practice.
39.72%, when compared with Group A. Indeed, the strategy
of matching offers with churners in Month 9 can retain more 5.8 Classifiers
potential churners, which leads to around 50% more profit
We compare some widely used classifiers in data min-
4 ing: RF, LIBFM, LIBLINEAR (L2-regularized logistic re-
This is a small-scale test due to the limitation of retention
resources at that time. gression) and GBDT. Some classifiers have won KDD CUP
616
0.6 4 1 4
R@5×10 R@5×10
4 4
0.6 R@10×10 R@10×10
0.9 4 4
0.5 R@20×10 0.8 R@20×10
0.85 0.5
0.4 0.6
Precision
PR−AUC
Recall
AUC
0.4
0.8
0.3 0.3 0.4
0.75
0.2 0.2 0.2
0.7
0.1
0.65 0.1 0
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
Month(s) Month(s) Month(s) Month(s)
6. CONCLUSIONS 7. REFERENCES
This paper re-explores an old but classic research topic in [1] A. M. Almana, M. S. Aksoy, and R. Alzahrani. A survey on
telco industry—churn prediction—from the 3V’s perspec- data mining techniques in customer churn analysis for
tives of big data. Releasing the power of telco big data telecom industry. Journal of Engineering Research and
Applications, 4(5):165–171, 2014.
can indeed enhance the performance of predictive analytics. [2] W. Au, K. Chan, and X. Yao. A novel evolutionary data
Extensive results demonstrate that Variety plays a more im- mining algorithm with applications to churn prediction.
portant role in telco churn prediction, when compared with IEEE Trans. on Evolutionary Computation, 7(6):532–545,
Volume and Velocity. This suggests that telco operators 2003.
617
[3] K. binti Oseman, S. binti Mohd Shukor, N. A. Haris, and prediction model in telecom industry using boosting. IEEE
F. bin Abu Bakar. Data mining in churn analysis model for Trans. on Industrial Informatics, 10(2):1659–1665, 2014.
telecommunication industry. Journal of Statistical Modeling [24] S. Neslin, S. Gupta, W. Kamakura, J. Lu, and C. Mason.
and Analytics, 1:19–27, 2010. Detection defection: Measuring and understanding the
[4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet predictive accuracy of customer churn models. Journal of
allocation. J. Mach. Learn. Res., 3:993–1022, 2003. Marketing Research, 43(2):204–211, 2006.
[5] L. Breiman. Random forests. Machine learning, 45(1):5–32, [25] M. Owczarczuk. Churn models for prepaid customers in the
2001. cellular telecommunication industry using large data marts.
[6] J. Burez and D. Van den Poel. Handling class imbalance in Expert Systems with Applications, 37(6):4710–4712, 2010.
customer churn prediction. Expert Systems with [26] L. Page, S. Brin, R. Motwani, and T. Winograd. The
Applications, 36(3):4626–4636, 2009. pagerank citation ranking: Bringing order to the web. 1999.
[7] K. Coussement and D. Van den Poel. Churn prediction in [27] C. Phua, H. Cao, J. B. Gomes, and M. N. Nguyen.
subscription services: An application of support vector Predicting near-future churners and win-backs in the
machines while comparing two parameter-selection telecommunications industry. arXiv preprint
techniques. Expert systems with applications, arXiv:1210.6891, 2012.
34(1):313–327, 2008. [28] Pushpa and S. Dr.G. An efficient method of building the
[8] K. Dasgupta, R. Singh, B. Viswanathan, D. Chakraborty, telecom social network for churn prediction. International
S. Mukherjea, A. A. Nanavati, and A. Joshi. Social ties and Journal of Data Mining & Knowledge Management
their relevance to churn in mobile telecom networks. In Process, 2:31–39, 2012.
EDBT, pages 668–677, 2008. [29] D. Radosavljevik, P. van der Putten, and K. K. Larsen. The
[9] P. Datta, B. Masand, D. Mani, and B. Li. Automated impact of experimental setup in prepaid churn prediction
cellular modeling and prediction on a large scale. Artificial for mobile telecommunications: What to predict, for whom
Intelligence Review, 14(6):485–502, 2000. and does the customer experience matter? Transactions on
[10] J. Davis and M. Goadrich. The relationship between Machine Learning and Data Mining, 3(2):80–99, 2010.
Precision-Recall and ROC curves. In ICML, pages 233–240, [30] S. Rendle. Scaling factorization machines to relational data.
2006. In PVLDB, volume 6, pages 337–348, 2013.
[11] E. J. de Fortuny, D. Martens, and F. Provost. Predictive [31] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The
modeling with big data: Is bigger really better? Big Data, hadoop distributed file system. In IEEE Symposium on
1(4):215–226, 2013. Mass Storage Systems and Technologies (MSST), pages
[12] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and 1–10, 2010.
C.-J. Lin. LIBLINEAR: A library for large linear [32] W. Verbeke, K. Dejaeger, D. Martens, J. Hur, and
classification. Journal of Machine Learning Research, B. Baesens. New insights into churn prediction in the
9:1871–1874, 2008. telecommunication sector: A profit driven data mining
[13] I. Guyon, V. Lemaire, M. Boullé, G. Dror, and D. Vogel. approach. European Journal of Operational Research,
Analysis of the KDD Cup 2009: Fast scoring on a large 218(1):211–229, 2012.
orange customer database. 7:1–22, 2009. [33] W. Verbeke, D. Martens, C. Mues, and B. Baesens.
[14] J. Hadden, A. Tiwari, R. Roy, and D. Ruta. Computer Building comprehensible customer churn prediction models
assisted customer churn management: State-of-the-art and with advanced rule induction techniques. Expert systems
future trends. Computers & Operations Research, with applications, 38:2354–2364, 2011.
34(10):2902–2917, 2007. [34] C.-P. Wei and I. Chiu. Turning telecommunications call
[15] H. He and E. Garcia. Learning from imbalanced data. details to churn prediction: a data mining approach. Expert
IEEE Trans. on Knowledge and Data Engineering, systems with applications, 23(2):103–112, 2002.
21(9):1263–1284, 2009. [35] E. Xevelonakis. Developing retention strategies based on
[16] S. Hung, D. Yen, and H. Wang. Applying data mining to customer profitability in telecommunications: An empirical
telecom churn management. Expert Systems with study. Journal of Database Marketing & Customer Strategy
Applications, 31(3):4515–524, 2006. Management, 12:226–242, 2005.
[17] S. Jiang, G. A. Fiore, Y. Yang, J. Ferreira Jr, E. Frazzoli, [36] H.-F. Yu, H.-Y. Lo, H.-P. Hsieh, J.-K. Lou, T. G.
and M. C. González. A review of urban computing for McKenzie, J.-W. Chou, P.-H. Chung, C.-H. Ho, C.-F.
mobile phone traces: current methods, challenges and Chang, Y.-H. Wei, et al. Feature engineering and classifier
opportunities. In KDD Workshop on Urban Computing, ensemble for kdd cup 2010. In JMLR W & CP, pages 1–16,
pages 2–9, 2013. 2010.
[18] S. Jinbo, L. Xiu, and L. Wenhuang. The application of [37] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and
AdaBoost in customer churn prediction. In International I. Stoica. Spark: Cluster computing with working sets. In
Conference on Service Systems and Service Management, HotCloud, 2010.
pages 1–6, 2007. [38] J. Zeng. A topic modeling toolbox using belief propagation.
[19] M. Karnstedt, M. Rowe, J. Chan, H. Alani, and C. Hayes. J. Mach. Learn. Res., 13:2233–2236, 2012.
The effect of user features on churn in social networks. In [39] J. Zeng, W. K. Cheung, and J. Liu. Learning topic models
ACM Web Science Conference, pages 14–17, 2011. by belief propagation. IEEE Trans. Pattern Anal. Mach.
[20] N. Kim, K.-H. Jung, Y. S. Kim, and J. Lee. Uniformly Intell., 35(5):1121–1134, 2013.
subsampled ensemble (use) for churn management: Theory [40] X. Zhu and Z. Ghahramani. Learning from labeled and
and implementation. Expert Systems with Applications, unlabeled data with label propagation. Technical report,
39(15):11839–11845, 2012. Technical Report CMU-CALD-02-107, Carnegie Mellon
[21] A. Lemmens and C. Croux. Bagging and boosting University, 2002.
classification trees to predict churn. Journal of Marketing
Research, 43(2):276–286, 2006.
[22] E. Lima, C. Mues, and B. Baesens. Domain knowledge
integration in data mining using decision tables: Case
studies in churn prediction. Journal of the Operational
Research Society, 60(8):1096–1106, 2009.
[23] N. Lu, H. Lin, J. Lu, and G. Zhang. A customer churn
618