Telecom Churn Prediction System Based on Ensemble Learning Using Feature Grouping

: In recent years, the telecom market has been very competitive. The cost of retaining existing telecom customers is lower than attracting new customers. It is necessary for a telecom company to understand customer churn through customer relationship management (CRM). Therefore, CRM analyzers are required to predict which customers will churn. This study proposes a customer-churn prediction system that uses an ensemble-learning technique consisting of stacking models and soft voting. Xgboost, Logistic regression, Decision tree


Introduction
Owing to fierce competition among telecom companies, customer churn is inevitable. Customer churn is the act of a customer ending a subscription to a service provider and choosing the services of another company.
Companies must reduce customer churn because it weakens the company. A survey showed that the annual churn rate in the telecom industry ranges from 20% to 40%, and the cost of retaining existing customers is 5-10 times lower than the cost of obtaining new customers [1]. The cost of predicting churn customers is 16 times lower than that for obtaining new customers [2]. Decreasing the churn rate by 5% increases the profit from 25% to 85% [3]. This shows that customer-churn prediction is important for the telecom sector. Telecom companies consider customer relationship management (CRM) an important factor in retaining existing customers and preventing customer churn.
To retain existing customers, CRM analyzers must predict which customers will churn and analyze the reasons for customer churn. Once the at-risk customers are identified, the company must perform marketing campaigns for churn customers to maximize the churn-customer retention. Therefore, customer-churn prediction is an important part of CRM [4].
The accuracy of the prediction systems used by CRM analyzers is important. If analyzers are inaccurate in predicting customer churn, no campaigns can be performed. Owing to recent advancements in data science, data mining and machine learning technologies provide solutions to customer churn. However, there are several limitations in existing models. For example, logistic regression, a common churn-prediction model based on older data-mining methods, is relatively inaccurate. Furthermore, feature construction [5] is neglected during modeling development. Therefore, a good churn prediction system is necessary.
This study proposes a new customer-churn prediction system and feature construction to improve accuracy, and the contributions of this study can be summarized as follows: (1) A new prediction system based on ensemble learning with relatively high accuracy is proposed. (2) New features derived from equidistant grouping of customer behavior features are used to improve the system performance.
The rest of this paper is arranged as follows: Section 2 presents literature review, Section 3 proposes the ensemble model and equidistance feature grouping, and Section 4 describes the prediction system and experimental results. Finally, Section 5 concludes the paper.

Literature Review
Many methods such as machine learning and data mining are used for churn prediction. The decision-tree algorithm is a reliable method for churn prediction [6]. In addition, a neural network method [7], data certainty [8], and particle swarm optimization [9] are used for churn prediction.
Moreover, an artificial neural network (ANN) and decision trees are compared for customer-churn predictions [10], and the literature review shows that the decision-tree algorithm is better than ANNs for customer churn prediction.
A.T. Jahromi [11] studied the effect of customer loyalty on customer churn in prepaid mobile phone companies. In this study, features were segmented, and multiple algorithms such as decision trees and neural networks were used to predict the processed data. Results showed that a hybrid approach is better than a single algorithm. The KNN-LR is a hybrid approach using logistic regression and. the K-nearest neighbor (KNN) algorithm [12]. Researchers surveyed KNN-LR, logistic regression and the radial basis function (RBF) network and found that KNN-LR showed the best performance. Y. Zhang [13] proposed a distributed framework for data mining techniques to predict customer churn. The framework improves the CRM quality of service.
Ruba Obeidat proposed a hybrid genetic programming approach [14]. This study used the K-means algorithm and genetic programming to predict customer churn. Sahar F. Sabbeh used AdaBoost based on the boosting algorithm [15], which summarized existing machine-learning techniques and proposed that AdaBoost showed the best results. Hossam Faris proposed a hybrid swarm intelligent neural network model [16], which proposed an intelligent hybrid model based on particle swarm optimization and feedforward neural network for churn prediction. The results show that the model can improve the coverage of churn customers.

Dataset Preparation
The customer churn dataset is an open-source dataset [17] that contains 21 features and 3333 observations. The feature 'Churn' shows customer churn or non-churn based on existing conditions. Approximately 14.5% of the 'Churn' is 'T' label, and 84.5% of 'nonchurn' is 'F' label. Table 1 shows the data features. In this experiment, 80% (2666 instances) and 20% (667 instances) of the dataset are used for training and test datasets, respectively.

Proposed Method
This study proposes a new customer-churn prediction system consisting of feature construction, stacking model and soft votingas shown in Figure 1.

Proposed Method
This study proposes a new customer-churn prediction system consisting of feature construction, stacking model and soft votingas shown in Figure 1.

New Feature Construction with Equidistant Grouping
Feature engineering is important for data processing. Good feature selection and construction are essential for achieving high performance in machine-learning jobs [18]. Feature construction is the process of inferring or constructing additional features from original features, and it discovers missing information about the relationships between features. Feature construction transforms the original representation space to a new one to help better achieve data-mining objectives: improved accuracy, easy comprehensibility, truthful clusters, revealing hidden patterns, etc. [5].

New Feature Construction with Equidistant Grouping
Feature engineering is important for data processing. Good feature selection and construction are essential for achieving high performance in machine-learning jobs [18]. Feature construction is the process of inferring or constructing additional features from original features, and it discovers missing information about the relationships between features. Feature construction transforms the original representation space to a new one to help better achieve data-mining objectives: improved accuracy, easy comprehensibility, truthful clusters, revealing hidden patterns, etc. [5].
Feature grouping correlates relevant features. Features from the same group are more related, compared to features from a different group. Therefore, it is possible to generate groups of correlated features that are resistant to sample-size variations [19].
Some features in the churn dataset have small-large integer numbers, for example, from 0 to 365. The customers with similar values have similar churn trends. The churn accuracy can be improved when divided by equidistance groups.
When customers have similar consumption-expense behaviors, they may have similar churn. In this study, an equidistant grouping method for features of consumption-expense is used to construct new features. The original observations in a feature are equidistantly grouped to new groups according to the range of the feature value. Customers with similar consumption-expense patterns can be characterized into a group, resulting in identical values of the new feature. The process for feature grouping is shown in Figure 2. The instances of the original feature are grouped equidistantly based on the number of corresponding groups. Instances in the same group get the same value in the new feature.
Feature grouping correlates relevant features. Features from the same group are more related, compared to features from a different group. Therefore, it is possible to generate groups of correlated features that are resistant to sample-size variations [19].
Some features in the churn dataset have small-large integer numbers, for example, from 0 to 365. The customers with similar values have similar churn trends. The churn accuracy can be improved when divided by equidistance groups.
When customers have similar consumption-expense behaviors, they may have similar churn. In this study, an equidistant grouping method for features of consumptionexpense is used to construct new features. The original observations in a feature are equidistantly grouped to new groups according to the range of the feature value. Customers with similar consumption-expense patterns can be characterized into a group, resulting in identical values of the new feature. The process for feature grouping is shown in Figure  2. The instances of the original feature are grouped equidistantly based on the number of corresponding groups. Instances in the same group get the same value in the new feature. Sturges' formula formulates a method of choosing the optimum number of bins in a histogram [20] for a normally distributed dataset.
As shown in Figure 3, the histograms of some features in the dataset have a shape of normal distribution.
This study uses Sturges' formula to determine the optimal number of groups, which is given by where K represents the optimal number of groups and n is the largest feature value.

Stacking Model
(1) Classifiers XGB is a decision-tree-based ensemble machine-learning algorithm that uses a gradient boosting framework. It accurately predicts a target class by combining simple and weak models [21].
LR is a machine-learning method that solves binary classification problems and predicts classification possibility. LR has advantages such as simple implementation and strong explanatory power, and it is widely used in the industry [22].
Currently, DT is a mainstream prediction and classification technology. It is similar to human thinking. It adopts a recursive top-to-bottom method, carrying out internal node attribute comparisons and judging to split down from the node depending on attribute values, and concludes on the DT leaf [23]. Sturges' formula formulates a method of choosing the optimum number of bins in a histogram [20] for a normally distributed dataset.
As shown in Figure 3, the histograms of some features in the dataset have a shape of normal distribution.
This study uses Sturges' formula to determine the optimal number of groups, which is given by K = 1 + log(n)/log(2) = 1 + 3.322 log(n) where K represents the optimal number of groups and n is the largest feature value.

Stacking Model
(1) Classifiers XGB is a decision-tree-based ensemble machine-learning algorithm that uses a gradient boosting framework. It accurately predicts a target class by combining simple and weak models [21].
LR is a machine-learning method that solves binary classification problems and predicts classification possibility. LR has advantages such as simple implementation and strong explanatory power, and it is widely used in the industry [22].
Currently, DT is a mainstream prediction and classification technology. It is similar to human thinking. It adopts a recursive top-to-bottom method, carrying out internal node attribute comparisons and judging to split down from the node depending on attribute values, and concludes on the DT leaf [23].
In probability and statistics, the Bayesian rule is based on prior knowledge of events probability. In machine learning, NBC is a probability statistical classifier based on Bayes theorem. The classifier uses conditional independence assumptions and chooses a very likely category as the sample final category. The algorithm is simple, easy to implement, and less sensitive to missing data, showing small error and stable performance [24].
(2) Stacking The stacking model is a general method of using a higher-accuracy algorithm to combine lower-accuracy algorithms to achieve greater predictive accuracy. The best results are obtained when the higher-accuracy algorithm is in a first level and lower-accuracy Appl. Sci. 2021, 11, 4742 5 of 12 algorithms are in a second level [25]. In this study, the stacking model consists of two layers as shown in Figure 4, level 1 and level 2. A higher accuracy model is used in the first layer (level 1), while lower-accuracy models are used in the second layer (level 2).
XGB with a kind of ensemble function got the best accuracy, which is chosen as a first-level classifier. And LR, DT and NBC have primary and different characteristics, and they can complement XGB for the mismatched samples, which are chosen in a second level. The results of XGB and data are combined to generate data 1 for the second level classifiers.   In probability and statistics, the Bayesian rule is based on prior knowledge of events probability. In machine learning, NBC is a probability statistical classifier based on Bayes theorem. The classifier uses conditional independence assumptions and chooses a very likely category as the sample final category. The algorithm is simple, easy to implement, and less sensitive to missing data, showing small error and stable performance [24].
(2) Stacking The stacking model is a general method of using a higher-accuracy algorithm to combine lower-accuracy algorithms to achieve greater predictive accuracy. The best results are obtained when the higher-accuracy algorithm is in a first level and lower-accuracy algorithms are in a second level [25]. In this study, the stacking model consists of two layers as shown in Figure 4, level 1 and level 2. A higher accuracy model is used in the first layer (level 1), while lower-accuracy models are used in the second layer (level 2).
XGB with a kind of ensemble function got the best accuracy, which is chosen as a first-level classifier. And LR, DT and NBC have primary and different characteristics, and

Soft Voting
Soft voting estimates the class probability with different algorithms having contrasting approaches to improve the accuracy of prediction. It assigns a larger weight to the important classifier, and the highest category is selected by summing the probabilities predicted by the model. The soft-voting classifier prediction is mathematically represented as follows: m Figure 4. The proposed stacking model.

Soft Voting
Soft voting estimates the class probability with different algorithms having contrasting approaches to improve the accuracy of prediction. It assigns a larger weight to the important classifier, and the highest category is selected by summing the probabilities predicted by the model. The soft-voting classifier prediction is mathematically represented as follows:ŷ where argmax function is the function that outputs a maximum, and w j represents the weight associated with the classifier prediction; p ij is the probability of the classifier in predicting a certain class. In the process of assigning weights for soft voting, high-confidence models are given more weight based on the importance and accuracy of the classifier [26]. The prediction ensemble system in this study uses soft voting to obtain the final prediction results. Soft voting usually requires different classifiers to compensate for their drawbacks. Based on the accuracy and algorithmic differences, this study chose LR, DT, NBC algorithms as level 2 classifiers, and the results of level 2 classifiers with different weights are used for soft voting.

Evaluation Measures
In this study, the proposed ensemble system for predicting customer churn is evaluated using accuracy, precision, recall and F1 score.
Equation (3) calculates the accuracy metric. It is defined as the ratio of the number of samples correctly classified by the classifier to the total number of samples for a given test dataset. (4) is the formula for Precision. It identifies that the part of the prediction data is positive.

In Equation (3) 'TN' is True Negative, 'TP' is True Positive, 'FN' is False Negative and 'FP' is False Positive. Equation
The recall is another measure for completeness, i.e., the true hit of the algorithm. It is calculated by using Equation (5).
The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contributions of precision and recall to the F1 score are equal. It is calculated by using Equation (6).

Feature Construction
As shown in the Table 1, 16 features with numerical value in the dataset can be applied to equidistant grouping. The features 'account length' and 'area code' are not the customer's daily behavior and consumption features. The features 'total intl charge' and 'customer service calls' cannot be effectively grouped equidistantly because the value range is too small. However, the value range of the selected 12 features is relatively large and can be used to mine the hidden information between the data. Equation (1) is used to determine the number (K) of groups for 12 original features. The results are shown in Table 2.
For example, the feature 'Total day calls'has numerical value, ranging from 0 to 165. The feature construction results using Equation (1) are shown in Table 3. The feature 'Total day calls' is divided into 8 groups equidistantly through Equation (1). There are instances in the same group, where the same value is assigned in the new feature.  The histogram of the new features also approximates the normal distribution of the new features as shown in Figure 5.  The histogram of the new features also approximates the normal distribution of the new features as shown in Figure 5.   Feature construction is to improve accuracy by expanding feature space. If the original dataset and the new dataset are not combined, it cannot increase feature space and improve accuracy. As shown in Figure 6, 12 new features are derived from the original features through feature construction. The new dataset is generated based on 21 original features combined with 12 new features.  We use four machine learning algorithms with two datasets, namely, the original dataset and the new dataset to compare its performance. The results in Table 4 show that the new dataset shows better accuracy for all classifiers and that the proposed feature construction method improves the stacking model performance. The new feature grouping discovers missing latent factors from the churn dataset, regarding the relationship between features, and expands the feature space by creating additional features.

Stacking Model
The LR, DT, and NBC algorithms have relatively low accuracy compared to the XGB algorithm. However, by using high-accuracy XGB to stack the three relatively low-accuracy algorithms of LR, DT, and NBC, the model performance of these three machine-learning algorithms is improved. Table 5 shows that the accuracy of the stacked model for these classifiers with the new dataset increases by approximately 10%.  Table 6 shows that the stacking model accuracy of the new dataset increases by around 1% compared to the original dataset.

Soft Voting and Final Results
The stacked model outputs three different types of model results. These three results are input into soft voting. For better results, the models with high confidence and accuracy are given more weight in the soft-voting process [26]. Among the three models, LR has the best accuracy as show in Table 6 and is used as a main classifier and given more weight value assigned to 0.4. The accuracy of NBC and DT are slightly lower than that of LR, and their weight values are assigned to 0.3. Each weight for NBC and DT is lower than the main classifier LR, but the sum of their weights is bigger than the weight of LR. This can make up for the shortcomings of LR and improve the accuracy of soft voting. Soft voting outputs the label of highest probability. The results in Table 7 show that the accuracy of the proposed stacking ensemble system is 98.09% for the new dataset and 96.12% for the original dataset.  Table 8 compares the proposed ensemble system with other works. The proposed customer-churn prediction ensemble system shows the best accuracy. Table 8. Comparison with other works.

Conclusions
Various machine-learning techniques have been used for customer churn in CRM. This study proposes a customer-churn prediction system based on a stacking ensemble of machine learning, which consists of XGB in level 1; LR, DT, and NBC in level 2; and soft voting. Feature construction is used to expand the feature space and discover latent information from implicit features. The proposed feature construction through feature grouping improves the prediction accuracy compared to the original customer-churn dataset. The result of proposed system showed the best accuracies of 96.12% and 98.09% for the old and new datasets respectively compared to other prediction systems. This proposed system can determine important factors affecting customer purchasing behavior in the telecommunications industry.