A Data-Driven Approach to Improve Customer Churn Prediction Based on Telecom Customer Segmentation

Numerous valuable clients can be lost to competitors in the telecommunication industry, leading to profit loss. Thus, understanding the reasons for client churn is vital for telecommunication companies. This study aimed to develop a churn prediction model to predict telecom client churn through customer segmentation. Data were collected from three major Chinese telecom companies, and Fisher discriminant equations and logistic regression analysis were used to build a telecom customer churn prediction model. According to the results, it can be concluded that the telecom customer churn model constructed by regression analysis had higher prediction accuracy (93.94%) and better results. This study will help telecom companies efficiently predict the possibility of and take targeted measures to avoid customer churn, thereby increasing their profits.


Introduction
Client churn is a significant problem for telecommunication companies as it results in decreased profit [1]. Moreover, this is particularly relevant since telecommunication companies operate in a saturated global market, meaning it is increasingly challenging to retain customers. Although such companies make considerable marketing investments to acquire new users, retaining a customer is usually less expensive than acquiring a new one [2]. For these reasons, avoiding customer churn has become a significant concern for telecommunications companies.
Customer churn refers to the loss of a customer in favor of a competitor [3], reflecting the end of the relationship. Customer churn prediction allows one to identify the reasons for the end of the relationship and assemble a strategy that will minimize the churn rate, increasing profits. Thus, anticipating a customer's intention to end a relationship is instrumental for telecommunication companies and is considered a competitive advantage.
Previous studies have attempted to understand customer churn. For instance, Bach et al. [1] suggested a clustering and classification framework for churn management. Fathian et al. [4] proposed a new combined model based on ensemble and clustering classifiers. Holtrop et al. [5] aimed to anticipate customer churn using the principles of data anonymization. Although multiple studies have aimed to explain and predict customer churn, no study has tried to predict telecom client churn through discriminant analysis and logistic regression.
Facing this identified gap in the literature, this study aims to use factor analysis to investigate the business characteristics of telecom clients and to build a discriminant model and a logistic regression model to predict telecom client churn using customer segmentation data from three major Chinese telecommunication companies. Data are collected from China Mobile, China Unicom, and China Telecom and analyzed using a data mining approach to understand the factors that influence and allow one to predict telecom customer churn. Our study extends the previous work by Zhang [6] by innovatively showing how logistic regression analysis can be applied to build a telecom customer churn prediction model. Thus, we propose the following research questions: (1) Which factors will lead to customer loss? (2) How can one predict customer loss using the approach of data mining? (3) How can one develop a model to predict customer churn? It is expected that the results of this study will help telecommunications managers to identify the customer churn profile and create strategies to retain customers.

Literature Review
Technological progress is crucial in determining who will be the market leader and to achieve better market performance [7]. Meanwhile, technological progress has already changed the competition and the game's rules in the telecom industry. In the past, telecom operators generally won customers through price competition. However, today's consumers pay more attention to differentiated and value-added services, which has increased switching costs while making consumers more loyal [8]. In the telecom section, technological progress could help companies identify customers with a high risk of churn and to establish a business strategy with customer retention as the core goal, which will make the companies healthier and allow for long-term operation [9]. Finally, the development of telecommunication technologies has also brought about more market competition and higher customer churn rates. The customer churn rate for telecom customers in the European market has reached 30%, while in Asia it has reached 60% [10].
Through a Bayesian belief network analysis, it was concluded that the average tariff amount will affect customer churn. The two other factors are the average call time and tariff type [11]. The tariff structure will affect customers' perceptions of value, affecting customer churn [12]. Through a multilayer perceptron (MLP) analysis of a sample of five thousand Jordanian telecom customers, it was concluded that the monthly tariff is the most significant factor affecting customer churn [13]. Tariffs for domestic calls are essential in predicting customer loss [14]. There are two types of pricing in the telecom section: two-part tariff and pay-per-use pricing. Compared with two-part tariffs, pay-per-use pricing can reduce the customer churn rate by 10.5% [12]. Through a discriminant analysis and t-test of one thousand Indian telecom customers, it was concluded that the tariff rates for calls and customer satisfaction with the telecom service offered are the two key factors determining customer churn [13].
As competition in the telecommunications market intensifies, providing tariff price promotions and differentiated services for the key customers will be an efficient method to avoid customer churn [15]. In the Korean market, the tariff rate is one of the critical factors determining customer churn. Tariffs and customer care services are the two main factors influencing customer satisfaction and churn, as shown using discriminant and regression analyses [16]. Service quality in the telecom industry refers to Internet signal quality. Good service quality will improve customer satisfaction and loyalty, lowering the risk of customer loss [17]. Additionally, it will also help to attract new customers. Through a factor analysis and regression analysis, it was concluded that tariffs and service quality are key factors in prepaid customer churn. Hence, companies need to monitor and improve their service quality [13].
Customer retention and loss are influenced by the customers' sociodemographic characteristics and satisfaction [13]. The customers' sociodemographic data, for example regarding gender, could be used to predict whether customers will be lost or not [18]. Age and gender will influence telecom customers' preferences and behavior. People aged less than thirty years value customer service quality, value-added services, and mobile service fees. The tariff is not a key factor in determining churn for this segment. However, those aged older than thirty years pay more attention to tariff pricing, which will largely influence their retention or loss [19].
Predicting customer churn is not an easy task, since customer behaviors are heterogeneous [20]. In the past, companies have tended to investigate customer churn using traditional methods such as surveys. However, the data mining approach has been proven to be an efficient and better solution [21]. Specifically, a customer churn prediction model could be established to understand the factors that lead to customer churn and to predict customer loss. The model could be optimized through data mining to improve its prediction accuracy [18]. Moreover, customer segmentation is often combined with customer churn prediction for greater management effectiveness [22].
By comparing the accuracy of telecom customer churn prediction models constructed using different data mining methods, we can measure which data mining method is best [23]. In addition to accuracy, there are other metrics for measuring the performance of customer churn prediction models, such as the understandability and intuitiveness of the model [24]. Idris et al. [25] established a telecom customer churn prediction model with good understandability and intuitiveness using the GP-AdaBoost method.
There are two well-known data mining methods with outstanding prediction accuracy and understandability. One of them is the decision trees (DT) method and the other one is logistic regression. However, both methods have shortcomings: it is difficult for DT to deal with the linear relations of variables, and it is hard for logistic regression to handle the interaction impacts of variables. Thus, the logit leaf model (LLM) method performs better in classifying data. Compared with DT or logistic regression, LLM has shown better performance and understandability [26].
Vo et al. [27] stated that the current churn prediction methods mainly use structured rather than unstructured data to conduct analyses. Moreover, unstructured data and telephone communication voice content are innovatively used to build customer churn prediction models.
Machine learning (ML) and deep learning (DL) are suitable for customer loss prediction. An optimized synthetic minority oversampling method named the ISMOTE-OWELM model was used to improve accuracy in customer churn prediction [28].

Hypotheses and Proposed Model
Customer consumption tags distinguish and characterize customers by expenserelated information, such as by monthly fee, package type, or mobile terminal price [29]. Precision marketing can be performed using telecom data to classify and identify customers. Using such information will allow telecom operators to concentrate on the target customers and convert them into potential customers. This could significantly optimize marketing expenses and avoid customer churn [29].
Expense-related data could be applied to understand the reasons for customer loss. Customers with similar consumption-expense behaviors have similar reasons for churn. Users with similar expense-related characteristics could be segmented into groups to conduct an analysis [30]. Thus, we propose the following hypotheses: Taiwan's telecommunication industry has experienced fierce competition since it removed the restriction of wireless telecom services, and customer churn management has become the operators' focus in order to retain telecom customers by satisfying their needs. One of the main challenges is to predict customer churn [31].
Using empirical analysis, different data mining methods that can be used to allocate 'propensity-to-churn' scores were evaluated from customer and operator perspectives. The results showed that call data along with neural network and DT methods could be applied for accurate customer churn prediction models. Furthermore, the customers' recent six-month transactions can be applied to predict customer churn for the coming month. The call data can also be included in the transaction data. Thus, we proposed the following hypotheses: The Data Warehouse system, which accumulates telecom data, such as for SMS, was used to increase the customer retention rate for SyriaTel. Generally, all SMS and MMS data that indicate customer behavior should be used, as it is unknown which features will be valuable in predicting churn.
The SMS and MMS data for daily, weekly, and monthly users in the past nine months were aggregated for the research to identify related variables and see how they relate to each other. Three charts were built using three kinds of weights: (1) the standardized SMS and MMS quantities; (2) the standardized customer calling times; (3) the mean of the first two standardized weights. Two features for each chart were produced by applying the SenderRank and PageRank algorithms according to the directed charts [23].
The Indian liberalization and globalization process has influenced the telecom industry. The marked leader Airtel was selected to conduct a case study through its value proposition approach by concentrating on new value-added services such as the new SMS Pack plan. Consequently, the following hypotheses were also assessed: Our hypotheses are listed in Table 1. Table 1. The hypotheses of the study.

H1
The total fee receivable for the month positively impacts customer loss. H2 The fixed monthly cost has a positive impact on customer loss. H3 The local fee has a positive impact on customer loss. H4 The roaming fee has a positive impact on customer loss. H5 China Unicom's network fee has a positive impact on customer loss. H6 The fee with China Mobile has a positive impact on customer loss. H7 The fixed-line fee has a positive impact on customer loss. H8 The total monthly caller MOU has a positive impact on customer loss. H9 Total monthly called MOU has a positive impact on customer loss. H10 The total local caller MOU has a positive impact on customer loss. H11 China Unicom's SMS quantity has a positive impact on customer loss. H12 China Mobile's SMS quantity has a positive impact on customer loss. H13 China Telecom's SMS quantity has a positive impact on customer loss.

Data Collection
Client data were provided by three major Chinese telecommunication operators: China Mobile, China Unicom, and China Telecom. These data included the information for 4126 clients from 2007 to 2018, as well as anonymous demographic information, business information, and basic metadata information regarding the clients' fees, calls, and SMS and MMS activity.
The information from the dataset is shown in Table 2. Table 2. Dataset information.

Information Characterization
Demographic and business information

Data Analysis
For data analysis, we used SPSS. Factor analysis, Pearson correlation, chi-square, and discriminant and logistic regression analysis methods were used to predict customer churn [32].
The meanings of the independent variables from F1 to F6 are shown in Table 3. Table 3. The meanings of independent variables-adapted from Zhang [6].

F1
Common factor of non-monthly fixed cost F2 Common factor of monthly fixed cost F3 Common factor of the calls MOU F4 Common factor of long-distance and roaming call F5 Common factor of SMS F6 Common factor of China Unicom's MMS

Dataset Description
The samples' sex characteristics are shown in Table 4 and Figure 1. Of the 4126 customers, 1184 were females (28.7%) and 2942 were males (71.3%).   Among the 4126 customers, the ages ranged from 9 to 107. However, the most common ages ranged from 20 to 60 years old, representing 95% of the total. Customers aged 40 years were most represented, with 165 cases (4%).

Variable Selection
Factor analysis refers to the concept that significant and measured variables can be decreased to less latent variables with common variance [33]. Some factors are unobservable and unmeasurable, but variables can be reduced into the same group based on similar characteristics to test the relationships [34]. Expense data, such as monthly fee, package type, or mobile terminal price data, can be used to distinguish and characterize customers into different customer consumption tags [29]. Cost and expense management is critical to the operation of companies, and the factor analysis approach could be used to study the expense and cost data and to understand the relationships between the variables [35]. Telecom customer cost data, such as wireless data fees, are suitable for use in factor analysis and could be used to understand customer behavior [36]. Thus, the telecom customers' expense data were selected to conduct the following factor analysis. All expense-related factors, including the (1) total fee receivable for the month, (2) fixed Among the 4126 customers, the ages ranged from 9 to 107. However, the most common ages ranged from 20 to 60 years old, representing 95% of the total. Customers aged 40 years were most represented, with 165 cases (4%).

Variable Selection
Factor analysis refers to the concept that significant and measured variables can be decreased to less latent variables with common variance [33]. Some factors are unobservable and unmeasurable, but variables can be reduced into the same group based on similar characteristics to test the relationships [34]. Expense data, such as monthly fee, package type, or mobile terminal price data, can be used to distinguish and characterize customers into different customer consumption tags [29]. Cost and expense management is critical to the operation of companies, and the factor analysis approach could be used to study the expense and cost data and to understand the relationships between the variables [35]. Telecom customer cost data, such as wireless data fees, are suitable for use in factor analysis and could be used to understand customer behavior [36]. Thus, the telecom customers' expense data were selected to conduct the following factor analysis. All expense-related factors, including the (1) total fee receivable for the month, (2) fixed monthly costs, (3) local fee, (4) roaming fee, (5) Unicom's network fee, (6) China Mobile's fee, and (7) fixed-line fee, were used to conduct the factor analysis and analyze the characteristics of the cost factors. Later, Kaiser-Meyer-Olkin (KMO) and Bartlett tests were applied to identify whether these factors are suitable for factor analysis.

Research Hypothesis Testing: KMO and Bartlett Sphericity Tests
The KMO and Bartlett tests were carried out to identify whether the data could be used to conduct a factor analysis with good effect. If the KMO measures of sampling adequacy are >0.5 or the value of Sig is <0.05, the data can be used to conduct a factor analysis with good effect. The KMO and Bartlett test results for expense data are shown in Table 5. The KMO measures of sampling adequacy were 0.599 > 0.5, and the value of Sig was 0.000 < 0.05. Therefore, it was concluded that the data were suitable for factor analysis. Factor analysis needs to extract overlapping information for variables in order to reduce them. This requires that the original variables must have strong correlations with each other. If there is no overlapping information between the variables, they cannot be integrated and concentrated, and there is no need to perform the factor analysis.
We applied the common factor variance to judge the degree of information condensing via factor analysis ( Table 6). The common extracted factor values reached a maximum value of 87.8% and a minimum of 57.8%, with most being greater than 60%. The effect was good, and the information loss was low for each variable. It can be concluded that the results were representative and reliable.

Total Interpretation Variance
The cumulative variance of the first two factors was 72.798%, suggesting most of the observed variables were fully represented (Table 7). Therefore, the common factors F1 and F2 were selected.  Figure 2 shows a screen plot. The horizontal axis shows the component numbers, while the vertical axis shows the eigenvalues. The eigenvalues for the first two common factors 1 and 2 were greater than 1, which meant they were suitable for analysis.  Figure 2 shows a screen plot. The horizontal axis shows the component numbers, while the vertical axis shows the eigenvalues. The eigenvalues for the first two common factors 1 and 2 were greater than 1, which meant they were suitable for analysis.

Component Matrix
A component score coefficient matrix is shown in Table 8. F2 had a more significant load for the number of fixed monthly costs. Additionally, except for the small load on the fixed monthly cost, the first factor has the same load on the other cost factors. Therefore, the first factor F1 can explain the non-monthly fixed cost factor.

Component Matrix
A component score coefficient matrix is shown in Table 8. F2 had a more significant load for the number of fixed monthly costs. Additionally, except for the small load on the fixed monthly cost, the first factor has the same load on the other cost factors. Therefore, the first factor F1 can explain the non-monthly fixed cost factor. Therefore, we confidently concluded that F1 (common factor of non-monthly fixed costs) and F2 (common factor of monthly fixed costs) could characterize the expense attributes. The formulas used are shown below, which were adapted from Zhang [ Customer call data, such as data for total monthly calls, long-distance calls, and roaming calls, are suitable for use in a factor analysis to investigate the main factors influencing customer preference for the service provider [37]. Factor analysis was conducted on several variables, including customer call data, to identify the main factors determining customer loyalty. It was concluded that better call quality and service will positively influence customer loyalty [38]. Thus, the telecom customers' call data were selected to conduct the following factor analysis. The following call-related factors were used to conduct the factor analysis and analyze the characteristics of cost factors: (1) total monthly traffic MOU; (2) total monthly caller MOU; (3) total monthly called MOU; (4) total local MOU; (5) total local called MOU; (6) total long-distance MOU; (7) total roaming MOU. Later KMO and Bartlett tests of sphericity were applied to identify whether these factors were suitable for factor analysis.

Research Hypotheses Testing: KMO and Bartlett Sphericity Tests
The KMO and Bartlett test results for call data are shown in Table 9. The KMO measures of sampling adequacy were 0.555 > 0.5, and the value of Sig was 0.000 < 0.05. It was concluded that the data were suitable for factor analysis.

Common Factor Variance
The common factor variance results are shown in Table 10. The common extracted factor values ranged between 47.5% and 99.6%. Most of these extraction values were greater than 80%, revealing an ideal overall effect. The results were considered scientific and representative, as each variable's loss rate was low.

Total Interpretation Variance
The cumulative variance reached 83.463% (Table 11), suggesting most of the observed variables were represented. Therefore, most of the original information was replaced by factors F1 and F2. The scree plot is displayed in Figure 3. The horizontal axis shows the component numbers, while the vertical axis shows the eigenvalues. The feasibility of the first two common factors was revealed, as the eigenvalues of the first two common factors 1 and 2 were greater than 1.

Total Interpretation Variance
The cumulative variance reached 83.463% (Table 11), suggesting most of the observed variables were represented. Therefore, most of the original information was replaced by factors F1 and F2. The scree plot is displayed in Figure 3. The horizontal axis shows the component numbers, while the vertical axis shows the eigenvalues. The feasibility of the first two common factors was revealed, as the eigenvalues of the first two common factors 1 and 2 were greater than 1.

Component Matrix
The component score coefficient matrix is shown in Table 12. F4 had more significant loads for the total long-distance MOU and total roaming MOU. Therefore, long-distance and roaming calls were resumed as the second factor F4. Additionally, the total monthly called MOU, total local MOU, and total local called MOU numbers showed significant loads for the first factor F3. Therefore, the first factor F3 can explain the called MOU factor. Therefore, we confidently concluded that F3 (common factors of the called MOU) and F4 (common factors of long-distance and roaming call) characterize the call attributes. The formulas for calculation were as below, which were adapted from Zhang [

Selection of Variables
Factor analyses are performed to explore the factors that influence telecom customer experiences using certain variables, including customer SMS and MMS data [39]. Customer SMS data, for example relating to the SMS quantity in the telecom package, are suitable for use in factor analyses, which could help telecom companies to identify the factors that impact the customer satisfaction and loyalty [40]. The telecom sector has achieved impressive development in Bangladesh. Customer SMS data has been used in factor analyses, helping to understand the relationship between SMS data and customer loss [20]. Thus, the telecom customers' SMS and MMS data in the data source were selected to conduct the following factor analysis. All SMS-related factors, including (1) China Unicom' SMS quantity, (2) China Mobile's SMS quantity, (3) China Telecom's SMS quantity, (4) China Unicom's MMS quantity, and (5) CRBT, were used to conduct the factor analysis and analyze the characteristics of the cost factors. Later, KMO and Bartlett tests were applied to identify whether these factors could be used to conduct the factor analysis.

Research Hypothesis Testing: KMO and Bartlett Tests of Sphericity
The test results of KMO and Bartlett for SMS data are shown in Table 13. The KMO measures of sampling adequacy were 0.567 > 0.5, and the value of Sig was 0.000 < 0.05. It was concluded that the data were suitable for factor analysis.

Common factor variance
The results of common factor variance are shown in Table 14. The common factor extracted revealed results more significant than 50%. The results were considered scientific and representative, as each variable's loss rate was low. Table 14. Common factor variance-adapted from Zhang [6].

Initial Value Extraction Value
China

Total Variance of Interpretation
The cumulative variance was 50.087% (Table 15), suggesting most of the observed variables were represented. Therefore, most of the original information was replaced by factors F1 and F2. The scree plot is displayed in Figure 4. The horizontal axis shows the component numbers, while the vertical axis shows the eigenvalues. The feasibility of the first two common factors was revealed, as the eigenvalues of the first two common factors 1 and 2s were greater than 1.

Component Matrix
The component score coefficient matrix is shown in Table 16. F6 had more significant loads for China Unicom's MMS quantity and CRBT. Therefore, MMS and CRBT were resumed as the second factor F6. Moreover, the first factor F5 showed more significant loads for China Unicom's SMS quantity, China Mobile's SMS quantity, and China Telecom's SMS quantity. Therefore, the SMS quantity can be explained by the first factor F5.

Component Matrix
The component score coefficient matrix is shown in Table 16. F6 had more significant loads for China Unicom's MMS quantity and CRBT. Therefore, MMS and CRBT were resumed as the second factor F6. Moreover, the first factor F5 showed more significant loads for China Unicom's SMS quantity, China Mobile's SMS quantity, and China Telecom's SMS quantity. Therefore, the SMS quantity can be explained by the first factor F5. According to the data, the discriminant analysis revealed an appropriate discriminant model. The model refers to the discrimination between the sample and the parent. First, historical data are established from the samples' discriminant distances. Then, each sample's data are replaced with the discriminant distance to calculate the actual distance.

Analysis of Discriminant Model
The discriminant model's eigenvalues were analyzed to identify the discriminating judgment power of the function. Then, Wilks' lambda discriminant test was applied to confirm the significance of the discriminant function, i.e., whether the discriminant function was valid or not. Afterward, Fisher's linear discriminant function was used for the telecom customer loss prediction equation, indicating the key factors (F1, F2, F3, F4, F5, and F6) that could influence the telecom customer churn. Finally, an accuracy test was conducted for the discriminant function to investigate the accuracy of the discriminant equation.
(1) Eigenvalues of the discriminant function The discriminant model was used in the analysis. In the table below, when the discriminant model's eigenvalue is higher, the model's discriminating judgment power is higher. The last column represents the canonical correlation coefficient, while the results reveal an acceptable range due to the discriminant function's eigenvalue (0.030) and canonical correlation (0.171) ( Table 17). In the Table 17, "a" means that the former canonical discriminant function was used in the analysis. Wilks' lambda is the ratio of the within-group sum of squares to the total sum of squares. The value is one when the group means for all observations are equal; it is close to zero when the within-group variation is small compared to the total variation. Thus, a large Wilks' lambda value indicates that the means of each group are more or less equal; a small Wilks' lambda value shows that the means of each group are different. It can be seen from Table 18 that the first discriminant function explained 97.1% of all variations. Moreover, the value of Sig. was 0.000 < 0.05, meaning that this discriminant function was significantly established. (3) Fisher's linear discriminant function test Y1 and Y2 represent the customer churn and customer existence, respectively (Table 19). Table 19. Classification function coefficients-adapted from Zhang [6].

The Loss or Retain of Customers Customer Loss-Y1 Customer Retain-Y2
Factor score for F1 −1.518 −0. The discriminant model indicates the top factors that could be used to forecast the telecom customer churn. The classification is considered to be Y1 if the result is one, revealing customer churn. If the result is zero, the classification is Y2, suggesting customer retention.
(4) Accuracy test for discriminant function One hundred random samples from the dataset were chosen to conduct the accuracy test. The results are shown in Table 20. Half of them were lost customers, and half were retained customers. The one hundred random samples were imported into the telecom customer churn discrimination model. Then, the predicted customers churn results were obtained to judge the prediction accuracy rate of the model. From the above Table, we can see that the overall prediction accuracy rate was 75%. Among the 50 retained customers, 36 were predicted successfully. The accuracy rate was 72%. Furthermore, among the 50 churn customers, 39 of them were predicted successfully, and the accuracy rate was 78%.

Logistic Regression Model of Telecom Customer Churn Prediction
It can be seen from Table 21 that a total of 19 items, such as the Total fee receivable for the month, are independent variables. Moreover, filter_$, which means the customer is lost or retailed, is the dependent variable for binary logistic regression analysis to build the customer loss prediction model. When filter_$ is one suggests that the customer is lost. When filter_$ is 0, it suggests that the customer will be retained. Based on these results, we can estimate whether or not a customer will stay with a telecommunications service provider based on the information in the dataset. The model formula is: ln(p/1 − p) = −2.056 − 0.002 × Total fee receivable for the month − 0.308 × Fixed monthly cost − 0.077 × Local fee + 0.023 × Roaming fee + 0.041 × Unicom network fee + 0.031 × Fee with China Mobile + 0.032 × Fee with fixed-line + 0.003 × China Unicom SMS quantity + 0.004 × China Mobile SMS quantity + 0.003 × China Telecom SMS quantity + 0.009 × China Unicom MMS quantity + 0.238 × CRBT − 0.539 × Total monthly traffic MOU − 0.016 × Total monthly caller MOU − 0.057 × Total monthly called MOU + 0.559 × Total local MOU + 0.039 × Total local called MOU + 0.548 × Total long-distance MOU + 0.510 × Total Roaming MOU (where p represents the probability that filter_$ is 1, which indicates that the customer will be lost. Furthermore, 1-p represents the probability that filter_$ is 0, which indicates that the customer will be retained).
According to the parameter test, it can be seen that the regression coefficient of the total fee receivable for the month was −0.002, but this was not significant, since z = −0.402 and p = 0.688 > 0.05. This suggests that the total fee receivable for the month will not affect filter_$. Thus, hypothesis 1 was rejected, meaning that the total monthly fee receivable does not positively impact customer loss.
The regression coefficient of the fixed monthly cost was −0.308, which was significant, since z = −11.564 and p = 0.000 < 0.05, suggesting that the fixed monthly cost will have a significant negative impact on the customer churn. Moreover, the dominance ratio (OR value) was 0.735, suggesting that when the fixed monthly cost increases by one unit, the decrease in Y is 0.735 times. Thus, hypothesis 2 was rejected, suggesting that the monthly fixed cost does not positively impact customer loss.
The summary analysis showed that Unicom's network fee, China Mobile's network fee, fixed-line fee, China Unicom's SMS quantity, China Mobile's SMS quantity, China Unicom's MMS quantity, CRBT, total local MOU, total long-distance MOU, and total roaming MOU have a significant favorable influence on the customer churn. On the other hand, the fixed monthly cost, local fee, total monthly traffic MOU, total monthly caller MOU, and total monthly called MOU significantly negatively impact the customer churn. However, the total fee receivable for the month, roaming fee, China Telecom's SMS quantity, and total local called MOU do not affect the customer churn. Therefore, H1, H2, H3, H4, H8, H9, H10, and H13 were rejected, while H5, H6, H7, H11, and H12 were confirmed. In Table 22, the model's overall prediction accuracy is shown to be 93.94%, and the model's fit is acceptable. The logistic regression analysis and hypothesis tests show that expense, SMS, and call information factors influence customer churn. Moreover, the accuracy test for the logistic regression prediction model proved that it has good prediction performance, with an accuracy rate of 93.94%. Thus, it is possible to estimate whether or not a customer will stay with a telecommunications service provider based on information from the data. This investigation indicates that the logistic regression method could be used to predict customer churn with high accuracy.

Discussion
The data are mainly from three major Chinese telecom operators: China Mobile, China Unicom, and China Telecom. This study aimed to use factor analysis to investigate the business characteristics of telecom clients and to build a discriminant model and a logistic regression model to predict telecom client churn. We showed how the Fisher discriminant equations and logistic regression analysis could be applied to build a telecom customer churn prediction model and achieve better evaluation metrics results for accuracy. After comparison, we suggest that the logistic regression approach performs better when building a telecom customer churn prediction model, with an accuracy rate of 93.94%.
Today's market is getting more competitive [41]. Telecom companies must make critical decisions and develop effective retention methods to avoid customer churn, as retaining existing customers is much less expensive [2]. In a competitive environment, retaining customers is critical. The telecom customer churn prediction model constructed using a logistic regression approach suggests that churn can be predicted when customers are unsatisfied with the offered service.
Fisher discriminant equations and logistic regression analysis were used to build a telecom customer churn prediction model. In our preliminary study, the logistic regression approach performed better than the others, with an accuracy rate of 93.94%, as compared to Fisher's discriminant equations with 75%.

Conclusions
Telecom customer churn is a central issue for telecom companies, since it decreases profits [1]. Furthermore, preventing customer churn is imperative. As the global telecom industry is becoming more saturated and companies are increasingly struggling to retain customers [41]. Currently, most companies invest heavily in marketing to attract new customers. However, keeping existing customers is cheaper than acquiring new customers [2]. Thus, it is becoming more critical and a significant concern for telecommunication companies to prevent customer churn. This study inventively builds a discriminant model and a logistic regression model to predict telecom client churn using customer segmentation data from three major telecommunication Chinese companies. Moreover, the results of this study will give telecom managers the ability to predict customer behavior and loss accurately and to optimize their strategies to improve customer retention rates. Meanwhile, the findings will help companies reduce costs and optimize their budgets. Furthermore, for telecom managers, it will be possible to improve customer targeting through the results of this paper and to increase the profits of telecom companies.
There is very little knowledge about how telecom customers' opinions regarding the services provided by their telecom company impact customer churn. We aimed to cover this research gap using a Fisher discriminant analysis and a logistic regression analysis of telecom customer churn related to diverse factors. Moreover, the discriminant function and logistic regression analysis are proven to predict telecom customer churn [42]. In this study, through a Wilks' lambda discriminant test, we be concluded that the discriminant equation is valid and can explain the reasons for churn. Furthermore, through the accuracy test, the logistic regression equation was also proven to be valid and can explain the reasons for churn. Serrano et al. [43] highlighted that previous telecom customer churn studies have mainly applied factor analysis, cluster analysis, and other methods, while telecom customer churn studies conducted using Fisher discriminant analysis and logistic regression analysis remains scarce, even in top journals. This new investigation should solve this problem.
According to the results of this paper, the recommendations are for telecom companies to decrease their monthly fixed costs and local costs to increase the possibility of retaining their telecom customers. Additionally, the managers of telecom companies have already realized the value and importance of improving the service quality of the Internet, fixedline, and CRBT products, as well as the call time for long-distant calls and the numbers of SMS and MMS in the telecom package, which has previously been proven to have a positive influence on telecom customers retention.

Research Limitations and Future Directions
The dataset includes the information for 4126 clients from 2007 to 2018. However, it has been nearly four years since then. Because of the COVID-19 pandemic, the telecom market and customer consumption habits may be significantly different from before. Therefore, more current data will be gathered to further improve the model's accuracy and move the model more in line with the current market situation. Furthermore, the model can be further improved using the repeated data testing approach.
Moreover, data were collected from three operators. Data from other operators may increase the reliability of the model. Finally, additional variables could be applied to improve its predictability. Funding: This work was supported by the Fundação para a Ciência e Tecnologia (FCT) within the following Projects: UIDB/04466/2020 and UIDP/04466/2020. Data Availability Statement: Not Applicable, the study does not report any data.