Deep Churn Prediction Method for Telecommunication Industry

: Being able to predict the churn rate is the key to success for the telecommunication industry. It is also important for the telecommunication industry to obtain a high proﬁt. Thus, the challenge is to predict the churn percentage of customers with higher accuracy without comprising the proﬁt. In this study, various types of learning strategies are investigated to address this challenge and build a churn predication model. Ensemble learning techniques (Adaboost, random forest (RF), extreme randomized tree (ERT), xgboost (XGB), gradient boosting (GBM), and bagging and stacking), traditional classiﬁcation techniques (logistic regression (LR), decision tree (DT), and k-nearest neighbor (kNN), and artiﬁcial neural network (ANN)), and the deep learning convolutional neural network (CNN) technique have been tested to select the best model for building a customer churn prediction model. The evaluation of the proposed models was conducted using two pubic datasets: Southeast Asian telecom industry, and American telecom market. On both of the datasets, CNN and ANN returned better results than the other techniques. The accuracy obtained on the ﬁrst dataset using CNN was 99% and using ANN was 98%, and on the second dataset it was 98% and 99%, respectively.


Introduction
Churn rate is the number of customers leaving a company annually. This can pose a challenge for any business organization. The prediction of customers who may want to leave the company is then a crucial task for any business. With recent advances in data analytics, many forms of Customer Relationship Management (CRM) systems have been embedded as data analytical methods and these have became the focus of many studies and practices. Such analytical methods pay much attention to customer-centric approaches over product-centric approaches. As a consequence, customer-company interactions have changed in such a way that numerous new advertising openings have emerged. The most profitable marketing tactics that make the most of shareholder value include reducing churn and maintaining current clients [1].
Customers remain the most valuable entity in the telecommunications industry for a company to continue operating. A reduction in the number of customers is an unanticipated event for businesses. As a result, businesses must examine consumer profiles in order to undertake business segmentation and make more informed decisions [2]. In today's highly competitive sectors, such as the telecom sector, customer churn is among the most urgent concerns [3]. Due to the high price of enlisting new consumers, the telecom industry has turned its attention to maintaining existing customers. Compared to new consumers, keeping existing customers leads to more sales and lower marketing costs [4]. Therefore, customer churn prediction has become an essential aspect of the telecommunication sector's strategic executive and planning procedure [5].
According to one report, the telecom industry's annual churn rate stretches from 20% to 40%, and the price of preserving current customers is 5-10 times cheaper than the price of acquiring new consumers [6]. Predicting customer turnover is 16 times less expensive than acquiring new customers. The profit increases from 25% to 85% when the churn rate is reduced by 5% [1]. This demonstrates the importance of predicting client attrition in the telecom industry. CRM is critical for telecom companies in retaining existing customers and reducing customer churn. Thus, the precision of the CRM analyzers' prediction systems is critical. No advertising can be run if analyzers are inaccurate in forecasting consumer attrition [7]. Data mining and machine learning technologies, which have seen recent advancements in data science, provide answers to consumer attrition.
In order for businesses to improve their customer relationships and obtain more devoted customers, personalized marketing strategies (which can be done through social media marketing) are very necessary. To accomplish this, it is necessary to give their customer support and sales people the ability to obtain information on the company's users and provide training so that they can properly interact with all of their clients. In the age of big data [4], such a task could be easily undertaken using AI techniques without putting much strain on the customer support and sales teams. It is essential to include AI in all business activities, including marketing/social marketing, CRM, and sales, among others, in order to successfully engage customers and earn their confidence.
Considering the important role that social media and other electronic marketing platforms play in the business world today, it is very vital to comprehend how to use, adapt, and implement such platforms effectively [8,9]. Deep learning and customer behaviour analysis can significantly affect social media marketing and other marketing activities of a company by allowing for more personalized and targeted marketing activities. By analyzing customer data, businesses can gain insights into what is likely to resonate with their audience. A business can use such information to create more effective social media and marketing campaigns. This will, in turn, lead to higher customer engagement and conversion rates. Additionally, deep learning algorithms can help companies to automate and optimize their advertising efforts, saving time and resources while improving overall performance.
As discussed in the next section, the existing works on churn prediction have been undertaken on different industry customers, e.g., employee churn [10]; however, the majority of them are on telecom data since this industry deals with the most severe losses from losing customers. Maximizing profit while attempting to predict the churn rate has been a point of concern for some researchers; however, for others, obtaining a better accuracy in predicting churn has been the focal point. However, the main issue that is not well-studied is that the target should be gaining higher prediction accuracy without compromising profit, which means less complicated techniques that can be used without a large amount of investment.
To address this issue, in this paper, various types of machine learning strategies are investigated in order to build a churn predication model. Traditional classifiers, ensemble learning, and deep learning have been investigated to build the model which was evaluated using two public datasets. The first dataset is from the Indian and Southeast Asian telecom industry [11] and the second dataset is from the American telecom market [12]. The benefactions of the proposed research model are as follows: • Recommending a prediction model for churn with a high precision and accuracy. • The proposed model is able to conquer the complication of lower quantity of churn in the datasets in comparison to the non-churn customers. • The proposed model has been thoroughly evaluated using diverse performance metrics using two public datasets. These include accuracy, recall, precision, F1 score, and AUC-(Area Under the ROC Curve) ROC (Receiver Operating Characteristic). Also, the model was compared with different related works on churn prediction rate, which showed that our model outperforms all of them. • Further, the statistical analysis using ANOVA and Wilcoxon showed that the proposed models are statistically significant compared to other models.
In addition to the above section, the paper is structured as follows. Section 2 discusses the related work. The tools, dataset, and approach employed in the research are briefly described in Section 3. The research findings have been thoroughly described in Sections 4 and 5 wraps up the entire project.

Literature Survey
Consumer preferences and expectations have shifted as a result of evolving technology and the increasingly extensive accessibility of numerous services and products, resulting in a highly competitive environment across a number of customer service sectors, including the financial sector. The authors Shirazi and Mohammadi [2] discussed the effects of this situation on the Canadian banking industry. Their main goal is to combine structured archived data with unstructured factual material such as online web sites, the amount of website traffic, and phone call records to create a predicted churn model. They also studied how different customer habits affect churning decisions. Companies' success is largely determined by their ability to analyze existing data and extract useful information. A cloud-based ETL (Extract, Transform, and Load) architecture for data analysis and the combination of diverse sources was recommended by Zdravevski et al. [13]. In the churn prediction example, they showed that they could identify the specific cause of churn and detect over 98% of churners. As a result, the support and sales teams could implement focused retention initiatives. In their study, Vo et al. [14] provided a customer churn forecasting model based on unstructured data, such as verbal comments made during phone conversations. They conducted extensive testing on substantial call center data using calls from clients. Their results showed that, utilizing understandable machine learning through behavior and customer sections, their model can effectively estimate consumer churn prospects and generate useful insights.
An effective retention strategy must not only correctly identify potential leavers but also those who are most profitable to the business and, hence, are deserving of keeping. Therefore, the best churn prediction model can both accurately pinpoint churners and take into account the company's profit. This could be achieved by embedding the (EMPC) metric, which stands for "Expected Maximum Profit amount for customer Churn," into a machine learning-based churn prediction model [15]. In this model, a profit-based decision tree was suggested which was used to evaluate real-life datasets from several communications service providers. The results showed that the profit-based decision tree model allowed for a noteworthy profit boost above conventional accuracy-driven tree-based techniques. In another work on maximizing profit alongside churn detection, Stripling et al. [16] presented ProfLogit in which a machine learning classifier was integrated with a genetic algorithm to exploit the EMPC in the training stage.
The telecommunication industry experiences customer turnover at a very high rate because of the easier portability options which are available these days and the severe market competition. Hence being able to predict the churn rate of customers has become a vital part of the whole business process. Table 1 shows several works which have studied churn predictions done using telecommunication industry data. An assortment of various learning tools or classifiers known as an ensemble work together to provide outcomes that are more precise and reliable by integrating all available techniques. In ensemble approaches, an inducer is a learner whose definite classification of the categorized training set is randomly well correlated [27,29]. A poor learner is, to some extent, related to the genuine classification. As a result of the ensemble learning framework, new approaches such as bagging, boosting, and random forests have arisen. One of the aggregation techniques used in ensemble learning to lower variation in prediction models is bagging. Two further techniques are boosting, which helps turn many poor learners into one aggregated strong model, and stacking, which seeks to lessen prediction bias. Several ensemble learning methods have been implemented in this study to forecast the customer turnover rate in the telecom sector, including RF, ERT, GBM, XGB, adaboost, and stacking. The different types of ensemble models used in this research work have been discussed below.
RF is a sort of ensemble study in which classification and regression are performed using numerous decision-making trees. In the RF classifier, there is some randomness in the selection of subsets and features for the nodes of each tree. One of the factors used to partition the data into random forests is the Gini index. The Gini index is a potent indicator of the randomness, impurity, or entropy in a dataset's values. It seeks to reduce the contaminants in a decision tree model from the root nodes to the leaf nodes. Variables that help not just the creation of an accurate model but also the prediction are vital for the random forest technique [29,30]. Figure 1 shows the basis representation RF. many poor learners into one aggregated strong model, and stacking, which seeks to lessen prediction bias. Several ensemble learning methods have been implemented in this study to forecast the customer turnover rate in the telecom sector, including RF, ERT, GBM, XGB, adaboost, and stacking. The different types of ensemble models used in this research work have been discussed below. RF is a sort of ensemble study in which classification and regression are performed using numerous decision-making trees. In the RF classifier, there is some randomness in the selection of subsets and features for the nodes of each tree. One of the factors used to partition the data into random forests is the Gini index. The Gini index is a potent indicator of the randomness, impurity, or entropy in a dataset's values. It seeks to reduce the contaminants in a decision tree model from the root nodes to the leaf nodes. Variables that help not just the creation of an accurate model but also the prediction are vital for the random forest technique [29,30]. Figure 1 shows the basis representation RF. In essence, the Extreme Randomized Tree (ERT) technique involves extensively randomizing the selection of trait and cut-points while dividing a tree node. In the worst case, it generates entirely random trees, whose designs are unaffected by the output of the learning model. The Extra-Trees or ERT technique produces an ensemble of unpruned results or regression trees in accordance with the conventional top-down method. It splits nodes randomly and generates trees using the complete learning sample, which are its two main distinctions from other tree-based ensemble techniques [31].
GBM is frequently applied to problems which involve classification and regression. It can be applied to a variety of real-world situations with great benefit. This is a method of numerical optimization that seeks recognition of an additive model with the lowest possible loss [32,33].
Regression tree XGBoost adheres to the Decision Tree concept, to which it is similar. It supports both classification and regression. This gradient booster (GBM) version is widely applied in machine learning and in its applications. It is scalable and efficient. The training is based on an "additive strategy": k additive functions are employed to expect each tree ensemble model when a molecule i and a descriptor xi vector are given [34,35].
Adaptive boosting is abbreviated as AdaBoost. In order to satisfy the classification requirements of datasets, it is a kind of dichotomous classification algorithm that develops and fuses a number of base classifiers. AdaBoost increases the weight of a test sample that the previous base classifier incorrectly classified, while decreasing the weight of the samples that were appropriately classified by the subsequent weak learner [36][37][38].
Bagging is a technique for enhancing the precision of predictions made by other learning algorithms. It is a method for merging multiple compound models and then In essence, the Extreme Randomized Tree (ERT) technique involves extensively randomizing the selection of trait and cut-points while dividing a tree node. In the worst case, it generates entirely random trees, whose designs are unaffected by the output of the learning model. The Extra-Trees or ERT technique produces an ensemble of unpruned results or regression trees in accordance with the conventional top-down method. It splits nodes randomly and generates trees using the complete learning sample, which are its two main distinctions from other tree-based ensemble techniques [31].
GBM is frequently applied to problems which involve classification and regression. It can be applied to a variety of real-world situations with great benefit. This is a method of numerical optimization that seeks recognition of an additive model with the lowest possible loss [32,33].
Regression tree XGBoost adheres to the Decision Tree concept, to which it is similar. It supports both classification and regression. This gradient booster (GBM) version is widely applied in machine learning and in its applications. It is scalable and efficient. The training is based on an "additive strategy": k additive functions are employed to expect each tree ensemble model when a molecule i and a descriptor x i vector are given [34,35].
Adaptive boosting is abbreviated as AdaBoost. In order to satisfy the classification requirements of datasets, it is a kind of dichotomous classification algorithm that develops and fuses a number of base classifiers. AdaBoost increases the weight of a test sample that the previous base classifier incorrectly classified, while decreasing the weight of the samples that were appropriately classified by the subsequent weak learner [36][37][38].
Bagging is a technique for enhancing the precision of predictions made by other learning algorithms. It is a method for merging multiple compound models and then majority voting (in classification) or averaging (in regression) their outputs to generate a more powerful prediction model [30,39]. Stacking, or stacked generalization, is an ensemble machine learning algorithm. Using a meta-learning strategy, it learns how to aggregate estimates from two or more fundamental machine learning algorithms. On either a classification or regression job, stacking has the advantage of combining the abilities of numerous high-performing models to produce predictions that are superior to any specific model within the ensemble [27].

Artificial Neural Network (ANN)
ANN is a mathematical description of the human nervous system that is heavily dependent on its functional and structural elements. In order to provide the needed scalar output, the final layer of the network employs a function, which is referred to as an activation or a transfer function [40]. An artificial neuron network has been mathematically depicted below: where O(t) = output at a given time, f = transfer function, c = bias, I i (t) = inputs, and v i (t) = weights [29,41].
The mathematical representation of a feed-forward NN consisting of one input layer, one hidden layer, and one output layer, as shown in Figure 2, is given through the equations below: where m i , q i = output of the preceding layer; p i , s i = weight of the current layer.
Sustainability 2023, 15, x FOR PEER REVIEW 6 of 26 majority voting (in classification) or averaging (in regression) their outputs to generate a more powerful prediction model [30,39]. Stacking, or stacked generalization, is an ensemble machine learning algorithm. Using a meta-learning strategy, it learns how to aggregate estimates from two or more fundamental machine learning algorithms. On either a classification or regression job, stacking has the advantage of combining the abilities of numerous high-performing models to produce predictions that are superior to any specific model within the ensemble [27].

Artificial Neural Network (ANN)
ANN is a mathematical description of the human nervous system that is heavily dependent on its functional and structural elements. In order to provide the needed scalar output, the final layer of the network employs a function, which is referred to as an activation or a transfer function [40]. An artificial neuron network has been mathematically depicted below: where = output at a given time, = transfer function, = bias, = inputs, and = weights [29,41]. The mathematical representation of a feed-forward NN consisting of one input layer, one hidden layer, and one output layer, as shown in Figure 2, is given through the equations below: where mi, qi = output of the preceding layer; pi, si = weight of the current layer.

Decision Tree (DT)
One of the several extensively used summative evaluations for categorizing data or identifying the hidden pattern in a batch of data is the decision tree. The first node in a decision tree is called the root node, while the second and third nodes are called the internal and leaf nodes. The leaf nodes that make up the decision tree's final layer each have a specified class goal value. Separating the nodes on each level in accordance with the splitting criteria forms the basis for a decision tree. Up until a pausing criterion is met, this splitting and expanding phase continues [29]. The various standards can be embodied as follows: where, The assessment criteria are defined as, where R = a training set; b i = a discrete attribute; z = target attribute; u i,j = values.

k Nearest Neighbor
In order to classify a new standard or invasive process, the kNN classifier compares the new structure to the training procedure instances. It then predicts the new process class using the closest k number of classes. The procedure takes for granted that, in the vector space, processes from the same class are clustered together. The value of k determines how many neighbors are required to describe the data class. The closest neighbors are chosen by a majority vote. The distance can be calculated using the Euclidean distance metric [30,42].

Logistic Regression (LR)
A statistically probabilistic method of categorization is called logistic regression. A subcategory variable that is impacted by one or more of the predictive elements can also be forecasted using this method (such as client characteristics). This method was utilized in our case after the original dataset had undergone considerable data preparation [43].

Convolutional Neural Network (CNN)
CNN is a method built on learning depiction, where the outline researches and identifies the properties necessary for uncovering from the several layers processing input information. It has historically been used in image processing applications; however, more recently, it has been used to forecast customer churn rate in the telecom industry using onedimensional frameworks. CNN is a neural network (NN) with a multilayer architecture that consists of multiple fully connected and convolutional layers. The hierarchy of the convolution layers serves as the network's basic component of building. Further, 1D CNNs are naturally suited to processing data from consumer profiles [44].
The core architecture of the CNN method has been modified in this study to support the analysis of 1D churn data. The 1D design is also quicker than the 2D structure since it is more straightforward and has fewer parameters. The representation of a 1D convolutional method is: x where L = convolutional layer, f a = activation or transfer function, b i = bias, k e = quantity of convolutional layers, and C = quantity of input channels [45,46].

Dataset Used
In this work, two publicly available churn prediction datasets of the telecommunication industry have been used for comparison of the performances of the machine learning models. In the first dataset, the initial number of instances was 99,999 and the number of features was set at 226. After the filtering of high value customers was been performed, the number of instances was brought down to 30,001 and after the completion of the whole cleaning procedure, the numbers of features was brought down to 57. The final cleaned dataset was composed of 91.9% non-churn customers and 8.1% churn customers. For the second dataset, filtering was not required as a part of the preprocessing; hence, only a basic cleaning procedure was performed on the dataset. The number of instances and features in the final cleaned dataset were 3333 and 20, respectively. This dataset was composed of 85.5% and 14.5% non-churn and churn customers. Figure 3 shows the churn distribution of both of the datasets.
The core architecture of the CNN method has been modified in this study to support the analysis of 1D churn data. The 1D design is also quicker than the 2D structure since it is more straightforward and has fewer parameters. The representation of a 1D convolutional method is: where L = convolutional layer, fa = activation or transfer function, bi = bias, ke = quantity of convolutional layers, and C = quantity of input channels [45,46].

Dataset Used
In this work, two publicly available churn prediction datasets of the telecommunication industry have been used for comparison of the performances of the machine learning models. In the first dataset, the initial number of instances was 99,999 and the number of features was set at 226. After the filtering of high value customers was been performed, the number of instances was brought down to 30,001 and after the completion of the whole cleaning procedure, the numbers of features was brought down to 57. The final cleaned dataset was composed of 91.9% non-churn customers and 8.1% churn customers. For the second dataset, filtering was not required as a part of the preprocessing; hence, only a basic cleaning procedure was performed on the dataset. The number of instances and features in the final cleaned dataset were 3333 and 20, respectively. This dataset was composed of 85.5% and 14.5% non-churn and churn customers. Figure 3 shows the churn distribution of both of the datasets.

Research Model
This work recommends a research design which forecasts customers' churn rate, i.e., whether a subscriber is going to leave or not by analyzing their behavioral pattern. In the first dataset, the churn is predicted only for the high value customers, which is determined by the amount of recharge over a certain period of time. A customer does not decide to churn instantly, it is a decision made over a certain period of time. This period is divided into three phases: the good phase, where the customer is happy and behaves in the usual way; the action phase, where the customers experience starts to sore; and the churn phase, where the customer has churned. The dataset used in this work spans over a window of four months: June, July, August, and September. Here, June and July are the good phase and, by analyzing the recharge amount spent in these two months, the high value customers are determined. Here, also, the amount used to indicate high value was being above 70th percentile of the average recharge amount. As shown in Figure 4, after filtering

Research Model
This work recommends a research design which forecasts customers' churn rate, i.e., whether a subscriber is going to leave or not by analyzing their behavioral pattern. In the first dataset, the churn is predicted only for the high value customers, which is determined by the amount of recharge over a certain period of time. A customer does not decide to churn instantly, it is a decision made over a certain period of time. This period is divided into three phases: the good phase, where the customer is happy and behaves in the usual way; the action phase, where the customers experience starts to sore; and the churn phase, where the customer has churned. The dataset used in this work spans over a window of four months: June, July, August, and September. Here, June and July are the good phase and, by analyzing the recharge amount spent in these two months, the high value customers are determined. Here, also, the amount used to indicate high value was being above 70th percentile of the average recharge amount. As shown in Figure 4, after filtering out the high value customers, the churn tag was assigned to the instances of churn, and then the new feature extracting was performed. In the second dataset, these preprocessing steps were not been used as necessary steps and it was instead cleaned by removing the unnecessary and redundant data.   Several machine learning approaches were applied to the dataset after cleaning, with the purpose of making predictions. One of the categorization strategies for the predictive system was a stacking ensemble model. The stacking model had two levels, as shown in Figure 5. DT, RF, LR, kNN, and SVM made up level 0, whereas LR made up level 1. The final output was predicted using a soft voting technique. RF, ERT, AdaBoost, GBM, XGB, and Bagging are some of the other ensemble models which were used. Other than ensemble learning, the conventional classifiers utilized here are DT, kNN, LR, and ANN, and the deep learning technique used is CNN. The split for the training and testing was 80% and 20%, respectively, and a 10-fold cross validation was used to validate the data. The simulations have been performed on a system with processor 11th Gen Intel(R) Core (TM) i7-1195G7 @ 2.90GHz 2.92 GHz, Installed RAM 16.0 GB (15.8 GB usable), System type 64-bit operating system, x64-based processor, Windows 11 Home Single Language Version 22H2. The complete specifications of the various classifiers utilized in the study, which have been determined through trial and error, are presented in Tables 2 and 3.

Performance Measures
Different performance measures such as accuracy, sensitivity, precision, specificity, and AUC-ROC value of the suggested predictive system for c behavior were assessed in this study.

Results and Discussion
In order to predict customer churn, various supervised learning approac been used in this study on behavioral data from consumers. The data were prep to remove the high value, as was covered in the section before this one. The ch labeled for classification after filtering. Here, CNN, as a deep learning techniq

Performance Measures
Different performance measures such as accuracy, sensitivity, precision, F1 score, specificity, and AUC-ROC value of the suggested predictive system for consumer behavior were assessed in this study.

Results and Discussion
In order to predict customer churn, various supervised learning approaches have been used in this study on behavioral data from consumers. The data were preprocessed to remove the high value, as was covered in the section before this one. The churn was labeled for classification after filtering. Here, CNN, as a deep learning technique, some ensemble learning methods, and a few classical classifications have all been used. The research model was verified using k-fold cross validation, where k = 10 for both of the models. Tables 4 and 5 show that CNN and ANN returned the best accuracy, among all of the techniques which were applied, for both of the datasets: 99% and 98% on the first dataset, and 98% and 99% on second dataset, respectively. The AUC score and AUC-ROC curve were used to visualize the performance of the classifiers. CNN and ANN revealed the most efficient outcome among all of the machine learning techniques which were applied, with an AUC score of 0.99 using both the techniques on the first dataset and 0.99 and 0.96 on the second. On the first dataset, all of the models exhibited an accuracy over 90% and the AUC scores for all of the ensemble models were valued between 0.70 and 0.76. On the second dataset, all of the ensemble models except for AdaBoost exhibited an accuracy over 90% and the AUC value for all of the ensemble models except for AdaBoost were between 0.80 and 0.89. The traditional classifiers, except for ANN, exhibited an accuracy over 90% on the first dataset; however, when applied on the second dataset, the accuracy was below 90%. kNN and LR exhibited lowest AUC scores on both of the datasets.  Figures 6 and 7 show the accuracy, using a bar graph plot and line graph plot, of the AUC score measures for all of the classifiers on both the datasets used in this work. In both of the graphs, the distinct difference in the values for both of the datasets can be visualized. The accuracy plot helps to visualize that Adaboost, DT, kNN, and LR achieved better efficacy on the first dataset than the second; whereas bagging and stacking performed the same on both datasets. The AUC graph shows that all of the ensemble models, except Adaboost, exhibited a better score on the second dataset, and all of the other models exhibited similar scores on both datasets.   The line graph plot for sensitivity and F1 scores for all of the classifiers for both datasets are shown in Figure 8. The F1 score is biased toward the genuine positive rate because it depends on the accuracy and sensitivity values. Therefore, it may be deduced from the F1 score and sensitivity that there were fewer false positive samples.
The accuracy of each classifier is displayed as a bar plot in Figure 9, along with the matching AUC ratings. This graph shows that, despite having a high accuracy value, some models showed quite a low AUC score. Figures 10 and 11 show the bar plot of the precision and specificity for both of the datasets. While the precision plot shows a high true positive rate, the specificity graph shows a comparatively lower true negative rate, which is more definitive in respect to the efficacy of the ensemble models on the first dataset and on some of the traditional classifier's performances on the second dataset.
The line graph plot for sensitivity and F1 scores for all of the classifiers for both datasets are shown in Figure 8. The F1 score is biased toward the genuine positive rate because it depends on the accuracy and sensitivity values. Therefore, it may be deduced from the F1 score and sensitivity that there were fewer false positive samples. The accuracy of each classifier is displayed as a bar plot in Figure 9, along with the matching AUC ratings. This graph shows that, despite having a high accuracy value, some models showed quite a low AUC score.  Figures 10 and 11 show the bar plot of the precision and specificity for both of the datasets. While the precision plot shows a high true positive rate, the specificity graph shows a comparatively lower true negative rate, which is more definitive in respect to the efficacy of the ensemble models on the first dataset and on some of the traditional classifier's performances on the second dataset.   Figure 12 shows the AUC-ROC curve plot for the XGB, stacking ensemble model, ANN, and CNN on both datasets, respectively. These four models exhibited the best AUC scores among all of the techniques which were applied to both datasets. Here, we can see the almost perfect curve for both ANN and CNN because of the high AUC value which was obtained. It can be seen that XGB and stacking have performed better on the second dataset compared to the first dataset.  Figure 12 shows the AUC-ROC curve plot for the XGB, stacking ensemble model, ANN, and CNN on both datasets, respectively. These four models exhibited the best AUC scores among all of the techniques which were applied to both datasets. Here, we can see the almost perfect curve for both ANN and CNN because of the high AUC value which was obtained. It can be seen that XGB and stacking have performed better on the second dataset compared to the first dataset.
In Figure 13, a bar graph is shown which compares the existing studies on churn prediction with this proposed work. The graph shows that this work has achieved better results than most of the other existing works, predicting the churn customers with much higher accuracy.
From the results above, it could be argued that AI-driven customer churn prediction, as well as machine learning and smart data, could assist businesses in knowing their customers and understanding the demands of those customers. Such an understanding can help to retain existing customers, which is cost-effective, as this is 5-10 times cheaper than the cost of acquiring new consumers [6]. Meanwhile, mMachine learning models can significantly influence social media marketing in several ways, among which are: -Targeted Advertising: machine learning models can analyze customer data such as demographics, interests, and online behavior to identify potential target audiences for specific products and services. This allows for more effective and efficient targeting, leading to higher conversion rates. -Content Optimization: machine learning algorithms can help businesses determine the best times to post content and the types of content that perform best, leading to increased customers engagement and reach. -Advertising Optimization: machine learning models can be used to automate and optimize advert placement and bid prices, saving time and resources while improving overall campaign performance. -Sentiment Analysis: machine learning models can be used to analyze customer sentiment and feedback, providing valuable insights into customer opinions and preferences. Such information can then be used to improve the company marketing strategies and campaigns.  In Figure 13, a bar graph is shown which compares the existing studies on churn prediction with this proposed work. The graph shows that this work has achieved better results than most of the other existing works, predicting the churn customers with much higher accuracy.   From the results above, it could be argued that AI-driven customer churn prediction, as well as machine learning and smart data, could assist businesses in knowing their customers and understanding the demands of those customers. Such an understanding can help to retain existing customers, which is cost-effective, as this is 5-10 times cheaper than the cost of acquiring new consumers [6]. Meanwhile, mMachine learning models can significantly influence social media marketing in several ways, among which are: -Targeted Advertising: machine learning models can analyze customer data such as demographics, interests, and online behavior to identify potential target audiences for specific products and services. This allows for more effective and efficient targeting, leading to higher conversion rates. -Content Optimization: machine learning algorithms can help businesses determine the best times to post content and the types of content that perform best, leading to increased customers engagement and reach. -Advertising Optimization: machine learning models can be used to automate and optimize advert placement and bid prices, saving time and resources while improving overall campaign performance. -Sentiment Analysis: machine learning models can be used to analyze customer sentiment and feedback, providing valuable insights into customer opinions and preferences. Such information can then be used to improve the company marketing strategies and campaigns.
As such, machine learning models can significantly improve the effectiveness and efficiency of social media marketing efforts.
Further, such machine learning-based tools are not only easy to use and quick to operate but are also capable of learning from their own data, which contributes to a positive experience for clients and encourages repeat business. Machine-learning-based As such, machine learning models can significantly improve the effectiveness and efficiency of social media marketing efforts.
Further, such machine learning-based tools are not only easy to use and quick to operate but are also capable of learning from their own data, which contributes to a positive experience for clients and encourages repeat business. Machine-learning-based churn prediction systems have the capability to learn from their preceding errors and enhance their results through collecting historical data [47]. Because of this, we are able to determine the value of each individual consumer, forecast future expenses and income, and determine the areas in which the majority of marketing efforts regarding social media or other platforms should be focused.

Statistical Analysis
ANOVA and Wilcoxon rank-sum tests were applied to statistically assess the quality of the algorithms as applied to the two public datasets: Southeast Asian telecom industry, and American telecom market (as shown in Tables 6-9). Tables 6 and 7 show that the p-values are less than 0.05, which indicates that there is a statistically significant difference between the groups. In Tables 8 and 9, the obtained p-value > 0.05 shows that the results have no significant difference.

Conclusions
The telecom industry has become highly competitive over the last two decades. Hence, having a highly efficient churn predictive system is very important. Therefore, in this work several supervised machine learning techniques have been tested on two churn datasets from two different countries' telecom markets and with different features. In both of the scenarios, ANN and CNN returned better results. The other techniques which were used are RF, ERT, AdaBoost, XGB, GBM, bagging, DT, kNN, and LR. All of these techniques performed quite well on both of the datasets, with accuracies over 90% for all of them except for AdaBoost, DT, kNN, and LR, which did not perform well on the second dataset even though they returned a satisfactory result for the first dataset. As shown in Section 3.2, the churn distribution is drastically imbalanced, which is very obvious considering that the number of churning customers should always be lower than the non-churning customers. However, this imbalance in the data strongly disturbs the performance of the applied model. As inferred from the specificity values, which is also known as the true negative rate, these models have a higher detection rate of the non-churn customers compared to the churners [48]. This problem was solved using CNN, which exhibited a very high true negative prediction rate, which is the churn customers in this scenario. The efficiency of the other models can be upgraded by using data balancing; however, when the percentage of the churn customers is too low, even data balancing may not improve the performance. Hence, using datasets containing a higher number of churn customers to train the models is the best-case scenario. Within the same line of benefits, deep churn prediction is a very useful tool in dealing with social media marketing activities for any company, as it helps to identify those customers who are at risk of leaving a business, or "churn." Through predicting churn, a business can take proactive steps and marketing activities to retain their customers and prevent them from leaving, leading to increased customer loyalty, revenue, and preventing the loss of valuable customers.
Thus, in the context of social media marketing, deep churn prediction models can analyze customer behavior, engagement, and other factors to pinpoint customers who are at risk of leaving. Such information can then be used to target these customers with personalized marketing campaigns and incentives to retain them. For example, a deep churn prediction model may identify that a customer who has stopped engaging with a company's social media accounts is at high risk of churning. The company can then target this customer with a personalized communication, offer, or special promotion to encourage them to remain a customer and reengage with the company. In conclusion, deep churn prediction is a valuable tool for social media marketers.