An Intelligent Hybrid Scheme for Customer Churn Prediction Integrating Clustering and Classiﬁcation Algorithms

: Nowadays, customer churn has been reﬂected as one of the main concerns in the processes of the telecom sector, as it affects the revenue directly. Telecom companies are looking to design novel methods to identify the potential customer to churn. Hence, it requires suitable systems to overcome the growing churn challenge. Recently, integrating different clustering and classiﬁcation models to develop hybrid learners (ensembles) has gained wide acceptance. Ensembles are getting better approval in the domain of big data since they have supposedly achieved excellent predictions as compared to single classiﬁers. Therefore, in this study, we propose a customer churn prediction (CCP) based on ensemble system fully incorporating clustering and classiﬁcation learning techniques. The proposed churn prediction model uses an ensemble of clustering and classiﬁcation algorithms to improve CCP model performance. Initially, few clustering algorithms such as k-means, k-medoids, and Random are employed to test churn prediction datasets. Next, to enhance the results hybridization technique is applied using different ensemble algorithms to evaluate the performance of the proposed system. Above mentioned clustering algorithms integrated with different classiﬁers including Gradient Boosted Tree (GBT), Decision Tree (DT), Random Forest (RF), Deep Learning (DL), and Naive Bayes (NB) are evaluated on two standard telecom datasets which were acquired from Orange and Cell2Cell. The experimental result reveals that compared to the bagging ensemble technique, the stacking-based hybrid model (k-medoids-GBT-DT-DL) achieve the top accuracies of 96%, and 93.6% on the Orange and Cell2Cell dataset, respectively. The proposed method outperforms conventional state-of-the-art churn prediction algorithms.


Introduction
The size of the information and communication technology (ICT) market is growing as more communication companies approach the market as a result of globalization and liberalization.Customers are encouraged to switch from one service to another as a result of the rising in efficient and modern service [1].Customer happiness and improvement are essential for customer retention.Customer relationship management (CRM) relies on this [2,3].User retention can only be achieved by calculating the numerous characteristics of a customer who is about to churn, but this computation is challenging due to manual data collecting which is an ineffective and laborious task [4].Continuous customer turnover is extremely damaging for a company's business.If the mobile telecommunications business could predict customer defection, it would do well and make some measures to retain those customers who are going to switch in many sectors for example e-commerce [5,6], finance [7], banks [8] and telecommunications industries [9].The industry strives to make minor adjustments in order to keep clients and revenue [10].
It is tough to lose a present customer in any business situation.Customer churn occurs in three modes.Some individuals may leave due to service provider issues such as network outages, costly rates, billing hassles, and so on.Second, certain clients have a habit to jump from one company to the other.Finally, some clients tend to switch service providers for reasons that the communication sector does not understand.This type of user, as well as the factors for their shifting, should be determined in advance [11].Registering a fresh client is a long and costly process.Therefore, churn prediction and anticipation are the only ways for a telecommunications firm to survive [12].Once a potential churn customer has been detected, the employer might give incentives and presents to sustain those clients.Leading telecommunications firms use the strategy of attracting clients using promotions and e-mail marketing.The churn client patterns are examined, and appropriate deals are offered for various groups based on their categorization.Some may receive call discounts, while others may receive data discounts.Customers who are important to the firm are recognized during this classification, and appropriate high-value incentives are offered to them [11].
Many machine learning-based classification algorithms including decision trees (DT), k-nearest neighbor (KNN), naive bayes (NB), neural network (NN), and support vector machine (SVM) are used to identify customer churn [13][14][15][16].Previously, researchers employed single classification methods for CCP.Ensemble-based classification algorithms have been the subject of recent development.With the use of a fusion method, these new strategies employ hybrid techniques that combine numerous single classifiers, and their predictions are integrated into a single aggregated result [17].
Therefore, this study proposed an ensemble method using the advantages of clustering and classification learning algorithms for telecom churn prediction.We proposed the integration of a supervised and unsupervised learning approach for CCP.In this research, a hybrid-based CCP approach entirely incorporating (k-means, k-medoids, and Random) clustering and classification (GBT, DT, RF, DL, and NB) models are suggested.We compare different ensemble models to recognize the top-performing model.

•
Clustering: the proposed model uses k-means, k-medoids, and random in the clustering stage.This stage gives us the top clustering technique.The k-medoids technique performs better and overcomes several hitches of the k-mean and random algorithm.The rest of our paper is organized as follows.Section 2 explains related work of customer churn prediction in telecom industry.The materials and methods used in this research are elaborated in Section 3. In Section 4, experimental results are analyzed.Performance comparison with other state-of-the-art techniques are discussed in Section 5. Discussion on the bases of experimental results are briefly explained in Section 6 followed by conclusion and few directions for future work in Section 7.

Literature Work
Enterprise Architecture (EA) provides a whole vision, using sets of models or blueprints, of an organization along with its information technologies, business processes and strategies [18][19][20][21].Typically, classification methods anticipate future consumer behav-ior based on customer attributes indicated in personal demographics, account and billing information, and call details.Conventionally, data mining techniques for churn prediction were employed largely to effectively determine telecom churners.
For instance, robust churn prediction tools [22] have been developed using decision trees and neural networks.In [23], researchers designed a hybrid scheme that merges k-means and a rule method in an effort to improve the accuracy of predictions up to 89.70%.Similarly, ref. [24] presents a genetic algorithm-based neural network strategy to maximize the accuracy of telecom churner prediction.They compare their results with the z-score classification model.Before using a classification system, most churn prediction algorithms use feature extraction, sampling, or both.Authors of [25], published detailed research examining the effect of sampling approaches on CCP performance.In their research, they combined gradient boosting and weighted random forest with random undersampling and sophisticated undersampling algorithms to increase prediction performance.Furthermore, applying an advanced survey strategy does not result in substantial improvements in prediction accuracy, as shown in the same study employing the CUBE sampling methodology, corroborating the point stated in [26].Likewise, simply replicating instances using oversampling does not improve classification results significantly revealed by Verbeke W. et al. [27].This also backs with the idea that simply replicating minority classes via random oversampling or eliminating majority classes via random undersampling could not increase CCP accuracy [25].The problem with the above studies is that they cannot produce better classification results that are acceptable for the telecom sector.
Additionally, numerous researchers have concentrated on developing classification techniques for telecom churn prediction employing only relevant information, rather than a broader set of features.A multi-objective feature extraction technique based on NSGAII was presented by Huang B. et al. [28].Another [29] proposed a Bayesian Belief Network model to select and extract the most relevant characteristics that can be used to predict customer churn.A hybrid two-phase mechanism has been presented in [30] using feature extraction.An ensemble of classification techniques is proposed for CCP using a rotation-based classification network [31].They applied two classifiers including Rotation Forest and Rot-Boost with AdaBoost.They compared these algorithms with Bagging, Random Forest, RSM, CART, and C4.5 algorithms.They also explored various feature extraction approaches and experimental results reveal that ICA-based RF is the best of all the remaining algorithms.Anouar Dalli [32] suggested a deep learning-nased model for CCP in telecommunication industry by changing hyperparameters of NN.The RemsProp optimizer outperformed other algorithms stochastic gradient descent (SGD)based algorithms in terms of accuracy.Praveen Lalwani et al. [33], proposed a machined learning-based approach using logistic regression, NB, support vector machine, RF, DT, etc. to train the model.They also applied boosting and ensemble methods to check the performance of proposed methods on accuracy.It is found that the top AUC score of 84%, is obtained by both Adaboost and XGBoost classifiers.Xin Hu et al. [34] designed a CCP model based on DT and neural network (NN).The prediction results reveal that compared with the single CCP model, the combined prediction approach produce greater accuracy and better prediction effect.Hemlata Jain et al. [35] proposed telecom churn prediction model using seven different machined learning-based experiments incorporating features engineering and normalization techniques.This study proven to be more effective than previous models.RF surpassed other models, attaining 95% accuracy.
A review of the literature reveals a few challenges which must be addressed to improve the performance of CCP.Researchers still struggling in achieving high classification accuracy for CCP in the telecom industry.It is observed from previous work, ensemble-based classification models are not much explored yet.This strategy substitutes single algorithm classification, increases ensemble-based system prediction accuracy, and is applicable to all approaches.Therefore, in this study, we proposed an ensemble-based hybrid system using the benefits of supervised and unsupervised learning algorithms to increase the performance of CCP.

Materials and Methods
The suggested approach has improved the effectiveness of churn prediction results by employing a hybrid technique that combines several clustering, classification, and ensembles including bagging and stacking.Figure 1 illustrates the complete architecture of the proposed system for CCP.

Datasets Collection
The datasets included in this study are easily accessible over the internet and are widely employed in the current research on telecom churn prediction.Orange telecom offers information on its web-based site [36].The additional dataset provided by Cell2Cell is available on the Website of Duke University's Center for customer relationship management [27].The Orange dataset is made available online with its original imbalanced class distribution, but the Cell2Cell dataset has been preprocessed and a balanced version is supplied for research purposes.The feature names in the Orange dataset are concealed to protect the confidentiality of client information.Some features in the Orange dataset have either no value or a single value.Due to the lack of usefulness of these characteristics, they are eliminated from the dataset.Finally, both datasets contain a few nominal features that are translated into their corresponding numerical representation to maintain numerical format uniformity across the whole dataset.The properties of both datasets are displayed in Table 1.

Pre-Processing of the Dataset
The telecom dataset containing irregularities, such as missing values, duplicates removal, noise removal, or empty features, is handled through the WEKA tool.In addition, the nominal characteristics of the sample are changed to a numerical representation by classifying the examples into small, medium, and large classes according to the number of observations in each class [37].The dataset is preprocessed to provide a consistent numerical structure for the training phase.

Clustering Algorithms
Clustering is an unsupervised machine learning task.Because of its usefulness, this technique is often referred to by another name called cluster analysis.In order to use a clustering algorithm, you must first provide the algorithm with a large amount of unlabeled input data and then let the program discover whatever patterns it can in the data.Clusters are the names for these collections.As defined by their connections to neighboring data points, data points that share similarities form a cluster.Cluster analysis has several applications, including feature engineering and pattern discovery.Clustering can help you get understanding from data you know nothing about.
Clustering is performed after data has been cleaned and polished through preprocessing.The suggested approach uses clustering to boost prediction accuracy.The suggested model employs the following clustering techniques.

K-Means Clustering
By definition, the K-means clustering algorithm separates the input data set of N rows into a smaller number of subsets, where k is always smaller than N.The parameter k, standing for (centre of cluster mean) is chosen randomly.It calculates averages for each cluster and quantifies the distance between them.This procedure is repeated until the requisite clusters have been achieved.The distance is calculated using the following formula [38], where w ik = 1 for data point x i if it belongs to cluster k; else w ik = 0. Also, µ k is the centroid of x i 's cluster.

K-Medoids Clustering
Rousseeuw Lloyd and Kaufman presented the K-medoids algorithm in 1987; it is a clustering method similar to the original but also based on partitions.When compared to k-means, K-medoids are less susceptible to being thrown off by noise and outliers [38].The value of each cluster can be determined using the following formula.
where N i and M i are entities for which difference is computed.

Random Clustering
We can execute random flat clustering of any dataset with the help of random clustering.In addition, there is no particular order to the samples' placement in the clusters, and some clusters may be empty.

Classification Algorithms
Following clustering algorithms, the suggested model then performs a classification task on both datasets.In this study, we compare and contrast several clustering techniques and then combine the ones that perform the best with various classification algorithms.The suggested model begins with evaluating the effectiveness of individual classifiers.Next, we apply clustering and ensemble classifiers to ensure the maximum possible accuracy using the proposed model.The most effective method for predicting customer turnover is an amalgamation of clustering and ensemble classifiers, thus this is the approach that will be taken into account.The developed churn prediction algorithm makes use of the following classifiers.

K-Nearest Neighbor
For its simplicity, the Knn algorithm is widely used in the field of data mining.The Knn algorithm is utilized in the construction of the churn prediction model.This algorithm's ability to do classification tasks without requiring any previous information about the distribution of the data.The Knn approach is useful for making predictions about the characteristics of a substance by comparing those characteristics to the experimental results of compounds that are most similar to it [39].Mathematically it can be calculated by applying the following formula, where j is the total number of examples contained in the set and n and m represent the instance for which distance is being measured.

Decision Tree
In 1993, Quinlan used the name "decision tree" to describe a technique of "divide and conquer" [40].Following are formulas for determining the entropy and information gain of each feature.
where D is the dataset, i is the set of classes in D, and p(i) is the probability of each class.

Gradient Boosted Tree
The machine learning technique of gradient boosting can be used for both regression and classification.This approach has the potential to incorporate a large number of decisionmaking organizations as well as a conclusive prediction model [41].The goal of generating an ensemble of decision trees, as GBT does, is to boost prediction performance.Since GBT generates an ensemble of low-quality estimation techniques, it works better than random forest.It can be calculated by the following formula.
where p is the prediction generated using input x. θ is the best parameter that best fits the model.

Random Forest
Random forests [42], also known as random decision forests, are a type of ensemble learning technique for performing classification, regression, and other operations by establishing a large number of decision trees during training and then producing an output class that is the mode of the classes or the average prediction (regression) of the individual trees.
To calculate the Gini Index, start with one and deduct the sum of squared probabilities for each class from that number.The Gini Index can be formally represented as follows: where p x represents the probability of an item being assigned to a particular category.
The following formula for entropy can be used to calculate the information gain: the information gain equals the product of the probabilities of the class multiplied by a logarithm with a base of 2 of those probabilities.
Here p denotes the probability that it is a function of entropy.

Deep Learning
A huge number of neuron layers can be compared using this method, making it a "multi-layer" methodology.It's an artificial system employed to address the most challenging data mining issues.Using back-propagation and stochastic gradient descent, deep learning trains artificial neural network(ANN) with many interconnected layers.Neurons having tanh, rectifier, and maxout activation functions can be found in the network's many hidden layers.High predictive accuracy is made possible by state-of-the-art features like adaptive learning rate, rate annealing, momentum training, dropout, and L1 or L2 regularisation.Using multi-threaded (asynchronous) training, each compute node learns a subset of the global model's parameters on its own data, and then periodically adds to the global model by way of model averaging throughout the network.

Naive Bayes
Abstractly, Naive Bayes is a conditional probability model: given a problem instance to be classified, represented by a vector x = (x1, . . ., xn) representing some n features (independent variables), it assigns to this instance probabilities p(C k |x1, . . ., xn) for each of K possible outcomes or classes C k .The issue with the above formulation is that it becomes impossible to build a model using probability tables if the number of features n is big or if a feature can carry on an enormous number of values.To make the model more manageable, it will be necessary to restructure it.Bayes' theorem allows us to break down the conditional probability into its constituent parts, which we can then use to make predictions [43]: where, C is a class that will have its probability value computed since it is affected by the value that is assigned to feature x.Simply, in term of Bayesian probability, the above equation can be written as posterior = prior * likelihood evidence (9)

Ensemble Classifiers
Following ensemble techniques are employed in this study to increase the proposed system performance.

Voting
The outcomes of separate categorization algorithms are combined via majority vote using a voting technique.Class labels are assigned to test data by each classifier individually, then the results are aggregated via voting, and a final class prediction is made by taking the class with the highest number of votes [44].The following equation is used to perform majority voting on a dataset: where C represents the number of classifiers, D(c, i) is the decision of the classifier and i represents the classes.

Bagging
Bootstrap Aggregation is also known as "Bagging".It's a kind of ensemble technique that uses a "bag alike or unlike entities.Improve the performance of prediction models, it aids in reducing the variance of the classifiers that are employed in those models [45].
where t is the set of training samples, h t is the set of trained classifiers, and w i are the set of class labels.

Stacking
Instead of choosing one leaner over another, you can combine them all via stacking [46].You can use it to outperform any of the individual training models.Trainable classifiers are fed bootstrapped samples of the training data.In stacking, classifiers are divided into two tiers: Tier-1 and Tier-2.Classifiers in the first tier are taught to make predictions using bootstrapped data, and those predictions are utilized to train the second tier.This ensures that the training data is used effectively for learning.

Proposed Framework
The fusion of multiple models can improve the network's performance by combining clustering methods such as k-means, k-medoids, and random clustering with Gradient Boosted Tree, Decision Tree, Random Forest, Naive Bayes, and Deep Learning classifier, integrating ensemble machine learning algorithms such as Voting, bagging, boosting, and stacking.Below are the steps involved in the proposed method.

Map Clustering on the Label
After clustering, mapping is applied to produce TP, TN, FN, and FP from Orange and Cell2Cell datasets.Cluster 0 is mapped to churners, while cluster 1 is mapped to non-churners.Let's say we have a table with three columns: cluster0, cluster1, and churn.Mapping will produce a fourth column, prediction (churn), which will include classes 0 and 1. Class 0 and 1 are used as a mapping for cluster 0 and 1 data, allowing TP, TN, FN, and FP to be produced.cluster 0 and cluster 1 are the two groups that emerge from applying clustering to the dataset.The values of the predicted class are compared to the clusters to see if they fall within any of them.Therefore, clusters are examined by first mapping them using prediction class.A TN is produced if the values of cluster 0 falls into prediction class 0, and a TP is produced if the values of cluster 1 fall into prediction class 1.To the same extent, FP is produced if the values of cluster 0 falls into prediction class 1, and FN is produced if the values of cluster 1 fall into prediction class 0. Hence, mapping is used for cluster analysis.

Results
In this section, we explain the evaluation matrix used in this study and comprehensively explain the experimental results carried out through this proposed scheme.Additionally, a discussion part explains the overall analysis of the performance of the designed model and then makes a brief comparison with conventional approaches published for CCP.

Evaluation Measures
This segment of the paper concisely deliberates the performance assessment metrics of the proposed model.All exploited multi-class performance evaluation metrics are defined below: Accuracy: Accuracy is the percentage of truly predicted examples to all types of predictions made by the model.Mathematically, it can be defined as: Precision: Precision metrics estimate total examples identified as positive by the algorithm truly belong to the positive class.It can be calculated as: Recall: Recall measures estimate which section of samples that actually belong to the positive class is truly predicted positively by the model.Mathematically, it can be computed as: F-measure: F-measure is calculated by taking the harmonic mean of precision and recall.It can be defined as:

Performance Analysis Based on Clustering Algorithms
In experiment 1, various unsupervised approaches are compared by employing popular clustering models such as k-means, k-medoids, and random.The technique is built on clustering, in which customers with similar perspectives on the business are grouped to allow for churn or non-churn prediction categorization.The performance of clustering techniques is shown in the form of a bar graph in Figure 2. The x-axis indicates the clustering method and the y-axis reflects the performance values including accuracy, recall, precision, and f-measure achieved by a different method.Figure 2a represents the performance of the Orange dataset while Figure 2b represents the performance of clustering techniques for the Cell2Cell dataset.The bar graph shows k-medoids technique reach 75.44% and 75.56% accuracies for the Orange and Cell2Cell dataset, respectively.However, through this comparison, we can see that the k-medoids method outperforms other methods.

Performance Analysis Based on Classification Algorithms
In the second experiment, five different classification algorithms (i.e GBT, DT, RF, NB, and DL) are employed as a single classifier.The model competence is computed based on precision, recall, accuracy, and f-measure.Among the five single classifier models, GBT demonstrated the highest accuracy of 92.98% and 93.19% for the Orange and Cell2Cell dataset, respectively, while the DT classifier shows the lowest results and achieves 85.40% and 87.04% for both the Orange and Cell2Cell dataset.Figure 3a,b illustrate the results of the classifiers evaluated in this research.

Combining k-Medoids Clustering Algorithm Hybrid Classifiers
In the fourth experiment, we construct a hybrid network by combining multiple classifiers along with the k-medoids clustering method to enhance the performance of the proposed CCP system.Classification results from different algorithms are combined using the voting process.Data is classified by each classification algorithm, and the outcomes are then integrated via voting to produce a final class prediction based on the maximum number of votes for a given class.The experimental results of this hybrid model reveal that the merger of different classifiers with k-medoids clustering performs better than executing each single classifier alone.Moreover among all the experiments, a hybrid classifier (GBT-DT-DL) combined with K-med clustering shows better accuracy of 95.05% and 93.40% for the Orange and Cell2Cell dataset, respectively.Figure 5a,b show the result in the form of a bar graph using various indicators such as precision, recall, f-measure, and accuracy for both datasets.

Ensemble Classifiers Combined with k-Medoids and Hybrid Classifiers
In the fifth experimental work, ensemble classifiers namely bagging and stacking are utilized to more improve the experimental results.In this method, we combined ensemble classifiers with k-medoids and hybrid classifiers and achieve higher results as compared to the above experiments.In bagging voting operator is employed to create a diverse combination of classifiers.In the testing phase apply model and performance classification operators are applied to calculate the performance of the hybrid approach.Figure 6a depicts that the ensemble method (k-medoids-GBT-DT-DL) with bagging obtain top performance in terms of 95.12% accuracy for the Orange dataset.Figure 6b illustrates that the ensemble method (k-medoids-GBT-DT-DL) with bagging obtain top performance in terms of 93.43% accuracy for the Cell2Cell dataset.Figure 6c shows maximum accuracy of 96% on the Orange dataset using the ensemble method (k-medoids-GBT-DT-DL) with stacking.Figure 6d indicates supreme accuracy of 93.6% on the Cell2Cell dataset by applying the ensemble method (k-medoids-GBT-DT-DL) with stacking.Furthermore, all other measures including precision, recall, and f-measure also perform better using the ensemble technique in contrast to other methods.

Performance Comparison with Other Existing Approaches
The proposed ensemble model integrates clustering and classification-based algorithms to significantly handle the problems related to the massive type of telecom datasets.This research efforts to particularly address the fundamental problems faced in telecom CCP challenge in a more systematic and intuitive method.Some existing studies for CCP using the Orange and Cell2Cell dataset are explained below.Irina V. Pustokhina et al. [49] proposed an improved system for CCP using the Orange telecom dataset.The authors employed a hybrid method called ISMOTE-OWELM.The improved synthetic minority over-sampling technique (ISMOTE) is used to deal with the imbalanced dataset, and the rain optimization algorithm (ROA) is employed to estimate the ideal sample rate.In the last step, the optimal weighted extreme machine learning (OWELM) method is used to establish the class labels of the employed sample.In this work, three different datasets were applied to evaluate the model efficiency, however Orange dataset attained 92% accuracy rate.
Muhammad Usman et al. [50] designed and implement a system for CCP using a comparative analysis of learning networks.The two standard datasets Cell2Cell and KDD Cup are used for this study.They determined long short-term memory (LSTM) networks obtained accuracy of 72.7% on the Cell2Cell dataset and 89.3% classification accuracy on the KDD Cup dataset for CCP.
Samah Wael Fujo et al. [51] implemented a deep-learning-based artificial neural network (Deep-BP-ANN) for CCP in the telecommunication industry.The author employed two feature selection techniques, Variance Thresholding, and Lasso Regression.Moreover, the early stopping method was also applied to overcome the overfitting issue and to end training at the appropriate time.IBM Telco and Cell2cell datasets were used to evaluate the performance of the implemented model.Experimental results reveal that the IBM Telco obtained 88.12% accuracy and 79.38% accuracy achieved by the Cell2cell dataset.
Praseeda, C. K., & Shivakumar, B. L. [52] proposed a model for CCP using both classification and clustering algorithms.The fuzzy particle swarm optimization (FPSO) technique was applied for feature selection and the divergence kernel-based support vector machine (DKSVM) technique was used to categorize churn customers.After the classification step, the hybrid clustering-based kernel distance possibilistic fuzzy local information C-means (HKD-PFLICM) model was employed for cluster-based retention.Results of this study reveal that the proposed approach achieved a 76.51% accuracy rate on the Cell2Cell dataset and outperform other existing algorithms.

Discussion
This study established an ensemble-based hybrid model using supervised and unsupervised learning approaches.This suggested framework produces the following results.Experiment 1 (Figure 2) reveals that k-medoids clustering technique achieved top performance of 75.44% and 75.56% for Orange and Cell2Cell datasets.Henceforth, in experiment 2 (Figure 3), we applied single classification algorithms are employed and obtain following results, 92.98%, 93.19%, 85.4%, 87.04%, 86.62%, 88.24%, 91.38%, 92.41%, 86.02%, 87.79% for GBT, DT, RF, DL and NB on both Orange and Cell2Cell datasets, respectively.In single classification algorithms, GBT outperforms other classifiers in terms of different evaluation matric accuracy.In experiment 3 (Figure 4), a combination of k-medoids and single classification algorithms have been accomplished.k-medoids-GBT achieved the highest results of 94% and 92.25% accuracy on both datasets.k-medoids-DL model perform well and obtained the second position by getting 93.5% and 92.53% accuracy on both datasets.Next, in experiment 4 (Figure 5), a voting-based hybrid classifier with the k-medoids clustering technique has been applied to check the effect on accuracy and other measures.k-medoids-GBT-DT-DL achieved 95.06% accuracy on the Orange dataset and 93.40% accuracy on the Cell2Cell dataset.The k-medoids-GBT-DT-NB model also gets better results as compared to other hybrid models in experiment 4. In experiment 5 (Figure 6a,b), bagging-based hybrid classifiers with k-medoids ensemble models was evaluated.The k-medoids-GBT-DT-DL model gives a superior performance of 95.12% and 93.43% accuracy on both datasets as compared to all other models.Similar to experiment 4, the k-medoids-GBT-DT-NB model performs better after k-medoids-GBT-DT-DL with accuracies of 93.64% and 92.98% as compared to other combinations.Lastly, in experiment 6, stacking-based hybrid classifiers with k-medoids ensemble models have been developed to enhance the performance.It can be seen from bar graphs Figure 6c,d) similar to experiment 5 k-medoids-GBT-DT-DL combination obtained the highest results as compared to all other combinations applied on Orange and Cell2Cell dataset.k-medoids-GBT-DT-DL achieved 95.34% accuracy, 77.09% recall, 83.51% precision, 80.17% f-measure for Orange dataset and 93.43% accuracy, 67.45% recall, 79.10% precision, 72.81% f-measure for Cell2Cell dataset.Furthermore, this study compares existing methods with our proposed ensemble method in the form of Table 2. Table 2 illustrates that the proposed approach for CCP outperforms traditional approaches.Moreover, Figure 7 depicts the accuracy comparison among all methods used in our study.The bar graph shows the lowest performance by single clustering k-medoids algorithm and Hybrid 4(k-medoids-Stacking-GBT-DT-DL) algorithm obtain extremely significant results on both datasets.It is seen that the proposed stacking-based ensemble algorithm achieves the highest accuracy rate among all experiments.The experimental results indicate that the proposed ensemble model evaluated on Orange dataset has achieved 96%, 91.61%, 90.23% of accuracy, recall and F-measure respectively, while this proposed ensemble model evaluated on Cell2Cell dataset has acquired 93.6%, 85.45%, 83.72% of accuracy, recall and F-measure respectively.To summarize this discussion, the results of all experiments demonstrate that hybrid models have always outperformed single clustering/classifier-based approaches.On both datasets, the proposed model exhibits superior classification performance in terms of accuracy, recall(sensitivity) and F-measure.Multiple existing techniques were applied to both datasets used in this research, however, our model stacking-based ensemble k-medoids-GBT-DT-DL model produced higher outcomes.

Conclusions
In summary, current work focuses on the development of extremely effective ensemblebased customer churn prediction models.In this study, Orange and Cell2Cell telecom churn prediction datasets are employed to develop CCP models.The initial data collection contains a wide range of data values, so the samples are normalized.Since preprocessing, the data collection is clustered using unsupervised based methods including k-means, k-medoids, and random clustering.The set of classifiers examined in this study has now completed the classification procedure.The stacking-based ensemble model (k-medoids-Stacking-GBT-DT-DL) is the most accurate of all ensemble models.Consequently, performance is evaluated based on precision, recall, and accuracy.The ensemble system outperforms the single classification methods in terms of accuracy.Furthermore, we can improve our results using large datasets and state-of-the-art deep learning-based methods.However, these methods require more computational resources because of their deep architectures.In the future, this study can be extended using deep learning-based approaches to enhance the performance of customer churn prediction for the telecom sector.Future efforts will be devoted to enhancing the speed and effectiveness of this system.
• Classification: the proposed system evaluates different classifiers as single and hybrid models using two datasets.Through the classification stage, we select the most appropriate individual and hybrid classification model.• Ensemble-based Churn prediction: the churn prediction stage, best hybrid clustering, and classification-based hybrid models are used with an ensemble classifier to select the top CCP ensemble approach.

Figure 2 .
Figure 2. (a) Results analysis using single clustering methods on the Orange dataset.(b) Results analysis using single clustering methods on Cell2Cell dataset.

Figure 3 .
Figure 3. (a) Results analysis using single classification methods on the Orange dataset.(b) Results analysis using single classification methods on Cell2Cell dataset.

4. 4 .
Combining the k-Medoids Clustering Algorithm with Each Single Classifier In the third experiment, we build a hybrid network by merging the k-medoids clustering method with every single classifier.DT, RF, GBT, DL, and NB are the key classifiers that are hybridized with k-medoids.Dataset is primarily sorted into relative clusters using the k-medoids algorithm, afterwords clusters are separated into training examples and testing examples.Based on training datasets, classification algorithms construct a training network of the CCP system, which is then used to evaluate the system.Precision, recall, f-measure, and accuracy are key performance indicators used to calculate the system's performance as shown in Figure 4a,b.The performance of the GBT classifier is superior to that of other classifiers, as it improves classification performance, k-medoids with GBT achieve 94% and 92.25% accuracies for the Orange and Cell2Cell dataset.The performance of other indicators are also shown in Figure 4a,b in the form of a bar graph.

Figure 4 .
Figure 4. Depicts clustering and classification-based hybrid model performance (a) Hybrid model results using Orange dataset.(b) Hybrid model results using Cell2Cell dataset.

Figure 5 .
Figure 5. Shows voting-based ensemble classifier results (a) Voting-based methods results on the Orange dataset.(b) Voting-based methods result on the Cell2Cell dataset.

Figure 6 .
Figure 6.Represents different algorithm's performance based on ensemble techniques (a) Baggingbased methods results on Orange dataset.(b) Bagging-based methods results on the Cell2Cell dataset.(c) Stacking-based methods results on the Orange dataset.(d) Stacking-based methods results on the Cell2Cell dataset.
Ahmed, A., & Maheswari [47] developed a metaheuristic-based churn prediction model using the Orange dataset.A hybrid-based firefly algorithm is applied as the classifier for CCP.The compute-intensive module of the proposed firefly algorithm which is a comparison module was replaced by simulated annealing.This model delivers efficient results and achieves 86.38% accuracy on the Orange dataset.Idris, A., & Khan, A. [46] proposed a novel intelligent model using Filter Wrapperbased Churn Prediction (FW-ECP) for the telecom sector.The originality of the FW-ECP is in its capability to integrate filter and wrapper-based feature choice along with the use of the learning potential of an ensemble classifier constructed from multiple base classifiers.They have evaluated and compared the results of the designed FW-ECP model using two publicly accessible datasets, Orange and Cell2Cell.The presented model attained 79.4% accuracy on the Orange dataset and 84.9% accuracy on the Cell2Cell dataset.Vijaya, J., & Sivasankar, E. [48] designed an efficient model for CCP by combining Particle swarm optimization (PSO) and feature selection with simulated annealing (FSSA).Authors designed three types of PSO viz., PSO combined through feature selection for the pre-processing scheme, PSO inserted by simulated annealing, and lastly PSO through hybridization of both feature selection and simulated annealing.Various classification algorithms including DT, NB, KNN, SVM, RF, and three hybrid algorithms were applied to analyze the model performance.The proposed model PSO-FSSA achieves a 94.08% accuracy value and the hybrid ANN-MLR model achieves 85.44% accuracy.

Table 1 .
Features of applied databases for this study.

Table 2 .
Comparison with state-of-the-art approaches.
Figure 7. Illustration of accuracy comparison among all experiments using Orange and Cell2Cell datasets.