An Enhanced Spectral Clustering Algorithm with S-Distance

: Calculating and monitoring customer churn metrics is important for companies to retain customers and earn more proﬁt in business. In this study, a churn prediction framework is developed by modiﬁed spectral clustering (SC). However, the similarity measure plays an imperative role in clustering for predicting churn with better accuracy by analyzing industrial data. The linear Euclidean distance in the traditional SC is replaced by the non-linear S-distance ( Sd ). The Sd is deduced from the concept of S-divergence ( SD ). Several characteristics of Sd are discussed in this work. Assays are conducted to endorse the proposed clustering algorithm on four synthetics, eight UCI, two industrial databases and one telecommunications database related to customer churn. Three existing clustering algorithms— k -means, density-based spatial clustering of applications with noise and conventional SC—are also implemented on the above-mentioned 15 databases. The empirical outcomes show that the proposed clustering algorithm beats three existing clustering algorithms in terms of its Jaccard index, f-score, recall, precision and accuracy. Finally, we also test the signiﬁcance of the clustering results by the Wilcoxon’s signed-rank test, Wilcoxon’s rank-sum test,and sign tests. The relative study shows that the outcomes of the proposed algorithm are interesting, especially in the case of clusters of arbitrary shape.


Introduction
Advancements in information technology have given rise to digital innovation in the service industry; e.g., the e-commerce industry, banking, telecom, airline industry, etc. [1]. Customers now have easy access to enormous amounts of data for their desired service or consumables. This in turn has generated a scenario in which companies are finding it a very difficult task to retain their existing customer base. Companies have thus become more cautious to increase customer acquisition and to control customer churn. Consumers switching from one firm to another for a specified period negatively impacts the economy of the company. Thus, customer acquisition and churn management have become key factors for the service sector. Several methods exist to effectively increase customer acquisition and to manage customer churn, such as improving customer acquisition by nurturing effective relationship with customers, identifying the customers who are likely to leave and giving proactive solutions to the causes of their dissatisfaction, improving sales approaches and improving marketing strategies and customer services. Technology is also responsible for the reframing of marketing to increase customer loyalty through the examination of stored information and customer metrics. It also allows customer relations to be connected with business demand [2]. However, the problem of identifying the best set of clients who can subscribe to a product or service is considered NP-hard [3].
It is important to utilize and allocate resources effectively and efficiently by distinguishing high-value customers. It is also imperative for industrial enterprises to customize marketing strategies in such a way that they can achieve an edge over their competitors. There is a need to use an unsupervised machine learning clustering algorithm in order to group customers according to some similarity or common trend, especially when the customer database is growing constantly, when the average transaction size increases or when the frequency of transactions per customer increases. In other words, a clustering algorithm helps to analyze the different needs of different groups of customers or to customize marketing strategies for an organization to acquire customers and to manage customer churn. These problems can be handled using analytical methods, which use the concepts of statistics, machine learning and data mining [4]. In [5], data mining by evolutionary learning using the genetic algorithm was presented for churn predictions for telecom subscriber data. Machine learning algorithms-for instance, decision tree (DT) and neural networks (NN)-have been exploited to predict customer churn by considering billing information, demographics, call detail, contract/service status, records and service change logs [6]. An approach to recognizing potential bank customers who may react to a promotional offer in direct marketing based on customer historical data using support vector machine (SVM) was presented in [7]. Churn prediction in online games using records of players' login was addressed using the k-nearest neighbors (KNN) algorithm in [8,9]. Various machine learning algorithms such as logistic regression, DT, NN and SVM were adopted and compared to anticipate the victory of telemarketing calls for selling bank long-term investments in [10]. SVM [11] and KNN [12] were also used to predict potential online buyers based on browser session data and hypertext transfer protocol level information. In [13], deciding the active demand reduction potential of wet appliances was considered and solved using the expectation-maximization clustering algorithm. Hierarchical and fuzzy k-means clustering were compared in order to improve business models in demand response programs [14]. In [15], density and grid-based (DGB), density-based spatial clustering of applications with noise (DBSCAN), fast search and find of density peaks (FSFDP) and other clustering algorithms were exploited for DNA microarray industrial data, finding DGB is more suitable for clustering databases with arbitrary shapes than the traditional approaches. E-customer behavior characterization was done by utilizing Web server log data using association rules in [16].
It can be observed from the literature that almost all the conventional unsupervised machine learning algorithms have been exploited in industrial applications, especially in churn prediction, by analyzing the behaviors of customers. However, the performance of an unsupervised clustering algorithm relies on data/features, similarity/distance measure, objective functions, initial cluster centers and the clustering algorithm itself. The similarity measure plays an important role in disclosing hidden patterns and understanding the massive industrial data properly. A substantial amount of research work has been done for the study of clustering using various linear distance measures such as Euclidean, Manhattan, Pearson correlation, Eisen cosine correlation, Spearman correlation, Kendall correlation, Bit-Vector, Hamming, the Jaccard Index and the Dice Index, but this has drawn little attention, especially in terms of introducing non-linearity into similarity measures for data clustering [17,18]. Surprisingly few of these approaches do not abide by the triangle inequality property [19]. The aim of investigating non-linearity in clustering algorithms is to identify a more accurate boundary between two groups. The Bregman divergence was considered as a measure of similarity and merged with the traditional k-means to increase its efficacy in [19]. Currently, a few studies on various divergence-based similarity measures in clustering are underway [20,21]. In this work, the spectral clustering (SC) algorithm is adopted and modified using the non-linear S-distance (Sd), which is obtained from the S-divergence (SD). Some characteristics of Sd are also discussed in this study. The proposed SC algorithm is implemented on four toy databases, eight real-world UCI databases, two service industrial databases and one telecommunications database related to customer churn. The proposed SC algorithm is compared with conventional SC algorithms; i.e., the SC algorithm with linear Euclidean distance (Ed) [22], k-means [22] and DBSCAN [15]. All the achieved outcomes show that the proposed clustering algorithm performs better than the three existing approaches.
The rest of the article is structured as follows: Sd and its properties are presented in Section 2. The graph Laplacian and its characteristics are shown in Section 3. The modified SC algorithm and its proof of convergence are addressed in Section 4. Section 5 presents empirical outcomes and discussion. Section 6 concludes the work.

S-Distance and Its Properties
In d-dimensional Euclidean space d + , p and q are two points [23]. Equation (1) is employed to compute the Sd.
Let f be an injective function stated as f : d + → M d such that f(p) = diag((p 1 , p 2 , ..., p d )), where M d represents the positive definite matrices of size d × d. Thus, the Sd is well-stated. The Sd is obtained from the idea of SD, which is denoted arithmetically by Equation (2).
where |.| is a determinant of a matrix and dist s (p, q) = dist sd ( f (p), f (q)). At the moment, we ensure that Sd meets all the characteristics for becoming a metric. The characteristics are given below: Proof. The modified form of Equation (1) is presented below: Proof. Proposition 2 can be written as dist 2 s (p, q) = ∑ d l=1 log 1 2 if p and q are the same then q can be substituted by p in the above Equation and the Proof. The Sd amid p and q is expressed as given below: ∴ dist s (p, q) = dist s (q, p). Thisimplies that the Sd also abides by the symmetric metric property.

Proposition 4.
Triangle Inequality: In d-dimensional Euclidean space d + , p, q and o are any three points. Then, this proposition states that the sum of any two sides-namely, dist s (p, o) and dist s (o, q)-of a triangle is equal to or exceeds the length of the third side dist s (p, q). Mathematically, dist s (p, q) ≤ dist s (p, o) + dist s (o, q).
Proof. The following can be written by utilizing propositions 1 and 2: Thus, Sd is a metric. At this time, some of the characteristics of Sd are presented below: Proof. This may be demonstrated by refutation. At the beginning, we can assume that the Sd is a Bregman divergence. This implies that dist s (p, q) is rigorously convex in p. It will be necessary to demonstrate that dist s (p, q) is not convex in p.
Take the double derivative for both sides of the following Equation with regard to p l . Then, ∂dist 2 Thus, a Hassian matrix that has negative diagonal entries would be attained. So, we have verified that the Sd is not a Bregman divergence.
Theorem 2. The a • p is employed to denote the Hadamard product amid a and p. Then, this can be written dist 2 Proof. This can be written as a • p = (a 1 p 1 , . . . , a d p d ) as per Hadamard product. Thus, dist 2 s (a l p l , aq l ) = log((a l p l + a l q l )/2) − 0.5(log(a l p l ) + log(a l q l )) = log((a l (p l + q l ))/2) − 0.5(log(p l ) + log(q l ) + 2 log(a l )) Remark 1. Figure 1 displays the line of the norm-balls of the Sd and Ed around the point (5000, 5000) in d + , where d = 2. One can observe from Fig. 1 that the lines of Sd and Ed look like distorted triangles and concentric circles, respectively. Further, the contour plots of Sd approach each other as we get close to the origin. When two points get close to the origin, then Sd would be high. On the contrary, the Sd of two points would be low when they are far from the origin. In contrast, the Ed of two points would be the same in both cases. Thus, Sd works well if the larger clusters are far from the origin and the smaller clusters are nearer to the origin.

Graph Laplacian and Its Properties
Consider a database D = {p 1 , . . . , p n } with n number of points in d-dimensional Euclidean space, where p i ∈ d + expresses the i th point. W = (ℵ, Ψ, A) is another representation of the same database, where ℵ and Ψ is the set of points and edges of these points, respectively. The A is used to express an affinity matrix or a symmetric weighted matrix of the graph W. In order to build a W, we consider the local neighborhood relationships of these points. Some approaches are available in the literature to construct affinity matrices [24]. Despite that, we have utilized a symmetry-favored KNN to increase the modeling of a graph and reduce outliers and the effect of noise. The graph W may be expressed by the underlying manifold characteristics of the data space [25,26]. In SC, the proper selection of the pairwise similarity measure is crucial [24,26]. Equation (3) is employed to produce an asymmetry-weighted matrix Π ∈ n×n connected to W.
where dist s () is the Sd between two data points p i and p j , k p i represents the k th NN of p i ∈ ℵ and dist k () is the set of KNN of p i . The weighted symmetric matrix of graph W is achieved by utilizing Π using Equation (4).
Equation (4) is adopted to build the symmetry-favored KNN graph of W. Figure 2 shows a pictorial representation of the difference between a symmetry-favored KNN graph and a KNN graph. The weights of symmetric edges of W are higher than the asymmetric edges because the points associated with symmetric edges belong to the same sub-manifold.
The essential components of an SC are graph Laplacian matrices, which are of two types: normalized and unnormalized [27]. Equation (6) is employed to estimate the unnormalized graph Laplacian matrix.
In contrast, Equation (7) is exploited to calculate the normalized graph Laplacian matrix.
where I is an identity matrix. The µ 0 , . . . , µ n−1 and τ 0 , . . . , τ n−1 are the eigenvalues and eigenvectors of W no , respectively. Proposition 5 presents a discussion of the properties of W no .

Proposition 5.
Three properties of W no are given below: 1.
We have for every g ∈ n . 2.
W no is symmetric and positive semidefinite. 3.
W no consists of n non-negative and real-valued eigenvalues 0 = µ 0 ≤ · · · ≤ µ n−1 , where n is the number of points in D.

Proposed Spectral Clustering Algorithm and Analysis
In SC, a graph partitioning problem is approximated in a manner so that low weights are assigned to edges, which are between clusters. This means that the association between clusters is low or clusters are not similar. On the other hand, high edge weights are assigned when clusters are similar. In [28], a similarity graph with an adjacency matrix A is partitioned by solving the mincut problem. This consists of the selection of partition B 1 , . . . , B k , which minimizes Equation (9).
Here, B i is the complement of B i , where B i is a disjoint subset of ℵ points. In reality, the mincut problem does not give us satisfactory partitions. So, this problem can be solved using a normalized cut, Ncut, which is defined by Equation (10).
The Ncut problem can be relaxed and helps to derive the normalized SC [24]. Equation (11) is used to represent the cluster indicator vector, g j = (g 1,j , . . . , g n,j ) .
where 1 ≤ i ≤ n, 1 ≤ j ≤ k and a matrix G can be constructed as G = (g i,j ) 1≤i≤n,1≤j≤k and G G = I with g i ζg i = 1 and . So, Equation (12) is utilized to denote the minimization of Ncut.
where Tr is the trace of a matrix. After relaxing the discreteness condition and replacing V = ζ 1 2 G, the relaxed problem is as shown in Equation (13): Equation (13) consists of a matrix V that contains the first k eigenvectors of W no as columns. Let V = {v 1 , . . . , v n } be a given set of vectors in k + . Equation (13) can be further simplified as Equation (14).
Equation (13) is the trace minimization problem that is solved by a matrix V, containing the first k eigenvectors of W no in columns. We want to assign v i ∈ V to any mutually exclusive class such that 2 ≤ k ≤ n. A mathematical way to design this problem as follows: The solution to the above problem χ uses k-means with Sd, which converges to a local optimal solution of χ in finite iterations [29]. Algorithm 1 shows the modified SC.

Algorithm 1 The proposed SC algorithm.
Input: ℵ, k, k ℵ-a set of points, nearest neighbors for affinity matrix and number of clusters Output: Cluster labels of all the points 1.
Convert Υ matrix to Γ ∈ n×k by normalizing the Υ such that the rows to have unit

5.
Cluster data points Γ i=1,...,n ∈ k in to k clusters via k-means clustering with either Ed or Sd 6.
At the end, allot each point p i to cluster j if and only if i th row of matrix Γ was allotted to cluster j

Experimental Results and Discussion
A laptop Intel(R) Core(TM) i7-2620M CPU@2.70GHz and 4-GB RAM running on Windows 10 with a 64-bit Python 3.6.5 compiler was used for this study. Every aspect of the work was done in the Spyder 3.2.8 Python development environment.

Database Description
In total, 15 databases of three classes are considered in this work to compare the performance of the proposed clustering algorithm with three existing approaches.

Synthetic Databases
Four synthetic/toy databases were considered. In varied distributed database (DB1), data points are distributed with varied variances. Four concentric circles are present in noisy four-circles (DB2), where each circle represents a class. The blob database (DB3) consists of isotropic Gaussian blobs with three classes. The data point distribution is anisotropic in nature for the anisotropically distributed database (DB4). Table 1 presents the title of the toy databases, the number of sample points of these databases, the number of facets in each point and the number of clusters. The data distribution in the two-dimensional Euclidean space of each of these four synthetic databases is shown in Figure 3. The x-axis and y-axis of each plotted distribution denote feature 1 and feature 2, respectively, as two features are present in each of the toy databases.

UCI and Industrial Databases
Eight popular realistic databases-Digits, Iris, Breast cancer, Wine, Avila, Letter, Poker and Shuttle-were adopted from the UCI repository [30,31]. A brief portrayal of these UCI databases is given in Table 1. On the other hand, two industrial databases-namely, Bank telemarketing [10] and Purchasing intention [32]-are considered in this work. A database related to telecommunication customer churn was adopted from the Kaggle repository to study the customer data for retaining and maximizing benefit by devising suitable business plans. Brief details of these databases are given in Table 1. Outliers and data reconciliation are not handled separately in this work. However, normalization was carried out before applying the proposed algorithm to model the data correctly. As mentioned in section 2, Sd is defined in d-dimensional Euclidean space d + ; thus, raw data were normalized to obtain a positive scale by shifting data with the absolute of the most negative value such that the most negative value would be the minimum positive non-zero value and all other data points would be positive.

Evaluation Indices
Accuracy is one of the most adopted validation indices. It denotes the ratio of correct outcomes that a machine learning algorithm has attained. The higher the accuracy obtained by an algorithm, the better and more useful that algorithm is. However, this may mislead researchers due to the accuracy paradox. Accuracy adopted along with other indices; for instance, the Jaccard index, f-score, recall, and precision [33][34][35]. Interested readers are referred to [36] to learn about the various validation indices in depth. Non-parametric statistical hypothesis tests, called the Wilcoxon's signed-rank test, Wilcoxon's rank-sum test and sign test, were conducted as well at the 5% significance level to determine whether two dependent samples were chosen from the data [37,38].

Results and Discussion
In this study, the proposed SC-i.e., an SC with Sd (SC-S)-is compared with the conventional SC (SC-E) [22], k-means [22] and DBSCAN [15] on the basis of 14 databases. As we know, an affinity matrix helps to represent data points graphically and the affinity matrix depends on a symmetric favored KNN. So, in the first experiment, two methods-SC-S and SC-E-were executed on four synthetic databases only and the performances were judged based on five validation indices; namely, the Jaccard index, f-score, recall, precision and accuracy. A significant amount of time has been devoted by the research community to deciding the best value of k for KNN. Still, this is an open problem. So, the value of k is determined based on empirical results in this work. Initially, 10 is considered as the value of k. Later on, this reaches 30 with a step size of 5. The achieved Jaccard index, f-score, recall, precision and accuracy using SC-S and SC-E are shown in Figures 4-8, respectively. It is observed from Figures 4-8 that the SC-S always outperforms the SC-E for the five evaluation metrics. Moreover, KNN was stable when the value of k was 20, which is used for the rest of the work [25]. In the second experiment, the SC-S was compared with SC-E, k-means and DBSCAN on 14 databases. Figures 9-12 show the data distribution of each toy database separately after applying four clustering algorithms. Here, different colors are used to denote different clusters. The number of colors depends on the number of clusters in each database. However, the colors are assigned in each database randomly. So, no color is used to fix a particular cluster. It is clear from Figures 9-12 that the k-means performs worst compared to the rest of the three methods, but it is difficult to comment on these three methods with regard only to Figures 9-12. In Fig. 10, the result of k-means shows that k-means works better in the case of spherical data only. While the other methods perform better compared to k-means, more information is required to say more about the four clustering algorithms. Figure 13 shows the obtained Jaccard index, f-score, recall, precision and accuracy using the four clustering algorithms. Here, two parameters-the radius (Eps) and a minimum number of points (MinPts)-are required to execute DBSCAN. The values of Eps and MinPts are 0.5 and 3, respectively [39]. Figure 13 illustrates that the proposed clustering algorithm SC-S is the best among the four used clustering algorithms in terms of the five evaluation metrics. The proposed method along with three existing approaches was executed on eight UCI databases as discussed in Figure 14. In addition, two industrial databases and one telecommunication database were used for customer churn analysis, and the achieved results are displayed in Figure 15. Figure 15 shows the accuracy and TP rates obtained for the test cases with regard to the prediction horizon, which is calculated as the number of tasks performed by the users before leaving the commercial web site. As shown in Figure 15, it is clear that the SC-S has the highest accuracy compared to other existing approaches. Long-term deposits are favored by banks to maintain funds with minimal interest. Thus, this long-term deposit policy is better at generating higher successful sales, even if it requires some effort in communicating with customers. Under these circumstances, the proposed model SC-S shows higher accuracy as compared to the other existing approaches, as discussed in Figure 15. In this type of database, human agents have less probability to convert any call into successful calls. The telecommunication database is clustered into two clusters; namely, stable customers and churning customers. The objective is to predict customer behavior in the future based on these features. This analysis using clustering can help enterprises to develop efficient marketing strategies to select valuable customers and those that are necessary to retain, while customers that are going to churn can be contacted with appropriate retention measures to maximize profits. Further, enterprises can perform a deep analysis of the stable customers and can target more similar customers to increase their market space. In this experiment, the non-parametric significance test of the SC-S was compared with other methods: SC-E, k-means and DBSCAN. First, a pairwise comparison of the SC-C with SC-E was performed and is labeled as "M1". Second, a pairwise comparison was done with k-means and marked as "M2". Finally, SC-S was compared with DBSCAN and denoted as "M3". This pairwise experiment was performed for three indices; namely, the Wilcoxon's signed-rank test, Wilcoxon's rank-sum test and sign tests [40]. These pairwise tests are the easiest ways to test statistics that can be conducted within the framework of an empirical study by a researcher. These non-parametric tests are also executed by considering the p-values (in Table 2) that are obtained based on accuracy only. The results of Table 2 allow us to refute the null hypothesis at a 5% level of significance. So, SC-S is statistically superior to the three existing approaches. Some insignificant p-values higher than 0.05 are also reported in Table 2.

Conclusions
In this work, an enhanced SC based on Sd is proposed to predict customer churn with better accuracy by analyzing industrial data. The traditional KNN is replaced by a symmetric-favored KNN in the proposed algorithm in order to increase the efficacy of clustering. Extensive experiments are performed on four synthetics, eight UCI, two industrial databases and one telecommunication database for customer churn analysis, validating the proposed algorithm by the comparison with three existing clustering algorithms; namely, SC-E, k-means and DBSCAN. All the outcomes show that the proposed algorithm performs better than the three existing approaches in terms of five validation metrics: the Jaccard index, f-score, recall, precision and accuracy. The statistical significance of the SC-S is also measured by considering Wilcoxon's signed-rank test, Wilcoxon's rank-sum test and sign tests. This study can be extended to large databases by optimizing the step of eigenvalue computation using either the Hadoop architecture or parallel computation. The real-world databases consist of categorical as well as numerical attributes. This study proves that the SC-S works well on databases with numerical attributes only. However, the SC-S cannot work on databases with categorical attributes, which deserves further study.