A Genetic XK-Means Algorithm with Empty Cluster Reassignment

: K-Means is a well known and widely used classical clustering algorithm. It is easy to fall into local optimum and it is sensitive to the initial choice of cluster centers. XK-Means (eXploratory K-Means) has been introduced in the literature by adding an exploratory disturbance onto the vector of cluster centers, so as to jump out of the local optimum and reduce the sensitivity to the initial centers. However, empty clusters may appear during the iteration of XK-Means, causing damage to the efﬁciency of the algorithm. The aim of this paper is to introduce an empty-cluster-reassignment technique and use it to modify XK-Means, resulting in an EXK-Means clustering algorithm. Furthermore, we combine the EXK-Means with genetic mechanism to form a genetic XK-Means algorithm with empty-cluster-reassignment, referred to as GEXK-Means clustering algorithm. The convergence of GEXK-Means to the global optimum is theoretically proved. Numerical experiments on a few real world clustering problems are carried out, showing the advantage of EXK-Means over XK-Means, and the advantage of GEXK-Means over EXK-Means, XK-Means, K-Means and GXK-Means (genetic XK-Means).


Introduction
Clustering algorithms are a class of unsupervised classification methods for a data set (cf. [1][2][3][4][5]). Roughly speaking, a clustering algorithm classifies the vectors in the data set such that distances of the vectors in the same cluster are as small as possible, and the distances of the vectors belonging to different clusters are as large as possible. Therefore, the vectors in the same cluster have the greatest similarity, while the vectors in different clusters have the greatest dissimilarity.
A clustering technique called K-Means is proposed and discussed in [1,2] among many others. Because of its simplicity and fast convergence speed, K-Means is widely used in various research fields. For instance, K-Means is used in [6] for removing the noisy data. A disadvantage of K-Means is that it is easy to fall into local optima. As a remedy, a popular trend is to integrate the genetic algorithm [7,8] with K-means to obtain genetic K-means algorithms [9][10][11][12][13][14][15][16][17][18][19][20][21][22][23]. K-Means is also combined with fuzzy mechanism to obtain fuzzy C-Means [24,25].
A successful modification of K-Means is proposed in [26], referred to as XK-Means (eXploratory K-Means). It adds an exploratory disturbance onto the vector of the cluster centers so as to jump out of the local optimum and to reduce the sensitivity to the initial centers. However, empty clusters may appear during the iteration of XK-Means, which violates the condition that the number of clusters

Algorithms
In this section, we first give some notations and describe some evaluation tools. Then, we define the clustering algorithms used in this paper.

Notations
Let us introduce some notations. Our task is to cluster a set of n genes {x i , i = 1, 2, ..., n} into K clusters. Each gene is expressed as a vector of dimension D: x i = (x i1 , x i2 , · · · , x iD ) T . For i = 1, 2, ..., n and k = 1, 2, ..., K, we define w ik = 1, if the i-th gene belongs to the k-th cluster, 0, otherwise.
Denote the center of the k-th cluster by c k = (c k1 , c k2 , . . . , c kD ) T , defined as The Euclidean norm ||.|| will be used in our paper. Then, for any two D-dimensional vectors y = (y 1 , y 2 , . . . , y D ) T and z = (z 1 , z 2 , . . . , z D ) T in R D , the distance is

Evaluation Strategies
In our numerical simulation, we will use the following evaluation tools: the mean squared error (MSE), the Xie-Beni index (XB) [12], the Davies-Bouldin index (DB) [13,27], and the separation index (S) [4]. The aim of the clustering algorithms discussed in this paper is to choose the optimal centers c k 's and the optimal label matrix W so as to minimize the mean square error MSE. Then, MSE together with the indexes XB, DB and S will be applied to evaluate the outcome of the clustering algorithms.
MSE is defined by MSE will be used as the evaluation function in the genetic operation of the numerical simulation later on. Generally speaking, lower MSE means better clustering result. The XB index [12] is defined as follows: where d min is the shortest distance between cluster centers. Higher d min means better clustering result. As we mentioned above, the MSE is the lower the better. Therefore, lower XB implies better clustering results.
To define the DB index [13,27], we first defined the within-cluster separation S k as where C k (resp. |C k |) denotes the set (resp. the number) of the samples belonging to the cluster k. Next, we define a term R k for cluster c k as Then, the DB index is defined as Generally speaking, lower DB implies better clustering results.
The separation index S [4] is defined as follows: Generally speaking, higher S implies better clustering results. The Nemenyi test [28][29][30] will be used for evaluating the significance of differences of XK-Means vs. EXK-Means and GXK-Means vs. GEXK-Means, respectively. The function cdf.chisq of SPSS software (SPSS Statistics 17.0, IBM, New York, USA) is used to compute the significance probability Pr. The value of Pr is in between 0 and 1. The smaller value of Pr implies the bigger significance of the difference of the two groups. One can say that the difference of the two groups is significant if Pr is less than a particular threshold value. The most often used threshold values are 0.01, 0.05 and 0.1. The threshold value 0.05 will be adopted in this paper.
The relative error ReError defined below will be used for a stop criterion in our numerical iteration process: where MSE t and MSE t−1 denote the values of MSE in the current and previous iteration steps, respectively.

XK-Means
Trying to jump out of the local minimum, the XK-Means algorithm is proposed in [26], where the usual K-Means is modified by adding an exploratory vector onto each cluster center as follows: where θ k is a D-dimensional exploratory vector at the current step. It is used to disturb the center produced by K-Means operation, and its component is randomly chosen as where b i is a given positive number, and with a given factor β ∈ [0, 1). In general, the disturbance should be decreased with the increase of the iteration step. Thus, for a new iteration step, the new value of b i is set to be with a given factor α ∈ [0, 1).

Remark 1.
Empty cluster will not appear in a usual K-Means iteration process. However, it is possible for XK-Means to produce an empty cluster in the iteration process. This happens when the exploratory vector θ k in Formula (13) drives the center c k away from the genes in the k-th cluster, such that all these genes join another cluster in the re-organization stage of the XK-Means and leave the k-th cluster empty. Then, the XK-Means iteration will end up with the number of clusters less than K, which violates the condition that the number of clusters should be K.

EXK-Means
Due to the disturbance θ k , the XK-Means algorithm may produce empty clusters during the iteration process, which violates condition (3). The reason for such a cluster to become empty is that it is too close to, and is attracted into other cluster when the centers are disturbed by the θ k 's. In this sense, it seems reasonable for such a cluster to "disappear". However, on the other hand, the empty clusters will damage the clustering efficiency due to the decrease of the number of working clusters.
To resolve this problem, our idea is to re-assign such an empty cluster by a vector that is farthest to its center. Specifically, our EXK-Means modifies the XK-means by applying the following Empty-cluster-reassignment procedure when empty clusters appear after an XK-Means iteration step.

Genetic Operations
As we argued in the Introduction, although EXK-Means and XK-Means algorithms improve the K-Means on the local minimum issue, but the possibility remains for them to fall into local optimum. We try to combine a genetic mechanism with the EXK-Means to get the global convergence. In particular, we propose to use the following genetic operations:

Label Vectors
For the convenience of genetic operation, in place of the label matrix W, let us introduce the n-dimensional label vector where each component l i ∈ {1, 2, . . . , K} represents the cluster label of x i , as in [10]. Let N denote the population size. Then, we write the population set as {L j , j = 1, 2, . . . , N}.

Initialization
To avoid empty clusters in the initialization stage, we initialize the population as follows. First, the top K components of each L j are randomly assigned as a permutation of {1, 2, . . . , K}. Secondly, the other components of L j are assigned as random cluster numbers respectively selected from the uniform distribution of the {1, 2, . . . , K}.

Selection
The usual roulette strategy is used for the random selection. The probability that an individual L j is selected from the existing population to breed the next generation is given by where F(L j ) is the reciprocal of MSE and represents the fitness value of the individual L j in the population.

Mutation
The mutation probability is denoted by P m , which determines whether an individual L j will be mutated. If an individual L j is to be mutated, the translation probability of its component l i to be k is defined as where i = 1, 2, . . . , n, k = 1, 2, . . . , K. To avoid empty individuals after mutation operation, l i is mutated only when the l i -th cluster contains more than two genes.

Three Steps EXK-Means
A three-step EXK-Means is applied for rapid convergence. For an individual L, it is updated through the following operations: calculate the cluster centers by using (4) for the given L; add the exploratory vector and update the cluster centers by using (13); reassign each gene to the cluster with the closest cluster center to form a new individual L; correct the new L by using the Empty-cluster-reassignment procedure in Section 2.4 if it contains empty cluster(s) at this moment. Repeat the process three times, and finally form an individual L of the next generation.

Genetic XK-Means ( GXK-Means )
The GXK-Means is briefly described as follows: 1. Initialization: Set the population size N, the maximum number of iterations T, the mutation probability P m , the number of clusters K and the error tolerance E Tol . Let t = 0, and choose the initial population P(0) according to Section 2.5.2. In addition, choose the best individual from P(0) and denote it as super individual L * (0). 2. Selection: Select a new population from P(t) according to Section 2.5.3, and denote it by P 1 (t). 3. Mutation: Mutate each individual in P 1 (t) according to Section 2.5.4, and get a new population denoted by P 2 (t). 4. XK-Means: Perform XK-Means on P 2 (t) three times to get the next generation population denoted by P(t + 1). 5. Update the super individual: choose the best individual from P(t + 1) and compare it with L * (t) to get L * (t + 1). 6. Stop if either t = T or ReError ≤ E Tol (see (12)), otherwise go to 2 with t ← t + 1.

GEXK-Means (Genetic EXK-Means)
The process of GEXK-Means proposed in this paper is as follows: 1. Initialization: Set the population size N, the maximum number of iterations T, the mutation probability P m , the number of clusters K and the error tolerance E Tol . Let t = 0, and choose the initial population P(0) according to Section 2.5.2. In addition, choose the best individual from P(0) and denote it as super individual L * (0). 2. Selection: Select a new population from P(t) according to Section 2.5.3, and denote it by P 1 (t). 3. Mutation: Mutate each individual in P 1 (t) according to Section 2.5.4, and get a new population denoted by P 2 (t). 4. EXK-Means: Perform the three steps EXK-Means on P 2 (t) according to Section 2.5.5 to get the next generation population denoted by P(t + 1). 5. Update the super individual: choose the best individual from P(t + 1) and compare it with L * (t) to get L * (t + 1). 6. Stop if either t = T or ReError ≤ E Tol (see (12)), otherwise go to 2 with t ← t + 1.
Let us explain the functions of the four operations in the GEXK-Means: selection, mutation, EXK-Means and updating of the super individual. The selection operation encourages the population to have a good evolution direction. The function of the EXK-Means operation is local search for better individuals. The mutation operation guarantees the ergodicity of the evolution process, which in turn guarantees the appearance of a global optimal individual in the evolution process. Finally, the updating operation of the super individual will catch forever the global optimal individual once it appears.

Data Sets and Parameters
Thirteen data sets shown in Table 1 are used for evaluating our algorithms. The first five data sets are gene expression data sets, including Sporulation [31], Yeast Cell Cycle [32], Lymphoma [33], and two UCI data sets Yeast and Ecoli. The other eight are UCI data sets, which are not gene express data sets.
As shown in Table 1, Sporulation, Yeast Cell Cycle and Lymphoma data sets contain some sample vectors with missing component values. To rectify these defective data, we follow the strategy adopted in [34][35][36]: the sample vectors with more than 20% missing components are removed from the data sets. In addition, for the sample vectors with less than 20% missing components, the missing component values are estimated by the KNN algorithm with the parameter k = 15 as in [35], where k is the number of the neighboring vectors used to estimate the missing component value (see [34][35][36] for details).
Here, we point out that this parameter k here is different from the index k we have used in this paper for denoting the k-th cluster. In the experiments, we use two different computers: M1 (Intel (R), Core (TM) i3-8100 CPU and 4 GB RAM, Santa Clara, CA, USA) and M2 (Intel (R), Core (TM) i5-7400 CPU and 8 GB RAM). The software Matlab (Matlab 2017b, Math Works, Natick, MA, USA) is used to implement the clustering algorithms.

Experimental Results and Discussion
We divide this subsection into three parts. The first part concerns with the performances of the algorithms in terms of MSE, S, DB and XB. The second part demonstrates the significance of differences of the algorithms in terms of Nemenyi Test. The third part presents the computational times of the algorithms. We shall pay our attention mainly on the comparisons of EXK-Means vs. XK-Means and GEXK-Means vs. GXK-Means, respectively, so as to show the benefit of the introduction of our empty-cluster-reassignment technique.

MSE, S, DB and XB Performances
Each of the five algorithms conducted fifty trials on the thirteen data sets. The averages over the fifty trials for the four evaluation criteria (MSE, S, DB and XB) are listed in Tables 2 and 3, devoted to the five gene expression data sets and the other eight UCI data sets, respectively.  From Tables 2 and 3, we see that our GEXK-Means achieves the highest S, and the lowest MSE, DB and XB for all the thirteen data sets. Therefore, GEXK-Means performs better than the other four algorithms.
We also observe that the overall performance of our EXK-Means is a bit better than that of XK-Means: EXK-Means is better than XK-Means in terms of all the four clustering criteria (MSE, S, XB and DB) for three of the thirteen data sets; EXK-Means is better in terms of three criteria for three data sets; EXK-Means is better in terms of two criteria for four data sets; and EXK-Means is better in terms of one criteria for three data sets. This means that EXK-Means performs better than XK-Means in nearly two thirds of the cases. (In the total 13 × 4 = 52 cases, EXK-Means is better than XK-Means for 3 × 4 + 3 × 3 + 2 × 4 + 1 × 3 = 32 cases.) The better case is marked by black face number in Tables 2 and 3.
To see more clearly the overall performance, in Figures 1-4 for MSE, DB, XB and S evaluations respectively, we further present the average performances of the five algorithms over the thirteen data sets. These figures clearly show that, in the sense of average performance, the proposed GEXK-Means outperforms the other four algorithms, and EXK-Means outperforms K-Means and XK-Means.    As an example to show what happens in the iteration processes, a typical iteration process on Yeast Cell Cycle data set is shown in Figures 5-8, presenting the MSE, XB, DB and S curves respectively for the five algorithms.     Table 4 shows the results of Nemenyi Test on MSE, S, DB and XB indexes. We use the threshold value 0.05 for the significance evaluation. For DB index, EXK-Means shows significant difference compared with XK-Means, while EXK-Means does not show significant difference compared with XK-Means for the other three indexes. For all four of the indexes, GEXK-Means shows a significant difference compared with GXK-Means.  Table 5 gives the average computational times over the fifty runs for each data set. It shows that the computational times of EXK-Means are a little bit longer than those of K-Means and XK-means, and the computational times of GEXK-Means are a little bit longer than those of GXK-Means. This indicates that the introduction of our empty-cluster-reassignment technique increases the computational time. However, our algorithms are better if we do not mind a bit of increase in the computational time and we care very much about the accuracy.

Convergence
In this section, the convergence properties of GEXK-Means are analyzed. It is clear that there exist m = K n possible solutions when classifying n genes into K clusters. As mentioned in Section 2.5, every possible solution can be denoted as a label vector L. Therefore, the number of all possible individuals is m. Let L * be the set of global optimal individuals with maximum fitness value.
Let P Mutation stand for the probability generating the optimal individual in P 2 (t) by mutation operation. Then, where P m > 0 is the mutation probability, P(L j (t)) > 0 is the selection probability defined by Equation (18), and ∑ N j=1 P(L j (t)) = 1.

Theorem 1.
When the GEXK-Means defined in Section 2.7 is applied for the classification of a given data set, the global optimal classification result for the data set will appear and will be caught with probability 1 in an infinite evolution iteration process of the GEXK-Means.
Proof. Along with the evolution process, the updating operation of the super individual will keep the super individual denoted by L * (t) of every generation t = 0, 1, 2, . . .. According to (22), we know that the L * (t) may become a global optimal individual with positive probability. According to the Small Probability Event Principle [37,38], the global optimum individual will appear in the super individual sequence with probability 1 when the evolution iteration process goes to infinity. This proves the global convergence of GEXK-Means.
We remark that the global convergence stated above is of a theoretical and probabilistic nature. It does not guarantee that the convergence to a global optimum can be reached in finite number of GEXK-Means iterations.

Conclusions
XK-Means (eXploratory K-Means) is a popular data clustering algorithm. However, empty clusters may appear during the iteration of XK-Means, which violates the condition that the number of clusters should be K and causes damage to the efficiency of the algorithm. As a remedy, we define an empty-cluster-reassignment technique to modify XK-Means when empty clusters appear, resulting in an EXK-Means clustering algorithm. Furthermore, we combine the EXK-Means with genetic mechanism to form a GEXK-Means clustering algorithm.
Numerical simulations are carried out on the comparison of K-Means, XK-Means, EXK-Means and GXK-Means (genetic XK-Means) and GEXK-Means. The evaluation tools include the mean squared error (MSE), the Xie-Beni index (XB), the Davies-Bouldin index (DB) and the separation index (S). The Nemenyi Test for multiple comparisons is also done on MSE, S, DB and XB, respectively. Thirteen real world data sets are used for the simulation. The running times of these algorithms are also considered.
The conclusions we draw from the simulation results are as follows: first, the overall performances of EXK-Means in terms of the four indexes outperform those of XK-Means, and the overall performances of GEXK-Means outperform those of GXK-Means. This shows the effectiveness of the introduction of the empty-cluster-reassignment technique. Secondly, if we take the threshold value as 0.05 for the Nemenyi Test, then GEXK-Means shows a significant difference compared with GXK-Means for all four of the indexes. However, EXK-Means shows a significant difference compared with XK-Means only for the DB index. Thirdly, our EXK-Means and GEXK-Means take a little bit more computational time than XK-Means and GXK-Means, respectively.
The following global convergence of the GEXK-Means is also theoretically proved: the global optimum will appear and will be caught in the evolution process of GEXK-Means with probability 1.
Author Contributions: C.H. developed the mathematical model, carried out the numerical simulations and wrote the manuscript; W.W. advised on developing the learning algorithms and supervised the work; F.L. contributed to the theoretical analysis; C.Z. and J.Y. helped in the numerical simulations.