Next Article in Journal
A Unified Multiple-Phase Fluids Framework Using Asymmetric Surface Extraction and the Modified Density Model
Previous Article in Journal
On Invariant Subspaces for the Shift Operator
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Genetic XK-Means Algorithm with Empty Cluster Reassignment

1
School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, China
2
School of Computer Sciences and Technology, Inner Mongolia University for Nationalities, Tongliao 028043, China
*
Author to whom correspondence should be addressed.
Symmetry 2019, 11(6), 744; https://doi.org/10.3390/sym11060744
Submission received: 11 April 2019 / Revised: 15 May 2019 / Accepted: 24 May 2019 / Published: 2 June 2019

Abstract

:
K-Means is a well known and widely used classical clustering algorithm. It is easy to fall into local optimum and it is sensitive to the initial choice of cluster centers. XK-Means (eXploratory K-Means) has been introduced in the literature by adding an exploratory disturbance onto the vector of cluster centers, so as to jump out of the local optimum and reduce the sensitivity to the initial centers. However, empty clusters may appear during the iteration of XK-Means, causing damage to the efficiency of the algorithm. The aim of this paper is to introduce an empty-cluster-reassignment technique and use it to modify XK-Means, resulting in an EXK-Means clustering algorithm. Furthermore, we combine the EXK-Means with genetic mechanism to form a genetic XK-Means algorithm with empty-cluster-reassignment, referred to as GEXK-Means clustering algorithm. The convergence of GEXK-Means to the global optimum is theoretically proved. Numerical experiments on a few real world clustering problems are carried out, showing the advantage of EXK-Means over XK-Means, and the advantage of GEXK-Means over EXK-Means, XK-Means, K-Means and GXK-Means (genetic XK-Means).

1. Introduction

Clustering algorithms are a class of unsupervised classification methods for a data set (cf. [1,2,3,4,5]). Roughly speaking, a clustering algorithm classifies the vectors in the data set such that distances of the vectors in the same cluster are as small as possible, and the distances of the vectors belonging to different clusters are as large as possible. Therefore, the vectors in the same cluster have the greatest similarity, while the vectors in different clusters have the greatest dissimilarity.
A clustering technique called K-Means is proposed and discussed in [1,2] among many others. Because of its simplicity and fast convergence speed, K-Means is widely used in various research fields. For instance, K-Means is used in [6] for removing the noisy data. A disadvantage of K-Means is that it is easy to fall into local optima. As a remedy, a popular trend is to integrate the genetic algorithm [7,8] with K-means to obtain genetic K-means algorithms [9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]. K-Means is also combined with fuzzy mechanism to obtain fuzzy C-Means [24,25].
A successful modification of K-Means is proposed in [26], referred to as XK-Means (eXploratory K-Means). It adds an exploratory disturbance onto the vector of the cluster centers so as to jump out of the local optimum and to reduce the sensitivity to the initial centers. However, empty clusters may appear during the iteration of XK-Means, which violates the condition that the number of clusters should be a pre-given number K and causes damage to the efficiency of the algorithm (see Remark 1 in Section 2.3 below for details). As a remedy, we propose in this paper to modify XK-Means in terms of an empty-cluster-reassignment technique, resulting in an EXK-Means clustering algorithm.
The involvement of the exploratory disturbance in EXK-Means helps to jump out of the local optimum during the iteration. However, in order to guarantee the convergence of the iteration process, the exploratory disturbance has to decrease and tend to zero in the iteration process. Therefore, it is still possible for EXK-Means to fall into local optimum. To further resolve this problem, we follow the aforementioned strategy to combine the genetic mechanism with our EXK-Means, resulting in a clustering algorithm called GEXK-Means.
Numerical experiments on thirteen real world data sets are carried out, showing the higher accuracies of our EXK-Means over XK-Means, and our GEXK-Means over GXK-Means, EXK-Means, XK-Means and K-Means: first, our GEXK-Means achieves the highest S, and the lowest MSE, DB and XB (see the next section for definitions of these evaluation tools) for all of the thirteen data sets. Therefore, GEXK-Means performs better than the other four algorithms. Second, the overall performance of our EXK-Means is a little bit better than that of XK-Means, which shows the benefit of the introduction of our empty cluster reassignment technique.
The numerical experiments also show that the execution times of EXK-Means are a little bit longer than those of K-Means and XK-means, and the execution times of GEXK-Means are the longest in the five algorithms. This is a disadvantage of EXK-Means and GEXK-Means. However, the computer speed is getting high and high in nowadays, and sometimes the computational time does not matter very much in practice if the data set is not very large. In case we do not mind a bit of an increase in the computational time and we care very much about the accuracy, our algorithm may be of value.
A probabilistic convergence of our GEXK-Means to the global optimum is theoretically proved.
This paper is organized as follows. In Section 2, we describe the K-Means, XK-Means, GXK-Means, and our proposed EXK-Means and GEXK-Means. In Section 3, numerical experiments are shown on GEXK-Means and its comparison with K-Means, XK-Means, GXK-Means and EXK-Means. The convergence of GEXK-Means to a globally optimal solution is theoretically proved in Section 4. Some short conclusions are drawn in Section 5.

2. Algorithms

In this section, we first give some notations and describe some evaluation tools. Then, we define the clustering algorithms used in this paper.

2.1. Notations

Let us introduce some notations. Our task is to cluster a set of n genes { x i , i = 1 , 2 , , n } into K clusters. Each gene is expressed as a vector of dimension D: x i = ( x i 1 , x i 2 , , x i D ) T . For i = 1 , 2 , , n and k = 1 , 2 , , K , we define
w i k = 1 , if the i - t h g e n e b e l o n g s t o the k - t h c l u s t e r , 0 , o t h e r w i s e .
In addition, we define the label matrix W = [ w i k ] . We require that each gene belongs to precisely one cluster, and each cluster contains at least one gene. Therefore,
k = 1 K w i k = 1 , i = 1 , 2 , , n ,
1 i = 1 n w i k < n , k = 1 , 2 , , K .
Denote the center of the k-th cluster by c k = ( c k 1 , c k 2 , , c k D ) T , defined as
c k = i = 1 n w i k x i i = 1 n w i k .
The Euclidean norm | | . | | will be used in our paper. Then, for any two D-dimensional vectors y = ( y 1 , y 2 , , y D ) T and z = ( z 1 , z 2 , , z D ) T in R D , the distance is
y z = ( i = 1 D | y i z i | 2 ) 1 2 .

2.2. Evaluation Strategies

In our numerical simulation, we will use the following evaluation tools: the mean squared error (MSE), the Xie–Beni index (XB) [12], the Davies–Bouldin index (DB) [13,27], and the separation index (S) [4]. The aim of the clustering algorithms discussed in this paper is to choose the optimal centers c k ’s and the optimal label matrix W so as to minimize the mean square error MSE. Then, MSE together with the indexes XB, DB and S will be applied to evaluate the outcome of the clustering algorithms.
MSE is defined by
M S E = 1 n k = 1 K i = 1 n w i k x i c k 2 .
MSE will be used as the evaluation function in the genetic operation of the numerical simulation later on. Generally speaking, lower MSE means better clustering result.
The XB index [12] is defined as follows:
X B = M S E d m i n ,
where d m i n is the shortest distance between cluster centers. Higher d m i n means better clustering result. As we mentioned above, the MSE is the lower the better. Therefore, lower XB implies better clustering results.
To define the DB index [13,27], we first defined the within-cluster separation S k as
S k = ( 1 C k x i C k x i c k 2 ) 1 2 ,
where C k (resp. | C k | ) denotes the set (resp. the number) of the samples belonging to the cluster k. Next, we define a term R k for cluster c k as
R k = max j , j k S k + S j c k c j .
Then, the DB index is defined as
D B = 1 K k = 1 K R k .
Generally speaking, lower DB implies better clustering results.
The separation index S [4] is defined as follows:
S = 1 k , j = 1 ; k j K | C k | | C j | k , j = 1 ; k j K | C k | | C j | c k c j .
Generally speaking, higher S implies better clustering results.
The Nemenyi test [28,29,30] will be used for evaluating the significance of differences of XK-Means vs. EXK-Means and GXK-Means vs. GEXK-Means, respectively. The function cdf.chisq of SPSS software (SPSS Statistics 17.0, IBM, New York, USA) is used to compute the significance probability Pr. The value of Pr is in between 0 and 1. The smaller value of Pr implies the bigger significance of the difference of the two groups. One can say that the difference of the two groups is significant if Pr is less than a particular threshold value. The most often used threshold values are 0.01, 0.05 and 0.1. The threshold value 0.05 will be adopted in this paper.
The relative error R e E r r o r defined below will be used for a stop criterion in our numerical iteration process:
R e E r r o r = | M S E t 1 M S E t M S E t | ,
where M S E t and M S E t 1 denote the values of MSE in the current and previous iteration steps, respectively.

2.3. XK-Means

Trying to jump out of the local minimum, the XK-Means algorithm is proposed in [26], where the usual K-Means is modified by adding an exploratory vector onto each cluster center as follows:
c k * = c k + θ k ,
where θ k is a D-dimensional exploratory vector at the current step. It is used to disturb the center produced by K-Means operation, and its component is randomly chosen as
( θ k ) i = r a n d ( a i , b i ) r a n d s i g n ( i ) , i = 1 , 2 , , D ,
where b i is a given positive number, and
a i = β b i ,
with a given factor β [ 0 , 1 ) . In general, the disturbance should be decreased with the increase of the iteration step. Thus, for a new iteration step, the new value of b i is set to be
b i * = α b i ,
with a given factor α [ 0 , 1 ) .
Remark 1.
Empty cluster will not appear in a usual K-Means iteration process. However, it is possible for XK-Means to produce an empty cluster in the iteration process. This happens when the exploratory vector θ k in Formula (13) drives the center c k away from the genes in the k-th cluster, such that all these genes join another cluster in the re-organization stage of the XK-Means and leave the k-th cluster empty. Then, the XK-Means iteration will end up with the number of clusters less than K, which violates the condition that the number of clusters should be K.

2.4. EXK-Means

Due to the disturbance θ k , the XK-Means algorithm may produce empty clusters during the iteration process, which violates condition ( 3 ) . The reason for such a cluster to become empty is that it is too close to, and is attracted into other cluster when the centers are disturbed by the θ k ’s. In this sense, it seems reasonable for such a cluster to “disappear”. However, on the other hand, the empty clusters will damage the clustering efficiency due to the decrease of the number of working clusters.
To resolve this problem, our idea is to re-assign such an empty cluster by a vector that is farthest to its center. Specifically, our EXK-Means modifies the XK-means by applying the following Empty-cluster-reassignment procedure when empty clusters appear after an XK-Means iteration step.
Empty-cluster-reassignment procedure:
  • Let K 0 be the number of empty clusters, 1 K 0 < K .
  • Find the most marginal point of each non-empty cluster: x k * = arg max x i C k x i c k , where C k is the set of genes in the k-th cluster.
  • Sort { x 1 * , x 2 * , , x K * } in descending order according to their distances to the corresponding centroids to get { x 1 * * , x 2 * * , , x K * * } .
  • Take the first K 0 genes from { x 1 * * , x 2 * * , , x K * * } to form K 0 new centers { x 1 * * } , { x 2 * * } , , { x K 0 * * } .
  • Adjust the partition of genes according to original centers and the new K 0 centers.

2.5. Genetic Operations

As we argued in the Introduction, although EXK-Means and XK-Means algorithms improve the K-Means on the local minimum issue, but the possibility remains for them to fall into local optimum. We try to combine a genetic mechanism with the EXK-Means to get the global convergence. In particular, we propose to use the following genetic operations:

2.5.1. Label Vectors

For the convenience of genetic operation, in place of the label matrix W , let us introduce the n-dimensional label vector
L = ( l 1 , l 2 , , l i , , l n ) T ,
where each component l i { 1 , 2 , , K } represents the cluster label of x i , as in [10]. Let N denote the population size. Then, we write the population set as { L j , j = 1 , 2 , , N } .

2.5.2. Initialization

To avoid empty clusters in the initialization stage, we initialize the population as follows. First, the top K components of each L j are randomly assigned as a permutation of { 1 , 2 , , K } . Secondly, the other components of L j are assigned as random cluster numbers respectively selected from the uniform distribution of the { 1 , 2 , , K } .

2.5.3. Selection

The usual roulette strategy is used for the random selection. The probability that an individual L j is selected from the existing population to breed the next generation is given by
P ( L j ) = F ( L j ) h = 1 N F ( L h ) , j = 1 , 2 , , N ,
F ( L j ) = 1 1 n k = 1 K i = 1 n w i k | | x i c k | | ,
where F ( L j ) is the reciprocal of MSE and represents the fitness value of the individual L j in the population.

2.5.4. Mutation

The mutation probability is denoted by P m , which determines whether an individual L j will be mutated. If an individual L j is to be mutated, the translation probability of its component l i to be k is defined as
P i k = P { l i = k } = 2 d max i | | x i c k | | l = 1 K ( 2 d max i | | x i c l | | ) ,
d max i = max k { | | x i c k | | } ,
where i = 1 , 2 , , n , k = 1 , 2 , , K . To avoid empty individuals after mutation operation, l i is mutated only when the l i -th cluster contains more than two genes.

2.5.5. Three Steps EXK-Means

A three-step EXK-Means is applied for rapid convergence. For an individual L , it is updated through the following operations: calculate the cluster centers by using ( 4 ) for the given L ; add the exploratory vector and update the cluster centers by using ( 13 ) ; reassign each gene to the cluster with the closest cluster center to form a new individual L ; correct the new L by using the Empty-cluster-reassignment procedure in Section 2.4 if it contains empty cluster(s) at this moment. Repeat the process three times, and finally form an individual L of the next generation.

2.6. Genetic XK-Means ( GXK-Means )

The GXK-Means is briefly described as follows:
  • Initialization: Set the population size N, the maximum number of iterations T, the mutation probability P m , the number of clusters K and the error tolerance E T o l . Let t = 0 , and choose the initial population P ( 0 ) according to Section 2.5.2. In addition, choose the best individual from P ( 0 ) and denote it as super individual L * ( 0 ) .
  • Selection: Select a new population from P ( t ) according to Section 2.5.3, and denote it by P 1 ( t ) .
  • Mutation: Mutate each individual in P 1 ( t ) according to Section 2.5.4, and get a new population denoted by P 2 ( t ) .
  • XK-Means: Perform XK-Means on P 2 ( t ) three times to get the next generation population denoted by P ( t + 1 ) .
  • Update the super individual: choose the best individual from P ( t + 1 ) and compare it with L * ( t ) to get L * ( t + 1 ) .
  • Stop if either t = T or R e E r r o r E T o l (see ( 12 ) ), otherwise go to 2 with t t + 1 .

2.7. GEXK-Means (Genetic EXK-Means)

The process of GEXK-Means proposed in this paper is as follows:
  • Initialization: Set the population size N, the maximum number of iterations T, the mutation probability P m , the number of clusters K and the error tolerance E T o l . Let t = 0 , and choose the initial population P ( 0 ) according to Section 2.5.2. In addition, choose the best individual from P ( 0 ) and denote it as super individual L * ( 0 ) .
  • Selection: Select a new population from P ( t ) according to Section 2.5.3, and denote it by P 1 ( t ) .
  • Mutation: Mutate each individual in P 1 ( t ) according to Section 2.5.4, and get a new population denoted by P 2 ( t ) .
  • EXK-Means: Perform the three steps EXK-Means on P 2 ( t ) according to Section 2.5.5 to get the next generation population denoted by P ( t + 1 ) .
  • Update the super individual: choose the best individual from P ( t + 1 ) and compare it with L * ( t ) to get L * ( t + 1 ) .
  • Stop if either t = T or R e E r r o r E T o l (see ( 12 ) ), otherwise go to 2 with t t + 1 .
Let us explain the functions of the four operations in the GEXK-Means: selection, mutation, EXK-Means and updating of the super individual. The selection operation encourages the population to have a good evolution direction. The function of the EXK-Means operation is local search for better individuals. The mutation operation guarantees the ergodicity of the evolution process, which in turn guarantees the appearance of a global optimal individual in the evolution process. Finally, the updating operation of the super individual will catch forever the global optimal individual once it appears.

3. Experimental Evaluation and Results

3.1. Data Sets and Parameters

Thirteen data sets shown in Table 1 are used for evaluating our algorithms. The first five data sets are gene expression data sets, including Sporulation [31], Yeast Cell Cycle [32], Lymphoma [33], and two UCI data sets Yeast and Ecoli. The other eight are UCI data sets, which are not gene express data sets.
As shown in Table 1, Sporulation, Yeast Cell Cycle and Lymphoma data sets contain some sample vectors with missing component values. To rectify these defective data, we follow the strategy adopted in [34,35,36]: the sample vectors with more than 20% missing components are removed from the data sets. In addition, for the sample vectors with less than 20% missing components, the missing component values are estimated by the KNN algorithm with the parameter k = 15 as in [35], where k is the number of the neighboring vectors used to estimate the missing component value (see [34,35,36] for details). Here, we point out that this parameter k here is different from the index k we have used in this paper for denoting the k-th cluster.
The values of the parameters used in the computation are set as follows:
Population size n = 20 (cf. Section 2.6 and Section 2.7)
Mutation probability P m = 0.1 (cf. Section 2.6 and Section 2.7)
Error tolerance E T o l = 0.001 (cf. Section 2.2, Section 2.6 and Section 2.7)
α = 0.3 (cf. ( 14 ) , ( 16 ) )
β = 0.95 (cf. ( 14 ) , ( 15 ) )
T = 150 (cf. Section 2.6 and Section 2.7)
In the experiments, we use two different computers: M1 (Intel (R), Core (TM) i3-8100 CPU and 4 GB RAM, Santa Clara, CA, USA) and M2 (Intel (R), Core (TM) i5-7400 CPU and 8 GB RAM). The software Matlab (Matlab 2017b, Math Works, Natick, MA, USA) is used to implement the clustering algorithms.

3.2. Experimental Results and Discussion

We divide this subsection into three parts. The first part concerns with the performances of the algorithms in terms of MSE, S, DB and XB. The second part demonstrates the significance of differences of the algorithms in terms of Nemenyi Test. The third part presents the computational times of the algorithms. We shall pay our attention mainly on the comparisons of EXK-Means vs. XK-Means and GEXK-Means vs. GXK-Means, respectively, so as to show the benefit of the introduction of our empty-cluster-reassignment technique.

3.2.1. MSE, S, DB and XB Performances

Each of the five algorithms conducted fifty trials on the thirteen data sets. The averages over the fifty trials for the four evaluation criteria (MSE, S, DB and XB) are listed in Table 2 and Table 3, devoted to the five gene expression data sets and the other eight UCI data sets, respectively.
From Table 2 and Table 3, we see that our GEXK-Means achieves the highest S, and the lowest MSE, DB and XB for all the thirteen data sets. Therefore, GEXK-Means performs better than the other four algorithms.
We also observe that the overall performance of our EXK-Means is a bit better than that of XK-Means: EXK-Means is better than XK-Means in terms of all the four clustering criteria (MSE, S, XB and DB) for three of the thirteen data sets; EXK-Means is better in terms of three criteria for three data sets; EXK-Means is better in terms of two criteria for four data sets; and EXK-Means is better in terms of one criteria for three data sets. This means that EXK-Means performs better than XK-Means in nearly two thirds of the cases. (In the total 13 × 4 = 52 cases, EXK-Means is better than XK-Means for 3 × 4 + 3 × 3 + 2 × 4 + 1 × 3 = 32 cases.) The better case is marked by black face number in Table 2 and Table 3.
To see more clearly the overall performance, in Figure 1, Figure 2, Figure 3 and Figure 4 for MSE, DB, XB and S evaluations respectively, we further present the average performances of the five algorithms over the thirteen data sets. These figures clearly show that, in the sense of average performance, the proposed GEXK-Means outperforms the other four algorithms, and EXK-Means outperforms K-Means and XK-Means.
As an example to show what happens in the iteration processes, a typical iteration process on Yeast Cell Cycle data set is shown in Figure 5, Figure 6, Figure 7 and Figure 8, presenting the MSE, XB, DB and S curves respectively for the five algorithms.

3.2.2. Nemenyi Test

Table 4 shows the results of Nemenyi Test on MSE, S, DB and XB indexes. We use the threshold value 0.05 for the significance evaluation. For DB index, EXK-Means shows significant difference compared with XK-Means, while EXK-Means does not show significant difference compared with XK-Means for the other three indexes. For all four of the indexes, GEXK-Means shows a significant difference compared with GXK-Means.

3.2.3. Computational Time

Table 5 gives the average computational times over the fifty runs for each data set. It shows that the computational times of EXK-Means are a little bit longer than those of K-Means and XK-means, and the computational times of GEXK-Means are a little bit longer than those of GXK-Means. This indicates that the introduction of our empty-cluster-reassignment technique increases the computational time. However, our algorithms are better if we do not mind a bit of increase in the computational time and we care very much about the accuracy.

4. Convergence

In this section, the convergence properties of GEXK-Means are analyzed. It is clear that there exist m = K n possible solutions when classifying n genes into K clusters. As mentioned in Section 2.5, every possible solution can be denoted as a label vector L . Therefore, the number of all possible individuals is m. Let L * be the set of global optimal individuals with maximum fitness value.
For an individual L that will to be mutated, according to Eqution ( 20 ) , we have
P i k = P { l i = k } = 2 d max i | | x i c k | | l = 1 K ( 2 d max i | | x i c l | | ) ,
where i = 1 , 2 , , n , k = 1 , 2 , , K , and d max i > | | x i c k | | > 0 . Therefore, 2 d max i | | x i c k | | > 0 , and P i k > 0 , i = 1 , 2 , , n , k = 1 , 2 , , K . We note that the number of genes and the number of clusters are finite. Therefore, P i k ( i = 1 , 2 , , n , k = 1 , 2 , , K ) has lower bound denoted by M > 0 . This means that every gene can be mutated into any one cluster with positive probability. In particular, L can be mutated into any other individuals with positive probability. Recall that P ( t ) = { L 1 ( t ) , L 2 ( t ) , , L N ( t ) } is the population at step t. Let P L j ( t ) L * stand for the probability in which L j ( t ) is mutated to one of the global optimal individuals. Then
P L j ( t ) L * > M n , j = 1 , 2 , , N .
Let P M u t a t i o n stand for the probability generating the optimal individual in P 2 ( t ) by mutation operation. Then,
P M u t a t i o n = j = 1 N P m P ( L j ( t ) ) ( P L j ( t ) L * ) > P m j = 1 N P ( L j ( t ) ) M n = P m M n > 0 ,
where P m > 0 is the mutation probability, P ( L j ( t ) ) > 0 is the selection probability defined by Equation ( 18 ) , and j = 1 N P ( L j ( t ) ) = 1 .
Theorem 1.
When the GEXK-Means defined in Section 2.7 is applied for the classification of a given data set, the global optimal classification result for the data set will appear and will be caught with probability 1 in an infinite evolution iteration process of the GEXK-Means.
Proof. 
Along with the evolution process, the updating operation of the super individual will keep the super individual denoted by L * ( t ) of every generation t = 0 , 1 , 2 , . According to ( 22 ) , we know that the L * ( t ) may become a global optimal individual with positive probability. According to the Small Probability Event Principle [37,38], the global optimum individual will appear in the super individual sequence with probability 1 when the evolution iteration process goes to infinity. This proves the global convergence of GEXK-Means. □
We remark that the global convergence stated above is of a theoretical and probabilistic nature. It does not guarantee that the convergence to a global optimum can be reached in finite number of GEXK-Means iterations.

5. Conclusions

XK-Means (eXploratory K-Means) is a popular data clustering algorithm. However, empty clusters may appear during the iteration of XK-Means, which violates the condition that the number of clusters should be K and causes damage to the efficiency of the algorithm. As a remedy, we define an empty-cluster-reassignment technique to modify XK-Means when empty clusters appear, resulting in an EXK-Means clustering algorithm. Furthermore, we combine the EXK-Means with genetic mechanism to form a GEXK-Means clustering algorithm.
Numerical simulations are carried out on the comparison of K-Means, XK-Means, EXK-Means and GXK-Means (genetic XK-Means) and GEXK-Means. The evaluation tools include the mean squared error (MSE), the Xie–Beni index (XB), the Davies–Bouldin index (DB) and the separation index (S). The Nemenyi Test for multiple comparisons is also done on MSE, S, DB and XB, respectively. Thirteen real world data sets are used for the simulation. The running times of these algorithms are also considered.
The conclusions we draw from the simulation results are as follows: first, the overall performances of EXK-Means in terms of the four indexes outperform those of XK-Means, and the overall performances of GEXK-Means outperform those of GXK-Means. This shows the effectiveness of the introduction of the empty-cluster-reassignment technique. Secondly, if we take the threshold value as 0.05 for the Nemenyi Test, then GEXK-Means shows a significant difference compared with GXK-Means for all four of the indexes. However, EXK-Means shows a significant difference compared with XK-Means only for the DB index. Thirdly, our EXK-Means and GEXK-Means take a little bit more computational time than XK-Means and GXK-Means, respectively.
The following global convergence of the GEXK-Means is also theoretically proved: the global optimum will appear and will be caught in the evolution process of GEXK-Means with probability 1.

Author Contributions

C.H. developed the mathematical model, carried out the numerical simulations and wrote the manuscript; W.W. advised on developing the learning algorithms and supervised the work; F.L. contributed to the theoretical analysis; C.Z. and J.Y. helped in the numerical simulations.

Funding

This research was supported by the National Science Foundation of China (No: 61473059), and the Fundamental Research Funds for the Central Universities of China.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

References

  1. Steinhaus, H. Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci. 1956, 3, 801–804. [Google Scholar]
  2. Macqueen, J. Some Methods for Classification and Analysis of MultiVariate Observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Los Angeles, CA, USA, 21 June–18 July 1967. [Google Scholar]
  3. Wlodarczyk-Sielicka, M. Importance of Neighborhood Parameters During Clustering of Bathymetric Data Using Neural Network. In Proceedings of the 22nd International Conference, Duruskininkai, Lithuania, 13–15 October 2016. [Google Scholar]
  4. Du, Z.; Wang, Y. PK-Means: A New Algorithm for Gene Clustering. Comput. Biol. Chem. 2008, 32, 243–247. [Google Scholar] [CrossRef] [PubMed]
  5. Lin, F.; Du, Z. A Novel Parallelization Approach for Hierarchical Clustering. Parallel Comput. 2005, 31, 523–527. [Google Scholar]
  6. Santhanam, T.; Padmavathi, M.S. Application of K-Means and Genetic Algorithms for Dimension Reduction by Integrating SVM for Diabetes Diagnosis. Procedia Comput. Sci. 2015, 47, 76–83. [Google Scholar] [CrossRef] [Green Version]
  7. Deep, K.; Thakur, M. A New Mutation Operator for Real Coded Genetic Algorithms. Appl. Math. Comput. 2007, 193, 211–230. [Google Scholar] [CrossRef]
  8. Ming, L.; Wang, Y. On Convergence Rate of a Class of Genetic Algorithms. In Proceedings of the World Automation Congress, Budapest, Hungary, 24–26 July 2006. [Google Scholar]
  9. Maulik, U. Genetic Algorithm Based Clustering Technique. Pattern Recognit. 2000, 33, 1455–1465. [Google Scholar] [CrossRef]
  10. Jones, D.R.; Beltramo, M.A. Solving Partitioning Problems with Genetic Algorithms. In Proceedings of the 4th International Conference on Genetic Algorithms, San Diego, CA, USA, 13–16 July 1991. [Google Scholar]
  11. Zheng, Y.; Jia, L.; Cao, H. Multi-Objective Gene Expression Programming for Clustering. Inf. Technol. Control 2012, 41, 283–294. [Google Scholar] [CrossRef]
  12. Xie, X.L.; Beni, G. A Validity Measure for Fuzzy Clustering. IEEE Trans. Pattern Anal. Mach. Intell. 1991, 13, 841–847. [Google Scholar] [CrossRef]
  13. Liu, Y.G. Automatic Clustering Using Genetic Algorithms. Appl. Math. Comput. 2011, 218, 1267–1279. [Google Scholar] [CrossRef]
  14. Krishna, K.; Murty, M.N. Genetic K-Means Algorithm. IEEE Trans. Syst. Man Cybern. 1999, 29, 433–439. [Google Scholar] [CrossRef]
  15. Bouhmala, N.; Viken, A. Enhanced Genetic Algorithm with K-Means for the Clustering Problem. Int. J. Model. Optim. 2015, 5, 150–154. [Google Scholar] [CrossRef] [Green Version]
  16. Sheng, W.G.; Tucker, A. Clustering with Niching Genetic K-means algorithm. In Proceedings of the 6th Annual Genetic and Evolutionary Computation Conference (GECCO 2004), Seattle, WA, USA, 26–30 June 2004. [Google Scholar]
  17. Zhou, X.B.; Gu, J.G. An Automatic K-Means Clustering Algorithm of GPS Data Combining a Novel Niche Genetic Algorithm with Noise and Density. ISPRS Int. J. Geo-Inf. 2017, 6, 392. [Google Scholar] [CrossRef]
  18. Islam, M.Z.; Estivill-Castro, V.; Rahman, M.A.; Bossomaier, T. Combining K-Means and a Genetic Algorithm through a Novel Arrangement of Genetic Operators for High Quality Clustering. Expert Syst. Appl. 2018, 91, 402–417. [Google Scholar] [CrossRef]
  19. Michael, L.; Sumitra, M. A Genetic Algorithm that Exchanges Neighboring Centers for k-means clustering. Pattern Recognit. Lett. 2007, 28, 2359–2366. [Google Scholar]
  20. Ishibuchi, H.; Yamamoto, T. Fuzzy Rule Selection by Multi-objective Genetic Local Search Algorithms and Rule Evaluation Measures in Data Mining. Fuzzy Sets Syst. 2004, 141, 59–88. [Google Scholar] [CrossRef]
  21. Zubova, J.; Kurasova, O. Dimensionality Reduction Methods: The Comparison of Speed and Accuracy. Inf. Technol. Control 2018, 47, 151–160. [Google Scholar] [CrossRef]
  22. Wozniak, M.; Polap, D. Object Detection and Recognition via Clustered Features. Neurocomputing 2018, 320, 76–84. [Google Scholar] [CrossRef]
  23. Anusha, M.; Sathiaseelan, G.R. Feature Selection Using K-Means Genetic Algorithm for Multi-objective Optimization. Procedia Comput. Sci. 2015, 57, 1074–1080. [Google Scholar] [CrossRef] [Green Version]
  24. Bezdek, J.C.; Ehrlich, R. FCM: The Fuzzy C-Means Clustering Algorithm. Comput. Geosci. 1984, 10, 191–203. [Google Scholar] [CrossRef]
  25. Indrajit, S.; Ujjwal, M. A New Multi-objective Technique for Differential Fuzzy Clustering. Appl. Soft Comput. 2011, 11, 2765–2776. [Google Scholar]
  26. Lam, Y.K.; Tsang, P.W.M. eXplotatory K-Means: A New Simple and Efficient Algorithm for Gene Clustering. Appl. Soft. Comput. 2012, 12, 1149–1157. [Google Scholar] [CrossRef]
  27. Davies, D.L.; Bouldin, D.W. A Cluster Separation Measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979, PAMI-1, 224–227. [Google Scholar] [CrossRef]
  28. Liu, Y.W.; Chen, W.H. A SAS Macro for Testing Differences among Three or More Independent Groups Using Kruskal-Wallis and Nemenyi Tests. J. Huazhong Univ. Sci. Tech.-Med. 2012, 32, 130–134. [Google Scholar] [CrossRef] [PubMed]
  29. Nemenyi, P. Distribution-Free Multiple Comparisons. Ph.D. Thesis, Princeton University, Princeton, NJ, USA, 1963. [Google Scholar]
  30. Fan, Y.; Hao, Z.O. Applied Statistics Analysis Using SPSS, 1st ed.; China Water Conservancy and Hydroelectricity Publishing House: Beijing, China, 2003; pp. 138–152. [Google Scholar]
  31. Chu, S.; DeRisi, J. The Transcriptional Program of Sporulation in Budding Yeast. Science 1998, 282, 699–705. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  32. Spellman, P.T. Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces Cerevisiae by Microarray Hybridization. Mol. Biol. 1998, 9, 3273–3297. [Google Scholar] [CrossRef]
  33. Alizadeh, A.A.; Eisen, M.B. Distinct Types of Diffuse Large B-cell Lymphoma Identified by Gene Expression Profiling. Nature 2000, 403, 503–511. [Google Scholar] [CrossRef] [PubMed]
  34. Yoon, D.; Lee, E.K. Robust Imputation Method for Missing Values in Microarray Data. BMC Bioinform. 2007, 8, 6–12. [Google Scholar] [CrossRef] [PubMed]
  35. Troyanskaya, O.; Cantor, M. Missing Value Estimation Methods for DNA Microarrays. Bioinformatics 2001, 17, 520–525. [Google Scholar] [CrossRef]
  36. Corso, D.E.; Cerquitelli, T. METATECH: Meteorological Data Analysis for Thermal Energy CHaracterization by Means of Self-Learning Transparent Models. Energies 2018, 11, 1336. [Google Scholar] [CrossRef]
  37. Liu, G.G.; Zhuang, Z.; Guo, W.Z. A novel particle swarm optimizer with multi-stage transformation and genetic operation for VLSI routing. Energies 2018, 11, 1336. [Google Scholar]
  38. Rudolph, G. Convergence Analysis of Canonical Genetic Algorithms. IEEE Trans. Neural Netw. 1994, 5, 96–101. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Average MSE of thirteen data sets (The value of MSE is the lower the better).
Figure 1. Average MSE of thirteen data sets (The value of MSE is the lower the better).
Symmetry 11 00744 g001
Figure 2. Average DB of thirteen data sets (The value of DB is the lower the better).
Figure 2. Average DB of thirteen data sets (The value of DB is the lower the better).
Symmetry 11 00744 g002
Figure 3. Average XB of thirteen data sets (The value of XB is the lower the better).
Figure 3. Average XB of thirteen data sets (The value of XB is the lower the better).
Symmetry 11 00744 g003
Figure 4. Average S of thirteen data sets (The value of S is the higher the better).
Figure 4. Average S of thirteen data sets (The value of S is the higher the better).
Symmetry 11 00744 g004
Figure 5. MSE curves of the algorithms for Yeast Cell Cycle (The value of MSE is the lower the better).
Figure 5. MSE curves of the algorithms for Yeast Cell Cycle (The value of MSE is the lower the better).
Symmetry 11 00744 g005
Figure 6. DB curves of the algorithms for Yeast Cell Cycle (The value of DB is the lower the better).
Figure 6. DB curves of the algorithms for Yeast Cell Cycle (The value of DB is the lower the better).
Symmetry 11 00744 g006
Figure 7. XB curves of the algorithms for Yeast Cell Cycle (The value of XB is the lower the better).
Figure 7. XB curves of the algorithms for Yeast Cell Cycle (The value of XB is the lower the better).
Symmetry 11 00744 g007
Figure 8. S curves of the algorithms for Yeast Cell Cycle (The value of S is the higher the better).
Figure 8. S curves of the algorithms for Yeast Cell Cycle (The value of S is the higher the better).
Symmetry 11 00744 g008
Table 1. Data sets used in experiments.
Table 1. Data sets used in experiments.
Data SetsNo. of
Vectors
n
No. of Vectors
with Missing
Components <20%
No. of Vectors
with Missing
Component ≥20%
No. of
Attributes
D
No. of
Classes
K
Sporulation6023413198716
Yeast Cell Cycle6078549868077256
Lymphoma40223166396150
Yeast148400810
Ecoli3360087
Dermatology36680346
Glass Identification21400107
Image Segmentation231000207
Wine Quality489800127
Wireless Indoor Localization20000074
Statlog Vehicle94600184
Page Blocks Classification547300106
Wine17800133
Table 2. Average MSE, S, DB and XB on the gene data sets.
Table 2. Average MSE, S, DB and XB on the gene data sets.
Data SetsAlgorithmMSESDBXB
(Lower the
Better)
(Higher the
Better)
(Lower the
Better)
(Lower the Better)
Yeast Cell CycleK-Means2.30923.16372.6990 5.0987 × 10 4
XK-Means2.27533.17732.5595 4.9905 × 10 4
EXK-Means2.27283.21702.0686 4.7811 × 10 4
GXK-Means2.26233.25602.4612 3.8156 × 10 4
GEXK-Means2.25723.26251.8070 3.6357 × 10 4
SporulationK-Means0.89592.76121.5781 3.0905 × 10 4
XK-Means0.89682.75561.5261 3.0761 × 10 4
EXK-Means0.89872.73151.6226 2.8585 × 10 4
GXK-Means0.89612.75101.5207 1.7465 × 10 4
GEXK-Means0.89512.76741.3285 1.5321 × 10 4
LymphomaK-Means4.87627.17252.8532 7.4561 × 10 4
XK-Means4.76837.29982.7294 7.1749 × 10 4
EXK-Means4.77647.28902.5627 5.7017 × 10 4
GXK-Means4.75207.32962.3285 5.7929 × 10 4
GEXK-Means4.72447.37872.1489 5.0597 × 10 4
YeastK-Means0.16130.31061.6854 9.6311 × 10 4
XK-Means0.15860.31921.3592 8.6864 × 10 4
EXK-Means0.16070.32341.3701 9.0603 × 10 4
GXK-Means0.15810.31941.2621 6.3508 × 10 4
GEXK-Means0.15600.32960.9154 4.2514 × 10 4
EcoliK-Means0.29143.31511.4452 7.6 × 10 3
XK-Means0.28803.27751.0973 5.9 × 10 3
EXK-Means0.23823.71910.7127 3.3 × 10 3
GXK-Means0.23213.40220.3364 3.1 × 10 3
GEXK-Means0.22683.77910.3021 1.1 × 10 3
Table 3. Average MSE, S, DB and XB on the non-gene data sets.
Table 3. Average MSE, S, DB and XB on the non-gene data sets.
Data SetsAlgorithmMSESDBXB
(Lower the
Better)
(Higher the
Better)
(Lower the
Better)
(Lower the Better)
Glass IdentificationK-Means1.28863.83062.0521 8.7987 × 10 3
XK-Means1.15564.10601.3489 7.7299 × 10 3
EXK-Means1.21134.11611.4594 8.3280 × 10 3
GXK-Means1.12684.12180.9689 3.7220 × 10 3
GEXK-Means1.02504.12930.7230 1.5982 × 10 3
Image SegmentationK-Means63.8058168.36031.4437 3.7125 × 10 4
XK-Means66.8793169.08781.3625 4.7787 × 10 4
EXK-Means59.8254169.73551.2243 4.7210 × 10 4
GXK-Means59.7865186.65271.0697 2.9622 × 10 4
GEXK-Means59.5037187.78431.0269 2.4321 × 10 4
Page Blocks ClassificationK-Means645.1506 5.6031 × 10 3 1.3881 6.8264 × 10 4
XK-Means643.3151 5.6031 × 10 3 1.6223 7.2708 × 10 4
EXK-Means640.8521 5.6301 × 10 3 1.1316 6.7756 × 10 4
GXK-Means605.8574 5.6896 × 10 3 0.8693 1.6157 × 10 4
GEXK-Means601.7767 5.6964 × 10 3 0.7865 1.0283 × 10 4
Wireless Indoor LocalizationK-Mean10.206628.94951.6324 4.4231 × 10 4
XK-Means10.198928.94951.7512 3.4610 × 10 4
EXK-Means10.196228.94901.14556 5.521 × 10 4
GXK-Means10.184928.92100.9275 2.3561 × 10 4
GEXK-Means10.085428.98400.8816 2.2878 × 10 4
DermatologyK-Mean5.744120.87491.4462 3.3 × 10 3
XK-Means5.855120.65501.2770 5.7 × 10 3
EXK-Means5.739720.77671.2425 2.9 × 10 3
GXK-Means5.742020.76391.2201 2.2 × 10 3
GEXK-Means5.725220.93250.8816 1.6 × 10 3
Statlog (Vehicle Silhouettes)K-Mean53.8433271.53350.8871 7.005 × 10 4
XK-Means53.8433271.53351.2066 8.8111 × 10 4
EXK-Means53.6535269.14401.0323 6.5618 × 10 4
GXK-Means53.5880270.61110.6289 6.3654 × 10 4
GEXK-Means53.4423301.54720.4893 5.6564 × 10 4
Wine QualityK-Mean14.276758.18800.9625 2.1481 × 10 4
XK-Means14.202158.22280.9658 2.1777 × 10 4
EXK-Means14.209058.52580.9420 2.5431 × 10 4
GXK-Means14.138258.55400.8499 2.3418 × 10 4
GEXK-Means14.103258.62690.6973 1.7011 × 10 4
WineK-Means93.0094470.25730.8236 4.1 × 10 3
XK-Means93.2120470.25730.5436 2.6 × 10 3
EXK-Means93.0092469.47000.6360 2.2 × 10 3
GXK-Means92.9745470.30150.5275 1.8 × 10 3
GEXK-Means92.8682504.19080.3211 1.5 × 10 3
Table 4. Nemenyi Test for multiple comparisons.
Table 4. Nemenyi Test for multiple comparisons.
Groups for ComparisonEvaluation TechniquePr
XK-Means vs. EXK-MeansMSE0.8994
S0.6691
DB0.0421
XB0.4539
GXK-Means vs. GEXK-MeansMSE0.0492
S0.0412
DB0.0409
XB0.0403
Table 5. Average running times (in seconds).
Table 5. Average running times (in seconds).
Data SetsEvaluation
Technique
K
-Means
XK
-Means
EXK
-Means
GXK
-Means
GEXK
-Means
Machine
SporulationMSE11.36811.68912.56289.656688.552M2
S11.46411.79212.61256.782689.457M2
DB10.42410.56910.480239.529662.183M2
XB9.6329.8819.878269.364396.33M2
Yeast
Cell Cycle
MSE58.6560.21364.524867.5605432.23M2
S58.8760.35165.114789.1085433.69M2
DB69.42572.3773.064965.6825654.75M2
XB52.58456.32159.664754.5385014.56M2
LymphomaMSE69.42570.4374.3324962.2245321.6M2
S69.66171.3474.9944987.3005324.1M2
DB72.13577.48374.2864753.6265332.42M2
XB62.52666.3668.6654665.1865026.559M2
Glass
Identification
MSE0.36750.3680.297519.56620.824M1
S0.36780.3710.298019.96020.885M1
DB0.5050.5350.53619.32527.687M1
XB0.2670.3030.30416.26319.806M1
Image
Segmentation
MSE4.5524.2304.336298.128361.755M1
S4.5784.2574.402296.867363.455M1
DB4.5644.6634.718286.692382.924M1
XB4.5654.1314.183263.960332.092M1
Page Blocks
Classification
MSE4.0394.0784.122382.663395.273M1
S4.1654.2164.269378.632397.421M1
DB7.8597.6497.947376.657520.118M1
XB4.5964.6224.264296.186390.57M1
YeastMSE1.8101.9641.870159.200165.08M1
S1.9141.9981.978161.230167.1M1
DB3.6303.7843.800163.620241.346M1
XB1.7181.8171.847148.960161.346M1
Wireless Indoor
Localization
MSE1.4281.4301.433119.630122.278M1
S1.4731.4591.62120.775124.512M1
DB2.8432.7602.840124.360176.689M1
XB1.4361.4451.41898.641120.628M1
EcoliMSE0.4080.4370.43729.61130.494M1
S0.4090.4410.43929.84530.569M1
DB0.7360.8580.88228.01043.321M1
XB0.3990.4470.49029.38031.22M1
DermatologyMSE0.6340.6350.71023.45024.829M1
S0.6390.6510.78624.68725.854M1
DB0.780.8320.85724.31226.758M1
XB0.4060.5420.51623.72026.811M1
Statlog
(Vehicle Silhouettes)
MSE0.7050.7260.76853.85657.061M1
S0.7120.7320.79855.28958.067M1
DB1.2981.3031.35158.45077.966M1
XB0.5900.6340.64243.89251.137M1
Wine QualityMSE4.5384.6584.758386.680444.963M1
S4.5634.7134.799379.668446.921M1
DB8.4358.5128.571412.654565.498M1
XB4.4634.4994.635356.215421.432M1
WineMSE0.1590.1820.17910.22611.133M1
S0.1610.1880.18210.12911.186M1
DB0.2670.3040.30911.63613.852M1
XB0.1520.1720.1748.38510.130M1

Share and Cite

MDPI and ACS Style

Hua, C.; Li, F.; Zhang, C.; Yang, J.; Wu, W. A Genetic XK-Means Algorithm with Empty Cluster Reassignment. Symmetry 2019, 11, 744. https://doi.org/10.3390/sym11060744

AMA Style

Hua C, Li F, Zhang C, Yang J, Wu W. A Genetic XK-Means Algorithm with Empty Cluster Reassignment. Symmetry. 2019; 11(6):744. https://doi.org/10.3390/sym11060744

Chicago/Turabian Style

Hua, Chun, Feng Li, Chao Zhang, Jie Yang, and Wei Wu. 2019. "A Genetic XK-Means Algorithm with Empty Cluster Reassignment" Symmetry 11, no. 6: 744. https://doi.org/10.3390/sym11060744

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop