You are currently viewing a new version of our website. To view the old version click .
Algorithms
  • Article
  • Open Access

17 December 2023

Improving Clustering Accuracy of K-Means and Random Swap by an Evolutionary Technique Based on Careful Seeding

and
1
Engineering Department of Informatics Modelling Electronics and Systems Science, University of Calabria, 87036 Rende, Italy
2
CNR—National Research Council of Italy, Institute for High Performance Computing and Networking (ICAR), 87036 Rende, Italy
*
Author to whom correspondence should be addressed.
This article belongs to the Collection Feature Paper in Metaheuristic Algorithms and Applications

Abstract

K-Means is a “de facto” standard clustering algorithm due to its simplicity and efficiency. K-Means, though, strongly depends on the initialization of the centroids (seeding method) and often gets stuck in a local sub-optimal solution. K-Means, in fact, mainly acts as a local refiner of the centroids, and it is unable to move centroids all over the data space. Random Swap was defined to go beyond K-Means, and its modus operandi integrates K-Means in a global strategy of centroids management, which can often generate a clustering solution close to the global optimum. This paper proposes an approach which extends both K-Means and Random Swap and improves the clustering accuracy through an evolutionary technique and careful seeding. Two new algorithms are proposed: the Population-Based K-Means (PB-KM) and the Population-Based Random Swap (PB-RS). Both algorithms consist of two steps: first, a population of J candidate solutions is built, and then the candidate centroids are repeatedly recombined toward a final accurate solution. The paper motivates the design of PB-KM and PB-RS, outlines their current implementation in Java based on parallel streams, and demonstrates the achievable clustering accuracy using both synthetic and real-world datasets.

1. Introduction

Clustering is a fundamental machine learning [1] approach for extracting useful information from the data of such application domains as pattern recognition, image segmentation, text analysis, medicine, bioinformatics, and Artificial Intelligence. K-Means [2,3,4] is a classical clustering algorithm often used due to its simplicity and efficiency.
The aim of K-Means is to partition N data points X = { x i } i = 1 N , e.g., x i R D , in K , with 2 K N , subsets (said clusters) by ensuring that points belonging to the same cluster are similar to one another, and points in different clusters are dissimilar. The Euclidean distance between data points usually expresses the similarity. Every cluster is represented by its central point or centroid. K-Means aims to optimize (minimize) the Sum-of-Squared Errors ( S S E ) cost, a sort of internal variance (or distortion) in clusters.
Recently, K-Means properties have been studied in depth [5,6,7]. A crucial aspect is the procedure which initializes the centroids (seeding method). In fact, the accuracy of a clustering solution strongly depends on the initial centroids. A basic limitation of K-Means is the adoption of a local strategy for managing centroids, which determines the algorithm often blocks in a sub-optimal solution. Random Swap [8,9] and genetic/evolutionary approaches [10,11,12] are examples of more sophisticated clustering algorithms that try to remedy this situation by adopting a global strategy of centroid management. At each iteration of Random Swap, a centroid is randomly selected and replaced by a randomly chosen data point of the dataset. The S S E of the new configuration is then compared to that of the previous centroids’ configuration and, if it is diminished, the configuration becomes current for the next iteration. The algorithm can be iterated a maximum number of times. Random Swap has been demonstrated to approach, in many cases, the obtainment of a solution close to the optimal one, even with a possible increase in the computational time.
The contribution of this paper is the development of two new clustering algorithms by extending the K-Means and Random Swap through an evolutionary technique [10,13] and careful seeding. The two algorithms are Population-Based K-means (PB-KM) and Population-Based Random Swap (PB-RS). They borrow ideas from the evolutionary algorithms underlying GA-K-Means [10] and Recombinator-K-Means [11,12] and consist of two steps. In the first step, a population of candidate centroid solutions is initially built, by executing J times the Lloyd’s K-Means or Random Swap along with the Greedy-K-Means++ (GKM++) [11,12,14] seeding method, which is effective for producing a careful configuration of centroids with a reduced S S E cost. In the second step, PB-KM and PB-RS start from a configuration of centroids extracted by using GKM++ on the candidate centroids of the population. Then, they recombine centroids until a satisfying solution minimizes the S S E cost.
The greater the number of repetitions is during the second step (independent restarts of PB-KM, the number of swap iterations of PB-RS), the higher the possibility of getting a combined solution near the best one is.
Regarding reliability and accuracy, PB-KM significantly enhances the classical Lloyd’s repeated K-Means in globular, spherical, and Gaussian-shaped clusters [5,6]. PB-RS is better suited for studying general datasets with an irregular distribution of points. A common issue of PB-KM and PB-RS, though, is the assumption that good clustering follows by minimizing the S S E cost. Unfortunately, this is not true for some challenging datasets [9] which can only be approximated through the proposed and similar tools.
To cope with large datasets, PB-KM and PB-RS systematically use parallel computing. Currently, the two algorithms are developed in Java using lambda expressions and parallel streams [15,16]. This way, it is possible to exploit today’s multi/many-core machines transparently.
This paper extends the preliminary paper [17] presented at the Simultech 2023 conference, where the basic idea of PB-KM was introduced. Differences from the conference paper are indicated in the following.
  • A more complete description of the evolutionary approach, which is the basis of the proposed clustering algorithms, is provided.
  • PB-KM now includes a mutation operation in the second step of recombination.
  • An original development of PB-RS is presented which, with respect to standard Random Swap [8,9], is more apt to move directly to a good clustering solution.
  • More details about the Java implementations are furnished.
  • All previous execution experiments were reworked, and new challenging case studies were added to the experimental framework, exploiting synthetic (benchmark) and real-world datasets.
The paper first describes the evolutionary approach of PB-KM and PB-RS, then reports the experimental results of applying them to several datasets. The simulation results confirm that the new algorithms can ensure accurate clustering and good execution times.
The paper is structured as follows. Section 2 reviews the related work about the K-Means and methods for seeding. It also overviews the fundamental aspects of Random Swap, the evolutionary clustering of GA-K-means and Recombinator-K-Means, which inspired the algorithms proposed here. The section also describes some external measures suited to assess the clustering quality. The operations of the PB-KM and PB-RS algorithms are presented in Section 3. Some implementation issues in Java are discussed in Section 4. Section 5 reports the chosen experimental setup made up of synthetic datasets and real-world ones, and the experimental results tied to the practical applications of the developed tools. The execution performance of the new algorithms is also demonstrated. Finally, conclusions are drawn together with an indication of ongoing and future work.

3. Population-Based Clustering Algorithms

3.1. PB-KM

The proposed Population-Based K-Means (PB-KM) algorithm was inspired by the operation of both Recombinator-K-Means [11,12] and GA-K-Means [10]. The design is based on a more simple, yet effective, clustering approach. PB-KM is organized into two steps (see Algorithm 4). The first step is devoted to the initialization of the population. The second step recombines the centroids of the population toward a final accurate clustering solution.
As in GA-K-Means, an elitist approach is usually used for managing the population. J solutions achieved via Lloyd’s K-Means seeded by GKM++ are used to initialize the population, thus containing J K points. Each initial solution is the best one, which emerges after R 1 repetitions of the K-Means. In the case R 1 = 1 , the population is set as in Recombinator-K-Means. A value R 1 > 1 allows for the population to be preliminarily established with J “best” solutions, as can happen with GA-K-Means. It is worth noting that, unlike Recombinator-K-Means, PB-KM rests on the basic GKM++ seeding without using a weighting mechanism.
The evolutionary iterations in the second step of PB-KM consist of R 2 repetitions of K-Means, fed by GKM++ seeding applied to the population instead of the dataset. The clustering result is the emerging best solution among the R 2 executions of K-Means. In other words, the crossover operation coincides with applying the GKM++ seeding followed by the optimization of K-Means. If the emerged solution has an S S E value less than that of the current best solution, it becomes the current one and the obtained centroids are replaced (mutation operation) in the population by the centroid configuration which was selected by GKM++.
Algorithm 4 describes the two steps of PB-KM which depend on three parameters: J ,   R 1 and R 2 . In step 1 the writing run (K-Means, GKM++, X ) expresses that K-Means is executed with the GKM++ seeding method applied to the data points of the dataset X . In step 2, K-Means is seeded by GKM++ applied to the centroid points of the population . The S S E cost, though, is always computed on the entire X dataset, partitioned according to the candidate solution (cand) suggested by K-Means.
Algorithm 4. The PB-KM operation.
1. Setup population
   
   repeat  J   times{
         costBest←∞, candBest←?
         repeat  R 1  times{
            cand←run(K-Means,GKM++, X )
            cost←SSE(cand, X )
            if(cost<costBest){
                  costBest←cost
                  candBest←cand
       }
       }
        = {candBest}
    }
2. Recombination
   costBest←∞
   candBest←?
   repeat  R 2  times{
      cand←run(K-Means,GKM++, )
      cost←SSE(cand, X )
      if(cost<costBest){
            costBest←cost
            candBest←cand
            replace in the GKM++ selected centroids by cand centroids
      }
      check candBest accuracy by clustering indexes
      }
Generally, following GKM++ seeding in step 1, each identified solution has limited chances of aligning precisely with the optimal solution. However, as discussed in Section 2.3, it may encompass “exemplars”, i.e., centroids near the optimal ones. These exemplars tend to aggregate in dense regions surrounding the ground truth centroids. In step 2, the likelihood of selecting an exemplar by GKM++ in a peak is influenced by the density of that area. Conversely, when an exemplar is chosen, the probability of selecting a point in the same peak area or its vicinity as a subsequent centroid is minimal, thanks to GKM++ ensuring that candidate centroids are far from one another. Consequently, the R 2 repetitions in step 2 have a favorable prospect of detecting a solution closely resembling the optimal one in practical scenarios (as shown later in this paper).
The parameters J , R 1 and R 2 depend on the handled dataset and the number of clusters K . In many cases, a value J = 25 was found to be sufficient in approaching an accurate solution. For regular datasets, e.g., with spherical clusters regularly located in the data space, even R 1 = 1 can be adopted. A small or moderate value, e.g., R 1 = 3 , can be used in more complex datasets. Generally speaking, the greater the value of R 2 is, the higher the chance of hitting a solution close to the optimal one is.
The computational cost of the two steps of PB-KM is directly derived from the Repeated K-Means behavior and the use of GKM++. In particular, the first step has a linear cost O J   R 1 K S N D + K N I D + N D , whereby in the squared brackets, there is first the GKM++ cost, then the K-Means cost ( I   is the number of iterations for reaching the convergence) and finally the cost for computing the S S E value. The second step has a similar cost when one considers that the seeding is fed by the population, which has J K points: O ( R 2 [ J K 2 S D + K N I D + N D ] ) .

3.2. PB-RS

The setup population step of PB-RS consists of running J times the parallel version of Random Swap described in [9], each run continued for T swap iterations (e.g., T = 5000 ) and storing in the population each emerged solution.
Algorithm 5 shows the recombination step of PB-RS. An initial configuration of K centroids is set up by applying GKM++ to population . The corresponding partition of dataset points is then built and its S S E cost is defined as the current cost. Then, T swap iterations are executed. The value of T depends on the dataset and the number of clusters K .
Algorithm 5. The PB-RS recombination step.
cand←GKM++( )
partition X data points according to cand
cost←SSE( X )
repeat  T  times{
   save cand
   cand’←swap(cand), that is: cs←pj, pj , s←unif_rand(1.. K), j←unif_rand( 1 . . J K )
   refine cand’ by a few K-Means iterations (e.g., 5)
   new_cost←SSE(cand’ , X )
   if(new_cost<cost){
         accept cand’, cand←cand’
         cost←new_cost
   }
   else{
         restore saved cand and its previous partitioning
   }
}
check the accuracy of candBest by further clustering indexes.
At each swap iteration, a centroid in the current configuration ( c a n d ) is randomly selected and replaced by a randomly chosen candidate point taken from the population.
It is worth noting that the population remains unaltered during the swap iterations under PB-RS. The initial selection of centroids via GKM++ triggers mutation and crossover operations, respectively, represented by swap operation and K-Means refinement.
The cost of the first step of PB-RS can be summarized as O J K S N D + K N D + T N K D K N i D + τ K + 1 τ N + K , where for each of the J solutions, first the cost of GKM++ is accounted, then the cost of partitioning the dataset according to the initial centroids is considered; after that, the cost of the T swap iterations is added. i is the small number of iterations of K-Means executed at each swap to optimize the new centroid configuration and τ is the probability of accepting the new centroid configuration. If the new configuration is rejected, N + K is the cost of restoring the previous centroids and associated partitioning. The cost of the second step of PB-RS is similar to that of the first step by considering one single run of Random Swap ( J = 1 ) and that the single seeding of GKM++ is fed from the population and costs J K 2 S D . The number of swap iterations, T , is expected to depend on the particular adopted dataset.
With respect to the standard Random Swap operation [8,9], PB-RS recombination tends to move more “directly”, experimentally confirmed, toward a good clustering solution by avoiding many unproductive iterations. This is because at each swap iteration, only a candidate centroid in the population, not a point in the whole dataset, is considered for the replacement of a centroid in the current vector of centroids.

4. JAVA Implementation Notes

The realized Java implementation of PB-KM was designed to tackle the important task of facilitating the parallel execution of recurring operations. These include the partitioning and centroids update steps of K-Means (refer to Algorithm 1), the computation of the SSE cost, and the fundamental operations of GKM++, among others. To achieve this, parallel streams and lambda expressions [9,15,16,20] were leveraged. A parallel stream is orchestrated by the fork/join mechanism, allowing for arrays/collections like datasets, populations, centroid vectors and so forth, to be divided into multiple segments. Separate threads are then spawned to process these segments independently, and the results are eventually combined. Lambda expressions serve as functional units specifying operations on a data stream concisely and efficiently.
While the use of such popular parallelism can be straightforward in practical scenarios, it necessitates caution from the designer to avoid using shared data in lambda expressions, as this could introduce subtle data inconsistency issues, rendering the results meaningless.
Supporting classes for PB-KM/PB-RS encompass a foundational environmental G class, exposing global data (see Table 1), such as N (dataset numerosity), D (number of data point coordinates or dimensions), K (number of clusters/centroids), S (GKM++ accuracy degree), J (population size), available seeding methods, methods for loading the dataset into memory, ground truth information (centroids or partition labels), if there are any, population and more. The helper D a t a P o i n t class enables common operations on data points like the Euclidean distance and offers some method references (equivalent to lambda expressions) employed in point stream operations.
To illustrate the Java programming style, Algorithm 6 presents a snippet of the K-means++/greedy K-means++ methods operating on a source, which can be the entire dataset or the population. The operations pertain to calculating of the common denominator (see Algorithm 2) of the probabilities of the data points being chosen as the next centroid.
Algorithm 6. Code fragment of K-Means++/Greedy_K-Means++ operating on a source of data points.

final int l=L;//turn L into a final variable l
Stream<DataPoint> pStream=
      (PARALLEL) ? Arrays.stream(source).parallel(): Arrays.stream(source);
DataPoint ssd=pStream//sum of squared distances
   .map(p->{
      p.setDist(Double.MAX_VALUE);
      for(int k=0; k<l; ++k) {//existing centroids
            double d=p.distance(centroids[k]);
            if(d<p.getDist()) p.setDist(d);
      }
      return p; })
   .reduce(new DataPoint(), DataPoint::add2Dist, DataPoint::add2DistCombiner);
double denP=ssd.getDist();
//common denominator of points probability

//random switch
Initially, a stream (a view, not a copy of the data) p S t r e a m is extracted from the source of data points. The value of the G’s PARALLEL parameter determines whether pStream should be operated in parallel. In the following, it is normally assumed that PARALLEL = true.
The intermediate map() operation on p S t r e a m processes the points of the s o u r c e in parallel by recording the minimal distance to existing centroids (indexes from 1 L ) into each point p . This is achieved as part of the Function’s lambda expression of the map() operation. Notably, each point only modifies itself and avoids modifications to any shared data.
The m a p ( ) operation yields a new stream operated by the r e d u c e ( ) terminal operation. The reduce() operation concretely initiates parallel processing, including the map executions. It instructs the underlying threads to add the squared point distances, utilizing the method reference a d d 2 D i s t of the D a t a P o i n t class. The partial results from the threads are ultimately combined by the method reference a d d 2 C o m b i n e r of D a t a P o i n t , adding them and producing a new D a t a P o i n t   s s d , of which the distance field contains the desired calculation ( d e n P ).
Following the calculations in Algorithm 6, a random switch based on point probabilities finally selects the next (not yet chosen) centroid.
Parallel streams are also used to implement K-Means (see also [9,15]), particularly for the concretization of the basic steps 2 and 3 of Algorithm 1. In addition, parallelism is exploited for computing the S S E cost and in many similar operations.
Algorithm 7 illustrates the function which computes the S S E cost of a given centroid configuration (current contents of the c e n t r o i d s vector) and its corresponding partitioning of the dataset points. First, the squared distance to its nearest centroid is stored into each point. Then, all the squared distances are accumulated in a point s , through a r e d u c e ( ) operation which receives a neutral point (located in the origin of data points) and a lambda expression that creates and returns a new point with the sum of the squared distances of the two parameter points p1 and p2. In Algorithm 7, all the dataset points can be processed in parallel.
Algorithm 7. Java function which calculates the S S E the cost of a given partitioning.
Stream<DataPoint> pStream=
      (PARALLEL) ? Stream.of(dataset).parallel(): Stream.of (dataset);
DataPoint s=pStream
   .map(p ->{
         int k=p.getCID();//retrieve partition label (centroid index) of p
         double d=p.distance(centroids[k]);
          p.setDist(d*d);//store locally to p the squared distance of p to its (nearest) centroid
         return p;
    } )
   .reduce(new DataPoint(),
          (p1,p2)->{ DataPoint ps=new DataPoint(); ps.setDist(p1.getDist()+p2.getDist());
                           return ps; }
    );
return s.getDist();
The systematic use of the parallelism in PB-KM/PB-RS purposely reduces the time required, e.g., for computing all the distances between the points and associated centroids which, in turn, can significantly reduce the program execution time on a multi-core machine.

5. Experimental Framework

For comparison purposes with Recombinator-K-Means, all the synthetic (benchmark) and real-world datasets used in [11,12], plus others, were chosen to test the behavior of PB-KM/PB-RS. The datasets are split into four groups.
The first group (see Table 2) contains some basic benchmark datasets taken from [24], often used to check the clustering capabilities of algorithms based on K-Means. All the datasets come with ground truth centroids and will be processed, as in [11,12], by scaling down the data entries by the overall maximum. A brief description of the datasets is shown in Table 2.
Table 2. The first group of synthetic datasets [24].
The A 3 dataset comprises 7500 2-d points distributed across 50 spherical clusters. S 3 admits 5000 2-d points divided into 15 Gaussian distributed clusters with limited overlap. As discussed in [5,6], cluster overlapping is the key factor which can favor centroid movement during K-Means refinement, and then the obtainment of an accurate clustering. D i m 1024 is an example of a dataset with high-dimensional points. It contains 1024 Gaussian-distributed points in 16 well-separated clusters. U n b a l a n c e is made up of 6500 2-d points split into eight Gaussian clusters, articulated in two neatly separated groups of clusters containing 2000 and 100 points, respectively. B i r c h datasets contain 105 2-dimensional points distributed into 100 clusters. In particular, B i r c h 1 places its clusters on a 10 × 10 grid. B i r c h 2 , instead, puts the clusters on a sine curve. B i r c h 1 and B i r c h 2 have spherical clusters of the same size.
The synthetic datasets presented in Table 2 can be studied using classical Repeated K-Means and careful seeding. However, in many cases, only an imperfect solution will emerge from the experiments (see Section 5.1).
The second group of datasets (see Table 3) includes two real-world datasets taken from the UCI Repository [25]. M u s k concerns a molecule identification problem, whether it is musk or not. Although limited in the number of data points, N , and the number of clusters, K , the dataset admits a high number, D , of features (coordinates) per point, which complicates the identification problem. The M i n i B o o N E dataset, instead, regards a signals identification problem, whether they are neutrinos or background. In this case, the challenge is represented by both high values of N and D . The two datasets were also used in [14]. In particular, the solutions documented in [14] will be assumed in this paper as “golden” solutions, from which ground truth centroids are inferred and used to qualify the correctness of the solutions achieved via PB-KM/PB-RS. For comparison purposes with the results in [14], the two datasets will be processed by first scaling all the data entries via min–max normalization.
Table 3. The second group of real-world datasets.
The third group of datasets (see Table 4) contains three synthetic datasets achieved from [24] whose good clustering does not necessarily follow from the minimization of the S S E cost (see also [9]).
Table 4. The third group of synthetic datasets [24].
B i r c h 3 (see Figure 1) differs from the regular B i r c h 1 and B i r c h 2 because it admits clusters with a random size and randomly located in the data space. W o r m s _ 2 d (see Figure 2) is composed of 35 clusters with 2-dimensional data points. W o r m s _ 64 d is characterized by 25 clusters with data points with 64 dimensions. The geometrical shapes of the worm datasets are determined by starting at a random position and moving in a random direction. At any moment, the points follow a Gaussian distribution, whereby the variance gradually increases step-by-step. In addition, movement direction is continually changed according to an orthogonal direction. In W o r m s _ 64 d , though, the orthogonal direction gets randomly re-defined at every step.
Figure 1. The B i r c h 3 dataset [24].
Figure 2. The W o r m s _ 2 d dataset [24].
Clustering of worm datasets was investigated in [26] using an enhanced and careful density peak-based algorithm [27]. B i r c h 3 was analyzed, e.g., in [9,11,12]. Such previous results will be used as a reference to assess the accuracy of the clustering solutions documented in this paper. To compare with the results in [11,12], the three datasets will be processed by first scaling down the data entries by the overall maximum.
The fourth group (see Table 5) comprises some challenging real-world datasets, many of which are without ground truth information. Clustering difficulties arise from the large number of data points, N , the number of point features, D , and the number of needed clusters, K .
Table 5. The fourth group of real-world datasets.
The non-binarized version of the B r i d g e dataset, the 8 bits per color version of the H o u s e dataset, and the frame 1 vs. 2 version of M i s s   A m e r i c a dataset, were downloaded from the [24] repository. They all refer to image data processing.
The image facial recognition problem of the O l i v e t t i dataset, from the AT&T Laboratories Cambridge, handles 40 human subjects, each portrayed in 10 different poses. Every facial photo is stored by 64 × 64 = 4096 pixels.
The U r b a n G B is a large dataset consisting of geographical coordinates of car accidents in Great Britain’s urban areas. The dataset can be downloaded from [28].
The solutions for the O l i v e t t i and U r b a n G B reported in [11,12] were used to infer “ground truth” information about the datasets. In all the cases, though, the S S E cost and its evolution vs. the real time can be used for comparison purposes.
As in [11,12], the datasets of the fourth group are processed without scaling, except for the U r b a n G B where the first dimension of the data entries are scaled down by a factor of 1.7 .
The following simulation experiments were executed on a Win11 Pro platform, Dell XPS 8940, Intel i7-10700 (8 physical cores), CPU@2.90 GHz, 32GB RAM and Java 17.

5.1. Clustering the A3 Dataset

For a preliminary study, the A 3 dataset was chosen (see Table 2 and Figure 3). The goal was to compare the performance of classical Repeated K-Means driven by different seeding methods, to that it was achievable with PB-KM. In particular, A 3 was first clustered via Repeated K-Means R K M separately fed by uniform random R K M U n i f , K-Means++ (RKMKM++) and Greedy K-Means++ R K M G K M + + seeding procedure.
Figure 3. The A 3 synthetic dataset [24].
Then, 104 repetitions of K-Means were executed and the following quantities monitored: (a) the minimal value of the S S E cost S S E m i n , (b) the corresponding Cluster Index ( C I ) value (see Section 2.5) ( C I m i n   ( S S E ) ), (c) the minimal value of the observed C I ( C I m i n ) and the corresponding value of the S S E cost ( S S E m i n ( C I ) ), (d) the emerging average C I value ( a v g _ C I ) and (e) the s u c c e s s _ r a t e , that is, the number of runs which ended with a C I = 0 , divided by 104. In addition, the Parallel Execution Time ( P E T ), in sec, needed by Repeated K-Means to complete its runs was also observed. Table 6 collects all the achieved results.
Table 6. Clustering experimental results on the A 3 dataset.
The experimental data in Table 6 clearly confirm the superior behavior ensured by GKM++ seeding, which makes R K M capable of outperforming the scenarios where K-Means++ (KM++) or the uniform random (Unif) centroids initialization is adopted. The observed average C I and the s u c c e s s   r a t e are worth being noted. As one case see in Table 6, it always happens that the minimum C I value occurs at the minimum S S E cost.
The R K M G K M + + results in Table 6 comply with the results reported, e.g., in [9,11].
Table 7 collects the results observed when using PB-KM with the parameter values J = 25 and R 1 = 3 adopted in the first step (see Algorithm 4), and the value R 2 = 40 used for the second step. Only the P E T was annotated for the first step, which creates the population of candidate solutions. The second-step results clearly confirm that PB-KM is able to correctly solve the A 3 dataset. In fact, a s u c c e s s _ r a t e of 100% and C I = 0 were observed. The S S E minimal value coincides with that obtained with R K M G K M + + in Table 6.
Table 7. PB-KM experimental results about the A 3 dataset.
Results in Table 7 also show how the execution time of PB-KM outperforms that achievable by straight Repeated K-Means (see Table 6). The same results of minimal S S E and C I , and a 100% s u c c e s s _ r a t e were also observed when using R 2 = 10 . In reality, one single recombination iteration is sufficient for obtaining the minimal S S E and C I . All of this was precisely confirmed by using the PB-RS recombination on the same population created by PB-KM.
The following sections report the experimental results collected by applying PB-KM/PB-RS to the four groups of selected datasets. A common point of all the experiments concerns using the GKM++ seeding method both in the first step of the population set-up and in the second step of recombination.

5.2. First Group of Synthetic Datasets (Table 2)

Table 8 shows the experimental results collected when applying PB-KM to all the benchmark datasets reported in Table 2 (entries are preliminary scaled down by the overall maximum). It is worth noting that all these datasets have a s u c c e s s _ r a t e of 0 % when clustered by Repeated K-Means together with uniform random seeding, as also documented in [5]. The experimental results confirm that C I = 0 always occurs at the minimum value of the objective S S E cost.
Table 8. PB-KM results on the synthetic datasets of Table 2.
The S 3 , D i m 1024 and U n b a l a n c e datasets were studied using J = 25 and R 1 = 3 for the first step (three independent repetitions of K-Means are used for defining each solution of the population), and by R 2 = 40 for the second step (see Algorithm 4). Due to the higher number of clusters K , B i r c h 1 and B i r c h 2 were instead studied by using J = 20 and R 1 = 3 for the first step, and R 2 = 40 for the second step. Table 8 reports the Parallel Elapsed Time ( P E T ) in sec, required by the first and second step of PB-KM.
In reality, PB-KM was capable of detecting the “best” solution, that is, one with a minimal S S E and C I , just after a few iterations (in some cases after 1 iteration) of the recombination step. All of this was also confirmed by PB-RS recombination. The results (e.g., CI = 0) in Table 8 are the same as reported in [9], where Random Swap was used, and in [11] where the Recombinator-K-Means tool was exploited.

5.3. Second Group of Real-World Datasets (Table 3)

The M u s k and M i n i B o o N E real-world datasets (data entries preliminarily scaled by min-max normalization), together with the ground truth information inferred from the solutions reported in [14], were easily clustered using PB-KM with J = 25 and R 1 = 3 for the first step, and R 2 = 20 for the second step. The results, which coincide with those reported in [11,12,14], are shown in Table 9. Very few iterations of PB-RS also confirmed them.
Table 9. PB-KM results on the real-world datasets of Table 3.

5.4. Third Group of Synthetic Datasets (Table 4)

All the entries of datasets in Table 4 were preliminarily scaled by the overall maximum. The datasets of this group were processed via PB-RS because it provided the most accurate results. The initial population of B i r c h 3 was built using J = 20 and T = 5000 swap iterations of Random Swap always seeded by GKM++, requiring a parallel elapsed time of P E T = 5041 s. The recombination step was carried out using T = 10,000 iterations. Figure 4 depicts the S S E vs. the real-time PET (s). Figure 5 shows the Cluster Index (recall B i r c h 3 comes with ground truth centroids, see also Section 2.5) C I vs. real-time PET (s).
Figure 4. S S E cost vs. time for the B i r c h 3 dataset.
Figure 5. C e n t r o i d   I n d e x   ( C I ) vs. time for the B i r c h 3 dataset.
Notably, a C I = 12 was estimated in [9] by using standard Parallel Random Swap executed for 10 5 iterations, requiring a P E T = 7341   s e c . Figure 5 suggests a final value of C I = 11 , after a significantly smaller time.
To ensure a proper number of candidate centroids for the W o r m s _ 2 d / W o r m s _ 64 d datasets which have K = 35 and K = 25 clusters, respectively, a population with J = 40 solutions and T = 5000 swap iterations was preliminarily created with PB-RS, requiring P E T = 4485 sec for W o r m s _ 2 d , and a population of J = 40 solutions and R 1 = 3 with PB-KM, requiring P E T = 5156   s e c for W o r m s _ 64 .
The PB-RS recombination step lasts after T = 10 4 iterations for W o r m s _ 2 d . Figure 6 and Figure 7 show the measured S S E vs. time and the C I vs. time for W o r m s _ 2 d , respectively. Since the worm datasets come with partition labels as ground truth, the C I is, in reality, a Generalized C I [23] based on the Jaccard distance among the partitions (see also [9])).
Figure 6. S S E cost vs. time for the W o r m s _ 2 d dataset.
Figure 7. ( G e n e r a l i z e d )   C e n t r o i d   I n d e x   ( C I ) vs. time for the W o r m s _ 2 d dataset.
Figure 6 confirms that for a dataset like W o r m s _ 2 d , the minimization of the S S E cost does imply the most accurate solution to be achieved (here assessed according to the Cluster Index C I ). In fact, for lower values of the S S E (see Figure 6), the C I increases. The average value C I = 7.6 in Figure 7 complies with a similar result documented in [26].
As discussed in [9], the W o r m s _ 64 d , despite the higher dimensionality w.r.t. W o r m s _ 2 d , is more amenable to clustering and can be correctly solved via Random Swap. This was confirmed using PB-RS with a recombination step of T = 1000 swap iterations (fewer iterations could have been used as well).
Figure 8 and Figure 9 show the S S E vs. time and the C I vs. time for W o r m s _ 64 d , respectively. As shown in Figure 9, the ultimate value of C I is 0, which starts occurring at the minimum S S E (see Figure 8), thus witnessing the obtainment of a solution with correctly structured clusters.
Figure 8. S S E cost vs. time for the W o r m s _ 64 d dataset.
Figure 9. ( G e n e r a l i z e d )   C e n t r o i d   I n d e x   ( C I ) vs. time for the W o r m s _ 64 d dataset.

5.5. Fourth Group of Real-World Datasets (Table 5)

The challenging real-world datasets in Table 5, which have many clusters, were clustered by PB-RS by preparing a population of J = 5 solutions each emerging after T = 5000 iterations of Random Swap. PB-RS was chosen because it provided better experimental results (in terms of accuracy and time efficiency) w.r.t. PB-KM.
The dataset entries were used without scaling, except for the U r b a n G B . In the U r b a n G B , instead, the first dimension of the data entries is preliminarily scaled down by a factor of 1.7 . The recombination step lasts after T = 5 × 10 4 iterations. However, it can be terminated when the current cost differs from the previous one of a quantity less than a given numerical threshold (e.g., 10 4 ).
Figure 10, Figure 11, Figure 12, Figure 13, Figure 14, Figure 15 and Figure 16 report the performance curves (the S S E cost vs. time and, when possible, the C I vs. time) observed for the B r i d g e , H o u s e , M i s s A m e r i c a , O l i v e t t i and U r b a n G B [28] datasets. Since O l i v e t t i and U r b a n G B   are provided with ground truth information (both centroid and partition labels), Figure 14 and Figure 16 portray the observed centroid index C I vs. time, respectively, for O l i v e t t i and U r b a n G B .
Figure 10. S S E vs. time for the B r i d g e dataset.
Figure 11. S S E vs. time for the M i s s A m e r i c a dataset.
Figure 12. S S E vs. time for the H o u s e dataset.
Figure 13. S S E vs. time for the O l i v e t t i dataset.
Figure 14. C I vs. time for the O l i v e t t i dataset.
Figure 15. S S E vs. time for the U r b a n G B dataset.
Figure 16. C I vs. time for the U r b a n G B dataset.
From Figure 14, it emerges that the clustering algorithm could not recognize 7 faces out of 40 . Similarly, the results in Figure 16 indicate that, in the best case, 143 of 469 cases were not correctly handled in the U r b a n G B dataset.
The shown experimental results agree with those reported in [11,12].

5.6. Time Efficiency of PB-KM

The computational efficiency of the developed tools was assessed, in a case, using the PB-KM recombination step on the W o r m s _ 64 d dataset (see Table 4), via the GKM++ seeding method, with J = 40 and R 2 = 200 repetitions, separately in parallel (the parameter P A R A L L E L   s e t   t o   t r u e ) and sequential P A R A L L E L = f a l s e modes. The total elapsed time t E T (in msec) for the serial ( t E T S ) and parallel ( t E T P ) case needed by PB-KM recombination to complete were measured, together with the total number of executed K-Means iterations (respectively t I T S ,   t I T P ), as reported in Table 10.
Table 10. The sequential and parallel execution of PB-KM recombination on W o r m s 64 d . (8 physical cores).
From the data in Table 10, the average elapsed time per iteration was computed as a v E T i t S = t E T S t I T S = 292.73 and a v E T i t P = t E T P t I T P = 43.70 , and the speedup was estimated as
s p e e d u p = a v E T i t S / a v E T i t P = 292.73 43.70 = 6.7

6. Conclusions

This paper proposes two evolutionary-based clustering algorithms: Population-Based K-Means (PB-KM) and Population-Based Random Swap (PB-RS). The two algorithms were inspired by the Recombinator-K-Means [11,12] and the Genetic Algorithm of P. Franti [10], plus the use of the careful seeding ensured by the Greedy K-Means++ (GKM++) method [11,14].
However, PB-KM and PB-RS are based on a simpler, yet effective, approach which rests on two steps. In the first step, a population of J candidate “best” centroid solutions is created. The second step recombines the population’s centroids toward obtaining a careful solution. This is achieved in PB-KM through a number of independent repetitions of Lloyd’s K-Means [2,3,4], and in PB-RS by a certain number of iterations of Random Swap [8,9]. In both cases the starting point is a centroid configuration achieved by applying GKM++ to the population. Refinement of the initial solution is then controlled by partitioning the points of the dataset and moving toward minimizing the Sum of Squared Errors ( S S E ) cost.
A key factor of PB-KM and PB-RS concerns their implementation in Java, which is based on parallel streams [9,15,16], which enables the exploitation of the parallel computing potential of modern multi/many-core machines.
The paper documents the reliable and efficient clustering capabilities of PB-KM and PB-RS by applying them to a collection of challenging benchmark and real-world datasets.
Ongoing and future work aims to address the following points.
First, it aims to experiment with the two developed algorithms for clustering sets [20,29] and more in general categorical and text-based datasets.
Second, it aims to port the implementations on top of the efficient Theatre actor system [30], which allows for better control and exploitation of the parallel resources of a multi/many-core machine.
Third, the aim is to adapt PB-KM by replacing Lloyd’s K-Means with the Hartigan and Wong variation of K-Means [31,32]. The idea is to experiment with an incremental technique which constrains the switching of a data point from its source cluster to a destination cluster also on the basis of its Silhouette coefficient [33]. The goal is to favor the definition of well-separated clusters.
The fourth aim is to compare the two developed algorithms to affinity propagation clustering [34] algorithms, e.g., for studying the seismic consequences caused by earthquakes [35]. In addition, the influence of the method about point distributions in the hypersphere [36] on our described clustering work deserves particular attention.

Author Contributions

Conceptualization, L.N. and F.C.; Methodology, L.N. and F.C.; Software, L.N. and F.C.; Validation, L.N. and F.C.; Writing–original draft, L.N.; Writing–review & editing, F.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All data is available in on-line repositories [24,25,28].

Acknowledgments

The authors thank the colleague Carlo Baldassi for providing the O l i v e t t i and U r b a n G B datasets, along with ground truth information.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Bell, J. Machine Learning: Hands-on for Developers and Technical Professionals; John Wiley & Sons: Hoboken, NJ, USA, 2020. [Google Scholar]
  2. Lloyd, S.P. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
  3. MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability; University of California: Berkeley, CA, USA, 1967; pp. 281–297. [Google Scholar]
  4. Jain, A.K. Data clustering: 50 years beyond k-means. Pattern Recognit. Lett. 2010, 31, 651–666. [Google Scholar] [CrossRef]
  5. Fränti, P.; Sieranoja, S. K-means properties on six clustering benchmark datasets. Appl. Intell. 2018, 48, 4743–4759. [Google Scholar] [CrossRef]
  6. Fränti, P.; Sieranoja, S. How much can k-means be improved by using better initialization and repeats? Pattern Recognit. 2019, 93, 95–112. [Google Scholar] [CrossRef]
  7. Vouros, A.; Langdell, S.; Croucher, M.; Vasilaki, E. An empirical comparison between stochastic and deterministic centroid initialization for K-means variations. Mach. Learn. 2021, 110, 1975–2003. [Google Scholar] [CrossRef]
  8. Fränti, P. Efficiency of random swap algorithm. J. Big Data 2018, 5, 1–29. [Google Scholar] [CrossRef]
  9. Nigro, L.; Cicirelli, F.; Fränti, P. Parallel random swap: An efficient and reliable clustering algorithm in Java. Simul. Model. Pract. Theory 2023, 124, 102712. [Google Scholar] [CrossRef]
  10. Fränti, P. Genetic algorithm with deterministic crossover for vector quantization. Pattern Recognit. Lett. 2000, 21, 61–68. [Google Scholar] [CrossRef]
  11. Baldassi, C. Recombinator K-Means: A population based algorithm that exploits k-means++ for recombination. arXiv 2020, arXiv:1905.00531v3. [Google Scholar]
  12. Baldassi, C. Recombinator K-Means: An evolutionary algorithm that exploits k-means++ for recombination. IEEE Trans. Evol. Comput. 2022, 26, 991–1003. [Google Scholar] [CrossRef]
  13. Hruschka, E.R.; Campello, R.J.; Freitas, A.A. A survey of evolutionary algorithms for clustering. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 2009, 39, 133–155. [Google Scholar] [CrossRef]
  14. Celebi, M.E.; Kingravi, H.A.; Vela, P.A. A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst. Appl. 2013, 40, 200–210. [Google Scholar] [CrossRef]
  15. Nigro, L. Performance of parallel K-means algorithms in Java. Algorithms 2022, 15, 117. [Google Scholar] [CrossRef]
  16. Urma, R.G.; Fusco, M.; Mycroft, A. Modern Java in Action; Manning: Shelter Island, NY, USA, 2018. [Google Scholar]
  17. Nigro, L.; Cicirelli, F. Performance of a K-Means algorithm driven by careful seeding. In Proceedings of the 13th International Conference on Simulation and Modeling Methodologies, Technologies and Applications (SIMULTECH) 2023, Rome, Italy, 12–14 July 2023; pp. 27–36, ISBN 978-989-758-668-2. [Google Scholar]
  18. Arthur, D.; Vassilvitskii, S. K-Means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms 2007; pp. 1027–1035.
  19. Goldberg, D.E. Genetic Algorithms in Search Optimization and Machine Learning; Addison Wesley: Boston, MA, USA, 1989. [Google Scholar]
  20. Nigro, L.; Fränti, P. Two Medoid-Based Algorithms for Clustering Sets. Algorithms 2023, 16, 349. [Google Scholar] [CrossRef]
  21. Rezaei, M.; Franti, P. Set Matching Measures for External Cluster Validity. IEEE Trans. Knowl. Data Eng. 2016, 28, 2173–2186. [Google Scholar] [CrossRef]
  22. Fränti, P.; Rezaei, M.; Zhao, Q. Centroid index: Cluster level similarity measure. Pattern Recognit. 2014, 47, 3034–3045. [Google Scholar] [CrossRef]
  23. Fränti, P.; Rezaei, M. Generalized centroid index to different clustering models. In Proceedings of the Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), Mérida, Mexico, 29 November–2 December 2016; LNCS 10029. pp. 285–296. [Google Scholar]
  24. Fränti, P. Repository of Datasets. Available online: http://cs.uef.fi/sipu/datasets/ (accessed on 1 August 2023).
  25. UCI Machine Learning Repository. Available online: http://archive.ics.uci.edu/ml (accessed on 1 August 2023).
  26. Sieranoja, S.; Fränti, P. Fast and general density peaks clustering. Pattern Recognit. Lett. 2019, 128, 551–558. [Google Scholar] [CrossRef]
  27. Rodriguez, R.; Laio, A. Clustering by fast search and find of density peaks. Science 2014, 344, 14.92–14.96. [Google Scholar] [CrossRef]
  28. Baldassi, C. UrbanGB Dataset. Available online: https://github.com/carlobaldassi/UrbanGB-dataset (accessed on 1 August 2023).
  29. Rezaei, M.; Fränti, P. K-sets and k-swaps algorithms for clustering sets. Pattern Recognit. 2023, 139, 109454. [Google Scholar] [CrossRef]
  30. Nigro, L. Parallel Theatre: A Java actor-framework for high-performance computing. Simul. Model. Pract. Theory 2021, 106, 102189. [Google Scholar] [CrossRef]
  31. Hartigan, J.A.; Wong, M.A. Algorithm as 136: A k-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 1979, 28, 100–108. [Google Scholar] [CrossRef]
  32. Slonim, N.; Aharoni, E.; Crammer, K. Hartigan’s k-means versus Lloyd’s k-means-is it time for a change? In Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence (IJCAI 2013), Beijing, China, 3–9 August 2013; pp. 1677–1684. [Google Scholar]
  33. Bagirov, A.M.; Aliguliyev, R.M.; Sultanova, N. Finding compact and well separated clusters: Clustering using silhouette coefficients. Pattern Recognit. 2023, 135, 109144. [Google Scholar] [CrossRef]
  34. Frey, B.J.; Dueck, D. Clustering by passing messages between data points. Science 2007, 315, 972–976. [Google Scholar] [CrossRef] [PubMed]
  35. Moustafa, S.S.; Abdalzaher, M.S.; Khan, F.; Metwaly, M.; Elawadi, E.A.; Al-Arifi, N.S. A quantitative site-specific classification approach based on affinity propagation clustering. IEEE Access 2021, 9, 155297–155313. [Google Scholar] [CrossRef]
  36. Lovisolo, L.; Da Silva, E.A.B. Uniform distribution of points on a hyper-sphere with applications to vector bit-plane encoding. IEE Proc.-Vis. Image Signal Process. 2001, 148, 187–193. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.