Self-Adjusting Variable Neighborhood Search Algorithm for Near-Optimal k-Means Clustering

: The k-means problem is one of the most popular models in cluster analysis that minimizes the sum of the squared distances from clustered objects to the sought cluster centers (centroids). The simplicity of its algorithmic implementation encourages researchers to apply it in a variety of engineering and scientiﬁc branches. Nevertheless, the problem is proven to be NP-hard which makes exact algorithms inapplicable for large scale problems, and the simplest and most popular algorithms result in very poor values of the squared distances sum. If a problem must be solved within a limited time with the maximum accuracy, which would be di ﬃ cult to improve using known methods without increasing computational costs, the variable neighborhood search (VNS) algorithms, which search in randomized neighborhoods formed by the application of greedy agglomerative procedures, are competitive. In this article, we investigate the inﬂuence of the most important parameter of such neighborhoods on the computational e ﬃ ciency and propose a new VNS-based algorithm (solver), implemented on the graphics processing unit (GPU), which adjusts this parameter. Benchmarking on data sets composed of up to millions of objects demonstrates the advantage of the new algorithm in comparison with known local search algorithms, within a ﬁxed time, allowing for online computation.


Problem Statement
The aim of a clustering problem solving is to divide a given set (sample) of objects (data vectors) into disjoint subsets, called clusters, so that each cluster consists of similar objects, and the objects of different clusters have significant dissimilarities [1,2]. The clustering problem belongs to a wide class of unsupervised machine learning problems. Clustering models involve various similarity or dissimilarity measures. The k-means model with the squared Euclidean distance as a dissimilarity measure is based exclusively on the maximum similarity (minimum sum of squared distances) among objects within clusters.
Clustering methods can be divided into two main categories: hierarchical and partitioning [1,3]. Partitioning clustering, such as k-means, aims at optimizing the clustering result in accordance with a pre-defined objective function [3].
The k-means problem [4,5], also known as minimum sum-of-squares clustering (MSSC), assumes that the objects being clustered are described by numerical features. Each object is represented by a point in the feature space R d (data vector). It is required to find a given number k of cluster centers 2 of 32 (called centroids), such as to minimize the sum of the squared distances from the data vectors to the nearest centroid.
Let A 1 , . . . , A N ∈ R d be data vectors, N be the number of them, and S = {X 1 , . . . , X k } ⊂ R d be the set of sought centroids. The objective function (sum of squared errors, SSE) of the k-means optimization problem formulated by MacQueen [5] is: min X∈{X 1 ,...,X k } A i − X 2 → min X 1 ,...,X k ∈R d . (1) Here, · is the Euclidean distance, integer k must be known in advance.
A cluster in the k-means problem is a subset of data vectors for which the specified centroid is the nearest one: We assume that a data vector cannot belong to two clusters at the same time. At an equal distance for several centroids, the question of assignment to a cluster can be solved by clustering algorithms in different ways. For example, a data vector belongs to a cluster lower in number: Usually, for practical problems with sufficiently accurate measured values of data vectors, the assignment to a specific cluster is not very important.
The objective function may also be formulated as follows: SSE(X 1 , . . . , X k ) = k j=1 i=1,N:A i ∈C j A i − X j 2 → min X 1 ,...,X k ∈R d , .
Equations (3) and (4) correspond to continuous and discrete statements of our problem, respectively. Such clustering problem statements have a number of drawbacks. In particular, the number of clusters k must be given in advance, which is hardly possible for the majority of practically important problems. Furthermore, the adequacy of the result in the case of a complex cluster shapes is questionable (this model is proved to work fine with the ball-shaped clusters [6]). The result is sensitive to the outliers (standalone objects) [7,8] and depends on the chosen distance measure and the data normalization method. This model does not take into account the dissimilarity between the objects in different clusters, and the application of the k-means model results in some solution X 1 , . . . , X k even in the cases with no cluster structure in the data [9,10]. Moreover, the NP-hardness [11,12] of the problem makes the exact methods [6] applicable only for very small problems.
Nevertheless, the simplicity of the most commonly used algorithmic realization as well as the interpretability of the results make the k-means problem the most popular clustering model. Developers' efforts are focused on the design of heuristic algorithms that provide acceptable and attainable values of the objective function. geographic/geometrical data, this algorithm results in a solution which is very far from the global minimum of the objective function (1), and the multi-start operation mode does not improve the result significantly. More accurate k-means clustering methods are much slower. Nevertheless, recent advances in high-performance computing and the use of massively parallel systems enable us to work through a large amount of computation using the Lloyd's procedure embedded into more complex algorithmic schemes. Thus, the demand for clustering algorithms that compromise on the time spent for computations and the resulting objective function (1) value is apparent. Nevertheless, in some cases, when solving problem (1), it is required to obtain a result (a value of the objective function) within a limited fixed time, which would be difficult to improve on by known methods without a significant increase in computational costs. Such results are required if the cost of error is high, as well as for evaluating faster algorithms, as reference solutions.
Agglomerative procedures, despite their relatively high computational complexity, can be successfully integrated into more complex search schemes. They can be used as a part of the crossover operator of genetic algorithms [46,88] and as a part of the VNS algorithms. Moreover, such algorithms are a compromise between the solution accuracy and time costs. In this article, by accuracy, we mean exclusively the ability of the algorithm (solver) to obtain the minimum values of the objective function (1).
The use of VNS algorithms, that search in the neighborhoods, formed by applying greedy agglomerative procedures to a known (current) solution S, enables us to obtain good results in a fixed time acceptable for interactive modes of operation. The selection of such procedures, their sequence and their parameters remained an open question. The efficiency of such procedures has been experimentally shown on some test and practical problems. Various versions of VNS algorithms based on greedy agglomerative procedures differ significantly in their results which makes such algorithm difficult to use in practical problems. It is practically impossible to forecast the relative performance of a specific VNS algorithm based on such generalized numerical features of the problem as the sample size and the number of clusters. Moreover, the efficiency of such procedures depends on their parameters. However, the type and nature of this dependence has not been studied.

Our Contribution
In this article, we systematize approaches to the construction of search algorithms in neighborhoods, formed by the use of greedy agglomerative procedures.
In this work, we proceeded from the following assumptions: (a) The choice of parameter r value (the number of excessive centroids, see above) of the greedy agglomerative heuristic procedure significantly affects the efficiency of the procedure. (b) Since it is hardly possible to determine the optimal value of this parameter based on such numerical parameters of the k-means problem as the number of data vectors and the number of clusters, reconnaissance (exploratory) search with various values of r can be useful. (c) Unlike the well-known VNS algorithms that use greedy agglomerative heuristic procedures with an increasing value of the parameter r, a gradual decrease in the value of this parameter may be more effective.
Based on these assumptions, we propose a new VNS algorithm involving greedy agglomerative procedures for the k-means problem, which, by adjusting the initial r parameter of such procedures, enables us to obtain better results in a fixed time which exceed the results of known VNS algorithms. Due to self-adjusting capabilities, such an algorithm should be more versatile, which should increase its applicability to a wider range of problems in comparison with known VNS algorithms based on greedy agglomerative procedures.

Structure of this Article
The rest of this article is organized as follows. In Section 2, we present an overview of the most common local search algorithms for k-means and similar problems, and introduce the notion of neighborhoods SWAP r and GREEDY r . It is shown experimentally that the search result in these neighborhoods strongly depends on the neighborhood parameter r (the number of simultaneously alternated or added centroids). In addition, we present a new VNS algorithm which performs the local search in alternating GREEDY r neighborhoods with the decreasing value of r and its initial value estimated by a special auxiliary procedure. In Section 3, we describe our computational experiments with the new and known algorithms. In Section 4, we consider the applicability of the results on the adjustment of the GREEDY r neighborhood parameter in algorithmic schemes other than VNS, in particular, in evolutionary algorithms with a greedy agglomerative crossover operator. The conclusions are given in Section 5.

Materials and Methods
For constructing a more efficient algorithm (solver), we used a combination of such algorithms as Lloyd's procedure, greedy agglomerative clustering procedures, and the variable neighborhood search. The most computationally expensive part of this new algorithmic construction, Lloyd's procedure, was implemented on graphic processing units (GPU).

The Simplest Approach
Lloyd's procedure, the simplest and most popular algorithm for solving the k-means problem, is described as follows (see Algorithm 1).

Algorithm 1. Lloyd(S)
Require: Set of initial centroids S = {X 1 , . . . , X k }. If S is not given, then the initial centroids are selected randomly from the set of data vectors {A 1 , . . . , A N }. repeat 1. For each centroid X j , j = 1, k, define its cluster in accordance with (2); // I.e. assign each data vector to the nearest centroid 2. For each cluster C j , j = 1, k, calculate its centroid as follows: until all centroids stay unchanged.
Formally, the k-means problem in its formulation (1) or (3) is a continuous optimization problem. With a fixed composition of clusters C j , the optimal solution is found in an elementary way, see Step 2 in Algorithm 1, and this solution is the local optimum of the problem in terms of the continuous optimization theory, i.e., local optimum in the ε-neighborhood. A large number of such optima forces the algorithm designers to systematize their search in some way. The first step of Lloyd's algorithm solves a simple combinatorial optimization problem (3) on the redistribution of data vectors among clusters, that is, it searches in the other neighborhood.
The simplicity of Lloyd's procedure enables us to apply it to a wide range of problems, including face detection, image segmentation, signal processing and many others [92]. Frackiewicz et al. [93] presented a color quantization method based on downsampling of the original image and k-means clustering on a downsampled image. The k-means clustering algorithm used in [94] was proposed for identifying electrical equipment of a smart building. In many cases, researchers do not distinguish between the k-means model and the k-means algorithm, as Lloyd's procedure is also called. Nevertheless, the result of Lloyd's procedure may differ from the results of other more advanced algorithms many times in the objective function value (1). For finding a more accurate solution, a wide range of heuristic methods were proposed [55]: evolutionary and other bio-inspired algorithms, as well as local search in various neighborhoods. Modern scientific literature offers many algorithms to speed up the solution of the k-means problem. Algorithm named k-indicators [95] promoted by Chen et al. is a semi-convex-relaxation algorithm for approximate solution of big-data clustering problems. In the distributed implementation of the k-means algorithm proposed in [96], the algorithm considers a set of agents, each of which is equipped with a possibly high-dimensional piece of information or set of measurements. In [97,98], the researchers improved algorithms for the data streams. In [99], Hedar et al. present a hierarchical k-means method for better clustering performance in the case of big data problems. This approach enables us to mitigate the poor scaling behavior with regard to computing time and memory requirements. Fast adaptive k-means subspace clustering algorithm with an adaptive loss function for high-dimensional data was proposed by Wang et al. [100]. Nevertheless, the usage of the massively parallel systems is the most efficient way to achieve the most significant acceleration of computations, and the original Lloyd's procedure (Algorithm 1) can be seamlessly parallelized on such systems [101,102].
Metaheuristic approaches for the k-means and similar problems include genetic algorithms [46,103,104], the ant colony clustering hybrid algorithm proposed in [105], particle swarm optimization algorithms [106]. Almost all of these algorithms in one way or another use the Lloyd's procedure or other local search procedures. Our new algorithm (solver) is not an exception.

Local Search in SWAP Neighborhoods
Local search algorithms differ in forms of neighborhood function n(S). A local minimum in one neighborhood may not be a local minimum in another neighborhood [50]. The choice of a neighborhood of lower cardinality leads to a decrease in the complexity of the search step, however, a wider neighborhood can lead to a better local minimum. We have to find a balance between these conflicting requirements [50].
A popular idea when solving k-means, k-medoids, k-median problems is to search for a better solution in SWAP neighborhoods. This idea was realized, for instance, in the J-means procedure [80] proposed by Hansen and Mladenovic, and similar I-means algorithm [107]. In SWAP neighborhoods, the set n(S) is the set of solutions obtained from S by replacing one or more centroids with some data vectors.
Let us denote the neighborhood, where r centroids must be simultaneously replaced, by SWAP r (S). The SWAP r neighborhood search can be regular (all possible substitutions are sequentially enumerated), as in the J-means algorithm, or randomized (centroids and data vectors for replacement are selected randomly). In both cases, the search in the SWAP neighborhood always alternates with the Lloyd's procedure: if an improved solution is found in the SWAP neighborhood, the Lloyd's procedure is applied to this new solution, and then the algorithm returns to the SWAP neighborhood search. Except for very small problems, regular search in SWAP neighborhoods, with the exception of the SWAP 1 neighborhood and sometimes SWAP 2 , is almost never used due to the computational complexity: in each of iterations, all possible replacement options must be tested. A randomized search in SWAP r neighborhoods can be highly efficient for sufficiently large problems, which can be demonstrated by the experiment described below. Herewith, the correct choice of r is of great importance.
As can be seen on Figure 1, for various problems from the clustering benchmark repository [108,109], the best results are achieved with different values of r, although in general, such a search provides better results in comparison with Lloyd's procedure. Our computational experiments are described in detail in Sections 2.5-2.7.

Agglomerative Approach and GREEDYr Neyborhoods
When solving the k-means and similar problems, the agglomerative approach is often successful. In [86], Sun et al. propose a parallel clustering method based on MapReduce model which implements the information bottleneck clustering (IBC) idea. In the IBC and other agglomerative clustering algorithms, clusters are sequentially removed one-by-one, and objects are redistributed among the remaining clusters. Alp et al. [88] presented a genetic algorithm for facility location problems, where evolution is facilitated by a greedy agglomerative heuristic procedure. A genetic algorithm with a faster greedy heuristic procedure for clustering and location problems was also proposed in [90]. In [46], two genetic algorithm approaches with different crossover procedures are used to solve k-median problem in continuous space.
Greedy agglomerative procedures can be used as independent algorithms, as well as being embedded into genetic operators [110] or VNS algorithms [79]. The basic greedy agglomerative procedure for the k-means problem can be described as follows (see Algorithm 2).

Agglomerative Approach and GREEDY r Neyborhoods
When solving the k-means and similar problems, the agglomerative approach is often successful. In [86], Sun et al. propose a parallel clustering method based on MapReduce model which implements the information bottleneck clustering (IBC) idea. In the IBC and other agglomerative clustering algorithms, clusters are sequentially removed one-by-one, and objects are redistributed among the remaining clusters. Alp et al. [88] presented a genetic algorithm for facility location problems, where evolution is facilitated by a greedy agglomerative heuristic procedure. A genetic algorithm with a faster greedy heuristic procedure for clustering and location problems was also proposed in [90]. In [46], two genetic algorithm approaches with different crossover procedures are used to solve k-median problem in continuous space.
Greedy agglomerative procedures can be used as independent algorithms, as well as being embedded into genetic operators [110] or VNS algorithms [79]. The basic greedy agglomerative procedure for the k-means problem can be described as follows (see Algorithm 2).

Algorithm 2. BasicGreedy(S)
Require: Set of initial centroids S = {X 1 , . . . , X K }, K > k, required final number of centroids k. S ← Lloyd(S); while |S| > k do for i = 1, K do F i ← SSE(S\{X i }); end for Select a subset S ⊂ S of r toremove centroids with the minimum values of the corresponding variables F i ; // By default, r toremove = 1. S ← Lloyd(S \S ); end while. In its most commonly used version, with r toremove = 1, this procedure is rather slow for large-scale problems. It tries to remove the centroids one-by-one. At each iteration, it eliminates such centroids that their elimination results in the least significant increase in the SSE value. Further, this procedure involves the Lloyd's procedure which can be also slow in the case of rather large problems with many clusters. To improve the performance of such a procedure, the number of simultaneously eliminated centroids can be calculated as r toremove = max 1, (|S| − k) · r coe f . In [90], Kazakovtsev and Antamoshkin used the elimination coefficient value r coe f = 0.2. This means that at each iteration, up to 20% of the excessive centroids are eliminated, and such values are proved to make the algorithm faster. In this research, we use the same value.
In [79,90,110], the authors embed the BasicGreedy() procedure into three algorithms which differ in r value only. All of these algorithms can be described as follows (see Algorithm 3):  Such procedures use various values of r from 1 up to k. If r = 1 then the algorithm selects a subset (actually, a single element) of S 2 regularly: {X 1 } in the first iteration, {X 2 } in the second one, etc. In this case, n repeats = k. If r = k then obviously S' = S 2 , and n repeat =1. Otherwise, r is selected randomly, r ∈ 2, k − 1 , and n repeats depends on r: n repeats = max{1,[k/r]}.
If the solution S 2 is fixed, then all possible results of applying the Greedy(S,S 2 ,r) procedure form a neighborhood of the solution S, and S 2 as well as r are parameters of such a neighborhood. If S 2 is a randomly chosen locally optimal solution obtained by Lloyd(S 2 ') procedure applied to a randomly chosen subset S 2 ⊂ {A 1 , . . . , A N }, S 2 = k, then we deal with a randomized neighborhood.
Let us denote such a neighborhood by GREEDY r (S). Our experiments in Section 3 demonstrate that the obtained result of the local search in GREEDY r neighborhoods strongly depends on r.

Variable Neighborhood Search
The dependence of the local search result on the neighborhood selection reduces if we use a certain set of neighborhoods and alternate them. This approach is the basis for VNS algorithms. The idea of alternating neighborhoods is easy to adapt to various problems [76][77][78] and highly efficient, which makes it very useful for solving NP-hard problems including clustering, location, and vehicle routing problems. In [111,112], Brimberg and Mladenovic and Miskovic et al. used the VNS for solving various facility location problems. Cranic et al. [113] as well as Hansen and Mladenovic [114] proposed and developed a parallel VNS algorithm for the k-median problem. In [115], a VNS algorithm was used for a vehicle routing and driver scheduling problems by Wen et al.
The ways of neighborhood alternation may differ significantly. Many VNS algorithms are not even classified by their authors as VNS algorithms. For example, the algorithm in [57] alternates between discrete and continuous problems: when solving a discrete problem, the set of local optima is replenished, and then such local optima are chosen as elements of the initial solution of the continuous problem. A similar idea of the recombinator k-means algorithm was proposed by C. Baldassi [116]. This algorithm restarts the k-means procedure, using the results of previous runs as a reservoir of candidates for the new initial solutions, exploiting the popular k-means++ seeding algorithm to piece them together into new, promising initial configurations. Thus, the k-means search alternates with the discrete problem of finding an optimal initial centroid combination.
VNS class includes a very efficient abovementioned J-Means algorithm [80], which alternates search in a SWAP neighborhood and the use of Lloyd's procedure. Even when searching only in the SWAP 1 neighborhood, the J-Means results can be many times better than the results of Lloyd's procedure launched in the multi-start mode, as shown in [62,97].
until the stop conditions are satisfied.

Algorithm 5. RVNS(S)
Require: Initial solution S, selected neighborhoods n l , l = 1, l max . repeat l ← 1 ; While l ≤ l max do select randomly S ∈ n l (S); if f (S') < f (S) then S ← S ; l ← 1 else l ← l + 1 end if; end while; until the stop conditions are satisfied.
Algorithms of the RVNS scheme are more efficient when solving large-scale problems [50], when the use of deterministic VND requires too large computational costs per each iteration. In many efficient algorithms, l max = 2. For example, the J-Means algorithm combines a SWAP neighborhood search with Lloyd's procedure.
As a rule, algorithm developers propose to move from neighborhoods of lower cardinality to wider neighborhoods. For instance, in [79], the authors propose a sequential search in the neighborhoods GREEDY 1 → GREEDY random → GREEDY k → GREEDY 1 → . . . Here, GREEDY random is a neighborhood with randomly selected r ∈ 2, k − 1 . In this case, the initial neighborhood type has a strong influence on the result [79]. However, the best initial value of parameter r is hardly predictable.
In this article, we propose a new RVNS algorithm which involves GREEDY r neighborhood search with a gradually decreasing r and automatic adjustment of the initial r value. Computational experiments show the advantages of this algorithm in comparison with the algorithms searching in SWAP neighborhoods as well as in comparison with known search algorithms with GREEDY r neighborhoods.

New Algorithm
A search in a GREEDY r neighborhood with a fixed r values, on various practical problems listed in the repositories [108,109,118], shows that the result (the value of the objective function) essentially depends on r, and this dependence differs for various problems, even if the problems have similar basic numerical characteristics, such as the number of data vectors N, their dimension d, and the number of clusters k. The results are shown on Figures 2 and 3. At the same time, our experiments show that at the first iterations, the use of Algorithm 3 almost always leads to an improvement in the SSE value, and then the probability of such a success decreases. Moreover, the search in neighborhoods with large r values stops giving improving results sooner, while the search in neighborhoods with small r, in particular, with r = 1, enables us to obtain the improved solutions during a longer time. The search in the GREEDY 1 neighborhood corresponds to the adjustment of individual centroid positions. Thus, the possible decrement of the objective function value is not the same for different values of r.
A search in a GREEDYr neighborhood with a fixed r values, on various practical problems listed in the repositories [108,109,118], shows that the result (the value of the objective function) essentially depends on r, and this dependence differs for various problems, even if the problems have similar basic numerical characteristics, such as the number of data vectors N, their dimension d, and the number of clusters k. The results are shown on Figures 2 and 3. At the same time, our experiments show that at the first iterations, the use of Algorithm 3 almost always leads to an improvement in the SSE value, and then the probability of such a success decreases. Moreover, the search in neighborhoods with large r values stops giving improving results sooner, while the search in neighborhoods with small r, in particular, with r = 1, enables us to obtain the improved solutions during a longer time. The search in the GREEDY1 neighborhood corresponds to the adjustment of individual centroid positions. Thus, the possible decrement of the objective function value is not the same for different values of r.    We propose the following sequence of neighborhoods: GREEDYr0→GREEDYr1→ GREEDYr2→… →GREEDY1 →GREEDYk→ ⋯. Here, r values gradually decrease: r0 > r1 > r2…. After reaching r = 1, the search continues in the GREEDYk neighborhood, and after that the value of r starts decreasing again. Moreover, the r value fluctuates within certain limits at each stage of the search. This algorithm can be described as follows (Algortithm 6).  We propose the following sequence of neighborhoods: Here, r values gradually decrease: r0 > r1 > r2 . . . . After reaching r = 1, the search continues in the GREEDY k neighborhood, and after that the value of r starts decreasing again. Moreover, the r value fluctuates within certain limits at each stage of the search. This algorithm can be described as follows (Algortithm 6).
Genetic algorithms with greedy agglomerative heuristics are known to perform better than VNS algorithms with sufficient computation time [79,90] which results in better SSE values. Despite this, the limited time and computational complexity of the Greedy() procedure as a genetic crossover operator leads to a situation when genetic algorithms may have enough time to complete a very limited number of crossover operations and often only reach the second or third generation of solutions. Under these conditions, VNS algorithms are a reasonable compromise of the computation cost and accuracy.
The choice of the initial value of parameter r 0 is highly important. Such a choice is quite simply carried out by a reconnaissance search with different r 0 values. The algorithm with such an automatic adjustment of the parameter r 0 by performing a reconnaissance search is described as follows (Algorithm 7).

Algorithm 7. AdaptiveGreedy (S) solver
Require: the number of reconnaissance search iterations n recon . select randomly S ⊂ {A 1 , . . . , A N }, |S| = k; S ← Lloyd(S); Results of computational experiments described in the next Section show that our new algorithm, which sequentially decreases the value of the parameter r 0 , has an advantage over the known VNS algorithms.

CUDA Implementation
The greedy agglomerative procedure (BasicGreedy) is computationally expensive. In Algorithm 2, the objective function calculation F i ←SSE(S\{X i }) is performed more than (K − k) · k times in each iteration, and after that, Lloyd() procedure is executed. Therefore, such algorithms are traditionally considered as methods for solving comparatively small problems (hundreds of thousands of data points and hundreds of clusters). However, the rapid development of the massive parallel processing systems (GPUs) enables us to solve the large-scale problems with reasonable time expenses (seconds). Parallel (CUDA) implementation of the algorithms for the Lloyd() procedure is known [101,102], and we used this approach in our experiments.
Graphic processing units (GPUs) accelerate computations with the use of multi-core computing architecture. The CUDA (compute unified device architecture) is the most popular programming platform which enables us to use general-purpose programming languages (e.g., C++) for compiling GPU programs. The programming model uses the single instruction multiple thread (SIMT) principle [119]. We can declare a function in the CUDA program a "kernel" function and run this function on the steaming multiprocessors. The threads are divided into blocks. Several instances of a kernel function are executed in parallel on different nodes (blocks) of a computation grid. Each thread can be identified by special threadIdx variable. Each thread block is identified by blockIdx variable. The number of threads in a block is identified by blockDim variable. All these variables are 3-dimensional vectors (dimensions x, y, z). Depending on the problem solved, the interpretation of these dimensions may differ. For processing 2D graphical data, x and y are used for identifying pixel coordinates.
The most computationally expensive part of Lloyd's procedure is distance computation and comparison (Step 1 of Algorithm 1). This step can be seamlessly parallelized if we calculate distances from each individual data vector in a separate thread. Thus, threadIdx.x and blockIdx.x must indicate a data vector. The same kernel function prepares data needed for centroid calculation (Step 2 of Algorithm 1). Such data are the sum of data vector coordinates in a specific cluster sum j = i∈{1,N}:A i ∈C j A i and the cardinality of the cluster counter j = C j . Here, j is the cluster number.
Variable sum j is a vector (1-dimensional array in program realization). To perform Step 1 of Algorithm 1 on a GPU, after initialization sum j ← 0 and counter j ← 0 , the following procedure (Algorithm 8) runs on (N + blockDim.x) / blockDim.x nodes of computation grid, with blockDim.x threads in each block (in our experiments, blockDim.x = 512): n ← j; end if end for; sum n ← sum n + A n ; counter n ← counter n + 1; SSE ← SSE+ D 2 nearest . // objective function adder If sum j and counter j are pre-calculated for each cluster then Step 2 of Algorithm 1 is reduced to a single arithmetic operation for each cluster: X j = sum j /counter j . If the number of clusters is not huge, this operation does not take significant computation resources. Nevertheless, its parallel implementation is even simpler: we organize k treads, and each thread calculates X j for an individual cluster. Outside Lloyd's procedure, we use Algorithm 8 for SSE value estimation (variable SSE must be initialized by 0 in advance).
The second computationally expensive part of the BasicGreegy() algorithm is estimation of the objective function value after eliminating a centroid [120]: F i = SSE(S\{X i }). Having calculated SSE(S), we may calculate as SSE(S\{X i }) as where For calculating (5) on a GPU, after initializing F i ← SSE(S) , the following kernel function (Algorithm 9) runs for each data vector. Require: index i of centroid being eliminated. l ← blockIdx.x ·blockDim.x + threadIdx.x; if l > N then return end if; D nearest ← +∞; // distance from A l to the nearest centroid except X i for j = 1, k do if l i and A j − X i < D nearest then D nearest ← A j − X i ; end if end for; All distance calculations for GREEDY r neighborhood search are performed by Algorithms 8 and 9. A similar kernel function was used for accelerating the local search in SWAP neighborhoods. In this function, after eliminating a centroid, a data point is included in solution S as a new centroid.
All other parts of new and known algorithms were implemented on the CPU.

Benchmarking Data
In all our experiments, we used the classic data sets from the UCI Machine Learning and Clustering basic benchmark repositories [108,109,118]: (a) Individual household electric power consumption (IHEPC)-energy consumption data of households during several years (more than 2 million data vectors, 7 dimensions), 0-1 normalized data, "date" and "time" columns removed; (b) BIRCH3 [121]: one hundred of groups of points of random size on a plane (10 5 data vectors, 2 dimensions); (c) S1 data set: Gaussian clusters with cluster overlap (5000 data vectors, 2 dimensions); Mopsi-Joensuu and Mopsi-Finland are "geographic" data sets with a complex cluster structure, formed under the influence of natural factors such as the geometry of the city, transport communications, and urban infrastructure (Figure 4).  In our study, we do not take into account the true labeling provided by the data set (if it is known), i.e., the given predictions for known classes, and focus on the minimization of SSE only.

Computational Environment
For our computational experiments, we used the following test system: Intel Core 2 Duo E8400 CPU, 16GB RAM, NVIDIA GeForce GTX1050ti GPU with 4096 MB RAM, floating-point performance 2138 GFLOPS. This choice of the GPU hardware was made due to its prevalence, and also one of the best values of the price/performance ratio. The program code was written in C++. We used Visual  In our study, we do not take into account the true labeling provided by the data set (if it is known), i.e., the given predictions for known classes, and focus on the minimization of SSE only.

Computational Environment
For our computational experiments, we used the following test system: Intel Core 2 Duo E8400 CPU, 16GB RAM, NVIDIA GeForce GTX1050ti GPU with 4096 MB RAM, floating-point performance 2138 GFLOPS. This choice of the GPU hardware was made due to its prevalence, and also one of the best values of the price/performance ratio. The program code was written in C++. We used Visual C++ 2017 compiler embedded into Visual Studio v. 15

Results
For all data sets, 30 attempts were made to run each of the algorithms (see Tables 1 and A1,  Tables A2-A11 in Appendix A). Note: "↑", "⇑": the advantage of the new algorithms over known algorithms is statistically significant ("↑" for t-test and "⇑" for Mann-Whitney U test), "↓", "⇓": the disadvantage of the new algorithm over known algorithms is statistically significant; "↔", "⇔": the advantage or disadvantage is statistically insignificant. Significance level is 0.01.
For comparison, we ran local search in various GREEDYr neighborhoods at fixed r value. In addition, we ran various known Variable Neighborhood Search (VNS) algorithms with GREEDYr neighborhoods [79], see algorithms GH-VNS1-3. These algorithms use the same sequence of neighborhood types (GREEDY 1 →GREEDY random →GREEDY k ) and differ in the initial neighborhood type: GREEDY 1 for GH-VNS1, GREEDY random for GH-VNS2, and GREEDY k GH-VNS3. Unlike our new AdaptiveGreedy() algorithm, GH-VNS1-3 algorithms increase r values, and this increase is not gradual. In addition, we included the genetic algorithm (denoted "GA-1" in Tables A1-A11) with the single-point crossover [103], real-valued genes encoded by centroid positions, and the uniform random mutation (probability 0.01). For algorithms launched in the multi-start mode (j-Means algorithm and Lloyd's procedure), only the best results achieved in each attempt were recorded. In Tables A1-A11,  such algorithms are denoted Lloyd-MS and j-Means-MS, respectively. The minimum, maximum, average, and median objective function values and its standard deviation were summarized after 30 runs. For all algorithms, we used the same realization of the Lloyd() procedure which consume the absolute majority of the computation time.
The best average and median values of the objective function (1) are underlined. We compared the new AdaptiveGreedy() algorithm with the known algorithm which demonstrated the best median and average results (Table 1). For comparison, we used the t-test [122,123] and non-parametric Wilcoxon-Mann-Whitney U test (Wilcoxon rank sum test) [124,125] with z approximation.
To compare the results obtained by our new algorithm, we tested the single-tailed null hypothesis H 0 : SSE AdaptiveGreedy = SSE known (the difference in the results is statistically insignificant) and the research hypothesis H 1 : SSE AdaptiveGreedy < SSE known (statistically different results, the new algorithm has an advantage). Here, SSE AdaptiveGreedy are results ontained by AdaptiveGreedy() algorithm, SSE known are results of the best-known algorithm. For t-test comparison, we selected the algorithm lowest in average SSE value, and for Wilcoxon-Mann-Whitney U test comparison, we selected the algorithm with the lowest SSE median value. For both tests, we calculated the p-values (probability of the null-hypothesis acceptance), see p t for the t-test and p u for the Wilcoxon-Mann-Whitney U test in Table 1, respectively. At the selected significance level p sig = 0.01, the null hypothesis is accepted if p t > 0.01 or p U > 0.01. Otherwise, the difference in algorithm results should be considered statistically significant. If the null hypothesis was accepted, we also tested a pair of single-tailed hypotheses SSE AdaptiveGreedy = SSE known and SSE AdaptiveGreedy > SSE known .
In some cases, the Wilcoxon-Mann-Whitney test shows the statistical significance of the differences in results, while the t-test does not confirm the benefits of the new algorithm. Figure 5 illustrates such a situation. Both algorithms demonstrate approximately the same results. Both algorithms periodically produce results that are far from the best SSE values, which is expressed in a sufficiently large value of the standard deviation. However, the results of the new algorithm are often slightly better, which is confirmed by the rank test. AdaptiveGreedy() algorithm, GH-VNS1-3 algorithms increase r values, and this increase is not gradual. In addition, we included the genetic algorithm (denoted "GA-1" in Tables A1-A11) with the singlepoint crossover [103], real-valued genes encoded by centroid positions, and the uniform random mutation (probability 0.01). For algorithms launched in the multi-start mode (j-Means algorithm and Lloyd's procedure), only the best results achieved in each attempt were recorded. In Tables A1-A11, such algorithms are denoted Lloyd-MS and j-Means-MS, respectively. The minimum, maximum, average, and median objective function values and its standard deviation were summarized after 30 runs. For all algorithms, we used the same realization of the Lloyd() procedure which consume the absolute majority of the computation time.
The best average and median values of the objective function (1) are underlined. We compared the new AdaptiveGreedy() algorithm with the known algorithm which demonstrated the best median and average results (Table 1). For comparison, we used the t-test [122,123] and non-parametric Wilcoxon-Mann-Whitney U test (Wilcoxon rank sum test) [124,125] with z approximation.
To compare the results obtained by our new algorithm, we tested the single-tailed null hypothesis H0: SSEAdaptiveGreedy = SSEknown (the difference in the results is statistically insignificant) and the research hypothesis H1: SSEAdaptiveGreedy < SSEknown (statistically different results, the new algorithm has an advantage). Here, SSEAdaptiveGreedy are results ontained by AdaptiveGreedy() algorithm, SSEknown are results of the best-known algorithm. For t-test comparison, we selected the algorithm lowest in average SSE value, and for Wilcoxon-Mann-Whitney U test comparison, we selected the algorithm with the lowest SSE median value. For both tests, we calculated the p-values (probability of the nullhypothesis acceptance), see pt for the t-test and pu for the Wilcoxon-Mann-Whitney U test in Table 1, respectively. At the selected significance level psig = 0.01, the null hypothesis is accepted if pt > 0.01 or pU > 0.01. Otherwise, the difference in algorithm results should be considered statistically significant. If the null hypothesis was accepted, we also tested a pair of single-tailed hypotheses SSEAdaptiveGreedy = SSEknown and SSEAdaptiveGreedy > SSEknown.
In some cases, the Wilcoxon-Mann-Whitney test shows the statistical significance of the differences in results, while the t-test does not confirm the benefits of the new algorithm. Figure 5 illustrates such a situation. Both algorithms demonstrate approximately the same results. Both algorithms periodically produce results that are far from the best SSE values, which is expressed in a sufficiently large value of the standard deviation. However, the results of the new algorithm are often slightly better, which is confirmed by the rank test. In the comparative analysis of algorithm efficiency, the choice of the unit of time plays an important role. The astronomical time spent by an algorithm strongly depends on its implementation, In the comparative analysis of algorithm efficiency, the choice of the unit of time plays an important role. The astronomical time spent by an algorithm strongly depends on its implementation, the ability of the compiler to optimize the program code, and the fitness of the hardware to execute the code of a specific algorithm. Algorithms are often estimated by comparing the number of iterations performed (for example, the number of population generations for a GA) or the number of evaluations of the objective function. However, the time consumption for a single iteration of a local search algorithm depends on the neighborhood type and number of elements in the neighborhood, and this dependence can be exponential. Therefore, comparing the number of iterations is unacceptable. Comparison of the objective function calculations is also not quite correct. Firstly, the Lloyd() procedure which consumes almost all of the processor time, does not calculate the objective function (1) directly. Secondly, during the operation of the greedy agglomerative procedure, the number of centroids changes (decreases from k + r down to k), and the time spent on computing the objective function also varies. Therefore, we nevertheless chose astronomical time as a scale for comparing algorithms. Moreover, all the algorithms use the same implementation of the Lloyd() algorithm launched under the same conditions.
In our computational experiments, the time limitation was used as the stop condition for all algorithms. For all data sets except the largest one, we have chosen a reasonable time limit to use the new algorithm in interactive modes. For IHEPC data and 50 clusters, a single run of the BasicGreedy() algorithm on the specified hardware took approximately 0.05 to 0.5 s. It is impossible to evaluate the comparative efficiency of the new algorithm in several iterations, since in this case, it does not have enough time to change the neighborhood parameter r at least once. We have increased the time to a few minutes. This time limit does not correspond to modern concepts of interactive modes of operation. Nevertheless, the rapid development of parallel computing requires the early creation of efficient algorithmic schemes. Our experiments were performed on a mass-market system. Advanced systems may cope with such large problems much faster.
As can be seen from Figure 6, the result of each algorithm depends on the elapsed time. Nevertheless, an advantage of the new algorithm is evident regardless of the chosen time limit.

Discussion
The advantages of our algorithm are statistically significant for a large problem (IHEPC data), as well as for problems with a complex data structure (Mopsi-Joensuu and Mopsi-Finland data). The Mopsi data sets contain geographic coordinates of Mopsi users, which are extremely unevenly  Table 2. The difference with the results in Table 1 is obviously insignificant. The ranges of SSE values in the majority of Tables A1-A11 are narrow, nevertheless, the differences are statistically significant in several cases, see Table 1. In all cases, our new algorithm outperforms known ones or demonstrates approximately the same efficiency (difference in the results is statistically insignificant). Moreover, the new algorithm demonstrates the stability of its results (narrow range of objective function values).
Search results in both SWAP r and GREEDY r neighborhoods depend on a correct choice of parameter r (the number of replaced or added centroids). However, in general, local search algorithms with GREEDY r neighborhoods outperform the SWAP r neighborhood search. A simple reconnaissance search procedure enables the further improvement of the efficiency.

Discussion
The advantages of our algorithm are statistically significant for a large problem (IHEPC data), as well as for problems with a complex data structure (Mopsi-Joensuu and Mopsi-Finland data). The Mopsi data sets contain geographic coordinates of Mopsi users, which are extremely unevenly distributed in accordance with the natural organization of the urban environment, depending on street directions and urban infrastructure ( Figure 6). In this case, the aim of clustering is to find some natural groups of users according to a geometric/geographic principle for assigning them to k service centers (hubs) such as shopping centers, bus stops, wireless network base stations, etc.
Often, geographical data sets show such a disadvantage of Lloyd's procedure as its inability to find a solution close to the exact one. Often, on such data, the value of the objective function found by the Lloyd's procedure in the multi-start mode turns out to be many times greater than the values obtained by other algorithms, such as J-Means or RVNS algorithms with SWAP neighborhoods. As can be seen from Tables A2, A3 and A5 in Appendix A, for such data, GREEDY r neighborhoods search provides significant advantages within a limited time, and our new self-adjusting AdaptiveGreedy() solver enhances these advantages.
The VNS algorithmic framework is useful for creating effective computational tools intended to solve complex practical problems. Embedding the most efficient types of neighborhoods in this framework depends on the problem type being solved. In problems such as k-means, the search in neighborhoods with specific parameters strongly depends not only on the generalized numerical parameters of the problems, such as the number of clusters, number of data vectors, and the search space dimensionality, but also on the internal data structure. In general, the comparative efficiency of the search in GREEDY r neighborhoods for certain types of practical problems and for specific data sets remains an open question. Nevertheless, the algorithm presented in this work, which automatically performs the adjustment of the most important parameter of such neighborhoods, enables its user to obtain the best result which the variable neighborhood search in GREEDY r is able to provide, without preliminary experiments in all possible GREEDY r neighborhoods. Thus, the new algorithm is a more versatile computational tool in comparison with the known VNS algorithms.
Greedy agglomerative procedures are widely used as crossover operators in genetic algorithms [46,88,90,110]. In this case, most often, the "parent" solutions are merged completely to obtain an intermediate solution with an excessive number of centers or centroids [46,88], which corresponds to the search in the GREEDY k neighborhood (one of the crossed "parent" solutions acts as the parameter S 2 ), although, other versions of the greedy agglomerative crossover operator are also possible [90,110]. Such algorithms successfully compete with the advanced local search algorithms discussed in this article.
Self-configuring evolutionary algorithms [126][127][128] have been widely used for solving various optimization problems. An important direction of the further research is to study the possibility of adjusting the parameter r in greedy agglomerative crossover operators of genetic algorithms. Such procedures with self-adjusting parameter r could lead to a further increase in the accuracy of solving the k-means problem with respect to the achieved value of the objective function. Such evolutionary algorithms could also involve a reconnaissance search, which would then continue by applying the greedy agglomerative crossover operator with r values chosen from the most favorable range.
In addition, the similarity in problem statements of the k-means, k-medoids and k-median problems promises us a reasonable hope for the applicability of the same approaches to improving the accuracy of algorithms, including VNS algorithms, by adjusting the parameter r of the neighborhoods similar with GREEDY r .

Conclusions
The process of introducing machine learning methods into all spheres of life determines the need to develop not only fast, but also the most accurate algorithms for solving related optimization problems. As practice shows, including this study, when solving some problems, the most popular clustering algorithm gives a result extremely far from the optimal k-means problem solution.
In this research, we introduced GREEDY r search neighborhoods and found that searching in both SWAP and GREEDY r neighborhoods has advantages over the simplest Lloyd's procedure. However, the results strongly depend on the parameters of such neighborhoods, and the optimal values of these parameters differ significantly for test problems. Nevertheless, searching in GREEDY r neighborhoods outperforms searching in SWAP neighborhoods in terms of accuracy.
We hope that our new variable neighborhood search algorithm (solver) for GPUs, which is more versatile due to its self-adjusting capability and has an advantage with respect to the accuracy of solving the k-means problem over known algorithms, will encourage researchers and practitioners in the field of machine learning to build competitive systems with the lowest possible error within a limited time. Such systems should be in demand when clustering geographic data, as well as when solving a wide range of problems with the highest cost of error.

Conflicts of Interest:
The authors declare no conflict of interest. The funder had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.