Self-Conﬁguring (1 + 1)-Evolutionary Algorithm for the Continuous p-Median Problem with Agglomerative Mutation †

: The continuous p-median problem (CPMP) is one of the most popular and widely used models in location theory that minimizes the sum of distances from known demand points to the sought points called centers or medians. This NP-hard location problem is also useful for clustering (automatic grouping). In this case, sought points are considered as cluster centers. Unlike similar k-means model, p-median clustering is less sensitive to noisy data and appearance of the outliers (separately located demand points that do not belong to any cluster). Local search algorithms including Variable Neighborhood Search as well as evolutionary algorithms demonstrate rather precise results. Various algorithms based on the use of greedy agglomerative procedures are capable of obtaining very accurate results that are difﬁcult to improve on with other methods. The computational complexity of such procedures limits their use for large problems, although computations on massively parallel systems signiﬁcantly expand their capabilities. In addition, the efﬁciency of agglomerative procedures is highly dependent on the setting of their parameters. For the majority of practically important p-median problems, one can choose a very efﬁcient algorithm based on the agglomerative procedures. However, the parameters of such algorithms, which ensure their high efﬁciency, are difﬁcult to predict. We introduce the concept of the AGGL r neighborhood based on the application of the agglomerative procedure, and investigate the search efﬁciency in such a neighborhood depending on its parameter r . Using the similarities between local search algorithms and (1 + 1)-evolutionary algorithms, as well as the ability of the latter to adapt their search parameters, we propose a new algorithm based on a greedy agglomerative procedure with the automatically tuned parameter r . Our new algorithm does not require preliminary tuning of the parameter r of the agglomerative procedure, adjusting this parameter online, thus representing a more versatile computational tool. The advantages of the new algorithm are shown experimentally on problems with a data volume of up to 2,000,000 demand points. The time limitation plays an important role: the fastest algorithms may stop their convergence after several iterations and, vice versa, the slowest algorithms may continue their slow convergence and outperform the fastest algorithms if we enlarge their time limits. However, our new algorithms demonstrate their comparative advantage within wide time intervals (see Figure 3).


Problem Statement
One of the central problems of location theory is the p-median problem. The goal of the continuous p-median problem is to find p points (centers, medians) such that the sum of the distances from N known points (called demand points or data vectors) to the nearest of the p centers reaches its minimum. A i = (a i,1 , . . . , a i,d ), A i ∈ R d , and S = {X 1 , . . . , X p }⊂ R d is the set of sought points (medians, centers). The objective function (sum of distances) of the p-median problem is [1]: F X 1 , . . . , X p = F(S) = N ∑ i=1 min j=1,p L(A i , X j ) → min X 1 ,...,X p ∈R d . (1) Here, integer p must be known in advance. The distance metric L (·,·) between objects is the key concept in the location theory. As a rule, a metric is understood as a function or an equation that determines the distance between any points and classes in a metric space [2]. A metric space is a set of points with a defined distance function. The distance of order l between two points is determined by the Minkowski function [3]: ( The parameter l is determined by the researcher. We can use it to progressively increase or decrease the weight of the ith variable. Special cases of the Minkowski function are distinguished by the value of l. For l = 2 the function calculates Euclidean metric between two points. By default, location problems use Euclidean metric: Here, X j = x j,1 , . . . , x j,k ∀ j = 1, p, k = 1, d are sought centers, also called medians, A i = (a i,1 , . . . , a i,k ) ∀ i = 1, N are known points called demand points or data vectors. For l = 1, the function calculates Manhattan distance [3]. For l = +∞, the function calculates Tschebychev distance (L ∞ metric), and for l = 0, the function calculates Hamming distance [4].
Weber, in his work [5], investigated the problem of finding the center for a set of weighted points (the Weber problem or 1-median problem [5]) which is a special (simplest) case of the problem (1) for p = 1. At the same time, the Weber problem is a generalization of a similar Fermat problem [6] with three demand points (N = 3, p = 1). Weiszfeld, in his work [7], proved a theorem formulated by Sturm [8] and derived a sequence that converged to the optimal solution of the Weber problem. This was a version of the gradient descent algorithm [9] for the Weber problem.
The clustering problem is to divide a given set (sample) of N objects (data vectors) into p disjoint subsets, called clusters, so that each cluster consists of similar objects, and the objects of different clusters differ significantly. In the process of grouping a set of objects into certain groups (subsets), the general features of the object and the applied algorithms play an important role. Some of the automatic grouping problems can be considered from the point of view of location problems. The most popular k-means clustering model [10] can be described by Equation (1) where L (·,·) is the squared Euclidean distance between two points. The existence of a trivial non-iterative solution of the corresponding Weber problem with the squared Euclidean distances [11] makes the k-means problem one of the most popular optimization models for clustering. A disadvantage of this model is its high sensitivity to the outliers (separately located data points), and the p-median model is free from this drawback.
The network p-median problem [12] is to find the p nodes on a network that minimize the sum of the weighted distances from all nodes to the nearest of the p nodes selected as centers (medians). The network and continuous p-median problems are proved to be NP-hard [13,14].
Evolutionary algorithms are a class of search algorithms that are often used as function optimizers for static objective functions. Using specific evolutionary operators, the evolutionary algorithms recombine and modify a set (population) of candidate solutions to a problem. Various efficient evolutionary algorithms were developed for almost all types of optimization problems including machine learning problems. The use of the genetic and other evolutionary algorithms for solving discrete and continuous location problems [55][56][57][58][59][60] is also a popular approach. For example, in [61], the authors proposed the online-learning-based reference vector evolutionary many-objective algorithm guided by the reference-vector-based decomposition strategy which employs a learning-based technology to enhance its generalization capability. This strategy employs a learning automaton to obtain the feature of the problem and the searching state according to the feedback information from the environment to adjust the mutation strategy for each sub-problem in different status situations and enhance the optimization performance. The authors use clustering to form groups of reference vectors with the same mutation strategy. The work [62] focuses on a discrete optimization problem of improving the seaside operations at marine container terminals. The authors proposed a new Adaptive Island Evolutionary Algorithm for the berth scheduling problem, aiming to minimize the total weighted service cost of vessels. The developed algorithm simultaneously executes separate evolutionary algorithms in parallel on its "islands" (subpopulations of solutions) and exchanges individuals between the "islands" based on an adaptive mechanism, which enables more efficient exploration of the problem search space. The authors of [63] proposed an alternative many-objective EA, called AnD. In evolutionary many-objective optimization, there exists a problem of selecting some promising individuals from the population. The AnD uses two strategies: angle-based selection and shift-based density estimation, which are employed to remove poor individuals one by one. In a pair of individuals with the minimum vector angle (most similar search directions), one of the individuals is considered as a candidate for elimination. In addition, the EA is a popular solution methodology, showing a promising performance in solving different types of vehicle routing problems [64,65]. In [66], the authors presented an efficient EA developed to solve the mathematical model, which addresses the vehicle routing problem with a "factory-in-a-box". In [67], the authors proposed an approach to distinguish between bacterial and viral meningitis using genetic programming and decision trees able to determine the typology of meningitis for any combination of clinical parameters by achieving 100% of sensitivity.
The EAs are popular for training in various the machine learning methods. In [68], the Salp Swam Algorithm (SSA) was deployed in training the Multilayer Perceptron (MLP) for data classification. The proposed method shows supremacy in results as compared with other evolutionary algorithm-based machine learning problems.
In the case of p-median and similar location problems, evolutionary algorithms recombine the initial solution obtained by the ALA procedure or a local search algorithm. In genetic algorithms, a set (population) of solutions is successively improved with the use special genetic operators (algorithms) of initialization, selection, crossover, and mutation. The crossover operator recombines the elements ("genes") of two parent solution for generating a new ("offspring") solution. One-point and two-point crossover which are standard for many other genetic algorithms have been proved to present drawbacks when applied to problems such as k-means and p-median problems [69] due to generating the offspring solutions too far from their parents. The GAs with the greedy agglomerative crossover operator demonstrates better results [58,70,71]. In this crossover operator, as well as in the IBC algorithms [53,54], the number of centers is successively reduced down to the desired number p. Being originally developed for the network p-median problems, they were adapted for continuous p-median and k-means problems [72]. Metaheuristic approaches, such as genetic algorithms [73], are aimed at finding the global optimum. However, in large-scale instances, such approaches require very significant computational costs, especially if they are adapted to solving continuous problems [58]. The greedy agglomerative procedure is a computationally expensive algorithm, especially in the case of continuous problems when it includes multiple execution of a local search algorithm. However, this procedure allows us to find solutions that are difficult to improve by other methods without significantly increasing the calculation time.
The concept of the usage of greedy agglomerative procedures as the crossover operator in genetic algorithms is as follows. First, for two selected parent solutions (sets of medians) S 1 and S 2 , the algorithm constructs an intermediate solution S' adding r randomly selected medians (centers) from the second parent solution S 2 to the first solution S 1 : S = S 1 ∪ S 2 . Here, S 2 ' is a randomly chosen subset of S 2 , |S 2 | = r. Integer parameter r can be given in advance (often, r = 1 or r = p). After improving this new intermediate solution with the ALA or local search algorithm, one or more excessive medians are removed until |S'| = p. At each iteration, the greedy agglomerative procedure removes such a median that its elimination results in the least significant increase of the objective Function (1).
In [74], the authors systematized approaches to constructing algorithms for searching in neighborhoods (denoted as GREEDY r ) formed using greedy algorithmic procedures for the k-means problem. Searching in SWAP and GREEDY r neighborhoods has advantages over the simplest Lloyd's procedure [75,76]. However, the results strongly depend on the parameter of the greedy agglomerative procedures, and the optimal values of these parameters differ significantly for test problems. However, the GREEDY r search outperforms the SWAP search in terms of accuracy. Moreover, such algorithms often outperform more complex genetic algorithms.
Embedding the computationally expensive greedy agglomerative procedures into metaheuristics is even more computationally expensive, which limits the application of such a combination of algorithms to large problems. Nevertheless, the development of massively parallel systems expands the capabilities of such algorithmic combinations [74,77].
The (1 + 1)-evolutionary algorithms or (1 + 1)-EAs are similar with the local search and VNS algorithms but they have no clearly defined neighborhood; therefore, they can reach in one single step any point in the search space [78]. Unlike genetic algorithms, in (1 + 1)-EAs, there is no true population of solutions, and usually, they successively improve a single current solution with the application of the mutation operator to it.
In an (1 + 1)-evolutionary algorithm [78] for pseudo-Boolean optimization problems, the bitwise mutation operator flips each bit independently of the others with some probability p m that depends on the length of the bit string. The current bit string is replaced by the new one if the fitness of the current bit string is not superior to the fitness of the new string.
In their work, Borisovsky and Eremeev made the study on performance of the (1 + 1)-EA [79] and compared other evolutionary algorithms to the (1 + 1)-EA [80]. In [80], the authors studied the conditions under which (1 + 1)-EA is competitive with other evolutionary algorithms (EAs) in terms of the distribution of the fitness function at a given iteration and relative to the average optimization time. Their approach is applicable when the (1 + 1)-EA mutation operator prevails in the reproduction operator of the evolutionary algorithm. In this case, the lower bounds obtained for the expected optimization time of (1 + 1)-EA can be extended to other EAs based on the dominant operator. They proved that under the domination condition it is an optimal search technique with respect to the probability of finding solutions of sufficient quality after a given number of iterations. In the case of domination, the (1 + 1)-EA is also preferable with respect to the expected fitness at any iteration and the expected optimization time.
Reference [81] considers the scenario of the (1 + 1)-EA and randomized local search (RLS) with memory. The authors present two new algorithms: (1 + 1)-EA-m (with a raw list and hashtable option) and RLS-m+ (and RLS-m if the function is known in advance to be unimodal). These algorithms can be regarded as very simple forms of tabu search. Empirical results, with a reasonable fitness evaluation time assumption, verify that (1 + 1)-EA-m and RLS-m+ are superior to their conventional counterparts. In paper [82], the authors investigate the (1 + 1)-EA for optimizing functions over the space {0, ..., r} n , where r is a positive integer. They show that for linear functions over {0, 1, 2} n , the ex-pected runtime of this algorithm is O(n log n). This result generalizes an existing result on pseudo-Boolean functions and is derived using drift analysis. They also show that for large values of r, no upper bound for the runtime of the (1 + 1)-EA for linear function on {0, ..., r} n can be obtained with this approach nor with any other approach based on drift analysis with weight-independent linear potential functions.
In [83], the authors investigate the performance of the (1 + 1)-EA, on the maximum independent set problem (MISP) from a theoretical point of view. They showed that the (1 + 1)-EA can obtain an approximation ratio of (∆ + 1)/2 on this problem in expected time O(n 4 ), where ∆ and n denote the maximum vertex degree and the number of nodes in a graph, respectively. They reveal that the (1 + 1)-EA has better a performance than the local search algorithm on an instance of MISP and demonstrate that the local search algorithm with 3-flip neighborhood will be trapped in local optimum while the (1 + 1)-EA can find the global optimum in expected running time O(n 4 ).
In [84], the authors theoretically investigate the approximation performance of the (1 + 1)-EA, on the minimum degree spanning tree (MDST) problem which is a classical NP-hard optimization problem and show its capability of obtaining an approximate solution for the MDST problem with a limited maximum degree in expected polynomial time.
In [85], the authors analyze the expected runtime of the (1 + 1)-EA solving robust linear optimization problems (i.e., linear problems under robust scenarios) with a cardinality constraint. They consider two common robust scenarios, i.e., deletion-robust and worst-case, and disclose the potential of (1 + 1)-EAs for robust optimization.
An interesting approach was proposed in [86]. The authors propose a (1 + 1)-fast evolutionary algorithm ((1 + 1)-FEA) with an original approach to the adjustment of the mutation rate. This algorithm uses a random variable λ = ∈ 1, n which takes its values in accordance with some distribution D. The mutation operator in this (1 + 1)-FEA generates an instance λ' of λ in accordance with this distribution, applies the mutation with the mutation rate λ' to the current solution (replacing λ' bits with random values), and, if the new obtained solution improves the objective function value, then it replaces the current solution with the new one and corrects the probability distribution D so that the probability of generating the current value λ' increases. The worst case estimation of (1 + 1)-FEA for any black box function is significantly smaller than that for the original (1 + 1)-EA with a random mutation rate.

Research Gap and Our Contribution
Many important practical problems require obtaining a solution to a problem with a minimum error. For the p-median problem, such cases may include optimal location problems with a high cost of error, determined, for example, by the cost of transportation between sites. If the p-median problem is considered as a mathematical statement for clustering problems, it may be necessary to obtain a solution that would be difficult to improve by known methods without a multiple increase in the computation time. When comparing the accuracy of various algorithms, we need a reference solution which is not necessarily the exact global optimum of the problem, but at the same time is the best known solution. In this work, aimed at obtaining the most accurate solutions, we understand the accuracy of the algorithm exclusively as the achieved value of the objective function.
Algorithms based on the use of greedy agglomerative procedures, despite their computational costs without guaranteeing an exact result or even a result with a predetermined accuracy, enable one to obtain solutions to practical p-median problems that are difficult to improve by other known methods in comparable time [71,77]. Such algorithms are extremely sensitive to the parameter r of these procedures which determines the number of centers (medians) added to the intermediate solution [77]. In the field of (1 + 1)-evolutionary algorithms for pseudo-Boolean optimization problems, there are known approaches that can adjust a similar mutation parameter that determines the number of replaced bits in the current solution.
The work hypotheses of this article are as follows: 1.
The efficiency of the greedy agglomerative procedure applied to successive improvement of the p-median problem solution embedded into a more complex algorithm, such as evolutionary algorithm, highly depends on its parameter r (a number of excessive centers to be eliminated), and this dependence is hardly predictable.

2.
The principle of adjusting the numerical parameter of the mutation operator by changing the probability distribution of its values, which is used in (1 + 1)-evolutionary algorithms with 0-1 coding of solutions for pseudo-Boolean optimization problems, can also be effectively applied to adjust the numerical parameter of the agglomerative mutation operator based on the use of agglomerative procedures in an evolutionary algorithm with real coding of solutions when solving the p-median problems.
We propose a new algorithm based on the ideas of known search algorithms with greedy agglomerative procedures and (1 + 1)-evolutionary algorithms which successively improves the current solution with the use of greedy agglomerative procedures with the randomly chosen value of parameter r and simultaneously adjusts the probability distribution of parameter r in such procedures.
This article is an extended version of the paper presented at the International Workshop on Mathematical Models and their Applications (IWMMA'2020, 16-18 November 2020, Krasnoyarsk, Russia) [79].
The rest of this article is organized as follows. In Section 2, we describe known algorithms for the p-median problem including the algorithms with greedy agglomerative procedures. We investigate the dependence of their efficiency on parameter r. We propose the a massive parallel (CUDA) version for the greedy agglomerative procedures. In addition, using the similarities between local search algorithms and (1 + 1)-evolutionary algorithms, as well as the ability of the latter to adapt their search parameters, we propose a new algorithm based on the use of greedy agglomerative procedures with the automatically tuned parameter r. In Section 3, we describe the results of our computational experiments on various datasets up to 2 million demand points. Experiments demonstrate the advantage of our new algorithm in comparison with known ones. In Section 4, we discuss the applicability of the new algorithm to practical problems and propose possible directions for the further development of such algorithms. In Section 5, we give a short conclusion.

The Basic Algorithmic Approaches
Based on the algorithm proposed by Lloyd [75] for discrete p-median problems on a network, Alternate Location-Allocation (ALA) procedure [87,88] also known as the standard k-means procedure [76] the most popular algorithm for the k-means and p-median problems. This simple algorithm can be considered as a special case of similar Expectation Maximization (EM) procedure [89][90][91][92]. The ALA procedure sequentially improves an intermediate solution, enabling us to find a local minimum of the objective Function (1).
In terms of continuous optimization, the ALA procedure is not a true local search algorithm since it searches for a new solution not necessarily in the ε-neighborhood of the existing solution.
However, it usually outperforms simpler local optimization methods: gradient or coordinate descent, etc.
In the case of a p-median problem, Algorithm 1 solves the Weber problem (i.e., center search problem or a 1-median problem) for each cluster. The iterative Weiszfeld procedure gives an approximate solution to the Weber problem with given accuracy ε 1 . At each subsequent iteration, the coordinates of the solution (center) X of a subset S = {A 1 , . . . , A N } ⊂ {A 1 , . . . , A N } of N demand points are calculated from the previous solution X as follows: Algorithm 1 ALA (): Alternate Location-Allocation (Lloyd's procedure) Require: Set S of p initial centers S = { X 1 , . . . , X p . 1. For each center X i , i = 1, p, define its cluster C i ⊂ {A 1 , . . . , A N } as a subset of demand points having the same closest center X i . 2. For each cluster C i , i = 1, p, recalculate its center X i , i.e., solve the Weber problem:

Repeat from
Step 1 if Steps 1, 2 made any changes.
These calculations are repeated until L(X, X ) < ε 1 . Usually, the Weiszfeld procedure converges quickly. For the majority of problems, no more than 50 iterations are required in order to reach the accuracy limits of the double data type except include the cases where the solution coincides with one of the demand points. Such situations are easily bypassed: Here, ε 2 < ε 1 is a minimum distance at which two points are considered different. Embedding the Weiszfeld iterative procedure in an iterative ALA algorithm, which, in turn, is embedded in more complex algorithms, makes such algorithmic constructions difficult to use effectively with large-scale problems. Steps 1 and 2 of the ALA algorithm are aimed at the reduction of the objective function. Knowing the coordinates of the centers X i , without changing their coordinates, Step 1 redistributes the demand points A i , . . . , A N among the centers (i.e., forms clusters of demand points around the centers) so that the sum of the distances from each of the demand points to the nearest center becomes minimal.
Step 1 solves a simplest simple discrete optimization problem.
Step 2 solves a series of continuous optimization problems (Weber problems) which minimize the total distance from the center X i to the demand points A 1 , . . . , A N of its cluster. Each iteration of the Weiszfeld procedure (4) is also aimed at gradual improvement of the current solution X. For a gradual improvement of solutions, the Weber problem does not have to be solved to a given accuracy ε 1 at each iteration of the ALA algorithm. Thus, Steps 2 and 3 can take the following form: Step 2: For each cluster C i , i = 1, p, correct its center X i with an iteration of Weiszfeld procedure: Step 3: Repeat from Step 1 if Steps 1 and 2 moved at least one of centers X i by a distance exceeding ε 1 .
The essence of the VNS [42] is that for some intermediate solution, we determine a set of neighborhoods of this solution. From this set, the next type of neighborhood is selected, and we apply the corresponding local search algorithm for searching in this solution. If this algorithm finds an improved solution, the intermediate solution is replaced by this new solution, and the search continues in the same neighborhood. If the next local search algorithm cannot improve the solution, a new search neighborhood is selected from the set of neighborhoods. The stop criterion is the time limitation.
Several algorithms use only two types of neighborhoods: the first is ε-neighborhood (i.e., at this step, the problem of continuous optimization is solved). For example, the search in SWAP neighborhood is a popular method for solving p-median and k-means problems. The j-means [93] algorithm using these neighborhoods is one of the most accurate methods for solving such problems. The essence of the search in SWAP neighborhoods is as follows.
Let S = {X 1 , . . . , X p } be the local optimum of the p-median problem in ε-neighborhood (such a solution can be obtained by the ALA algorithm or other algorithms such as gradient descent). If the value of the objective function has improved due to this replacement, then the search continues in the ε-neighborhood of the new obtained solution (the problem of continuous optimization is solved again).
The search algorithm makes an attempt to replace some of the medians X 1 , . . . , X p with one of the data vectors A 1 , . . . A N . Search in the SWAP neighborhoods can be regular (all possible replacements are enumerated) or randomized (the medians and demand points are randomly picked for the replacement).
Let us denote by SWAP r a neighborhood (set of solutions) of a current solution obtained by the replacement of r medians by demand points.
In our computational experiments (described in detail in Section 3) on various problems from repositories [94,95], we investigated the dependence of the efficiency (ability to reach the minimum values of the objective functions) of the search in SWAP r neighborhood on the parameter r. The results obtained after the fixed runtime are given in Figure 1. Obviously, these results are highly dependent on r.
Algorithms 2021, 14, x FOR PEER REVIEW 9 of 31 search algorithm cannot improve the solution, a new search neighborhood is selected from the set of neighborhoods. The stop criterion is the time limitation. Several algorithms use only two types of neighborhoods: the first is ε-neighborhood (i.e., at this step, the problem of continuous optimization is solved). For example, the search in SWAP neighborhood is a popular method for solving p-median and k-means problems. The j-means [93] algorithm using these neighborhoods is one of the most accurate methods for solving such problems. The essence of the search in SWAP neighborhoods is as follows. Let S = {X1, …, Xp} be the local optimum of the p-median problem in ε-neighborhood (such a solution can be obtained by the ALA algorithm or other algorithms such as gradient descent). If the value of the objective function has improved due to this replacement, then the search continues in the ε-neighborhood of the new obtained solution (the problem of continuous optimization is solved again).
The search algorithm makes an attempt to replace some of the medians X1, …, Xp with one of the data vectors A1, …AN. Search in the SWAP neighborhoods can be regular (all possible replacements are enumerated) or randomized (the medians and demand points are randomly picked for the replacement).
Let us denote by SWAPr a neighborhood (set of solutions) of a current solution obtained by the replacement of r medians by demand points.
In our computational experiments (described in detail in Section 3) on various problems from repositories [94,95], we investigated the dependence of the efficiency (ability to reach the minimum values of the objective functions) of the search in SWAPr neighborhood on the parameter r. The results obtained after the fixed runtime are given in Figure  1. Obviously, these results are highly dependent on r.

Greedy Agglomerative Procedures
The agglomerative approach to solving the p-median problem is often successful [70]. To solve the p-median problem in a continuous space, the authors of [60] use genetic algorithms with various crossover operators based on greedy agglomerative procedures. Alp, Erkut, and Drezner in [70] presented a genetic algorithm for facility location problems, where evolution is facilitated by a greedy agglomerative heuristic procedure. A genetic algorithm with a faster greedy heuristic procedure for clustering and location problems was also proposed in [71].
Greedy agglomerative procedures can be used as independent algorithms or embedded in VNS or evolutionary algorithms [52,96]. Such procedures can be described as follows in Algorithm 2: Require: Set of initial centroids S = {X1, . . . , X K }, |S| = K > k, required final number of centroids k. Improve S with the two-step local search algorithm if possible; Select a subset S'⊂S of to remove centroids with the minimum values of the corresponding variables F i ; // By default, r toremove = 1; Improve this new solution S ← (S \S ); with the two-step local search algorithm; end while.
To improve the performance of such a procedure, the number of simultaneously eliminated centers can be calculated as r toremove = max 1, (|S| − p) · r coe f . In [52,97], the authors used the elimination coefficient value r coe f = 0.2. This means that at each iteration, up to 20% of the excessive centers are eliminated, and such values are proved to make the algorithm faster.
The agglomerative procedure for obtaining an AGGL r neighborhood can be defined as follows in Algorithm 3: Require: Two sets of centers S, S 2 , |S|=|S 2 |= p, the number of centers r of the solution S 2 which are used to obtain the resulting solution, r ∈ 1, p .
for i = 1, n repeats do 1. Select a subset S ⊂ S 2 : |S | = r; Such procedures use various values of r ∈ 1, p , and n repeats depends on r: If the solution S 2 is fixed, then all possible results of applying the AGGL r (S, S 2 ) procedure form a neighborhood of the solution S, and S 2 as well as r are parameters of such a neighborhood. If S 2 is a randomly chosen locally optimal solution obtained by ALA (S 2 ) procedure applied to a randomly chosen subset S 2 ⊂ {A 1 , . . . , A N }, S 2 = p, then we deal with a randomized neighborhood.
Let us denote such a neighborhood by AGGL r (S). Our experiments demonstrate that the obtained result of the local search in AGGL r neighborhoods strongly depends on r (see Figure 2).
As mentioned above, the greedy agglomerative procedure AGGL r can be used as the crossover operator of the genetic algorithms. Algorithms proposed in [52,71] use such procedures with r = 1 and r = p for solving the k-means and p-median problems. Let us denote such a neighborhood by AGGLr (S). Our experiments demonstrate that the obtained result of the local search in AGGLr neighborhoods strongly depends on r (see Figure 2). As mentioned above, the greedy agglomerative procedure AGGLr can be used as the crossover operator of the genetic algorithms. Algorithms proposed in [52,71] use such procedures with r = 1 and r = p for solving the k-means and p-median problems. As a rule, the VNS algorithms move from neighborhoods of lower cardinality to wider neighborhoods. For instance, in [52], the authors propose a sequential search in the neighborhoods AGGL 1 →AGGL random → AGGL p →AGGL 1 → . . . Here, AGGL random is a neighborhood with randomly selected r ∈ 2, p − 1 . In this case, the initial neighborhood type has a strong influence on the result [52]. Figure 1 shows that the dependence of the obtained objective function value on r is complex. For the specific problems, there may exist specific values of the parameter r which enable us to obtain better results which are hardly predictable. Nevertheless, a search algorithm may collect the information of the most efficient values of r during its runtime.

CUDA Implementation of Greedy Agglomerative Procedures
Compute Unified Device Architecture (CUDA) is a hardware-software parallel computing architecture that can significantly increase computational performance [98] designed for the effective use of non-graphical computing on the graphics processing units (GPUs). A GPU is a set of multiprocessors that simultaneously execute parallel threads grouped into data blocks. The threads execute the same instructions on different data in parallel. Communication and synchronization in blocks is impossible.
A thread is assigned an identifier (threadIdx.x) depending on its position in the block, and a block identifier (blockIdx.x) is also assigned to a thread depending on its position in the grid. The thread and block identifiers are available at run time, which allows us to set specific memory access patterns. Each thread in the GPU performs the same procedure, known as the kernel [98].
CUDA versions of the algorithms for solving discrete p-median problems [99][100][101] (which actually solve rather small discrete problems from the OR (Operation Research) Library) as well as parallel implementations of the ALA algorithm for the k-means problems are described in the modern literature [74,98,102]. In our research, we take into account the specificity of the continuous problems.
For large amounts of data, calculating the distances L X i , A j from demand points to cluster centers is the most computationally expensive part of the ALA algorithm. The distances should be calculated to all centers, since the algorithm does not know in advance which center is the nearest one. Traditionally used in solving the k-means problem, this approach can be accelerated by using the triangle inequality [103,104] in the case of p-median problems: if the displacement of the ith center relative to the previous iteration was ∆X i = L X i , X i , and its distance to the jth demand point was L ij = L X i , A j , then the new distance lies within L ij ± ∆X i . If ∃i * ∈ 1, p : such that for new distances, L X i * , A j < L X i , A j . Knowing all distances L ij from the previous iteration and the shift of the centers X i , we can avoid calculating a significant part of the distances at each iteration, which is especially important when processing multidimensional data. Nevertheless, for large-scale problems, matrix L ij requires a significant additional amount of memory (pN values). Moreover, checking the condition (7) for each pair A j , X i which requires significantly less computational resources than distance calculation will not accelerate the execution of the algorithm in the case of its CUDA implementation, which assumes strictly simultaneous execution of all instructions by all threads. Therefore, in this work, we do not use of the triangle inequality to accelerate the CUDA version of the ALA algorithm, although more complex implementations for CUDA could probably take advantage of the triangle inequality. Note that the distances (8) are calculated both in Step 1 of Algorithm 1, and in its Step 2 which includes the iteration of the Weiszfeld procedure (4).
We used the following CUDA implementation of Step 1 of Algorithm 1. The step is divided into two parts. First, variables common for all threads are initialized Algorithm 4: Algorithm 4 CUDA implementation of Step 1 in the ALA algorithm, part 1 (initialization) X j ← 0∀j = 1, p // Here, X j are vectors used for recalculation of centers. counter j ← 0∀j = 1, p // Counters of objects assigned to each of centers.
For the rest of the CUDA implementation of Step 1 of the ALA algorithm, we used the number of threads N trreads = 512 for each CUDA block. The number of blocks is calculated as N blocks = (N + N threads − 1)/N threads .
Thus, each thread processes a single demand point in Algorithm 5: Synchronize threads.
The parallel implementation of Step 2 of the ALA algorithm (see Algorithm 6) uses N threads = 512 threads for each CUDA block. The number of blocks is N blocks2 = (p + N threads − 1) /N threads . The changed variable, common to all threads, is initialized to 0. Algorithm 6 CUDA implementation of Step 2 in the ALA algorithm Variable changed is used at the last step of the ALA algorithm and signals the need to continue the iterations.
In addition, we implemented Step 3 of BasicAggl algorithm on the GPU. At this step, Algorithm 2 calculates the sum of the distances after removing a single center with index i : F i = F(S\{X i }). Knowing the objective function value F(S) for set S of centers, we can calculate its new value where Here, C i' is the center number which A i' is assigned to (nearest to X i' ). We also used 512 threads per block, the number of blocks is calculated according to (9).
After initialization of common variable Dsum←0, the Algorithm 7 is started in a separate thread for each data point: After parallel running this algorithm, F i is calculated in accordance with (10) on Step 3 of Algorithm 2: The remaining parts of our GA are implemented by the central processor unit. The remaining parts performing the selection do not have any significant effect on the calculation speed. This CUDA implementation of the ALA algorithm enables us to use both the ALA algorithm and more complex algorithms that include it, on large amounts of data, up to millions of demand points.

New Algorithm
As in local search algorithms, in accordance with the general idea of (1 + 1)-EAs, our new algorithm successively improves the single intermediate solution. To improve it, the AGGLr procedure is applied to it (see Algorithm 3) with an additional intermediate solution randomly generated as the second parameter S 2 , which is improved by the ALA procedure. A similar idea was applied for the k-means problem in VNS algorithms with randomized neighborhoods [54], which we included in the list of algorithms for comparison. The key feature of the new algorithm is that it does not fix values of the r parameter nor does it generate r values with equal probability, but it adjusts the values of the generation probabilities for each of the possible r values in accordance with the results obtained. If the use of the AGGLr-procedure has resulted in an improvement in the value of the objective function, our algorithm increases the probability P r for the used r and for values close to it.
The new algorithm can be described as follows in Algorithm 8.

Results of Computational Experiments
In all our experiments, we used the classic datasets from the UCI Machine Learning and Clustering basic benchmark repositories [94,95,105]: (a) Individual Household Electric Power Consumption (IHEPC)-energy consumption data of households during several years, more than 2 million demand points (data vectors) in R 2 , 0-1 normalized data, "date" and "time" columns removed; (b) BIRCH3: 100 groups of points of random size, 100,000 demand points in R 2 ; (c,d) S1 and S4 datasets, respectively: Gaussian clusters with cluster overlap (5000 demand points in R 2 ); (e) Mopsi-Joensuu: geographic locations of users (6014 data vectors, 2 dimensions) in Joensuu city; (f) Mopsi-Finland: geographic locations of users (13,467 data vectors, 2 dimensions) in Finland.
For our computational experiments, we used the following test system: Intel Core 2 Duo E8400 CPU, 16GB RAM, NVIDIA GeForce GTX1050ti GPU with 4096 MB RAM, floating-point performance 2138 GFLOPS. This choice of the GPU hardware was made due to its prevalence, and also one of the best values of the price/performance ratio. The program code was written in C++. We used Visual C++ 2017 compiler embedded into Visual Studio v.15.9.5, NVIDIA CUDA 10.0 Wizards, and NVIDIA Nsight Visual Studio Edition CUDA Support v.6.0.0. For all datasets, 30 attempts were made to run each of algorithms (Algorithms 1-6).
We examined the following algorithms: (a) Lloyd: the ALA algorithm in the multi-start mode; (b) j-means: j-means algorithm (regular search in SWAP 1 neighborhood in combination with the ALA algorithm) in the multi-start mode; (c) AGGL r : randomized search in the AGGL r neighborhood, r = 1, p; (d) SWAP r : local search in SWAP r neighborhoods, r = 1, p (only the best result for r = 1, p is given); (e-g) GH-VNS1, GH-VNS2, GH-VNS3: Variable Neighborhood Search algorithms with neighborhoods formed by application of AGGL r () procedure, see [52]; (h) GA-1POINT: genetic algorithm with a standard 1-point crossover; (i) GA-UNIFORM: genetic algorithm with a standard 1-point crossover and uniform random mutation [56]; (j) Aggl-EA: our new algorithm.
For all datasets, 30 attempts were made to run each of the algorithms (see Tables 1 and A1,  Tables A2-A15 in Appendix A). In Table 1, we present the results of our new algorithm Aggl-EA() and the best of listed known algorithms, i.e., known algorithms which provided the best average or median values of the objective function (1) after 30 runs.
For the smallest and simplest test problems (see Tables A1 and A4 in Appendix A), several algorithms including Aggl-EA() and search in the AGGL neighborhoods resulted in the same objective function value (standard deviation of the result is 0.0) which is probably the global minimum. Nevertheless, the Aggl-EA() algorithm outperforms both genetic algorithms and simplest local search algorithms such as Lloyd() or j-means which do not reach this minimum value in all attempts. For more complex test problems, the comparative efficiency varies. However, there were no test problems for which our new algorithm demonstrated a statistically significant disadvantage in comparison with the best of known algorithms.
The best (minimum), worst (maximum), average, and median values of the achieved objective function were averaged after 30 runs of each algorithm with fixed time limitation. The best average and median values of the objective Function (1) are underlined. We compared our new Aggl-EA () algorithm with known examined algorithm having the best median or average results. The significance of advantage/disadvantage of Aggl-EA() algorithm was estimated with the t-test [106,107] and nonparametric Wilcoxon rank sum test [108,109].
In the analysis of algorithm efficiency, the researchers often estimate the computational expenses as the number of the objective function calculations. The ALA procedure is the most computationally expensive part of all examined algorithms. In the ALA algorithm, the objective Function (1) is never calculated directly. The program code profiling results show that Step 2 of Algorithm 1 and Step 3 of Algorithm 2 occupy more than 99% of computational resources (processor time). These steps were implemented on GPUs. Moreover, we use the same implementation of the ALA procedure and BasicAggl() algorithm for all examined algorithms. Thus, the consumed time could be proportional with the number of iterations of the ALA algorithm (its Step 2) and the most time-consuming iterations of Algorithm 2 (its Step 3). However, the time consumption of these iterations depends on the number of centers which successively decreases during the work of the BasicAggl () algorithm. Thus, for the p-median problems, we used the astronomical time as the most informative unit for the computational expenses.
The time limitation plays an important role: the fastest algorithms may stop their convergence after several iterations and, vice versa, the slowest algorithms may continue their slow convergence and outperform the fastest algorithms if we enlarge their time limits. However, our new algorithms demonstrate their comparative advantage within wide time intervals (see Figure 3).

Discussion
The genetic algorithms and Variable Neighborhood Search are algorithmic frameworks useful for creating efficient solvers of various problems including location problems such as p-median. The greedy agglomerative approach can be efficiently used as the crossover genetic operator [58,70,71,102] as well as in the VNS [52]. However, as our experiments show (Appendix A), the correct choice of parameter r in such procedures plays the most important role and simple search algorithms with constant value of r may outperform more complex VNS algorithms if r is tuned correctly.
Unlike the majority of evolutionary algorithms, the (1 + 1)-evolutionary algorithms as well as local search algorithms focus on the successive improvement of a single intermediate solution. Successful application of various local search algorithms such as SWAP search, j-means or various VNS algorithms including to the p-median problem shows that the correct choice of a neighborhood or an algorithmic procedure for the successive improvement of a single current solution can be a more profitable strategy for solving practical problems than operating with a large population of solutions. In this work, we have increased the efficiency of one of these procedures by adjusting its parameter. Nevertheless, adjusting the parameters of greedy agglomerative procedures can also enhance the capabilities of genetic algorithms with a greedy agglomerative crossing operator, which, in our opinion, is an urgent area of further research. In addition, similar algorithms can be developed for a wider range of problems, including k-means, k-medoid, and the problem of separating a mixture of probability distributions.
Our new algorithm demands significant computational resources. Nevertheless, obtaining the most accurate results for solving problems, in addition to being of immediate practical importance in the case of a high cost of error, solves the problem of obtaining reference solutions with which the results of other, less computationally expensive algorithms can be compared.
As in the case of the k-means problem [74], the most complex dependence of the greedy procedure efficiency on its parameter r and important advantage of our new algorithm is detected for the largest tested problems (IHEPC dataset) as well as for the problems of "geographic" location (Mopsi datasets) with the sets of demand points formed under the influence of natural factors, as well as factors associated with the development of urban infrastructure.
For the majority of test problems, our computational experiments demonstrate the advantage of our new algorithm or its approximately equal effectiveness in comparison with known algorithms. For large-scale problem, the effect of our new algorithm is more significant. Nevertheless, even with equal results in comparison with known algorithms, our new Aggl-EA algorithm is a more versatile tool due to its ability of adjusting its parameter in accordance with its behavior. Sometimes, our new algorithm demonstrates less stable results (higher standard deviation of the objective function) which may limit its scope or demands running in multi-start mode. Thus, further study of the causes and factors leading to the instability of the result is required.
We considered the use of the self-adjusted agglomerative procedures as the mutation operator of the simplest evolutionary algorithm with no true population of solutions. The efficiency of embedding similar mutation operators (without any self-adjustment) was shown in the paper [96]. Thus, the investigation of the self-adjustment features of other evolutionary operators (first of all, crossover operator of genetic algorithms) is a promising direction for the further research.

Conclusions
In this article, we introduced the concept of the AGGL r neighborhood based on the application of the agglomerative procedure, and investigate the search efficiency in such a neighborhood depending on the parameter r. Using the similarities between local search algorithms and (1 + 1)-evolutionary algorithms, as well as the ability of the latter to adapt their search parameters, we introduced our new Aggl-EA () algorithm with an embedded greedy agglomerative procedure with the automatically tuned parameter r. Our computational experiments on large-scale and medium-scale problems demonstrate the advantages of the new algorithm in comparison with known algorithms including the algorithms based on greedy agglomerative procedures.
Our computational experiments confirmed the proposed working hypotheses and led to the following conclusions: The agglomerative mutation operator, when used as part of an evolutionary algorithm, is not only able to improve its solutions, moreover, it can also be efficiently used as the only evolutionary operator in the (1 + 1)-evolutionary algorithm with results outperforming more complex genetic algorithms. Therefore, self-adjustment capability of other evolutionary operators based on agglomerative procedures is a promising direction for the further research.
The algorithm with the adjustment of the parameter r (the number of excess centers to be removed) of the agglomerative procedure shows the most impressive results in the case of solving large problems, as well as problems of a "geographic" nature, where many points of demand have a complex structure, formed under the influence of natural factors.
Such algorithms, implemented for massively parallel systems, despite the high computational complexity of the embedded agglomerative procedure, ALA-algorithm and Weiszfeld procedure, are capable of solving problems with several million demand points in a reasonable time.

Conflicts of Interest:
The authors declare no conflict of interest. The funder had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results. Table A1. Comparative results for BIRCH3 dataset. 10 5 data vectors in R 2 , p = 30 centers, time limitation 10 s.