Dimension-Wise Particle Swarm Optimization: Evaluation and Comparative Analysis

: This article evaluates a recently introduced algorithm that adjusts each dimension in particle swarm optimization semi-independently and compares it with the traditional particle swarm optimization. In addition, the comparison is extended to differential evolution and genetic algorithm. This presented comparative study provides a clear exposition of the effects introduced by the proposed algorithm. Performance of all evaluated optimizers is evaluated based on how well they perform in ﬁnding the global minima of 24 multi-dimensional benchmark functions, each having 7, 14, or 21 dimensions. Each algorithm is put through a session of self-tuning with 100 iterations to ensure convergence of their respective optimization parameters. The results conﬁrm that the new variant is a signiﬁcant improvement over the traditional algorithm. It also obtained notably better results than differential evolution when applied to problems with high-dimensional spaces relative to the number of available particles.


Introduction
Particle swarm optimization (PSO) algorithms are reputed for optimizing complex multidimensional problems with a balanced trade-off between reliability of convergence and computational efficiency. PSO relies on momentum, local/personal best attraction, and global best attraction to find a global optima. Although PSO tends to be slightly slower than differential evolution (DE), it can produce similar or even slightly better convergence rates depending on the problem at hand [1,2]. PSO also tends to be faster and more efficient than genetic algorithms (GAs), which rely heavily on random mutation and cross-over.
The work reported in this article builds on prior work where PSO was modified to optimize each dimension semi-independently [3]. This variant, called dimension-wise particle swarm optimization (DPSO), aims to increase the convergence reliability at the cost of some additional time complexity. The focus of the original article [3] was on how varying the ratio of particles to dimensions and per-dimension sensitivity-henceforth referred to as ill-conditioning-affected the algorithm's ability to find the global optimum. These evaluations were carried out with scarce populations, i.e., with more dimensions than particles, which increased the difficulty and importance of exploration. The same article also introduced self-tuning as a method of PSO parameter selection. These configurations are carried over to this work with some changes in the rules, an inclusion of DE and GA in the evaluation, and an increase in the number of functions to be optimized.
An in-sample evaluation with some degree of similarity to each out-of-sample function is used for each optimizer's parameter optimization process. The most suitable optimizer parameters are determined using self-tuning. This way, the algorithm of interest can explore and change its parameters during the in-sample test should it find a statistically better set w.r.t. fitness. For the out-of-sample tests, each algorithm must rely on the generally optimized parameters derived from the self-tuning process. The results are evaluated based on the statistical mean and standard deviation of each algorithm's global optimum results for the 24 problem functions used. These benchmark functions are composed of 7-, 14-, and 21-dimensional spaces with randomly generated offsets, ill-conditioning, rotation, and overlaps. To ensure the results are reliable, each problem function is randomly generated and evaluated 30 times.

Related Work
PSO has a wide range of modifications, which address specific aspects of the algorithm or the problem of interest. PSO can be improved by integrating other algorithms into its process-such as neural networks and support vector machines-or by modifying the fundamental rules for particle movement [4][5][6][7][8]. For relatively high dimensional problems, it is usually preferred to selectively optimize a subset of dimensions at a time [9]. By occasionally changing the subset, progressive convergence towards the overall global optimum can be improved.
DPSO differs from other variants such as dual gbest PSO (DGPSO) in that it does not rely on preprocessing steps to identify, order, and select the dimensions with the greatest sensitivity [9]. A simpler method of feature selection borrows from the GA, where crossover is used to combine the particle's current location, a predetermined relevant feature vector, the global best position, and its local best position. However, this approach uses a lookahead approach, evaluating the three candidate locations before choosing the best one as the new individual, effectively tripling the population size during evaluation [10]. PSO with an enhanced learning strategy and crossover operator uses a weighted sum of local bests over all particles based on a normalized fitness distribution to achieve variations in attractor locations [11]. DGPSO's approach can be described as a divide-and-conquer approach, while PSO with crossover hybridizes features from GA into PSO. These methods improve PSO's ability to find the optimum; however, many come at the expense of notably larger preprocessing requirements or by superficially increasing the population size. The aim of DPSO is to improve the fitness with relatively small increases in computational demand by limiting the scale of the modifications. DPSO's approach is less complicated, randomly selecting dimensions to apply global and local best attractions based on a fixed probability value. The selection of dimensions is not of a fixed number, nor is it based on pre-processed information about the local search space, thus maintaining a relatively low computational cost. The binary nature of DPSO's attractors is closer to casting a net-like structure of potential attraction points on the search space solely based on the particles' current location and the location of the local and global best points; i.e., it does not require pre-processing or a look-ahead feature.
The velocity limitations imposed on DPSO are similar to the boundary restrictions of the population migration PSO's (PMPSO). However, DPSO's limits do not change in scale over time and only restrict each particle's maximum viable search space for the next iteration relative to its current location [12]. Velocity restriction-based improvised PSO (VRIPSO) uses a dynamic method of limiting velocity and relies on an escape velocity mechanism to explore beyond its velocity limit [13]. Alternatively, DPSO warps individuals who appear to be stagnating, while VRIPSO occasionally allows particles to escape the imposed velocity limit. Warping has been applied to all algorithms in this article as a general rule to ensure that it does not become a primary difference in results. Although VRIPSO's approach may allow for faster convergence in some circumstances, DPSO's warping condition recycles particles to increase the exploration of dimensions that may not have been sufficiently covered.
Tabu search PSO has an interesting feature, where the fitness results of particles and the respective locations are saved for a specified number of iterations [8]. This type of logging has been applied for all algorithms covered in this article but without removing older records. By recording the fitness of a location for later, it is possible to skip the function evaluation step and, over time, to significantly reduce the time spent on iterations with repeated positions. An element of reserving the position was also included in all algorithms to ensure that multiple particles on the same position would only require one shared evaluation.
Another feature found in VRIPSO is that it adjusts momentum over time [13]. Continuous PSO has a more in-depth version of this momentum feature in that it determines a gradient by which to adjust the momentum factors, encouraging movement in directions with greater improvements and not just in the current direction [14]. Though these dynamic methods are appealing, DPSO was limited to using random noise injection, which can either dampen or excite the particle's movement irrespective of the perceived gradient or its direction of travel. These random perturbations allow for greater degrees of exploration along the dimensions lacking in representation or for which movement has stagnated. Given that this research is interested in solving high-dimensional problems with relatively sparse population sizes, it is important to ensure that particles do not restrict themselves to a subset of searchable dimensions while neglecting the rest.

Algorithms
For all algorithms, the local best position is the individual's recorded best optimal value such that, in the case of minimizing the reward r t , where N is the number of individuals used in the algorithm. The conditional replacement of the position when r t == r local serves to allow for a possibility of equivalent locations to replace the current attraction point. This comparison can also be made for the global best by evaluating across each individual i of the current iteration such that The reward r is an evaluation of fitness w.r.t. the parameters p when they are applied to the problem of interest. As the chosen problems are evaluated over 30 runs, r is the average plus standard deviation, to not only prefer parameters that perform well on average, but also give some preference to parameters that give consistent results. By recording with six decimal points of resolution and a range (−1, +1) for problem functions and (0, 1) for self-tuning algorithms, the search spaces become partially discretized and ensure the recorded values are precise in their outcomes. This setup assures that dependencies in the reward do not arise from rounding off the seventh decimal place, and allows for some extrapolation into more granular ranges such as integers. As the algorithms are iterating over a relatively long series of optimization steps, a log of fitness values is included to save processing time at the expense of memory. In the situation where the algorithm fails to process the current position, the fitness value is set to +∞. Should a given particle stagnate, i.e., when |∆ p t | = 0, the individual is warped to a new random position, maintaining a degree of random exploration.

Traditional PSO
In the traditional PSO algorithm, each individual has a velocity and position update method [15] v t+1 = c 0 · v t + c 1 · rand(0, 1) · ( p global − p t ) + c 2 · rand(0, 1) · ( p local − p t ), and respectively, where c 0 is the momentum constant, and c 1 and c 2 are the global and local attraction strengths. rand(0, 1) is a randomly generated value from a uniform distribution within the range [0, 1). rand(0, 1) varies the degree of attraction to the respective best value every iteration up to c 1 and c 2 , respectively. The parameter labels have been changed for ease of equating to data arrays in code. This is applied to all later algorithms as well, however, these parameters are entirely independent. Velocity and position vectors are subject to the dimension-wise limitation and Allowing particles to bounce off the upper and lower limits maintains a degree of activity, preventing early termination of exploration due to boundary collisions. Additionally, given that the position is stopped at the respective limit, a measure of performance at the said limit will still be obtained. The general process flow for PSO is shown in Figure 1a

Dimension-Wise PSO
The Dimension-wise PSO (DPSO) algorithm has been designed with the intention of handling relatively large dimensional spaces with a sparse population size. This is accomplished by determining the global and local attraction points separately for each dimension. DPSO's velocity is defined as (7) and is also subject to dimension-wise limitations followed by Equation (5). As with PSO, the individual's position update relies on Equation (4) subject to Equation (6). Percent noise injection-regulated by c 1 -is used as a precaution to improve variations in movement and increase the likelihood of escaping from local minima without causing a large change in course. Noise injection replaces the random values affecting attractor strengths and allows it to have an effect independent of a particle's distance from the optima. Visually, it can be likened to roughening a surface dependent on the speed of a ball rolling down its surface, causing the said ball to deviate from a trivial path. The trajectory and speed are often disrupted, but the general direction of travel is not necessarily changed as it has three other elements that are largely responsible for determining velocity, i.e., momentum, global best attraction, and local best preference.  The velocity limit, v lim , partially opposes the effect of percent noise injection, as it attempts to restrict the overshooting caused by an excessive build up of momentum and attractive strength along a given dimension. Situations where a particle builds up too much momentum and is slingshot further away from the known optimal locations can help with increasing diversity in exploration, but scattering particles away from the optimal attractors can also impair the rate of convergence [16]. To prevent an excessive expansion in the required search parameters, v lim is simplified to where c 6 is the percentage of a given dimension's span a particle can traverse in one iteration, and p max and p min are the position boundaries bracketing the valid range of exploration along the given dimension. The primary change that gives DPSO the ability to search on a per-dimension basis is that, in Equation (7), the uniform random attraction values are replaced with probabilistic binary values, b 3 and b 5 , respectively. These values determine if the individual should move toward the respective best position of a given dimension. The activation of b 3 and b 5 can be described as where c 3 and c 5 denote the respective probability of activation. It must be stressed that b 3 and b 5 are evaluated on a per-dimension basis, as opposed to being a single value applied to each dimension. Although this discretized approach reduces the coverage of points between the two known best locations, it increases the exploration of points in the surrounding neighborhood (see Figure 2). When momentum and noise injection are accounted for, the individual is less likely to become fixated on a single vector and can partially explore the area around each point. Without noise or momentum, the probability of moving to a given point is based on the products of b 3 , b 5 , 1 − b 3 , and/or 1 − b 5 for each dimension. For the top-left example in Figure 2, if the lower '×' marks the global optimum, the point in the top left corner would have the following transition probability where hor and vir denote the dimension with which the binary attractor is associated. The probability of transitioning to the global optimum's mark is

Genetic Algorithm
The Genetic Algorithm (GA) is very different from PSO in its method of optimization, as it does not rely on momentum or points of attraction. Instead, it uses the fitness of individuals from the last generation [17]. The first step of modifying the parameters is selecting parents. In this study, it is carried out through a roulette-style competition without replacement, where each individual's fitness is determined by the following where len(argmin( r)) returns the number of individuals who obtained the current minimum fitness, which allows the fitness distribution to approach a uniform distribution as more individuals produce equivalent fitness values.
As there is no order on the resulting list of parent indices, they are considered sufficiently mixed. These parents are used in pairs to produce the kth child where mod (k, q) is the modulus division of k by the number of selected parents q. parent1 and parent2 are the indices of the chosen parents whose genes are p parent1 and p parent2 , respectively. The number of parents permitted to produce offspring is defined by where c 0 is the percent population allowed to reproduce. The percent replacement is determined by (1 − c 0 ), i.e., the population that was not selected is replaced to minimize redundant evaluations. After the parents are selected, crossover occurs with where b g = 1 with random permutation g = P(G, G/2 ) and b g = 0 otherwise. G is the length of the genome, and P(G, X) denotes a random choice of X elements of G without replacement, while P R (G, X) is with replacement. Gene mutation for each child is determined by a random permutation with replacement g = P R (G, c 1 × G ), where c 1 is the maximum allowed percent mutation a genome can undergo (the minimum is one mutated genome). For each unique gene/dimension selected for mutation, a uniform random value within the respective limits is applied. The general process flow for GA is shown in Figure 1b

Differential Evolution
Differential Evolution (DE) tends to be faster in processing due to its relative simplicity. A key difference from PSO and GA is that DE mutates its population before evaluating the fitness of its individuals [18]. The first step is choosing a random permutation a, b, c = P(N, 3) for the genetic material to be used for mutation where c 0 is the degree of mutation that can be imposed on the base material provided by sample 'a'. After a mutant is generated, each gene/dimension has the possibility of going through crossover-determined by the crossover probability c 1 -where the respective mutated gene is clipped to fit within the dimension's limits and applied to the targeted individual, i.e., p i,s = p m,s . If no genes are selected for crossover, one is selected at random. The general process flow of DE is shown in Figure 1c. The random permutation of three genetic sources has a time complexity is O(3N), mutated gene production is O(3D × N), crossover using a random permutation is O(D × N), and each dimension limit check is O(D × N), resulting in a total time complexity per iteration of O((5D + 3) × N).

Methodology
The configuration for this experiment is such that the algorithm is assigned an initial estimate of the best parameter combination, which is randomly seeded with one of the individuals to be evaluated. This algorithm (Alg 1 ) is set to optimize 5 particles for up to 100 iterations. Thirty copies of the algorithm (Alg 2 ) are generated as the problem function, each optimizing 5 particles for up to 500 iterations. For each of these optimized algorithms, 30 randomly offset and ill-conditioned copies of the in-sample problem function are set to be optimized. Alg 2 's average global optimum result across these 30 copies plus the standard deviation is taken as the algorithm's measure of fitness during self-tuning. The outcome of evaluating the fitness of Alg 2 , i.e., its final global optimum result, is used as follows: if it is better than what was recorded for the global optimum in Alg 1 , the parameters used by Alg 1 are updated to those of Alg 2 before moving on to the next iteration. After self-tuning, the optimal parameters are applied to the algorithm again with 30 separate runs of the out-of-sample function problems. The out-of-sample optimization uses 5 individuals and lasts for 1000 iterations without allowance for early termination. The mean results of the out-of-sample global best logs are recorded for plotting and the final mean and standard deviation values are recorded for tabulation.

Self-Tuning
Self-tuning is a form of bootstrapping, where the algorithm attempts to improve its own parameters based on its relative improvement in optimizing a problem [19]. Granted, it is inefficient to self-tune on the problem of interest as this will likely result in having found the global optimum several times over. Therefore, it is preferred to self-tune using a simpler approximation of the out-of-sample problem(s) as an in-sample training step. This allows the algorithm parameters to be generally optimized for any problem similar to the in-sample function. For self-tuning to be effective, the in-sample problem must be a close approximation of the out-of-sample function set. In the case where the out-of-sample set only has one function, the in-sample function must at least be sufficiently simplified to make the additional evaluation steps worthwhile. In contrast to some alternative selftuning methods, an external approach does not require specialized modifications or rules to adjust the parameters on the fly [20,21].

Function Problems
The base problem functions used as benchmarks are: Elliptical (scale: 500), Ackley (scale: 32), Rosenbrock (scale: 7.5; offset: 1.5), Rastrigin (scale: 5.12), Drop-wave (scale: 5.12), Zero-sum (scale: 10), and Salomon (scale: 100) [22,23]. For 2 to 3 dimensions, most of these functions would be considered somewhat challenging, as they have many local minima. To add complexity to each problem, random offsets (up to 80% off center) and ill-conditioning (up to 10 times the scale), rotations, and partial-separability (30% overlap) were applied in steps as shown in Table 1 [24]. The dimensions shown in the table are for the 14D problem; however, these steps were similarly carried out for 7D and 21D problems. Ill-conditioning, rotation, and overlap effects were applied to all 7, 14, and 21 dimensions, respectively, while the 4 copies of each function had independently allocated dimensions. For the cases of overlap and rotation, some of these functions are partially dependent on the same inputs, further complicating the search process.

Results
The resulting optimal parameters after self-tuning on f 0 are shown in Appendix A (Tables A1-A3). For each algorithm, the final optimal parameters tended to be found well before reaching the limit of 100 iterations. In all cases, DPSO has shown to be a significant improvement from PSO, which was consistently in the last place. The resulting fitness in Table A4, the normalized fitness and per-problem rank in Table 2, and the plots shown in Figures A1, A4, A7 and A10, suggest that DPSO is capable of outperforming DE and GA in overall fitness with a margin of approximately 20% when using five particles on seven-dimensional problems. For 14 dimensions, Tables 3 and A5, as well as Figures A2, A5, A8, and A11, show that, although DPSO easily bests the GA results and despite only placing third or fourth in 6 of the 24 problem functions, it underperforms w.r.t. DE with a margin of 17%. Rank Factor is the average result of the normalized fitness divided by the smallest average normalized fitness. The Normalized fitness in Tables 2-4 are the fitness results scaled for ease of comparison.
Increasing the dimensions to 21, DPSO's overall fitness is notably better with respect to Tables 4 and A6 relative to the ranked second DE with a margin of more than 58%. The gradual improvements shown in Figures A3, A6, A9, and A12 also suggest that DPSO's ability to make gradual improvements was not severely hindered by the small ratio of particles to dimensions. Given that the standard deviation is relatively small compared to the mean fitness, the algorithms' final rankings are expected to be reliable for the applied out-of-sample functions and chosen parameters.   The results of this experiment show that DPSO and DE can be optimized through selftuning given a sufficient number of iterations and that they both demonstrate satisfactory results. The tuned parameters allow these algorithms to remain largely effective when the target problem deviates from the in-sample function, i.e., when they are able to selftune on an approximation of the out-of-sample problems and can be expected to perform well. Given the fact that the rules were applied equally to all algorithms, the primary cause for improvement in DPSO over PSO is the use of dimension-wise updates. A likely reason why a number of fitness results are relatively large is that ill-conditioning scales up the dimensional range, making it harder to find the global optimum. Regardless, every algorithm was able to demonstrate some improvement in each problem, e.g., ending with a fitness of 10 32 is a significant improvement when starting with 10 44 . It is interesting that the Ellipse problems had the worst performance results despite the relative simplicity of its surface-likely attributed to the exponential nature of each axis. PSO and DPSO have momentum factors that can cause them to overshoot or orbit the minima, but this problem does not exist for DE and GA. The likely factor that makes some of these problems more difficult for DE and GA is that the population size, which they rely on for diversity, is relatively small compared to the number of dimensions. This way, they are entirely reliant on warping and mutation to increase the variety of candidate solutions. Without randomly generating new genomes, they can only make do with trying to improve on the dimensions for which there is a sufficient variety of overall fit individuals to experiment on. From Figures A1-A12, it is apparent that most algorithms tend to converge or approach convergence within the first 500 iterations, with relatively small improvements thereafter. The last improvement in the global optimum is given at x-axis value 1.0 in the plotsshown in Tables A4-A6-but data with next to no improvement at the end were removed to improve clarity where possible. The sudden improvement given by DPSO may be due to the fact that the dimension-wise activation of attractors is similar in principle to the crossover method found in GA and DE.
A separate test without logging was conducted on the elliptical function ( f 3 ) to analyze the computational costs of these algorithms (given in Table 5). As expected, DPSO is slightly more demanding than PSO by approximately 300 bytes (4.7%) for seven dimensions, further diminishing to 3.6% and 3.2% for 14 and 21 dimensions, respectively. Compared to GA and DE, both DPSO and PSO are notably more memory demanding.
In terms of execution time, DPSO requires a few milliseconds more per iteration than GA and DE. The time lapse was determined by the average execution time to complete one iteration over 200 iterations and 21 dimensions. Given that the time complexities for GA and DE are similar and smaller than for PSO and DPSO, it is likely that some areas of code are not fully optimized even though the base code was the same and care was taken to minimize deviations from the base. In exchange for a slightly larger processing time relative to the other algorithms, DPSO was able to reduce the rate at which its ability to converge worsened when the disparity between population and search-space dimensionality increased.
There are several components of the DPSO algorithm that contribute to its execution time. Their individual contributions can be measured by recording the corresponding time lapse. The time required for PSO to calculate momentum followed by the time lapse for local and global attraction forces can be compared to DPSO's momentum plus noise injection and dimension-wise attraction forces. The difference in calculation time for momentum (>1 µs) versus momentum plus noise injection (36 µs) and the regular attraction (7 µs) versus the dimension-wise attraction (116 µs) calculations is larger than expected. A notable portion of this increase can likely be attributed to DPSO relying on a for-loop to perform its per-dimension calculations, while PSO is able to exploit the optimizations given by the numpy library. Regardless, the velocity update steps for DPSO are expected to take more time given that the attractors are decided on a per-dimension basis.

Conclusions and Future Work
This article examines the recently introduced dimension-wise particle swarm optimization (DPSO) algorithm and compares its performance with other commonly used metaheuristic optimization systems. Specifically, it compares the statistical mean of the global fitness values for DPSO, PSO, GA, and DE in a two-step process: in-sample tuning and out-of-sample evaluation. The optimal parameters for each algorithm were selected by applying self-tuning while evaluating 30 independent runs of a generic in-sample problem that approximated the set of all functions used for evaluation. To evaluate the performance, each algorithm was tested on 24 separate problems, and the mean and standard deviation were obtained from 30 separate runs of each for 7, 14, and 21 dimensions. The obtained results show that DPSO performs better than DE, PSO, and GA when the population is sparse w.r.t the number of dimensions to be explored.
In future work, it may be worth incorporating other methods such as Adaptive PSO to reduce the number of parameters that must be tuned. It may also be interesting to investigate other options for rules on warping particles and setting the population size as one of the adjustable parameters. To better grasp the benefits of DPSO, it would also be worth comparing this algorithm with more state-of-the-art variations in PSO such as Dual Gbest PSO without regard for the computational costs. Given that there is no conflict with momentum degradation methods such as the one found in continuous PSO w.r.t. the changes made in DPSO, integrating them into DPSO may allow for further improvements in fitness with minimal addition to computational costs [9].

Conflicts of Interest:
The authors declare no conflict of interest.

Symbols
The following symbols are used in this manuscript: p t Position of a particle at time t v t Velocity of a particle at time t r Location fitness on the problem surface at a given point/time N The total number of particle/individual used D The total number of dimensions in the search space i An arbitrary particle/individual of the current iteration t Time step/iteration/generation local Denoting local attractor information global Denoting global attractor information rand(0, 1) A uniform randomly generated number within the range [0, 1) c n An algorithm parameter designated for tuning b n Binary decider for attractors and crossover