A Two-Stage Differential Evolution Algorithm with Mutation Strategy Combination

: For most of differential evolution (DE) algorithm variants, premature convergence is still challenging. The main reason is that the exploration and exploitation are highly coupled in the existing works. To address this problem, we present a novel DE variant that can symmetrically decouple exploration and exploitation during the optimization process in this paper. In the algorithm, the whole population is divided into two symmetrical subpopulations by ascending order of ﬁtness during each iteration; moreover, we divide the algorithm into two symmetrical stages according to the number of evaluations (FEs). On one hand, we introduce a mutation strategy, DE / current /1, which rarely appears in the literature. It can keep sufﬁcient population diversity and fully explore the solution space, but its convergence speed gradually slows as iteration continues. To give full play to its advantages and avoid its disadvantages, we propose a heterogeneous two-stage double-subpopulation (HTSDS) mechanism. Four mutation strategies (including DE / current /1 and its modiﬁed version) with distinct search behaviors are assigned to superior and inferior subpopulations in two stages, which helps simultaneously and independently managing exploration and exploitation in different components. On the other hand, an adaptive two-stage partition (ATSP) strategy is proposed, which can adjust the stage partition parameter according to the complexity of the problem. Hence, a two-stage differential evolution algorithm with mutation strategy combination (TS-MSCDE) is proposed. Numerical experiments were conducted using CEC2017, CEC2020 and four real-world optimization problems from CEC2011. The results show that when computing resources are sufﬁcient, the algorithm is competitive, especially for complex multimodal problems.


Introduction
Proposed by Storn and Price in 1995 [1], the differential evolution (DE) algorithm is a stochastic optimization algorithm that simulates biological evolution in nature. It uses real number vector coding to search for and optimize the solution over continuous space. Compared with other optimization algorithms based on swarm intelligence, DE retains the global search strategy and guides the population evolution through mutation operations based on difference vectors and crossover operations through probability selection. In addition to outstanding performance on benchmark functions, DE is widely used in problems such as pattern recognition [2], image processing [3], artificial neural networks [4], electronic communication engineering [5], and bioinformatics [6].
In DE, the original vector X selected to perform the mutation operation is called the target vector, and the mutant vector V obtained through difference operations is the donor vector, which essentially is a kind of transitional vector; the new vector U generated by the crossover operation between the target and donor vectors is called the trial vector. The algorithm chooses the vector with a better fitness value from X and U to enter the next generation. Only a few DE control parameters must be set: the population size NP, scale factor F, and crossover rate CR. The procedures can be summarized as population initialization, mutation, crossover, and selection.
Mutation is crucial to the performance of the algorithm, and much work has been done on mutation strategies, leading to different exploration and exploitation features. We follow the "DE/x/y/z" form proposed by Storn et al. [1] to denote mutation strategies, where x represents the selection method of the target vector, y is the number of difference vectors, and z is the crossover method (generally binomial, so z is often omitted) [1].
Mutation can be treated as a form of local search, where the base vector acts as the center and the difference vector determines the range. Mutation strategies can be roughly categorized as either DE/best/ * , DE/rand/ * , or DE/current − to − * / * . DE/best/ * is based on the optimal individual of the population, which can provide guidance about the optimal region for local search, so it has considerable convergence speed. However, for all of the individuals to share the same search center, it can diminish population diversity and cause premature convergence. DE/rand/ * randomly chooses the base vector from the whole population, and different individuals are beneficial to maintain diversity and improve convergence reliability. Because, randomly selected base vectors do not provide effective guidance information, so the search efficiency is relatively low. Moreover, when the population is mostly concentrated in the local optimal region, the influence of selection probability can cause premature convergence. DE/current − to − * / * can be regarded as using a linear combination of the target vector and another vector to constitute the search center. Its performance is relatively balanced, and exploration and exploitation are performed simultaneously. Promotion, on the one hand, will inevitably lead to suppression on the other. Generally, no mutation strategy can outperform the others in all problems. It should be an appropriate methodology to combine different mutation strategies based on individual fitness values in different periods of algorithm iteration. We therefore propose a two-stage differential evolution algorithm with mutation strategy combination (TS-MSCDE).
First, we introduce a mutation strategy, DE/current/1, which has rarely appeared in the literature, along with DE/best/1, DE/rand/1, and DE/current − to − best/1, and compare them. It can be seen that DE/current/1 adopts the target vector as the base vector, so it has a strong capability to explore the solution space, which can avoid local optima to the maximum extent. However, its convergence speed and search efficiency are relatively low in the middle and later stages.
Second, in the evolution process of DE, individuals often have different tendencies of exploration and exploitation. Superior individuals are usually located near a local or global optimum, and tend to perform a local search to find better solutions, while inferior individuals are likely far from the optimal region and tend to explore globally to maintain population diversity. For this, we propose a heterogeneous two-stage doublesubpopulation (HTSDS) mechanism. DE/current/1 and its modified version are assigned to superior and inferior subpopulations, respectively, in the first stage for exploration, DE/best/1 and a modified DE/rand/1 are used in the second stage.
Finally, as mentioned above, our stage division is designed to explore in the early stage and exploit in the later stage. However, owing to the varying complexity of optimization problems, the two-stage division should be adaptive. For example, the algorithm can enter the second stage as soon as possible on unimodal problems, while it should explore more on more complex problems. Therefore, an adaptive two-stage partition (ATSP) strategy is designed to obtain the complexity of the optimization problem after population initialization, so as to adjust the stage proportion parameter.
To verify and analyze the effectiveness of our proposed algorithm, we performed experiments and comparisons on the CEC2017 benchmark set [7] to verify the impact of HTSDS and ATSP, followed by experiments with advanced algorithms on the CEC2020 benchmark set [8]. Results indicate that TS-MSCDE is highly competitive when computing resources are sufficient, especially for some difficult and complex multimodal problems. Four real-world optimization problems from CEC2011 [9] are selected to further prove the above conclusion.
The rest of this paper is organized as follows: Section 2 introduces the canonical DE algorithm, including its typical mutation, crossover, and selection operators. Section 3 reviews related work. The proposed HTSDS, ATSP, and complete TS-MSCDE procedures are introduced in Section 4. The effectiveness of the proposed mechanism and algorithm is discussed in Section 5 based on the computational results, and comparisons are drawn with other state-of-the-art algorithms. Conclusions are made, and future work discussed, in Section 6.

Differential Evolution
In this section, we introduce the basic differential evolution algorithm, including the well-known mutation strategy DE/rand/1 [1] and other widely used mutation strategies.

Initialization
An initial random population P consists of NP individuals, each represented by . . , NP , where t = 0, 1, 2, . . . , T max is the iteration number, and D is the number of dimensions in the solution space. In DE, uniformly distributed random functions are used to generate initial solutions x 0 i,j = x j,min + rand(0, 1)· x j,max − x j,min , where x j,max and x j,min are the maximum and minimum boundary values, respectively, on the corresponding jth dimension.

Mutation
At iteration t, for each target vector X t i , a mutant vector V t i is generated according to the following: Other widely used mutations strategies include where r 1 , r 2 , r 3 , r 4 , and r 5 are mutually different integers randomly generated from the range (1, 2, . . . , NP), and they are different from i (r 1 = r 2 = r 3 = i). X t best is the individual vector with the best fitness value in the population at iteration t.

Crossover
We illustrate the binomial crossover, in which the target vector X t i and donor vector V t i exchange elements according to the following rules: where the crossover rate CR ∈ [0, 1] is a uniformly distributed random integer in [1, D] that ensures at least one element of the trial vector is inherited from the donor vector.

Selection
The greedy selection strategy is generally used in DE. The variable, X t+1 i , is assigned when the fitness value of the trial vector, U t i , is equal to or better than X t i , and otherwise X t i is reserved: where f (x) is the fitness function.

Related Work
The performance of canonical DE generally depends on the mutation strategies, crossover strategies, and associated control parameters.
Parameter setting methods can be classified as constant, random, and adaptive (including self-adaptive). Storn and Price [1] provided general guidance for parameter control: a promising range of NP is between 5D and 10D, 0.5 is an appropriate value of F, and 0.1 is a first choice for CR in most situations. However, Ronkkonen et al. [10] concluded that F = 0.9 is a good tradeoff between the convergence rate and robustness. The setting of CR depends on the nature of the problem, with a value between 0.9 and 1 being suitable for non-separable and multimodal problems, and between 0 and 0.2 for separable problems. A DE variant with orthogonal crossover was proposed [11], with fixed parameters F = 0.9, CR = 0.9, and NP = D. Different conclusions have been drawn about the setting of constant parameters. It is noted that one set of parameters cannot be applied to all problems, and to improve algorithm performance requires more flexible setting methods.
To avoid manual adjustment of parameters, many researchers have proposed random parameter mechanisms. In CoDE [12], each individual randomly selects a pair of parameter settings from a pool of three pairs. DE-APC [13] randomly selects F and CR values of an individual from two preset constant sets, F set and CR set . Das et al. [14] proposed two ways to set F in DE: random and time-varying. For the random method, F is assigned to a random real number from (0.5, 1). In the time-varying method, F linearly decreases over a given interval. In one method [15], the control parameters for each individual are adaptively selected from a set of 12 settings, with the selection probability depending on the corresponding success rate. Instead of the F value of a single individual, PVADE [16] uses a scaling factor vector F m calculated from the diversity measure of the population at each dimension level. In SaDE [17], the value of F is randomly selected for each individual of the current population from N (0.5, 0.3).
Another class of approaches focuses on adaptive mechanisms for parameters. In the adaptive method, the control parameters can be adjusted dynamically according to the feedback of the search process or the evolution operation. In fuzzy adaptive DE (FADE), Liu and Lampinen [18] took the individual and relative fitness function values of successive generations as inputs and used fuzzy logic controllers to make adaptive adjustments to the parameters. Brest et al. [19] proposed jDE, which encodes parameters into each individual and adjusts them through evolution. In terms of solution quality, jDE is superior to basic DE, and competitive with other advanced algorithms. In JADE [20], according to the historical success information of parameters, F is generated by a Cauchy distribution, and CR is sampled from a normal distribution of each individual in each generation. SHADE [21] is an improved version of JADE that uses a mechanism based on success history to update F and CR. Instead of sampling F and CR values from gradually adapted probability distributions, SHADE [22] used historical memory archives M F and M CR . Historical memory archives store a set of F and CR values, which have performed well in the recent past. Zhang et al. [23] made the first attempt to model the tuning of structural hyper-parameters as a reinforcement learning problem, and present to tune the structural hyper-parameter of JADE, jSO [24], and other excellent algorithms.
Researchers have developed many mutation strategies [25,26]. Some have good exploration ability and are suitable for global search, while others have good exploitation ability and are good at local search. Fan and Lampinen [27] proposed a triangular mutation strat-egy, which was considered a local search operator, and designed the TDE algorithm together with DE/rand/1. Zhang and Sanderson introduced a new DE algorithm, JADE [19], improving optimization performance by a new mutation strategy, DE/current − to − pbest/1. Experimental results showed that JADE has convergence performance that is competitive with classical DE. Tanabe and Fukunaga proposed success historical memory differential evolution (SHADE) [28] based on DE/current − to − pbest/1, which ranked third among 21 algorithms at the IEEE CEC2013 competition [13]. Tanabe and Fukunaga proposed SHADE by linear population size reduction (L-SHADE) [28], which placed first in the IEEE CEC2014 competition [29]. An improved p-best mutation, MDE_PBX, was proposed to overcome the problem of premature convergence and convergence stagnation in classical DE [30]. This mutation strategy is similar to DE/current − to − pbest/1, which selects the optimal vector from a dynamic group of q% of the population. Xin Zhang and Shiu Yin Yuen [31] proposed a direction-based mutation operator for DE. The idea is to construct a pool of difference vectors, record them when fitness is improved, and use it to guide the search of the next generation. The directional mutation operator can be applied to any DE mutation strategy.
In [32], a method based on the allocation of appropriate positions for each individual in the variation strategy is proposed by comprehensively considering the individual's fitness and diversity contribution. Thus, a good balance between early exploration and later exploitation is achieved. On the basis of the classic DE/rand/2 and DE/best/2, Yuzhen Li et al. [33] proposed two new variant mutation strategies named DE/e − rand/2 and DE/e − best/2, respectively. An elite guidance mechanism randomly selects individuals from the elite group as the base vector and difference vector, so as to provide clearer guidance for individual mutation without losing randomness. The double-mutation strategy cooperation mechanism is adopted to achieve the balance between global and local search.
Many problems are black-box optimization problems, where the specific expression of the optimization target and its gradient information are unknown. Therefore, we cannot use the characteristics of the optimization target itself to obtain the global optimal solution [34]. However, we can still mine useful information to improve the performance of the algorithm. Wright [35] applied the theory of the fitness landscape to the dynamics of biological evolution optimization, which can reveal the relationship between a search space and fitness value through the characteristics of landscape information. Many fitness landscape measures have been proposed to understand the characteristics of optimization problems, [36,37]. Auto-correlation and correlation length [38] are often used to measure the ruggedness of a fitness landscape. The fitness distance correlation coefficient (FDC) measures the difficulty of a problem [39]. It measures the correlation between the fitness value and the optimal distance in the search domain. Lunacek et al. [40] introduced the dispersion metric (DM) to predict the presence of funnels in the fitness landscape by measuring the average distance between two high-quality solutions. Malan and Engelbrecht used entropy [41] to describe fitness landscape characteristics for continuous optimization problems.
The study of the fitness landscape is important in evolutionary computing, helping researchers to select the appropriate algorithm or operator according to the characteristics of a problem [42]. The combination of DE and fitness landscape has been much discussed. A method was proposed to detect a fitness landscape, determining whether a problem is unimodal or multimodal by generating several detection points on the line between the optimal and central solutions of a population, using different base vector selection rules and F values according to landscape modality [43]. Li et al. [44] proposed a hybrid DE, FLDE, based on the fitness landscape. The optimal mutation strategy is selected by extracting the local fitness landscape features of each generation of the population and combining these with a unimodal or multimodal probability distribution. It was proposed to use problem fitness landscape information to dynamically measure the search ability of different mutation strategies, considering the historical performance of mutation strategies, and dynamically select the most appropriate mutation strategy in the evolution process [45].
It can be seen that fitness landscape information is usually helpful to judge the complexity of optimization problems, and its inclusion can improve the performance of a DE algorithm.

TS-MSCDE
In this section, we discuss the characteristics of DE/current/1, introduce HTSDS and ATSP, and describe the implementation of TS-MSCDE.

DE/current/1
Numerous mutation strategies have appeared since DE was proposed, most derived from DE/best/1, DE/rand/1, and DE/current − to − best/1 [25]. As can be seen from Equations (1), (2) and (5), these three strategies all use a difference vector to provide disturbance, and differ in the way the base vector is chosen. Among the three mutation strategies, DE/best/1 is usually thought to have the fastest convergence speed and best search performance for simple unimodal problems. However, its population diversity and exploration ability may deteriorate and even be completely lost in very few iterations, leading to stagnation or premature convergence. Different from DE/best/1, DE/current − to − best/1 is more inclined to balance global exploration and local exploitation, but it has poor robustness. DE/rand/1 has a strong global search capability and is effective for complex multimodal problems. However, as shown in Figure 1a, although the population is relatively evenly distributed in the solution space (vectors of different colors represent different optimal regions), the local optimal region is relatively large compared with the global optimal region. Hence, DE/rand/1 has a high probability of choosing a blue or yellow vector as the base vector X t r 1 . This trend will be exacerbated as the iteration progresses. Therefore, the whole population has a high probability of falling into a local optimum.
We introduce a mutation strategy to avoid falling into a local optimum due to the selection of the base vector: where r 1 , r 2 are mutually different integers randomly generated in the range (1, 2, . . . , NP), and r 1 = r 2 = i. As shown in Figure 1b, the base vector of DE/current/1 is always the target vector itself, which is equivalent to the local search of each target vector centered on itself. Therefore, it can keep the diversity of the population to a great extent, so as to fully explore the solution space, and avoid the population falling into local optima in complex multimodal problems. However, since the convergence of the mutation strategy completely depends on the difference vector, it can converge slowly or even stagnate when the difference vector cannot provide effective direction information. DE/current/1-related algorithms are rarely applied to numerical optimization problems, which is perhaps because its advantages are no less obvious than its disadvantages, as previously mentioned, but its characteristics also provide opportunities for decoupling exploration and exploitation from another perspective. Reasonable decoupling can improve algorithm performance to some extent [33,46].

HTSDS Mechanism
Because it is necessary to consider the decoupling of exploration and exploitation at different stages of the algorithm, we divide it into two stages according to the number of evaluations (FEs). The extreme properties of DE/current/1 described above are undoubtedly well suited to the first stage to locate possible global optimal regions. If we do not do this, then individuals will become closer with iteration, diversity will gradually weaken, and the population will no longer be able to jump out of a local optimal region. However, it would be inefficient to use DE/current/1 as an undifferentiated and directionless search for all individuals in the exploration stage. In the exploitation stage, rapid convergence is needed on the basis of possible global optimal regions to refine the solution. Therefore, we add a heterogeneous two-subpopulation mechanism.  Figure 2a shows that in the early stage of the evolution, individuals in the solution space are uniformly scattered, and that inferior individuals are often far away from optimal regions. The distance between inferior individuals is often large, leading to the difference vector being relatively large, which is suitable for exploring solution spaces with a big step size. However, superior individuals may be located in different optimal regions, and the step size of the difference vector formed by them is relatively small, so it is suitable for local exploitation of each optimal region. According to the above analysis, to further improve the efficiency of DE/current/1 without affecting its exploration performance, we sort the population in ascending order of fitness during each iteration, let p ∈ [0, 1], and the superior set S contains the former p * NP individuals, the inferior set I contains the remaining individuals, and DE/current/1 can be modified to where S 1 , S 2 ∈ S S 1 = S 2 = i, I 1 ∈ I P 2 ∈ P I 1 = P 2 = i. It can be seen that the difference vector of the superior population is only selected from itself, while that of the inferior population selects the starting vector from the whole population and the end vector from the inferior population. We can conclude from the above formula that the individuals composing the superior and inferior populations complete the mutation together, and the information sharing between the two subpopulations only depends on the population division at each iteration, and does not carry through the difference vector selection of the whole population, as in most literature [47]. This has two advantages. First, the superior population can speed up the search rate and information sharing in the promising region without being disturbed by inferior individuals, while the inferior population can focus on exploring the solution space and finding potential solutions. Second, when the superior population stagnates or falls into a local optimum, it can restore diversity and jump out of it through the population division of each iteration. This subtle difference has a significant impact on the performance of the algorithm.
Referring to Figure 2b, when the algorithm enters the exploitation stage, the previous discussion still holds, but as the iteration goes on, the population gradually converges to the optimal region and the gap among individuals gradually narrows. Although DE/current/1 can keep searching at this time, the convergence efficiency is too low due to the lack of the guidance of the global optimal individual, so we consider other mutation strategies to accelerate convergence: where S 1 , S 2 ∈ S S 1 = S 2 = i, I 1, I 2 ∈ I P 3 ∈ P ∧ I 1 = I 2 = P 3 . It can be seen from the above formula that the superior population adopts DE/best/1 because most of the superior individuals are concentrated in the possible optimal region, i.e., the region where the superior individuals are located shows features of a unimodal function in later stages. Therefore, DE/best/1 can achieve faster convergence and refine the solution. In addition, a modified DE/rand/1 is used for inferior individuals, which considers convergence and diversity maintenance to improve the stability of the algorithm. The heterogeneous two-stage two-subpopulation mechanism (HTSDS) can be further explained by Figure 3. Through the HTSDS mechanism, the algorithm implements a symmetric framework based on the population of algorithm and the process of evolution; appropriate mutation strategies are assigned to superior and inferior subpopulations in two stages of evolution iteration. Symmetrical decoupling between exploration and exploitation consequently can be done on different individuals in different stages. At the same time, because of the characteristics of the DE/current/1 strategy, the algorithm can have a strong exploration ability in the early stage, which is particularly important for complex multimodal problems. Of course, this will reduce the convergence rate to some extent.

ATSP Strategy
In HTSDS, we divide the algorithm into two stages according to FEs. The question now is how to set a threshold ps ∈ (0, 1) so that ps * Max_FEs can be used for a partition. Problems have varying complexity. For simple unimodal problems, ps should be small so that the algorithm can converge as soon as possible, while ps should be increased for complex multimodal problems to improve the exploration proportion and avoid prematurely falling into a local optimum. Therefore, to further improve the performance of the algorithm on different problems, we should try to obtain an abstract mathematical representation of a problem's complexity.
The study of the fitness landscape [35] is important in evolutionary computing, whose purpose is to explain the behavior of evolutionary algorithms in solving optimization problems. Among the many fitness landscape measurement methods, the fitness distance correlation coefficient (FDC) [39] has been used by many evolutionary algorithms for its simplicity and efficiency, and has significantly improved algorithm performance. The FDCbased DE algorithm has been incorporated in mutation strategy selection [48,49] to improve the performance of the algorithm in different optimization problems. We use FDC and design a simple but efficient adaptive two-stage partition strategy (ATSP) to solve the stage partition problem of HTSDS.
The FDC method determines the correlation between the fitness value and the optimal solution distance in the search space. Ideally, it requires a representative set of fitness landscape samples and global optimum-related information. The original definition of FDC was the joint distribution of fitness values and distance [39], where r FD [−1, 1], f and d are the mean of fitness values and distance of n samples, s F and s D are the standard deviation of the fitness value and distance samples, respectively. The distance function is d i = d x i , x g . The global optimal value x g is generally unknown, so the global optimum is replaced by the optimal candidate solution x g = argmin x∈χ f (x), where χ is the set of fitness landscape samples. Many algorithms adopt the random walk method [49] to obtain the fitness landscape sample set. With no additional computational costs, we use the Latin hypercube design (LHD) [50] during initialization, so the initial population can uniformly cover the solution space, which can be used as the fitness landscape sample set. The initialization process can be expressed as where i = 1, 2, . . . , NP ∧ j = 1, 2, . . . , D, and lhd denote an LHD function that generates random numbers from independent standard normal distributions. To more clearly show problem descriptions with FDC, we choose 10 optimization problems from CEC2020 [8], whose corresponding FDC in 5, 10, 15, and 20 dimensions is calculated from Equations (12) and (13), as shown in Figure 4. The FDC value is generally high for a simple unimodal function like function 1, but low for a complex multimodal function. With the increase of dimension, the complexity of optimization problems increases, and the corresponding FDC decreases gradually.
From Equation (12), we know that when the fitness value decreases as the distance from the global optimal value (for the minimization problem), the FDC value should ideally be close to 1. Once we have chosen FDC as a measure of the complexity of the optimization problem, the question becomes how to properly incorporate it into the algorithm. We introduce the concept of correlation distance, Then d FD [0, 1]. When d FD = 0, the optimization problem presents an obvious convex feature to some extent, and it becomes increasingly complex with the increase of d FD . So, ps = ps lower + min ps upper − ps lower * d FD , ps limit . (15) ps limit is set to 0.5 so that ps will not be larger than 1. Through ATSP, we establish a simple linear relationship between the complexity of the optimization problem and the algorithm stage division. The algorithm can enter the exploitation stage earlier for simple unimodal problems, and otherwise, the exploration stage can be appropriately extended to find the global optimal solution as possible.

Parameter Self-Adaptation
The successful performance of the DE algorithm largely depends on the selection of the scaling factor F and crossover rate CR [1], which play a crucial role in its effectiveness, efficiency, and robustness. The mainstream method is to make the parameters adapt to different problems while iterating. A parameter-adaptive method based on the historical memory and fitness improvement weight showed good results [20,21,28], but this may increase the risk of the population falling into a local optimum [22], and the correlation between F and CR does not get enough attention. Therefore, we adopt another parameteradaptive method [51].
The adaptive process starts with the parameter setting pool, PARM_POOL, and probability pool, PR_POOL. We select (F, CR) for the initial parameter pool as [(0.1, 0.2), (0.5, 0.9), (1.0, 0.1), (1.0, 0.9)], and PR_POOL contains the selection probability pr for each pair of parameters, so each individual will select parameter pairs according to the probability pr. The adaptive process starts from 0.1 * MAX_FES, and adjusts the selection probability based on the performance of each pair of parameters: where ω pr is the sum of fitness improvements corresponding to each pair of parameters, n is the individual corresponding to each pair of parameters, and X new i and X old i are postand pre-evolutionary individuals, respectively. From this, we can get the improvement rate corresponding to each pair of parameters, where 0.05 is the minimum probability required for each pair of parameters. The corresponding PR_POOL can be obtained from ∆ p as where c = 0.05 is the learning rate, which can be used to obtain knowledge from the performance of each group of parameters of the previous generation, so as to better guide parameter adaptation.

Complete Procedure of the Proposed TS-MSCDE
Based on the HTSDS and ATSP described above, we now present the complete steps of TS-MSCDE.
(1) Initialization: The initialization of the proposed algorithm differs from classical DE, it uses LHD [49] instead of randomly initializing the population, and then calculates the corresponding threshold of stage division ps according to ATSP. (2) Mutation: According to the preset threshold p and the calculated threshold ps, the population was mutated by HTSDS. (3) Crossover: We use the same binomial crossover as the classic DE. (4) Selection: The same greedy selection strategy as the classic DE is adopted, that is, individuals with low fitness value enter the next iteration.
The pseudo-code of TS-MSCDE is expressed in Algorithm 1. Initialize a population of NP individuals P t = X t 1 , . . . , X t NP with X t i = x t i,1 , x t i,2 , . . . , x t i,D and each individual is generated by (13); 5.
Reindex the individuals of current population in ascending order of their fitness values, the superior (S) consists of the top p * NP individuals and the inferior (I) consists of the remaining individuals; 16.
FOR i = 1 to NP 17.
Generate F and CR from PARM_POOL by probability pool PR_POOL; 18.
END FOR 24.
FOR i = 1 to NP 26.
Generate trial vectors U t i by Equation (7); 27.
END FOR 28.
FOR i = 1 to NP 30.
END FOR 32.
Recalculate the fitness value improvement rate for PR_POOL by Equation (16) and Equation (17); 34. Step

Complexity Analysis
The complexity of the classical DE is O(T max * NP * D), Where T max is the maximum number of iterations, NP is the population size, and D is the problem dimension. Compared with the classical DE, the extra computation of TS-MSCDE is mainly reflected in HTSDS, ATSP, and PR_POOL calculation.
In HTSDS (see lines 17 to 27 of Algorithm 1), the main operations include population sorting to classify superior and inferior population, and then select the corresponding mutation strategy. The complexity of the sorting operator is O(log NP), so we can see that the increased complexity in HTSDS is O(log NP).
The ATSP (see lines 1 to 8 of Algorithm 1) mainly consists of LHD initialization and FDC calculation. The complexity corresponding to LHD is O(NP * D), and the complexity corresponding to FDC is O(NP * D), so the complexity increased by ATSP is O(NP * D).
The parameter adaption procedure (see line 36 to 37 of Algorithm 1) mainly consists of recalculation of the fitness value improvement rate, which increased the complexity by O(NP).
Considering the complexity of the classical DE, the complexity of the whole TS-MSCDE is O(NP * D + T max * (NP * D + log NP + NP)), so it can be concluded that the proposed algorithm keeps the computational efficiency in time compared with the classical DE.

Experiments and Comparisons
In this section, the computational results of the proposed algorithm are discussed along with other state-of-the-art algorithms. First, the characteristics of DE/best/1, DE/rand/1, DE/current − to − best/1, DE/current/1, and HTSDS were compared on the CEC2017 benchmark set [7]. Second, we validated the relevant effects of ATSP. Finally, an overall performance comparison between TS-MSCDE and advanced DE variants was conducted on the CEC2020 benchmark set [8]. All simulations were performed on an Intel Core i7 processor mated to a 3.4GHz CPU and 16 GB RAM.

Validation and Discussion of Classical Mutation Strategies and HTSDS
Although the characteristics of DE/best/1, DE/rand/1, and DE/current − to − best/1 have been described [25], comparison with DE/current/1 is rarely mentioned. Therefore, we compare the convergence rates of the four mutation strategies and the proposed HTSDS on typical optimization problems.
Convergence curves are shown in Figure 5. DE/best/1 achieved the top convergence speed among the four optimization functions, as expected, which is particularly important for unimodal function F1 (Figure 5a), and fell into a local optimum earliest for multimodal function F7 (Figure 5b). DE/current/1 converged relatively slowly, especially when the difference vector size became small as iteration continued. It should be pointed out that HTSDS focuses on full exploration in the early stage, with convergence speed slower than or equal to DE/current/1, while in the later stage, it focuses on accelerating convergence, and the convergence speed was significantly improved. In Figure 5d, F25 (the recognized global optimal is difficult to find), the curve dropped to the lowest among all mutation strategies, which means a better solution was found in a limited number of iterations.  We further discuss the characteristics of HTSDS and the other four mutation strategies, using the same settings. Because some complex problems are difficult to optimize, we add a new comparison algorithm HTSDS*, where * denotes Max_FEs = 200, 000 * D. To reduce errors due to randomness, each algorithm was executed for 51 runs independently, with results of D = 10, as shown in Table A1, which includes the mean value (Mean) and standard deviation (Std.Dev) of the error value f (X) − f (X * ), with f (X * ) representing the global optimal value. The table gives the rank of the mean value of each algorithm for each test function. The "+", "−", and "=" signs at the end of the table indicate that HTSDS* is respectively better than, equal to, or worse than other comparison algorithms.
It is not difficult to see from Table A1, when D = 10, several global optima of F1-F4 could be found for most mutation strategies. Owing to slow convergence in the late stage of DE/current/1, no global optima of a function could be found. From F5, DE/best/1 ranked relatively lowly among the six algorithms because it was easily trapped in local optima. Although DE/current − to − best/1 achieved better results on functions F1-F3, it performed poorly on F30 due to its poor robustness, and DE/rand/1, which had similar convergence, had a more balanced performance on all functions. Most importantly, when computing resources were not added, HTSDS had average performance, but ranked relatively highly on F21-F30. After adding computing resources, HTSDS* achieved outstanding performance. HTSDS* was first on 29 functions. Compared with the other five algorithms, HTSDS* was also superior on more than 20 functions. It can be concluded that HTSDS can give full play to its effect when computing resources are sufficient.

Verifying Validity of ATSP
We saw in Section 5.1 that HTSDS accelerated later convergence compared with DE/current/1. However, the stage partition parameter ps = 0.5 may be too large for unimodal functions and too small for some multimodal functions. Therefore, we use the previous algorithm settings to further discuss the effectiveness of ATSP.
With respect to the simple functions, i.e., F1-F4, all algorithms accurately found the global optimum in each run, and HTSDS+ATSP-3 are all smaller than HTSDS-3 in terms of avgFEs; that is, the convergence speed is faster. On complex functions F5-F30, HTSDS+ATSP-3 did not deteriorate because of ATSP. Instead, avgFEs were increased to further improve the accuracy of solutions on functions like F19 and F22. From comparison results of Table A2, HTSDS+ATSP-3 achieved better results than HTSDS-3 on 13 functions with the same computing resources. Moreover, with the reduction of computing resources, the effectiveness of ATSP diminishes considering the overall performance on all functions. We can conclude that ATSP has better adaptability than fixed parameters with relatively sufficient computing resources.

Comparison against State-of-the-Art Algorithms
It can be inferred from the previous experiments that, the mutation operator adopted by TS-MSCDE requires relatively sufficient computing resources. To verify whether TS-MSCDE performs comparably to other advanced algorithms under such conditions, we introduce the CEC2020 benchmark set [8]. Different from such sets as CEC2017, CEC2020 simplifies the test function set, leaving the most representative 10 optimization functions. Besides, in previous CEC competitions, the maximum allowed number of function evaluations (Max_FEs)-unlike problem complexity-did not scale exponentially with dimension. To address this disparity, CEC2020 significantly increases the Max_FEs for 10 scalable benchmark problems beyond their prior contest limits, with the goal of determining the extent to which this extra time translates into improved solution accuracy.
We set D = 5, Max_FEs = 50, 000; D = 10, Max_FEs = 1, 000, 000; D = 15, Max_FEs = 3, 000, 000; and D = 20, Max_FEs = 10, 000, 000. Several currently popular DE variants were chosen for comparison with TS-MSCDE, including algorithms ranked relatively high on CEC2020, and executed each algorithm 30 runs independently. These algorithms are: (1) SHADE: An efficient DE variant using a DE/current − to − pbest/1 mutation strategy. SHADE has been widely studied and used for comparison [21]. (2) LSHADE: Based on SHADE, the linear population decline mechanism is adopted, which significantly improves performance. This is a widely used DE variant [28]. (3) j2020: This algorithm uses a crowding mechanism and a strategy to select the mutant vector from two subpopulations. The algorithm won third place in the CEC2020 numerical optimization competition [52]. (4) AGSK: This new algorithm derived from human knowledge sharing won second place in the CEC2020 numerical optimization competition [51]. (5) MODE: A hybrid algorithm with three differential mutation strategies and linear sequence programming. It was awarded first place in the CEC2020 numerical optimization competition [53]. Table A3 lists the performance of the six algorithms on F1-F10 when D = 5. TS-MSCDE found the global optimum on F1, F5, F6, F7, and F8; IMODE found the global optimum on F9; and the other four algorithms found fewer global optima. The global optimum for F10 is generally acknowledged to be difficult to find, while TS-MSCDE has achieved outstanding performance. Overall, TS-MSCDE ranked first among all six optimization functions, IMODE ranked first among all eight functions, and the other four algorithms lagged behind. Table A4 shows the results when D = 10. Different from D = 5, with the increase in dimension, the ability of all algorithms to find the global optimum in a stable manner decreases. Nevertheless, TS-MSCDE achieved first place in F1, F3, F8, and F10, while IMODE achieved first place in F1, F4, F6, F7, and F9. It is worth mentioning that j2020 achieved first place in F1, F2, and F5. The performance of the other three algorithms is relatively indifferent, and the ranking of the three algorithms on each function is lower. Table A5 shows the results when D = 15. All six algorithms on F1 could still find the global optimum stably in 30 runs. TS-MSCDE ranked first on F3, F9, and F10, and still won the average by a large margin, while IMODE ranked first among all algorithms on F4, F7, and F8. Generally speaking, the performance of TS-MSCDE and IMODE was close, and the other four algorithms lagged behind. Table A6 shows the results when D = 20. The complexity of F9 and F10 has multiplied. Experiments showed that many algorithms fall easily into local optima on the two optimization functions. TS-MSCDE still achieved outstanding results on F10, and showed the same results as IMODE on F9. TS-MSCDE had a certain probability of finding the global optimum on F3 and F8. From the final results, TS-MSCDE and IMODE ranked first among five optimization functions, but IMODE ranked slightly higher than TS-MSCDE overall.
From Tables A3-A6, it can be seen that TS-MSCDE and IMODE performed closely to each other on CEC2020, with results exceeding those of the other four algorithms. In addition to numerical comparisons, we statistically compare solution quality. In Table 1, the Wilcoxon signed-rank test (significance level α = 0.05, R+ is the sum of ranks for the problems in which the first algorithm outperformed the second, and R-is the sum of ranks for the opposite [54]) is used to compare the results between the proposed algorithm and its competitors. From D = 5 to D = 20, the p-value obtained by TS-MSCDE in five rounds of comparison was less than 0.05, which indicates a big difference between TS-MSCDE and the other algorithms. We also find that TS-MSCDE won 16 rounds of comparisons in terms of the value of R+ and R−, which also reflects that TS-MSCDE achieved more promising results. We can come to the same conclusion from Table A3 to Table A6. In Table 2, the Friedman test, a widely used nonparametric test in the EA community, is used to validate the performance of the algorithms based on the mean value of 30 independent runs. It is not difficult to find that the p-values from 5D to 20D calculated by the Friedman test are all less than 0.05. Therefore, it can be concluded that there are significant differences in the performance of the comparison algorithms on the corresponding dimensions. From 5D to 15D, TS-MSCDE ranked first; was significantly better than AGSK, J2020, LSHADE, and SHADE; and slightly better than IMODE. On 20D, the performance of TS-MSCDE declined to third, and was worse than IMODE and AGSK. Therefore, from the overall rank, TS-MSCDE ranked first with a slight advantage, while IMODE ranked second. The difference between the performances of the two algorithms was small, and both significantly exceeded the other four algorithms.

Sensitivity Analysis of Population Partition Parameter p
In the above experiments, we verified the proposed HTSDS, ATSP, and TS-MSCDE in the benchmark function. In HTSDS, we set the population partition parameter p = 0.5 to divide the population into two subpopulations. In this section, we will further discuss whether TS-MSCDE is sensitive to the change of p.
First, we set p as a sequence from 0.1 to 0.9 with step size 0.1. At the same time, for intuitive statistics, we introduced the cumulative rank normalization formula: where D = 5, 10, 15, 20, f represents the function numbers in CEC2020, MR D p f represents the rank of the mean value of TS-MSCDE in a dimension corresponding to p. From Equation (19), the rank performance of the algorithm under different p values in different dimensions can be obtained as Figure 6. From the figure, we can see that the rank performance of p from 0.3 to 0.7 is relatively stable on 5D, and the performance is poor beyond that. The performance of TS-MSCDE may fluctuate slightly between 10D and 20D, but the general trend is that after p is greater than 0.5, the performance will gradually deteriorate, while before p is smaller than 0.5, the performance is different. We can analysis the reasons behind, when p is small, it means inferior subpopulation is larger. Connected with the description in Section 4.2, most individuals are maintaining population diversity, and a small number of individuals focus on local search. It may cause rapid convergence and frequent stagnation for superior population. Therefore, it may lead to unstable optimization results and slow convergence. When p is large, it is obvious that the inferior population is small, which leads to the difficulty in maintaining population diversity and finding the region where the global optimal is located. The performance of the algorithm gradually deteriorates with the increase of p. In summary, considering the optimization accuracy and convergence speed of the algorithm, p is set as 0.5.

TS-MSCDE on Real-World Optimization Problems
In previous sections, we have applied TS-MSCDE to CEC2020 to prove that the algorithm can achieve excellent results under sufficient computing resources. To further extend the practical implications of the conducted research, four real-world opti-mization problems in CEC 2011 test suite [9] are selected to judge the characteristics of the TS-MSCDE. The first real-world problem (F1) is an application to parameter estimation for frequency-modulated sound waves, which is the Problem no. 1 in CEC 2011. The second real-world problem (F2) is an application to spread spectrum radar polyphase code design, which is the Problem no. 7 in CEC 2011. For the third real-world problem (F3), we select the 12th problem in CEC 2011, which is the so-called "Messenger" spacecraft trajectory optimization problem. Finally, the fourth real-world problem (F4) is selected from the 13th problem in CEC 2011, which defines the "Cassini 2" spacecraft trajectory optimization problem. To ensure the effectiveness of TS-MSCDE, we set Max FEs = 6 * 10, 000, 20 * 1, 000, 000, 26 * 1, 000, 000, 22 * 1, 000, 000 for F1-F4 base on their problem dimension, other settings are consistent with Section 5.3.
As shown in Table 3, TS-MSCDE can achieve excellent results on four real-world optimization problems. It is worth mentioning that the four problems are typical complex multimodal problems and most algorithms are difficult to obtain satisfactory solutions.

Conclusions and Future Work
There are numerous complex optimization problems in the real world, which are difficult to obtain satisfactory solutions. When computing resources are sufficient, our first choice is to make full use of them to improve the quality of solutions. In this paper, we propose a two-stage differential evolution algorithm with mutation strategy combination (TS-MSCDE) focusing on solving this problem.
TS-MSCDE contains two search stages adaptively partitioned by ATSP and assigns different mutation strategies to superior and inferior individuals with HTSDS, so as to realize symmetrically decoupling of exploration and exploitation. Experiments were conducted on the CEC2017, CEC2020 and CEC2011 benchmark sets. From CEC2017, we can see that HTSDS+ATSP could achieve outstanding performance with sufficient computing resources. Given the curse of dimensionality, computing resources should grow exponentially with dimensions. Therefore, we selected classic DE variants SHADE and LSHADE together with IMODE, J2020, and AGSK (the top three algorithms on CEC2020) for comparative experiments. The Friedman test and the Wilcoxon test were used to analyze the performance of the algorithms. Further, four real-world optimization problems from CEC2011 are selected to verify the efficiency of TS-MSCDE. The results show that TS-MSCDE can tackle extremely difficult problems when computing resources are sufficient.
While TS-MSCDE exhibits a number of desirable characteristics, more study is required. Although HTSDS and ATSP have greatly accelerated the convergence of TS-MSCDE, the problem of slow convergence has not been completely solved. This problem occurs in some complex optimization problems with the increase of dimensions. Therefore, Future work will also focus on how to further improve the performance of TS-MSCDE under the condition of limited computing resources. Moreover, more application scenarios are worth exploring to demonstrate TS-MSCDE's practical significance.  Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Data sharing is not applicable to this article.

Conflicts of Interest:
We certify that we have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.