Formation Control with Connectivity Assurance for Missile Swarms by a Natural Co-Evolutionary Strategy

: Formation control is one of the most concerning topics within the realm of swarm intelligence. This paper presents a metaheuristic approach that leverages a natural co-evolutionary strategy to solve the formation control problem for a swarm of missiles. The missile swarm is modeled by a second-order system with a heterogeneous reference target, and the exponential of the resultant error is accumulated to be the objective function such that the swarm converges to optimal equilibrium states satisfying speciﬁc formation requirements. Focusing on the issue of the local optimum and unstable evolution, we incorporate a novel model-based policy constraint and a population adaptation strategy that signiﬁcantly alleviates the performance degradation of the existing natural co-evolutionary strategy in terms of slow training and instability of convergence. With application of the Molloy–Reed criterion in the ﬁeld of network communication, we developed an adaptive topology method that assures connectivity under node failure, and its effectiveness is validated theoretically and experimentally. The experimental results demonstrate that the accuracy of formation ﬂight achieved by this method is competitive with that of conventional control methods and is much more adaptable. More signiﬁcantly, we show that it is feasible to treat the generic formation control problem as an optimal control problem for ﬁnding a Nash equilibrium strategy and solving it through iterative learning.


Introduction
Intelligent control of swarm systems has been widely used in tracking, rescue and delivery [1,2].Through the interactive cooperation of multiple agents, the swarm can exhibit complex intelligent behaviors, such as cohesion, separation and alignment, known as the Reynolds rules [3].Among the various research directions, the formation control problem has been widely studied on a variety of system models for its theoretical and practical values.
In this paper, we focus on the formation control of missile swarms.Similar to common formation control problems, formation control for a missile swarm covers formation initialization, contraction, expansion and reconfiguration [4], corresponding with formation producing and formation tracking problems [5].Based on the sensing capability and interaction level, the current dominant research [6] classifies formation control methods into position-, displacement-and distance-based control.The main difference between these control methods is the ability to sense relative or absolute state information.Thus, we propose classifying angle-based control as a particular case of displacement-based control, in which the relative distance constraint from the relative position is removed.In this paper, we focus on the displacement-based control framework not only for its simplicity and stability but also for its vast realistic application value [7].
In practice, when dealing with large-scale swarm systems, control methods that the system can adopt are usually subjected to limited communication bandwidth, communication quality and other interference issues.Hence, more flexible communication means that tolerance to communication failures is necessary for achieving robust information transmission.However, traditional communication methods based on a fixed communication topology have difficulty coping with unexpected situations and providing continuous and reliable communication.For developing more robust intelligent swarm communication algorithms, two newly developed adaptive communication methods in the field of network control system are known as ad hoc-based networks and cluster-based networks [8].A recent study utilized the received signal strength instead of localization facilities to measure individual distances [9].Another approach under the same ad-hoc framework [10] developed a more robust intelligent swarm communication algorithm based on the Molloy-Reed criterion and grey prediction, improving the robustness of the drone swarm network and the reliability of data transmission.Such an algorithm is further employed in the flying ad-hoc network (FANET), which is a distributed and self-organizing communication framework.Considering that distance is usually closely related to communication quality, and the perception of distance does not require additional communication bandwidth, it is highly feasible to treat distance as a major factor in configuring the communication topology.In this paper, we develop an adaptive topology communication method based on minimizing the communication distance and overcome the problem of head failure in the cluster-based communication method to ensure connection stability for formation control.
Formation control methods based on traditional control theory and dynamic systems can provide more robust control for single control objectives and motion patterns [11], but it is hard to achieve high-precision coordination through conventional control methods with multiple objectives or in environments that are too complex to be modeled.That is why intelligent control approaches have been introduced as more flexible alternatives for operating in unstructured or dynamic environments surrounded by multiple uncertainties.
A number of metaheuristic algorithms have been shown to cope well with multiobjective, complex constrained optimization problems and have been widely used in all aspects of formation control, including the formation of optimal configurations and motion planning.A comprehensive survey on the development of such algorithms for aircraft motion planning problems is presented in [12].In [13], Liu investigated a particle swarm optimization (PSO)-based algorithm in the optimal design of missile formation configuration.In [14], Seung-Mok proposed a cooperative co-evolution PSO algorithm based on the traditional PSO-based model predictive control, which optimizes in a distributed way and improves the speed and performance of the original algorithm.Another population-based metaheuristic algorithm, known as the genetic algorithm (GA), was employed to evolve the positioning strategy of formation-controlled multi-robot systems [15].A gradient descentbased reinforcement learning method utilizing an actor-critic framework was proposed for optimal consensus control of multi-agent systems.Existing research on consensus control provides another perspective on solving the formation problem, since the general formation control problem is a special kind of consensus problem which requires that certain errors are maintained between the states of neighboring robots rather than identical states.Although existing metaheuristic-based algorithms have been applied to formation control in searches for single-step optimal solutions in real time, these methods are usually slow when running and do not have the ability to migrate and learn.In contrast, neural networkbased controllers can be trained to achieve the same performance through iterative learning and can be easily deployed to compact mobile units with low computational performance and cost, although the majority of the computational cost is spent in the training stage.
There is a dearth of existing research using neural network controllers, despite the fact that they are commonly used under the reinforcement learning (RL) paradigms [16] and trained to solve specific control or decision-making problems.In [17], Liu developed a self-organizing map (SOM)-based neural network approach for motion planning in the formation control problem of a multiple autonomous underwater vehicle (AUV) system, where the SOM is an unsupervised neural network serving specific purposes, such as competitive learning and task assignment [18].This is an iterative algorithm used to search for the most energy-efficient motion path for the agent.Since it takes all individuals' states as input and the target path as the output, it can be regarded as a centralized method which is not desirable for the practical deployment requirement.
The general idea of the heuristic algorithm using a neural network controller is similar to other algorithms, such as PSO and the ant colony optimization (ACO) algorithm, which are designed to find the optimal solution of the cost function under certain constraints by iterative searching.The essential difference between the former as well as other algorithms is that the former optimizes the neural network weights, and the final optimized controller can be deployed directly.In contrast, the latter optimizes directly in the solution space in a single step.Both methods have their advantages and disadvantages.Since the latter does not need pretraining, it relies too much on single-step computation, and it is difficult to guarantee the speed requirement for real-time operation.Therefore, a neural network controller, as an adaptive and learnable controller, is considered a promising means of intelligent control in the future [19].Evolutionary strategy can effectively and stably optimize the neural network structure and the network weights [20,21], thus benefiting any possible application scenario that requires observation of large amounts of historical data.
Apart from acting as a controller to generate control commands, neural networks are widely used in control systems for sensing, decision making, trajectory planning and many other purposes.In [22], a modified Grossberg neural network (GNN) was used to generate the shortest path to avoid obstacles and reach a target point.Similar research on optimal path or trajectory design is also available in [23,24].Lan [25] adopted the radial basis function to estimate the system disturbance, which led to enhanced robustness and adaptability of swarm formation control based on the artificial potential field method.Additional research that implements adaptive neural networks to estimate the uncertain and nonlinear dynamics of the system can be found in [26][27][28].Furthermore, in [29], Lan used reinforcement learning theory to train a neural network controller that could be applied to the distributed control of swarm systems in an unknown dynamic environment, where the agents in the swarm can perform basic intelligent behaviors such as tracking and obstacle avoidance.Likewise, extensive research has shown that neural networks have certain robustness in many scenarios without being inferior to traditional methods and are relatively more flexible and applicable.
The foremost motivation of this paper is to develop a metaheuristic evolutionary computational approach to solve the formation control problem for multi-agent systems (MASs) while exploring the usage of neural network (NN) controllers.We use a recently proposed natural co-evolutionary strategy (NCES) algorithm [30] to realize such a vision and apply it to the control of a second-order multi-missile system.The contributions of this paper are manifold.First, we incorporate a policy constraint approach to enhance the stability of the algorithm in order to optimize the objective function and find the Nash equilibrium strategy.The proposed algorithm is more flexible for adapting to varying control objectives compared with conventional approaches [4,[31][32][33].Then, an adaptive topology scheme is designed to address the common node failure problem in formation control, and this method can achieve stable communication connections at a relatively low communication cost.Finally, a stable population adaptation method is proposed to further improve the performance of the algorithm and mitigate the local optimum issue.Emulation experiments show that the proposed formation control algorithm can handle tasks such as formation maintenance, reference trajectory tracking and formation switching with high accuracy in the presence of obstacles, and it is immune to disturbances such as randomized initial positions and node failure.The implementation code (https: //github.com/GEYOUR/Swarm-Missile-Formation-Controlaccessed on 9 September 2022) of the parallelized algorithm as well as the experimental platform are available online to encourage further research on the constrained swarm system.The remainder of this work is constructed as follows.In Section 2, the nonlinear multi-missile system is modeled, and the displacement-based formation control problem is formulated, including the specification of formation patterns.In Section 3, the distributed NCES-based formation control algorithm is proposed based on a neural network controller.In Section 4, numerical experiments are conducted for a variety of scenarios, and the results are presented.Finally, the conclusions and recommendations are presented in Section 5.

System Modeling of a Swarm of Cruise Missiles
This paper focuses on formation control in a two-dimensional space (i.e., the OXY planar space of the inertial coordinate frame).As shown in Figure 1, the swarm consists of multiple missiles, and each missile can be treated as a point mass.In the simplified dynamics model, we do not consider aerodynamic factors or the effect of the missile's own inertia tensor.Considering a total of N missiles, the ith missile is denoted by M i , while V mi , α mi , x mi and y mi are used to represent the speed, heading angle and coordinates of M i in the global coordinate frame.Let x i = [x mi , y mi , α mi ] T represent the measurement vector of the missile, which is not necessarily required hereinafter, and x r = [x r , y r , α r ] T represent the measurement vector of the reference target.For brevity, let xi = [x T i , V mi ] T be the state vector with speed inserted.Here, a vi and a li are defined as the acceleration commands along the direction of the velocity and perpendicular to the direction of the velocity, respectively.As shown in the partial enlargement of the actuation decomposition, the two acceleration commands are independent and perpendicular to each other.The second-order system dynamic of the ith missile can be expressed as where u i = [a vi , a li ] T is the input for M i and where I n and 0 m,n denote the identity and zero matrices.Note that the system in Equation ( 1) is similar to the system discussed in [34], which is often called the unicycle-like vehicle.
In this paper, the missile is a nonholonomic system with controllable speed that is capable of maneuvering with limited actuation and dynamic constraints.We assume that under a certain communication relationship, the relative state z ji = x j − x i can be sensed by missile i, where j represents the neighboring missile.In addition, we make the following assumptions about the system: Assumption 1.We assume that the reference target or leader is a first-order model driven by an unknown changeable speed and angular velocity commands, which means that its vertical acceleration is not available, but its lateral acceleration is subject to the same constraints as the follower missiles.Hence, the heterogeneous system may bring in more complexity for mathematical analysis and stability issues.
Assumption 2. It is assumed that the missile controller applies the cascade control [35] structure such that its altitude is automatically controlled by the inner control loop, which is a stable closed loop.We focus on the design for the missiles' thrust and lateral acceleration of the outer control loop.This divide-and-conquer approach is widely used in many studies [32,36] to simplify the problem in order to verify the effectiveness of the control method, and it is also valid for three-dimensional dynamic models.

Formation Control under Displacement-Based Framework
Let us denote the coordinate of the ith node of the formation by λ pi .The formation pattern is defined by λ = {λ pi = [λ xi , λ yi ] T : i = 1, 2, . . ., N}, which is the set of Cartesian coordinates of the origin of each node that is expressed in the coordinate system such that the center of the formation is the origin, which should satisfy (3) In the displacement-based framework, each missile has to align its own coordinate system with the global coordinate system and be able to sense the relative positions and orientations of its neighbors, whereas its position in the global coordinate system is not necessarily required.Let N i denote the set of neighboring missiles of the ith missile.Suppose that the reference target should be kept in the formation center.Then, the tracking error of the ith missile is where e i ∈ R 3 and If the target state is available to the ith missile, then It is worth noting that the first term represents the formation maintenance error, while the second term represents the target tracking error when it is able to obtain the target information.However, the error vector is defined in the global coordinate system.In order to standardize the effect of an individual missile's heading angles on the error vector to the relative coordinate system that is aligned with the missile's orientation, we define the rotated error vector as e ri = R 3 (−α i )e i , where R 3 (−α) ∈ R 3×3 is the three-dimensional rotation matrix defined as Additionally, there exists e ri = e ix e iy e iα T , where the subscript r represents the rotational transformation.Considering the system in Equation ( 1) and the discretization of the system control loop, we can obtain the following error dynamics: where Here, L i is the number of missiles belonging to the set of neighbors N i , and τ is the simulation step size, where α ji = α j − α i and αi = α r − α i .Furthermore, u i = [a vi , a li ] T and u r = [v r , w r ] T denote the vector of the control commands of missile i and the reference target, respectively.Generally, the objective at time t can be defined by a function of the error vector: where is the symmetric positive definite matrix.K C balances the shape and consistent direction of movement of the formation.The optimal design of the formation controller u i can be formulated as a nonlinear optimization problem: This is subjected to in which T denotes the total flight time and u min , u max , V min and V max represent the system constraints.
In this paper, we define two formation patterns-the regular polygon and the straight line formation-which are shown in Figure 2 and denote each formation pattern by λ P (β,l f ) and λ L (β,l f ) , respectively.In this figure, each dot represents a node of the formation, and the nodes are connected to each other under the constraint relationship to form specific formation shapes.l f and β are the parameters that control the size and rotation of the formation, respectively, and we have In addition, R 2 (β) is a two-dimensional rotation matrix similar to that in Equation ( 5), which is Examples of a variety of formation patterns can be found in [1,37,38], but specific geometric definitions are represented only in this paper.
From the above definition, it can be intuitively observed that the straight line formation is symmetric with respect to the origin of the coordinate system that defines it such that the sum of its x and y coordinates is zero, which satisfies Equation (3).Similarly, when the number of nodes of a regular polygon is even, we can also get this conclusion.Therefore, we only need to prove that when the number of nodes is odd, it still satisfies this prerequisite as follows.
Suppose that the regular polygon formation defined in Equation ( 11) consists of N ∈ {2n + 1 : n ∈ Z} nodes.The Cartesian coordinates of each node in the formation are denoted by P i = [λ xi , λ yi ] T such that the sum of all coordinates is Suppose node 1 is located at the y axis such that ξ x = 0.After rotating the formation by ρ, we can obtain the following sum of coordinates by multiplication with the rotation matrix: The angle between the vector from the coordinate origin to node k and the corresponding direction of the Y axis is 2(k − 1)π/N.If node k is rotated to be on the y coordinate axis, since the formation is again symmetric about the Y axis, the summation of the x coordinates is Since N is an odd number, if k is not equal to 1 or (N + 1)/2, then 2(k − 1)π/N is not equal to 0 or π, and sin(2(k − 1)π/N) is not constantly equal to 0, and therefore, ξ y = 0. Thus, Equation ( 14) is a zero matrix for an arbitrary ρ value, and Equation (3) holds.It is shown that many other formation patterns can be considered as variants of the above two formation patterns, such as row, column and square patterns [1].Moreover, as discussed in the following section, this formation paradigm can also generate some asymmetric patterns, such as wedge and crescent patterns [25], by deleting nodes appropriately.

Natural Co-Evolutionary Strategy for MASs
With limited sensing capability, an MAS can be modeled as a multi-agent partially observable Markov decision process (POMDP), which is an extension based on the Markov decision process (MDP).Such a process can be expressed as with S and S representing the system state before and after transitions, respectively, and A u = [u 1 , . . ., u n ] being the action vector for all agents in the system.P(•) is the transition probability function.It is said to be a deterministic problem if P(•) is either one or zero; otherwise, it is a stochastic problem.Considering the simplified system model neglecting uncertainties, the formation control problem in general can be solved using optimization algorithms under a multiagent POMDP.Assuming that the fitness function measuring the performance of agent i is f i (θ i , θ N i ), in which θ i is its controller (policy) parameter, and θ N i = {θ j : j ∈ N i } is the set of parameters for its neighboring agents, then the objective of optimal control is to find the optimal control strategy θ * i , i ∈ (1, . . ., N) such that where F(•) denotes the overall performance of all agents in the MAS.Such a solution is often referred to as the Nash equilibrium strategy, meaning that no further improvement can be made to individual solutions without deteriorating the performance.It has been found to be very difficult to find a Nash equilibrium strategy for the nonlinear continuous agent system whose cost function is coupled with neighboring states while satisfying the system constraints in Equation ( 10) [14,39].For the formation control problem, which requires sufficient cooperation among neighboring agents, F(•) can be regarded as a simple summation of all individuals f i (•), and the formation control problem can be viewed as a constrained dynamic optimization problem with the objective of minimizing the total cost of the formation error.Thus, the Nash equilibrium strategy can be obtained using a co-evolutionary algorithm that is designed for evolving simultaneously to reach the overall optimum fitness.In a previous work [30], we improved the natural evolutionary strategy (NES) and proposed an NCES algorithm that seeks global optimality for the constrained multi-objective optimization problem in multi-agent systems.In brief, the NCES algorithm is a bio-inspired, population-based algorithm capable of optimizing high-dimensional parameters, such as neural network weights, toward the direction of higher fitness.The NCES algorithm usually proceeds as follows.First, the parameters θ i for i ∈ (1, . . ., N) are initialized, and the optimization objective is determined for a specific control problem.The fitness function f (•) which is embedded within the system model is thus obtained.In the second step, iterative optimization is performed, and new perturbations i are sampled m times at the beginning of each iteration step, which obey a probability distribution p(•) to obtain the perturbed population θ i = θ i + i corresponding to each agent.Then, their fitness values are evaluated in the system in a distributed way so that each population obtains the following corresponding gradient information: The parameters are also updated in a gradient ascent manner: Finally, the second step is looped until convergence or the Nash equilibrium strategy is found.Compared with the conventional NES algorithm, the NCES algorithm provides a more accurate estimation of gradients in the presence of multiple interactive agents.More details of the algorithm can be found in [30].

Distributed Co-Evolutionary Strategy Optimizing a Neural Network Controller
To investigate the applicability of neural network controllers within the framework of formation control problems, this paper proposes adopting a multi-layer perceptron (MLP) neural network controller with a cooperative NES-based training approach.The MLP NN has a single hidden layer of 16 nodes, and its schematic is depicted in Figure 3.The weighting matrices for each layer are represented by W 1i and W 2i .φ(•) and ψ(•) are the activation functions for the hidden layer and output layer.Specifically, φ(•) is the sigmoid function, and ψ(•) is selected as the hyperbolic tangent function (Tanh) in order to impose restrictions on the output and satisfy the system constraints.Another advantage of the neural network controller is that saturated control can be achieved by a reasonable choice of activation function without restricting the solution space.With z = [α mi , V mi , e ri ] T denoting the input, the output of the neural network controller is Note that the dashed network input TS in Figure 3 represents the transform signal, which is included in the input only when needed, as described in Section 4. The same network configuration will be used in the following experiments for all agents.Applying the previously mentioned NCES algorithm to train this controller, it is first determined that the parameters of the controller are θ i = [W 1i , W 2i ].The fitness function is the objective function in Equation (8) (i.e., f (θ i , θ N i ) = J i ), which is nonlinearly coupled with the states of neighboring agents and can only be evaluated through the interaction feedback within the system.In order to apply the NCES algorithm to the training of the missile formation controller in this paper, the plain NCES algorithm is insufficient to guarantee the global stability and the convergence speed of the algorithm.To further improve the performance of the algorithm and perform specific optimization for the formation problem, we propose a series of supplementary techniques to effectively enhance the convergence speed and accuracy of the algorithm.

Population Adaptation Technique
In [30], we pointed out the importance of the learning rate for the accuracy of the algorithm and proposed an algorithm for learning rate adaptation.However, adjusting the learning rate often consumes a lot of time.In order to ensure the speed of the algorithm, a novel population size adaptation algorithm is used to adjust the evolutionary process adaptatively.Based on the previous works [40,41], we can learn that the trend of the gradient is related to the complexity of the objective function.Usually, for a more complex optimization region, such as in a multi-modal or noised function, the estimated gradient is gentle, while in a flatter region, such as a spherical space, the gradient is relatively steep.To estimate the accuracy of the estimated natural gradient under the movement of the parameter distribution, the evolution path ρ θ is introduced to detect the resistance in the evolutionary process.The population size is adapted based on the length of the evolutionary path following empirical common sense that a larger population size will lead to higher accuracy for the estimated gradient.The weight matrix W 1i , W 2i of agent i is considered its parameter vector θ i , for which θ i ∈ R s and s is the total number of weights.The evolution path in iteration t is calculated by accumulating the square of the Mahalanobis distance of the parameter movement of all agents as Note that Σ is the covariance of the probability distribution from which the new populations are sampled, since the evolution path should not depend on the parameterization of the probability distribution.The population size η p (t) is then adjusted according to the evolutionary path as follows: where β is a constant factor that determines the growth rate of the population size and η min p and η max p are the minimum and maximum population size which are sent to the clip function clip(•) in case of undesirable adapted values, respectively.Note that since the total optimization complexity tends to increase as evolution progresses, we adopted a nondecreasing strategy to adjust the population size to ensure stability.The initial population size is set to be η p (0) = 10 + 5 ln(s), referring to the default set-up in [42], which should be a good candidate, and the boundaries are determined as Note that s is the number of parameters of one agent.The above configuration was found to be appropriate according to the experimental results.Through empirical observation, we came to an intuitive conclusion that the increment of the population size does not necessarily lead to an improvement in evolutionary quality and sometimes even leads to difficulty in convergence or falling into the local optimum.This is probably because a large population size will greatly average the contributions of individuals and thus reduce the exploratory nature of each individual, so it is wise to adjust the size appropriately rather than just thinking that the larger, the better.

Cluster-Based Adaptive Topology
The inter-agent connectivity in the field of multi-agent system has been primarily modeled and characterized by means of graph theory [6,43], which will be utilized to identify the observable neighbors of the missiles in the formation.In this subsection, we review some of the basics of graph theory.Suppose there are n agents in the MAS (agents can be represented by nodes), and a graphs G is defined as (V, ), where V = (v 1 , v 2 , . . ., v n ) represents the set of nodes and ⊆ V × V represents the set of edges composed of directed connections of different nodes.The neighbor set of the node i is defined as N i = {j ∈ V : (i, j) ∈ }, and j ∈ V is said to be connected to i ∈ V in a directed way if (i, j) ∈ and (j, i) / ∈ , while j is said to be connected to i in an undirected way if both (i, j) and (j, i) belong to .A directed path of G is a series of adjacent edges of the form (v 1 , v 2 ), (v 2 , v 3 ), . . ., (v i , v j ), (v j , v k ), and a graph is said to have a spanning tree if there exists a directed path from one node to any other nodes, and a graph is said to be connected in a directed (undirected) manner if there is a directed (undirected) path between any pair of distinct nodes.We use the adjacency matrix A = [a ij ] ∈ R |V |×|V | to represent the above node connectivity, in which The associated Laplacian matrix is L = D − A. D = diag(a i ) is the diagonal matrix of vertex degrees where a i is the degree of vertex i, and a i = ∑ n j=1 a ij .The Laplacian matrix L is a symmetric semi-positive definite matrix.
One of the foremost goals of designing a communication method is to ensure highly reliable and low-latency communication for all nodes in the swarm.The communication quality of the missile swarm is easily affected by various conditions, especially when performing in a hostile environment.In other words, the swarm of missiles is constantly subject to threats from the enemy's air defense system during flight, and some missiles may get intercepted and lose communication with the rest of the swarm.Therefore, dynamic communication approaches need to ensure that when some nodes fail, the swarm can still maintain effective communication to complete a mission.Based on the previous work in network topology and swarm communication, we developed a novel adaptive cluster-based network which adaptively reconfigures the communication topology to achieve robust and fault-tolerant communication.Under the guidance of the well-known Molloy-Reed criterion [44], which is where k denotes the average degree of an arbitrary node which can be considered somehow as the node degrees a i , it can be inferred that each node in the communication network should at least be connected to two other nodes.For displacement-based formation, the formation is persistent only if there exists at least a spanning tree in the communication topology.When establishing a communication connection, the formation error tends to grow as the communication distance between nodes increases, so we favor the connection method which minimizes the length of the communication chains [37].With the minimal constraint of satisfying the above conditions, we propose setting one node (usually node 1) of the network as the cluster head while the other nodes choose another communication node according to the inter-node distances that are derived from the formation definition.
To achieve this, we define the inter-agent distance matrix as where d ij is the Euclidean distance between nodes that can be calculated from the definition of the formation in Section 2 and D g is a symmetric positive definite matrix.Given that node h is selected as the cluster head, the element of the adjacency matrix is determined as where, γ * = arg min γ d iγ , γ ∈ N, 1 ≤ γ ≤ n and γ = h.In case there are multiple γ that minimize d iγ , the one with the highest order is taken.In the proposed adaptive topology, the cluster head undertakes the task of broadcasting its own state to the other nodes in the network, or the follower nodes need to follow the cluster head.In either case, the cluster head is not constrained by nodes other than the reference target.It is also important to note that only the cluster head has access to the reference target information, which greatly reduces the communication or detection burden of the follower nodes.
Using the above communication configuration method for a network with five nodes as an example, the obtained communication topology is shown in Figure 4a, and its adjacent matrix is Node 1 is the cluster head by default.When node 1 fails, node 2 successively inherits the head position as shown in Figure 4b, and the adjacent matrix becomes As a comparison, the traditional leader-follower topology in the case of leader node failure can be seen in Figure 4c.Due to the lack of adjacent links, effective connections among the nodes cannot be reformed.Using the proposed adaptive communication method, network connectedness can be ensured.The proof is that graph G is uniformly connected since, for any t ≥ t 0 , there exists a node h ∈ V such that h is the root of a spanning tree [45].

Model-Based Constrained Policy
The NCES algorithm explores the optimal policy that maximizes fitness by continuously interacting with the environment.Consequently, the NCES algorithm and the reinforcement learning algorithm are prone to fail to achieve global convergence of the control policy, in many cases due to excessive cumulative time or inherent defects in the design of the objective function.In much of the previous literature, control policies were allowed to freely explore regions of the environment under system constraints.In recent years, the idea of constrained policy has gradually emerged [46], which states that control policies should be executed under constraints that can be imposed either by human-developed rules or by the feedback state of the system.Constrained policy, also known as safety exploration [47], has been increasingly applied to the simulation and training of realistic robotic controllers by which not only can personal safety be ensured during the training process, but global convergence can also be accelerated to some extent.Based on the above facts, we propose a model-based constrained policy.The nonlinear error dynamics of the second-order system were obtained in (6), which were a function of the system input.To apply this method, we assume that the control input of the communication recipient can be obtained by the agent.Thus, at time t, the predicted formation error at the next step can be calculated as follows: where τ is the time step and ė ri is the derivative of the resultant error as described in Equation ( 6).The error deviation e ri (t) ∈ R 3 is which is derived from the absolute values of the two error vectors.Then, the aggregate matrix is defined as E(t) = [. . .e ri (t) . . .] ∈ R 3×n for i ∈ {1, 2, . . ., n}.A termination indicator ST is assigned to the system so that if it is a nonzero value, the system terminates and restarts the training algorithm, and it is defined as where δ s is the threshold that measures the maximum error increment the algorithm can tolerate.It satisfies δ s < (2l + 1)τV max , where l is the number of neighbors for the agent with the maximum number of neighbors in the MAS.With a reasonable choice of δ s , the algorithm is expected to exclude control strategies that deviate too much from the desired trajectory at an early stage.Experimental evidence shows that without such a constrained policy, the convergence speed and global optimality of the algorithm will be degraded.Applying the above adaptation techniques, the pseudo-code of the proposed algorithm used to train the missile formation controllers is shown in Algorithms 1 and 2.
Since the network topology adaptation and constrained policy strategies are embedded in the process of fitness evaluation, they are not exhibited in the pseudo-code.

Simulation and Result Analysis
In this section, numerical experiments are conducted to demonstrate the effectiveness of the proposed formation control algorithm.In order to simulate the actual physical environment, the construction of the experimental scenario and the modeling of the multi-agent system were carried out in PyBullet [48].We simulated the operation of the missile swarm in three-dimensional space at a fixed altitude, and for convenience, the trajectories were plotted as two-dimensional planar graphs.Some complex aerodynamic parameters, such as air resistance, were removed, while collision detection was preserved.The simulation time step τ was set to be 0.1 for the following experiments.

Basic Formation Control
First, the basic formation control tasks were implemented to examine the validity of the algorithm under basic situations.A swarm of five missiles entails tracking the reference target moving along a diagonal or spiral trajectory while keeping the formation geometry.The reference target or virtual leader is set to be at the center of the formation in order to achieve error-free formation control.The objective is to achieve zero tracking error as well as zero formation maintenance error for the whole time (i.e., t |e ri (t)|dt → 0 3 for i ∈ {1, . . ., N}).The hyperparameters of the algorithm are listed in Table 1, and it is noted that this parameter setting applied to all of the following experiments.Additionally, the system constraints are shown in Table 2.The missiles, as well as the reference target, were subjected to saturated control inputs and limited states.For the linear trajectory, the target was driven by a constant control input, while for the spiral trajectory, the control inputs of the target were v r = 0.65 − 0.01t (km/s) and ω r = 0.1 + 0.01t (rad/s).For convenience, we used |e ri | to represent the resultant error of each agent.The trajectories of the two situations are shown in Figures 5 and 6, and the corresponding analytical results are presented in Figures 7 and 8.It can be observed that the convergent resultant error was maintained in a small interval (within 0.05) in both cases, and the results were of comparable if not better accuracy than the comparison results in [33], which kept the position error within 50 m.In addition, good synchronization of the speed and heading angles among the missiles was achieved, although the neighboring speed information was not provided.

Moving into Formation
Moving into formation is, however, different from cases where the formation is in the ideal geometry in the initial state.The missiles were separated from the reference target, with their positions initialized randomly in an area 4 km wide and 3 km long.The missiles need to first move to the designated formation shape and then track the reference target in a consistent motion direction, and in this case, the target moved along the y axis with a constant speed of 0.5 km. Figure 9 shows the motion trajectory of the swarm formation, and Figure 10 shows the resultant error during the flight.The results indicate that the formation was able to adjust to the desired formation shape rapidly (mostly within 10 s) and achieved high accuracy in tracking and maintaining the formation in the case of random initial distribution.

Switching Formations
Furthermore, we discuss the case in which the missile swarm is supposed to switch among formation patterns in order to avoid obstacles and go through narrow spaces.To achieve this transformation, an additional input node TS was appended to the input layer of the policy network as depicted in Figure 3.It was assumed that whenever the obstacle was detected by the leader missile or the literal cluster head, the missile then sendt a signal to all connected missiles to perform a formation switch of λ P → λ L .Although formation geometries are predefined by the controller, the shapes can be controlled by parameters l f and β, which can be changed during flight.
In this scenario, two walls were placed in the trajectory of the formation flight to create narrow spaces, and the missiles needed to pass through the obstacles by changing the shape or size of the formation.We implemented an event-based formation-switching strategy, in which the nodes in the formation were able to detect obstacles within a certain range d c and send a formation-switching signal to all other nodes if an obstacle were to be detected.Similarly, after crossing the obstacle and reaching a safe distance, a formation recovery signal would be sent to restore the original formation.
To pass through the obstacles, we considered changing both the formation pattern and formation size, and the time-varying definitions of the formation in both cases were as follows.
The resulting trajectories are shown in Figures 11 and 12.It can be observed that when obstacles were detected, the swarm could swiftly adjust by changing the formation pattern or size and then recover the formation quickly after going through the narrow space:

Formation Control under Node Failure
An experiment was conducted in this scenario to verify the effectiveness of the proposed algorithm under node failure, where a swarm with six nodes was designed to pursue the reference target in a regular polygon formation and the reference target moved in a sinusoidal fashion.At t = 20 s during pursuit, cluster head node 1 and cluster member node 4 suffered attacks and would disconnect from the other nodes in the swarm, and the remaining nodes of the swarm needed to maintain their original formation and complete the formation task.
From the results in Figures 13 and 14, it can be observed that the swarm selected node 2 as the cluster head to reorganize the communication topology after node failure and successfully maintained the original formation shape after a short fluctuation.Finally, the robustness of the proposed formation control algorithm against node failure was validated.To investigate the effect of policy constraints on the control performance, we compared the results of the NCES-based formation control method that imposed policy constraints with the one without policy constraints.In some cases, such as node failure and switching formation, there was a certain chance that the algorithm would not converge.Moreover, in all cases, the convergence rate was generally improved by more than 20% with policy constraint rather than without it.Therefore, policy constraint is essential for the training of the NCES-based neural network controller.

Conclusions
This paper proposes a novel distributed NCES-based formation control algorithm for a second-order multi-missile system using neural networks.The algorithm minimizes the formation shape error and tracking error by training the optimal network controller through iterative learning, and it was combined with a policy constraint approach to enhance the stability of the algorithm.Additionally, we designed an adaptive topology scheme for the node failure situation which can achieve stable connections at a low communication cost.We also proposed a stable population adaptation method based on the evolution path, which further improved the performance of the algorithm and alleviated local optimum issues.The numerical experiments demonstrated that the proposed formation control algorithm is capable of accomplishing tasks, such as formation maintenance, tracking of reference trajectories, formation transformation and obstacle avoidance.This indicates that the proposed algorithm has a high level of accuracy and robustness to cope with situations such as random initial positions and node failures.This paper also discussed the parametric definition of the formation geometry as a background supplement to the field.
Due to the characteristics of constrained co-evolution, the algorithm is expected to be applied to predict the amount of available renewable energy by environmental indicators, such as gas emissions or wind, estimate the water absorption by crops in controlled agriculture and even provide new ways to solve quantum many-body problems [49][50][51].Future works are recommended to investigate online evolutionary algorithms to solve the problem of unknown or stochastic system models and to apply the proposed algorithm to real-world experiments.

Figure 2 .
Figure 2. Two types of formation patterns.

Figure 4 .
(a) All nodes are valid (b) Node 1 failed (c) Node 1 failed Schematic diagram of topology network with five nodes: (a,b) the adaptive topology networks and (c) the traditional leader-follower topology.

Figure 5 .
Figure 5. Trajectories of basic formation control along linear trajectory.

Figure 6 .
Figure 6.Trajectories of basic formation control along spiral trajectory.

Figure 7 .
Figure 7. Analytical results of the linear trajectory case.

Figure 8 .
Figure 8. Analytical results of the spiral trajectory case.

Figure 9 .
Figure 9. Trajectories in case of moving into formation.

Figure 10 .
Figure 10.Resultant error curves for case of moving into formation.

Figure 11 .
Figure 11.Trajectories for case of switching formation type.

Figure 12 .
Figure 12.Trajectories for case of switching formation size.

Figure 13 .
Figure 13.Trajectories of the node failure case.

Figure 14 .
Figure 14.Resultant error curves of the node failure case.

Algorithm 1
The distributed NCES based formation control algorithm.Input: agent number N, population size η p , standard deviation σ, rotation angle β, evolution path ρ θ , number of parameters m, iteration t

Table 1 .
Hyperparameters of the proposed algorithm.