Neural Surrogate-Enhanced Metaheuristic Optimization for Distributed Quadrotor Swarm Control

Li, Jinze; Wen, Zeling; Ning, Zhaoke

doi:10.3390/s26113398

Open AccessArticle

Neural Surrogate-Enhanced Metaheuristic Optimization for Distributed Quadrotor Swarm Control

by

Jinze Li

¹,

Zeling Wen

¹

and

Zhaoke Ning

^1,2,*

¹

School of Aeronautics and Astronautics, Sichuan University, Chengdu 610207, China

²

Multi-Source Information Intelligent Fusion Key Laboratory of Sichuan Province, Chengdu 610207, China

^*

Author to whom correspondence should be addressed.

Sensors 2026, 26(11), 3398; https://doi.org/10.3390/s26113398

Submission received: 17 April 2026 / Revised: 14 May 2026 / Accepted: 18 May 2026 / Published: 27 May 2026

(This article belongs to the Topic Advanced Methods in Unmanned Aerial Vehicle Control, Navigation, and Safety)

Download

Browse Figures

Versions Notes

Abstract

Real-time cooperative control of quadrotor swarms in cluttered environments requires balancing formation maintenance, obstacle avoidance, inter-UAV safety, and per-step computational cost. This paper proposes a multilayer perceptron (MLP) surrogate for high-level objective-weight selection in a modified multi-objective pigeon-inspired optimization (modified MPIO) distributed controller. The proposed MLP surrogate learns the state-to-weight mapping of the online search and directly predicts the two-dimensional objective-weight vector, while the original flocking, gap-based obstacle-avoidance, and command generation rules are retained unchanged. The surrogate is trained from teacher-generated weight labels using randomized scenes, DAgger-based state aggregation, and risk-weighted supervision. On a fixed closed-loop benchmark, the proposed controller increases the true collision free rate from 48.00% to 86.89% and the safe success rate from 38.67% to 74.22% relative to modified MPIO, while reducing the mean per-step decision latency for the whole swarm from 8494.70 ms to 0.92 ms. The improvement is most pronounced in safety-related and runtime metrics, while the formation-related gain is comparatively modest. Ablation results show that the final benchmark performance is not explained by DAgger or risk weighting alone, and that the medium-sized surrogate provides the best safety-latency tradeoff among the tested network architectures. A qualitative AirSim case study further indicates that the same high-level surrogate controller can be executed in a higher-fidelity asynchronous multirotor simulator.

Keywords:

UAV swarms; multi-objective optimization; neural surrogate; distributed control; real-time decision making

1. Introduction

Autonomous path planning for single UAVs has been studied extensively; the central problem is to generate safe and dynamically feasible trajectories from available map or sensor information while reaching mission objectives [1]. Compared with the single-UAV case, multi-UAV planning introduces additional coupling among agents: multiple vehicles must reach mission goals while maintaining coordination, inter-agent safety, and robustness under sensing and communication constraints [2,3,4,5]. Therefore, the resulting problem is a coupled multi-agent control problem in cluttered and time-varying environments rather than a single-route generation problem.

Representative structured coordination lines include flocking and consensus rules [6,7,8], velocity–obstacle methods [9,10], and safety certificate-based collision avoidance [11]. Optimization-based and heuristic or metaheuristic methods remain attractive for cooperative UAV control because they can accommodate nonconvex objectives, coupled constraints, and competing mission requirements without relying on a fully tractable analytical model [12,13,14,15]. Representative multi-objective UAV studies have shown that these methods can produce feasible cooperative behaviors in complex scenes [16,17]. Their main limitation for deployment is computational; when optimization must be executed online at every control step, the search cost grows quickly with swarm size, obstacle density, and constraint coupling, making strict real-time execution difficult [3,14].

Recent studies in adjacent domains involving dynamic control also reflect the broader use of learning and iterative optimization to improve closed-loop control under practical constraints, including model-free Q-learning-based fault-tolerant control for batch processes [18] and infinite-horizon iterative learning control under actuator failures [19]. Although these works address different plants and task structures from decentralized quadrotor swarm obstacle avoidance, they further motivate the general need to reduce online decision burden while preserving closed-loop control performance.

When repeated online optimization becomes the main bottleneck, learning-based methods become a practical way to reduce online computation. Related approximation routes include explicit MPC [20], learning-based MPC [21], and neural approximations of optimization-based controllers [22,23]. However, end-to-end replacement also weakens interpretability and explicit safety semantics in safety-critical multi-UAV control [21,24]. In closed-loop control, distribution shift is another key issue. Behavior-cloning policies can degrade when deployment states deviate from the training distribution. DAgger addresses this by aggregating expert labels on learner-visited states [25,26]. These considerations make full policy replacement less attractive in our setting, and instead point to a more limited surrogate design that preserves the existing control structure while replacing only the heaviest online decision module.

Therefore, this paper adopts a module-level neural-surrogate design. We retain the inherited distributed control structure and replace only the online modified MPIO search over a two-dimensional objective-weight vector. This weight vector regulates the tradeoff between flocking/formation maintenance and obstacle-avoidance behavior at each control step. Therefore, the neural surrogate does not output direct UAV control commands; it only predicts this intermediate decision variable, which is then used by the unchanged downstream control rules. Because the learned variable is low-dimensional and interpretable, it can be supervised by teacher-generated weight labels while preserving the surrounding control semantics. The surrogate is trained with randomized scenes, DAgger-style state aggregation, and risk-weighted supervision, then evaluated on a retained synchronous closed-loop benchmark. AirSim is used only as a qualitative case study for higher-fidelity asynchronous multirotor execution [26,27]. Our main contributions are as follows:

(1): We address the deployment-time bottleneck in an inherited modified MPIO swarm controller by replacing its two-dimensional online weight selection module with a lightweight neural surrogate, leaving the existing flocking and gap-based obstacle-avoidance rules unchanged.
(2): We train the surrogate with scene randomization, DAgger, and risk-weighted supervision, allowing it to better handle learner-visited states and place greater emphasis on safety-related samples during training.
(3): We evaluate the method on a fixed synchronous closed-loop benchmark as well as a qualitative AirSim case study. On the benchmark, the surrogate improves the true collision-free rate and safe success rate while greatly reducing whole-swarm per-step decision latency and eliminating step overruns under the current implementation. In AirSim, the same high-level controller remains executable in an asynchronous multirotor control loop, providing qualitative evidence of migration feasibility.
(4): We release the source code, final trained model, and merged evaluation results to support reproducibility and further comparison (https://github.com/cliche71/quadrotor-swarm-neural-surrogate.git, accessed on 17 May 2026).

2. System Modeling and Problem Formulation

2.1. UAV Model

At the platform level, each agent is a quadrotor Unmanned Aerial Vehicle (UAV). Because the eventual deployment target is a quadrotor swarm rather than an abstract point robot, we briefly state a standard six-degree-of-freedom (6-DOF) rigid-body model to define the vehicle variables and execution layer, following Sabatino’s thesis [28] and the survey by Zhang et al. [29]. Let

{O_{W}}

denote the inertial frame and

{O_{B}}

the body-fixed frame. The UAV position is described by

p_{W}

= {[x, y, z]}^{⊤}

and its attitude is represented by the Euler angles

(ϕ, θ, ψ)

, where

(x, y, z)

are the inertial coordinates of the center of mass and

(ϕ, θ, ψ)

are the roll, pitch, and yaw angles, respectively.

The world-frame translational kinematics are

{\dot{p}}_{W} = v_{W}, v_{W} = {[\dot{x}, \dot{y}, \dot{z}]}^{⊤} .

The body-frame linear and angular velocities are

v_{B} = {[u, v, w]}^{⊤}, ω_{B} = {[p, q, r]}^{⊤} .

The mass is denoted by m, the inertia matrix by

J = diag (J_{x}, J_{y}, J_{z})

, the total thrust along the body

z_{B}

-axis by T, and the body torques by

τ_{B}

= {[τ_{ϕ}, τ_{θ}, τ_{ψ}]}^{⊤}

. Using the standard Euler-angle attitude representation, the full 6-DOF quadrotor model can be written as follows [28,29]:

\{\begin{matrix} \ddot{x} & = \frac{T}{m} (cos ϕ sin θ cos ψ + sin ϕ sin ψ), \\ \ddot{y} & = \frac{T}{m} (cos ϕ sin θ sin ψ - sin ϕ cos ψ), \\ \ddot{z} & = \frac{T}{m} cos ϕ cos θ - g, \\ \dot{ϕ} & = p + q sin ϕ tan θ + r cos ϕ tan θ, \\ \dot{θ} & = q cos ϕ - r sin ϕ, \\ \dot{ψ} & = q \frac{sin ϕ}{cos θ} + r \frac{cos ϕ}{cos θ}, \\ \dot{p} & = \frac{1}{J_{x}} (τ_{ϕ} + (J_{y} - J_{z}) q r), \\ \dot{q} & = \frac{1}{J_{y}} (τ_{θ} + (J_{z} - J_{x}) p r), \\ \dot{r} & = \frac{1}{J_{z}} (τ_{ψ} + (J_{x} - J_{y}) p q) . \end{matrix}

(1)

However, the present contribution is not a new low-level attitude controller; rather, the optimization and learning developments of this paper operate at the high-level swarm-decision layer. For the planar task considered in this work, the horizontal motion of UAV i is described by

p_{i} = {[x_{i}, y_{i}]}^{⊤},

where

x_{i}

and

y_{i}

denote the planar coordinates. Under the adopted heading-angle definition, the planar kinematics are

{\dot{p}}_{i} = [\begin{matrix} {\dot{x}}_{i} \\ {\dot{y}}_{i} \end{matrix}] = v_{i} [\begin{matrix} cos ψ_{i} \\ sin ψ_{i} \end{matrix}],

where

v_{i} = ∥ v_{i} ∥

is the planar speed and

ψ_{i}

is the heading angle. Figure 1 illustrates the 6-DOF quadrotor coordinate frames and the planar task abstraction used in this work.

2.2. Swarm Flocking Objectives

We consider a swarm of N UAVs indexed by

i \in {1, \dots, N}

. The horizontal position and velocity of UAV i are denoted by

p_{i} = {[x_{i}, y_{i}]}^{⊤} \in R^{2}

and

v_{i} = {[v_{x, i}, v_{y, i}]}^{⊤} \in R^{2}

, respectively. Each UAV interacts only with neighbors inside a limited communication radius

R_{comm}

, leading to the neighbor set

N_{i} ≜ \{j \neq i | ∥ p_{j} - p_{i} ∥ \leq R_{comm}\} .

(2)

Following classical flocking models and distributed coordination principles [7,30], the present model adopts a local-rule view in which spacing regulation, velocity alignment, and short-range repulsion jointly shape the collective motion of the swarm. Under this view, flocking is not treated as a purely emergent effect, but as a structured high-level control prior written explicitly in terms of neighborhood interactions and local safety-related forces. In the current controller semantics, three horizontal control primitives are computed for UAV i: a spacing (cohesion–separation) term

{\dot{v}}_{i}^{space}

, a velocity alignment term

{\dot{v}}_{i}^{align}

, and a short-range collision repulsion term

{\dot{v}}_{i}^{coll}

.

The weighted spacing contribution used in the flocking channel drives

d_{i j}

towards a desired spacing radius

R_{des}

:

{\dot{v}}_{i}^{space} = w_{i 1} K_{f} \sum_{j \in N_{i}} (1 - {(\frac{R_{des}}{d_{i j}})}^{2}) Δ p_{i j},

(3)

where

K_{f}

is the spacing gain. When

d_{i j} >

R_{des}

, the coefficient in parentheses is positive and

{\dot{v}}_{i}^{space}

attracts UAV i towards its neighbors, whereas for

d_{i j} <

R_{des}

it becomes negative and induces repulsion;

Δ p_{i j} = p_{j} - p_{i}

and

d_{i j} = ∥ Δ p_{i j} ∥

denote the relative position and distance from agent i to j.

The weighted alignment contribution reduces velocity disagreement among neighbors and promotes locally coherent motion with alignment gain

K_{a}

:

{\dot{v}}_{i}^{align} = w_{i 1} K_{a} \sum_{j \in N_{i}} (v_{j} - v_{i}) .

(4)

To prevent close-proximity contact under dense interactions, we further introduce a strong repulsive term when another UAV enters a safety radius

R_{\lim}^{1}

around agent i:

{\dot{v}}_{i}^{coll} = - K_{c} \sum_{j \in N_{i} : d_{i j} \leq R_{\lim}^{1}} {(\frac{1}{d_{i j}} - \frac{1}{R_{\lim}^{1}})}^{2} \frac{Δ p_{i j}}{d_{i j}},

(5)

where

K_{c}

is the collision avoidance gain. This term grows rapidly as the inter-UAV distance decreases, and as such provides a hard local safety buffer.

The horizontal flocking contribution used later in the controller is then given by

{\dot{v}}_{i}^{flock} = {\dot{v}}_{i}^{space} + {\dot{v}}_{i}^{align} + {\dot{v}}_{i}^{coll} .

(6)

2.3. Obstacles and Gap-Based Avoidance Model

We model obstacles on the flight plane at altitude

h_{e}

as inflated disks with center

o_{k} = {[x_{k}, y_{k}]}^{⊤}

and effective safety radius

R_{\lim, k}^{2}

, indexed by

k \in K_{obs}

. For static cylindrical obstacles, each disk is given by the cylinder–plane intersection. For moving spherical obstacles, the disk center is updated at each control step from the obstacle state on that plane, with the local avoidance decision being recomputed from the refreshed obstacle positions. Accordingly, the planar obstacle region considered by the local avoidance module is

O_{k} = \{p \in R^{2} | ∥ p - o_{k} ∥ \leq R_{\lim, k}^{2}\} .

(7)

The local perception geometry is illustrated in Figure 2.

Each UAV is equipped with a forward-looking depth sensor with sensing range

R_{sense}

and field of view

[- Θ_{FOV}, Θ_{FOV}]

. Let

v_{e} = {[v_{e, x}, v_{e, y}]}^{⊤}

denote the desired planar cruise velocity. The current local heading

ψ_{i}

is taken from the current horizontal velocity direction; when the velocity is near zero, the UAV falls back on the direction of

v_{e}

.

In the local angular frame centered at

ψ_{i}

, each detected obstacle induces a blocked angular interval, and the complement of the union of all blocked intervals defines the free-gap set:

G_{i} = [- Θ_{FOV}, Θ_{FOV}] ∖ ⋃_{k} [α_{k}^{L}, α_{k}^{R}]

(8)

where

α_{k}^{L}

and

α_{k}^{R}

denote the left and right angular boundaries of the blocked interval induced by obstacle k in the local angular frame centered at

ψ_{i}

.

Thus, local obstacle avoidance is converted into selecting a feasible deviation angle inside the free gaps, following the general idea of gap-based obstacle avoidance [31]. We distinguish interior gaps from boundary gaps adjacent to

\pm Θ_{FOV}

.

Within each remaining gap, a small set of candidate deviation angles is sampled, optionally including

α = 0

when the reference direction lies inside the gap. Candidates are discarded if they are too close to the field-of-view boundary, provide insufficient forward progress, or fail to maintain enough clearance. For each feasible candidate

α

, we evaluate its normalized clearance

{\tilde{d}}_{i} (α)

, the normalized width

{\tilde{w}}_{i} (α)

of its parent gap, the forward progress term

p_{i} (α) = cos α

, and the normalized turning cost

t_{i} (α) = | α | / Θ_{FOV}

. A binary indicator

b_{i} (α) \in {0, 1}

marks whether the candidate belongs to a boundary gap.

The local avoidance direction is selected by the lightweight feasible gap score:

S_{i} (α) = k_{clear} {\tilde{d}}_{i} (α) + k_{width} {\tilde{w}}_{i} (α) + k_{prog} p_{i} (α) - k_{turn} t_{i} (α) - k_{edge} b_{i} (α) t_{i} (α)

(9)

which favors large clearance, wide gaps, strong forward progress, and modest turning while mildly penalizing boundary gap solutions. The selected deviation angle and obstacle-guided heading are

α_{i}^{⋆} = arg max_{α} S_{i} (α), ψ_{i}^{o} = ψ_{i} + α_{i}^{⋆} .

(10)

Finally, the corresponding obstacle-guided planar velocity is

v_{i}^{o} = w_{i 2} ∥ v_{e} ∥ [\begin{matrix} cos ψ_{i}^{o} \\ sin ψ_{i}^{o} \end{matrix}] .

(11)

3. Multi-Objective Optimization Formulation and Online Modified MPIO Solver

At each control step, we optimize neither a full trajectory nor low-level actuation directly; instead, UAV i searches for a two-dimensional objective-weight vector

w_{i} \in {[0, 1]}^{2}

that balances the flocking and obstacle avoidance components. In the modified MPIO solver, each pigeon represents one such candidate weight vector. For a given candidate, the controller computes the corresponding blended flocking–avoidance action and the resulting next-step planar state under the current local state. The objective functions introduced below are then used to evaluate the candidate solution represented by that pigeon, i.e., to assess how suitable the induced next-step state is for the controller update of UAV i. Therefore, this section formulates the resulting constrained per-step multi-objective problem and describes the modified MPIO teacher used to solve it online, then shows how the selected weight generates the control command and updates the UAV state.

Specifically, for UAV i, we define the two-dimensional weight vector

w_{i} = {[w_{i 1}, w_{i 2}]}^{⊤} \in {[0, 1]}^{2} .

(12)

3.1. Objective Optimization and Feasible Pareto Selection

The first soft objective smoothly shifts from cruise-velocity matching far from obstacles to preservation of forward progress near obstacles:

j_{1, i} (w_{i}) = β_{i} (∥ v_{e} ∥ - v_{i}^{⊤} \frac{v_{e}}{∥ v_{e} ∥}) + (1 - β_{i}) ∥ v_{e} - v_{i} ∥

(13)

where

β_{i} \in [0, 1]

is a clipped coefficient for clearance scheduling determined by the current minimum clearance to the inflated obstacle boundary, with the inflation defined by the hard safety margin

R_{\lim}^{2}

. As the clearance decreases,

β_{i}

increases towards 1, whereas in open space it decreases towards 0.

The second soft objective measures flocking quality and local velocity agreement over the communicated neighborhood. This objective combines the quality of formation geometry and the alignment with neighbor velocity, as follows:

\begin{matrix} j_{2, i} (w_{i}) & = \sum_{\begin{matrix} j \in N_{i} \\ ∥ p_{j} - p_{i} ∥ \leq R_{comm} \end{matrix}} (∥R_{des} - d_{i j}∥ + ∥v_{i} - v_{j}∥) . \end{matrix}

(14)

The hard constraints are used as a feasibility screen. The hard constraint for obstacles checks whether the distance between the UAV and the obstacle center is smaller than the obstacle safety radius:

j_{3, i} (w_{i}) = \{\begin{matrix} 1, & \exists k \in K_{obs} such that ∥p_{i} - o_{k}∥ \leq R_{\lim, k}^{2}, \\ 0, & otherwise . \end{matrix}

(15)

The inter-UAV hard constraint checks whether the minimum distance to any neighbor is smaller than the inter-UAV safety radius:

j_{4, i} (w_{i}) = \{\begin{matrix} 1, & min_{j \in N_{i}} ∥ p_{j} - p_{i} ∥ < R_{\lim}^{1}, \\ 0, & otherwise . \end{matrix}

(16)

Only candidates that satisfy both hard constraints are retained for further comparison.

The resulting decision is treated as a constrained bi-objective problem in the Pareto sense [12,13]. Therefore, the per-step decision problem for a single UAV can be written as

min (j_{1, i} (w_{i}), j_{2, i} (w_{i})) s . t . j_{3, i} (w_{i}) = 0, j_{4, i} (w_{i}) = 0 .

(17)

3.2. Online Modified MPIO Teacher

The modified MPIO solver acts as an online teacher over the weight space. At each step, it evaluates candidate weights, removes infeasible ones using the two hard constraints, performs Pareto ranking on the feasible set with respect to

(j_{1, i}, j_{2, i})

, and updates the population through leader–follower refinement. The final output is the feasible weight minimizing

j_{2, i}

on the last Pareto front; if this front is empty, the solver falls back to the best feasible weight in the elite archive A, and otherwise to a conservative default weight. Algorithm 1 summarizes the teacher-side per-step weight-selection procedure, and Table 1 lists the corresponding modified-MPIO hyperparameters.

Let

F_{1}^{(k)}

denote the Pareto first front at iteration k. The landmark center is defined as

X_{center}^{(k)} = \frac{1}{| F_{1}^{(k)} |} \sum_{X \in F_{1}^{(k)}} X,

(18)

where

X

denotes a candidate position in the weight space.

A representative form of the implemented leader update combines map-and-compass decay with attraction to an elite representative and to the current landmark center:

\begin{matrix} V_{i}^{(k)} = e^{- R k} V_{i}^{(k - 1)} + r_{1} f_{t} (1 - \frac{log k}{log K_{max}}) (X_{g}^{(k - 1)} - X_{i}^{(k - 1)}) \\ + r_{2} f_{t} (\frac{log k}{log K_{max}}) (X_{center}^{(k - 1)} - X_{i}^{(k - 1)}), \end{matrix}

(19)

X_{i}^{(k)} = Π_{[X_{L}, X_{U}]} (X_{i}^{(k - 1)} + Π_{[V_{L}, V_{U}]} (V_{i}^{(k)})),

(20)

where

X_{g}

is the current elite representative, R is the map-and-compass decay factor,

f_{t}

is the transition factor, and

Π

denotes the boundary projection operator. The random coefficients

r_{1}

and

r_{2}

are independently drawn from

U (0, 1)

.

For ordinary followers, the “modified” component lies in the hierarchical learning rule: lower-ranked individuals do not directly follow the global best, but instead learn a randomly selected dimension from a better-ranked individual. A representative update form is

X_{i}^{(k)} (d^{*}) = X_{j}^{(k - 1)} (d^{*}) + e r, N_{j}^{o} < N_{i}^{o} .

(21)

In this expression,

d^{*}

is the randomly selected dimension, j is a learning target ranked above i, e is the learning error, and

r \sim U (- 1, 1)

is a scalar random perturbation. This mechanism improves information transfer within the population while preserving high-quality Pareto structure and preventing all individuals from collapsing too quickly toward the same local preference.

Algorithm 1 Online Modified MPIO Teacher for Per-step Weight Selection

Require:: Local state of UAV i, neighbor states, params $P, K_{d}, K_{max}$
Ensure:: Teacher weight $w_{i}^{⋆}$
1:: Initialize pop. $P = {X_{q}, V_{q}}_{q = 1}^{P}$ in ${[0, 1]}^{2}$ and inject previous valid weight.
2:: Evaluate costs for all pigeons. Mark infeasible candidates ( $j_{3, i} = 1$ or $j_{4, i} = 1$ ) to exclude them from Pareto ranking; init elite archive $A \leftarrow \emptyset$ .
3:: for $k = 1, \dots, K_{max}$ do
4:: Pareto-sort the feasible subset of $P$ , compute landmark center $X_{center}^{(k)}$ via Equation (18).
5:: Update $A \leftarrow ArchiveUpdate (A, F_{1}^{(k)})$ and select global guidance $X_{g}^{(k)}$ from $F_{1}^{(k)}$ .
6:: for each pigeon $q \in P$ do
7:: Store $X_{q}^{old}$ .
8:: Update: If q is a leader, update via Equations (19) and (20); else perform hierarchical follower update (Equation (21)) or random walk.
9:: Evaluate & Rollback: Project to bounds and evaluate costs. Mark if infeasible. If dominated by its previous state, roll back to $X_{q}^{old}$ and restore costs.
10:: end for
11:: Remove the worst $K_{d}$ pigeons based on Pareto rank and crowding distance to reduce $| P |$ .
12:: end for
13:: Extract the final feasible Pareto front set $S_{1}$ from $P$ .
14:: return $w_{i}^{⋆} \in S_{1}$ minimizing $j_{2, i}$ , else best feasible in A, else default $[0.2, 0.8]$ .

3.3. From Weight Selection to Closed-Loop State Update

The selected weight is not an endpoint of the optimization stage; it is the high-level control decision used at the current step. In the current implementation, this decision is not applied through a direct planar Euler update. Instead, the controller retains a quadrotor-compatible control-state separation. The planar input is then constructed as

{\bar{u}}_{x y, i} = [\begin{matrix} {\bar{u}}_{x, i} \\ {\bar{u}}_{y, i} \end{matrix}] = [\begin{matrix} {\dot{v}}_{i 1}^{flock} + (v_{i 1}^{o} - v_{i 1}) \\ {\dot{v}}_{i 2}^{flock} + (v_{i 2}^{o} - v_{i 2}) \end{matrix}],

(22)

and the corresponding autopilot-style planar intermediate references are

\{\begin{matrix} {\bar{v}}_{x y, i} & = ∥ v_{x y, i} ∥ + t_{v} ({\bar{u}}_{x, i} cos ψ_{i} + {\bar{u}}_{y, i} sin ψ_{i}), \\ {\bar{ψ}}_{i} & = ψ_{i} + \frac{t_{ψ}}{{\bar{v}}_{i}} (- {\bar{u}}_{x, i} sin ψ_{i} + {\bar{u}}_{y, i} cos ψ_{i}) . \end{matrix}

(23)

The vehicle state is then propagated over one control period by the corresponding channel dynamics, followed by discrete state advancement of

(x_{i}, y_{i}, v_{x y, i}, ψ_{i})

over the current control period. Teacher-side candidate evaluation uses this mapping to predict the one-step consequence of each candidate weight, whereas the executed closed-loop controller uses the same high-level command semantics to advance the actual system state after a final weight has been selected.

Accordingly, modified MPIO is used to search the control weight vector online for the per-step multi-objective problem. However, because this search is performed under a fixed budget, the feasible front may become sparse under dense interactions or in narrow feasible regions. This motivates replacing the online search with a neural surrogate, while keeping the control execution and state update unchanged.

4. Neural Surrogate Learning

To replace the online weight search, we train a neural surrogate that maps local feature vectors to the corresponding weight decisions produced by the online modified MPIO controller. The training set combines base rollouts generated by this controller with DAgger-style relabeling on states visited by the current student. The overall pipeline is illustrated in Figure 3 and the corresponding online control flow used during deployment is summarized in Figure 4.

4.1. Base Data Collection and DAgger Relabeling

The training data are constructed by rolling out the online modified MPIO controller in randomized scenes and recording local feature–weight pairs [32]. In this context, the online modified MPIO controller is treated as the teacher, while the neural surrogate is treated as the student. Each sample consists of a 41 dimensional local feature vector and the corresponding two-dimensional weight label. However, successful teacher rollouts under-represent crowded and safe states, especially in narrow passages and densely coupled local interactions. Therefore, we further adopt a DAgger-style data aggregation strategy [26] in which the current student is rolled out closed-loop in randomized scenes and the visited states are converted into local features before being relabeled by the teacher, producing additional feature–weight pairs. The final training set is the union of the base data and the relabeled data.

Formally, let the base imitation dataset be

D_{base} = {(x_{n}, w_{n}^{⋆})}_{n = 1}^{N_{base}}, w_{n}^{⋆} = π_{T} (x_{n}),

(24)

where

x_{n} \in R^{41}

denotes the local feature vector defined in Table 2 and

w_{n}^{⋆} \in {[0, 1]}^{2}

denotes the two-dimensional weight label provided by the online modified MPIO teacher. Let the teacher-relabeled data induced by the current student rollout be

D_{dag} = {(x_{m}^{†}, w_{m}^{†})}_{m = 1}^{N_{dag}}, w_{m}^{†} = π_{T} (x_{m}^{†}),

(25)

where

x_{m}^{†}

denotes a state actually visited by the student in closed-loop and then relabeled by the teacher. The final mixed dataset is defined as

D_{mix} = D_{base} \cup D_{dag} .

(26)

Following this procedure, the retained mixed-dataset artifact used for training adds 70 DAgger-style episodes on top of the 70 retained base episodes, yielding 140 episodes and 53,747 supervised samples.

4.2. Surrogate Network and Risk-Weighted Training Objective

We use a lightweight multilayer perceptron as the neural surrogate to directly learn the mapping from local state features to the teacher weight vector. Let the input feature be

x

. Before training, each feature dimension is standardized as

\tilde{x} = \frac{x - μ}{σ},

(27)

where

μ

and

σ

respectively denote the feature-wise mean and standard deviation estimated from the training set. The surrogate policy is denoted by

\hat{w} = f_{θ} (\tilde{x})

(28)

and implemented as the lightweight MLP

41 \to 128 \to 64 \to 2 .

(29)

Each training sample pairs a standardized 41-dimensional local feature vector with the corresponding teacher-selected objective-weight vector. The surrogate is implemented as an MLP with ReLU activations in the hidden layers and a two-neuron sigmoid output layer, which constrains the predicted weights to

{[0, 1]}^{2}

. The network output is not a velocity, acceleration, or low-level UAV command; instead, it replaces the online modified MPIO weight search result and specifies the relative emphasis between the preserved flocking/formation component and the preserved gap-based obstacle avoidance component. The resulting weights are then used by the unchanged downstream command generation and state update rules. Neighbor and obstacle features are constructed in an ego-centered local frame before being fed to the MLP. Neighbor entries are ordered by increasing Euclidean distance to the ego UAV, and obstacle entries are ordered by increasing distance from the ego UAV to the obstacle center. The input keeps only the nearest three neighbors and nearest three obstacles; if fewer than three entries are available, zero padding is used to keep the 41-dimensional input fixed. Apart from this top-K selection and padding, the raw geometric, velocity, radius, and clearance features are not manually clipped, min–max normalized, or hard-bounded before training. Instead, each input dimension is standardized using the training-set mean and standard deviation, as provided in Equation (27). The sigmoid bound is applied only to the two-dimensional output weight vector, not to the input features. Auxiliary vertical quantities such as

λ_{i}

and

h_{i}

are retained only to provide context for altitude regulation towards

h_{e}

and the current operating condition; the surrogate prediction itself remains associated with the planar high-level weight-selection decision.

Training uses a risk-weighted SmoothL1 loss

L = \frac{1}{B} \sum_{b = 1}^{B} ω_{b} ℓ_{SmoothL 1} ({\hat{w}}_{b}, w_{b}^{⋆}),

(30)

where B is the mini-batch size,

w_{b}^{⋆}

is the teacher weight label, and

{\hat{w}}_{b}

is the surrogate output. The factor

ω_{b}

is a per-sample training weight used to emphasize safety-critical states. It is increased for samples with small neighbor or obstacle safety margins, and is further amplified for samples drawn from episodes containing inter-UAV or obstacle collisions.

Model selection uses validation loss with early stopping. The offline test MAE is 0.1929 and is reported only as an indicator of teacher-weight fitting quality, whereas the main evidence remains closed-loop safety and real-time performance.

5. Results and Discussion

5.1. Experimental Protocol

All controllers are evaluated on the same retained scene instances using the same simulator step size (

Δ t

= 0.5 s), episode horizon of 59.5 s, and metric definitions. The closed-loop benchmark uses synchronous simulation. At each control step, the simulator waits for the controller output before advancing to the next step. The retained mixed dataset used for training contains 140 episodes, which are split at the episode level into 80% training, 10% validation, and 10% test subsets. The base dataset and DAgger dataset use random seeds 2027 and 3010, respectively, whereas the benchmark evaluation uses seeds 2028, 2029, and 2030. This retained benchmark contains 450 closed-loop episodes in total, and is used as a controlled and reproducible evaluation set for method comparison rather than as a comprehensive robustness study over broad random-seed coverage.

All simulation, dataset generation, and closed-loop evaluation code was implemented in Python 3.9. The experiments were conducted on a local laptop equipped with a 13th Gen Intel Core i7-13700H CPU and 16 GB RAM. Neural network training used an NVIDIA GeForce RTX 4060 Laptop GPU with 8 GB VRAM. Closed-loop latency profiling was performed under the same hardware setting.

The reported metrics are true collision-free rate, safe success rate, formation pass rate, whole-swarm step compute time, and step–overrun ratio. The true collision-free rate requires that no inter-UAV hard boundary violation and no obstacle hard boundary violation occur at any step. The safe success rate is stricter; in addition to true collision-free execution, it also requires no violation of the inflated obstacle boundary with a margin of 0.5 m. The formation pass rate counts the episodes for which the closed-loop formation error remains within the same prescribed benchmark thresholds throughout the episode; the mean formation error must not exceed 3.0 m, the maximum formation error must not exceed 10.0 m, and the fraction of over-limit steps must not exceed 0.20. The whole-swarm step compute time is defined as the sum of all local decision times within one simulation step. Finally, the step–overrun ratio is defined consistently with this whole-swarm latency, i.e., as the fraction of simulation steps for which the whole-swarm decision time exceeds the control step size

Δ t

= 0.5 s. Therefore, the reported latency is used only as a profiling metric for the current implementation, and does not alter the benchmark state update.

Table 3 summarizes the closed-loop control settings and gap-selection coefficients used in the benchmark, including the task ranges, vehicle dynamics, swarm-interaction parameters, safety radii, sensing range, and gap-selection weights. Table 4 lists the surrogate training, hardware, and risk-weighting settings, including the optimizer hyperparameters, computing platform, and per-sample training-weight coefficients.

5.2. Closed-Loop Simulation Results

Overall Quantitative Results.

Table 5 summarizes the overall closed-loop comparison. Under the synchronous closed-loop benchmark, the neural surrogate outperforms both modified MPIO and base MPIO on the primary safety metrics while reducing both whole-swarm step compute time and step–overrun ratio by a large margin. Relative to modified MPIO, the neural surrogate improves the true collision-free rate from 48.00% to 86.89%, raises the safe success rate from 38.67% to 74.22%, and increases the formation pass rate from 30.44% to 34.44%. At the same time, the mean whole-swarm step compute time decreases from 8494.70 ms to 0.92 ms, and the step–overrun ratio drops from 98.75% to 0.00%. The base-MPIO baseline is weaker still across the same overall comparison, with a 98.93% step–overrun ratio. Taken together, these results on the retained frozen benchmark used in this study show that replacing the online modified MPIO weight search with the neural surrogate mainly improves the reported safety and real-time metrics, especially collision avoidance and step-level computation time, while the formation pass gain is comparatively modest. Under the current implementation, this also makes the controller compatible with the

Δ t

= 0.5 s step budget.

Additional Seed Sensitivity Check.

To further examine seed sensitivity, we evaluated the final neural surrogate on three additional evaluation seeds not included in the retained benchmark: 1000, 2000, and 3000. We used the same three scene families and 50 episodes per scene, resulting in 450 additional closed-loop episodes. As shown in Table 6, the additional seeds produced a true collision-free rate of

87.78 \pm 0.38 %

, a safe success rate of

74.67 \pm 4.16 %

, and a formation pass rate of

34.00 \pm 1.15 %

, which are close to the retained benchmark results over seeds 2028, 2029, and 2030. These results suggest that the final surrogate behavior is not limited to the original three evaluation seeds. Nevertheless, this additional check is reported as a seed sensitivity analysis rather than as a comprehensive robustness validation over broad random seed coverage.

Cross-Scene Consistency.

The overall advantage is not driven by a single favorable scene family. As shown in Figure 5, Figure 6 and Figure 7, the neural surrogate generally maintains stronger safety performance across the benchmark scene families, particularly in true collision-free rate and safe success rate, while its step compute time remains separated from both optimization-based baselines by several orders of magnitude. The formation pass rate does not show the same level of improvement as the safety and latency metrics, but remains broadly comparable across the tested scene families. Thus, the surrogate not only appears better after averaging over all cases, its main advantages in safety and online computation also remain visible as the environment becomes more cluttered and as the swarm size increases from

N = 3

to

N = 9

.

Failure Mode Breakdown.

To clarify where the collision-free improvement comes from, Table 7 decomposes the hard-collision statistics into inter-UAV collisions and obstacle hard collisions. The main difference appears on the inter-UAV side. The modified MPIO shows a 44.00% neighbor collision rate but only an 8.00% obstacle hard-collision rate; by contrast, the neural surrogate reduces the neighbor collision rate to 5.56% while keeping a comparable obstacle hard-collision rate of 7.56%. Therefore, the increase in true collision-free rate from 48.00% to 86.89% is driven primarily by improved inter-UAV safety rather than by a large change in obstacle hard-collision performance. This table explains the main source of the true collision-free improvement. The remaining gap between true collision-free and safe success is governed by the stricter inflated boundary margin criterion defined in Section 5.1, which is not decomposed in Table 7.

One notable observation in Table 5 is that the neural surrogate outperforms its online teacher even under the synchronous benchmark. This does not mean that a learned student generally outperforms an exact expert, because the comparison here is not against an exact per-step optimum but against a finite-budget online approximate teacher. In the current implementation, the teacher must complete feasibility screening and Pareto search within the fixed budget

K_{max} = 20

, so difficult local states can lead to sparse feasible fronts and unstable final selections. Therefore, we interpret the student-over-teacher gap as an empirical consequence of replacing a budget-limited online solver with a smoother learned approximation on the tested closed-loop state distribution.

Training and Model Ablations.

Table 8 reports ablations on the training protocol and surrogate capacity under the same retained frozen benchmark of 450 episodes. Scene randomization is kept fixed across variants because removing it would alter the underlying teacher rollout and dataset generation distribution as opposed to isolating a local training component. The “No DAgger” variant uses only the retained base teacher trajectories, the “Uniform Loss” variant uses the mixed dataset but removes the per-sample risk weights, and the variants with different model sizes keep the same risk-weighted training pipeline while changing only the hidden layer widths.

Together with the preceding comparison against the online modified MPIO teacher, these ablations indicate that the benchmark-level gain mainly comes from replacing the repeated online weight search with a learned surrogate, rather than from any single training refinement alone. Removing DAgger changes the true collision-free rate by only

- 0.45

percentage points relative to the full model, while replacing the risk-weighted objective with a uniform loss changes it by

- 0.22

percentage points. These small changes do not support treating either component as the dominant source of the final gain. Risk weighting has a clearer effect on the stricter safe success metric, increasing it from

72.44 %

to

74.22 %

relative to uniform loss, even though the full model has a higher offline MAE. This mismatch further supports treating offline teacher-weight fitting error as an auxiliary diagnostic rather than as the main evidence of closed-loop controller quality. The strongest ablation effect comes from surrogate capacity, with both the smaller and larger MLPs reducing collision-free and safe success rates and increasing inter-UAV collision rates relative to the default 128–64 model. Among the tested architectures, the medium-sized surrogate provides the best closed-loop safety tradeoff.

5.3. AirSim Case Study

Figure 8 presents an AirSim case study of the neural surrogate controller. This case study examines whether the surrogate-based high-level controller remains executable after migration from the point-mass benchmark to a higher-fidelity asynchronous multirotor simulator. As such, it serves as qualitative evidence of the migration feasibility and real-time plausibility of the learned controller.

The neural surrogate controller remains executable in AirSim and maintains stable closed-loop behavior in representative cluttered scenes. Figure 9 further shows the complete top-view trajectories in the three representative AirSim scenes, providing a clearer view of the overall path shapes, obstacle circumvention patterns, and trajectory continuity of the deployed controller. Together, these qualitative results show that the surrogate is not only effective on the frozen point-mass benchmark but is also suitable for deployment-oriented execution in a simulator with asynchronous updates and multirotor command interfaces. This evidence is qualitative, whereas the main quantitative comparison among controllers is provided by the frozen synchronous benchmark in Section 5.2.

We do not use the AirSim case study as a quantitative baseline comparison against base MPIO or modified MPIO. Under the current integration, the online optimization baselines require substantially longer per-step decision times than the neural surrogate, and would need additional simplifications or reduced search settings to run stably through the same asynchronous multirotor interface. Therefore, the fair controller comparison is kept in the retained synchronous benchmark, where all methods use the same scene instances and metric definitions. The AirSim figures are reported only to show that the learned high-level surrogate controller can be executed through a higher-fidelity multirotor simulation interface.

6. Conclusions

This paper develops a neural surrogate replacement for the online weight selection component within an inherited modified MPIO-based swarm controller rather than a new low-level quadrotor controller or a full end-to-end swarm-control architecture. The learned surrogate preserves the surrounding high-level control structure; this allows it to avoid repeated online optimization during deployment, which can make the teacher difficult to deploy under strict runtime budgets. On the closed-loop benchmark, the neural surrogate improves the true collision-free rate and safe success rate relative to both modified MPIO and base MPIO, modestly improves the formation pass rate, and reduces the whole-swarm step compute time and step–overrun ratio by a large margin. A breakdown of the safety results shows that the gain in true collision-free rate is driven mainly by better inter-UAV safety. The ablation results further indicate that the benchmark-level improvement is not dominated by any single training refinement alone; instead, the clearest effect comes from replacing the finite-budget online modified MPIO search with a learned medium-sized surrogate, while DAgger and risk-weighted supervision mainly support state coverage and safety emphasis during training. A qualitative AirSim case study further suggests that the learned high-level controller remains executable after migration to a higher-fidelity asynchronous multirotor simulator. A remaining limitation is that the present manuscript does not provide a directly comparable real-time deployment baseline for modified MPIO under the same interface as the neural surrogate. Our preliminary AirSim attempts indicate that stable execution of modified MPIO would require substantially reduced search settings and simplified scenarios under the current integration. In addition, the learned controller has not yet been validated on real hardware. The present results should be interpreted as evidence on a retained closed-loop benchmark rather than as a comprehensive robustness validation across broad random-seed coverage. Extending the evaluation along these directions is an important next step for strengthening both the deployment conclusions and the mechanism-level interpretation of this work.

Author Contributions

Conceptualization, Z.W. and Z.N.; methodology, J.L., Z.W. and Z.N.; software, J.L.; validation, J.L. and Z.W.; formal analysis, J.L.; investigation, J.L.; resources, Z.N.; data curation, J.L.; writing—original draft preparation, J.L.; writing—review and editing, J.L., Z.W. and Z.N.; visualization, J.L.; supervision, Z.N.; project administration, Z.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The APC was funded by the authors.

Data Availability Statement

The datasets generated and analyzed during the current study are not publicly archived. They are available from the first author upon reasonable request. The source code supporting the implementation is available in the public repository cited in the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhou, X.; Yi, Z.; Liu, Y.; Huang, K.; Huang, H. Survey on Path and View Planning for UAVs. Virtual Real. Intell. Hardw. 2020, 2, 56–69. [Google Scholar] [CrossRef]
Chung, S.J.; Paranjape, A.A.; Dames, P.; Shen, S.; Kumar, V. A Survey on Aerial Swarm Robotics. IEEE Trans. Robot. 2018, 34, 837–855. [Google Scholar] [CrossRef]
Rahman, M.; Sarkar, N.I.; Lutui, R. A Survey on Multi-UAV Path Planning: Classification, Algorithms, Open Research Problems, and Future Directions. Drones 2025, 9, 263. [Google Scholar] [CrossRef]
Alqudsi, Y.; Makaraci, M. UAV Swarms: Research, Challenges, and Future Directions. J. Eng. Appl. Sci. 2025, 72, 12. [Google Scholar] [CrossRef]
Arshid, K.; Krayani, A.; Marcenaro, L.; Gomez, D.M.; Regazzoni, C. Toward Autonomous UAV Swarm Navigation: A Review of Trajectory Design Paradigms. Sensors 2025, 25, 5877. [Google Scholar] [CrossRef]
Reynolds, C.W. Flocks, Herds, and Schools: A Distributed Behavioral Model. Comput. Graph. 1987, 21, 25–34. [Google Scholar] [CrossRef]
Olfati-Saber, R. Flocking for Multi-Agent Dynamic Systems: Algorithms and Theory. IEEE Trans. Autom. Control 2006, 51, 401–420. [Google Scholar] [CrossRef]
Cucker, F.; Smale, S. Emergent Behavior in Flocks. IEEE Trans. Autom. Control 2007, 52, 852–862. [Google Scholar] [CrossRef]
Fiorini, P.; Shiller, Z. Motion Planning in Dynamic Environments Using Velocity Obstacles. Int. J. Robot. Res. 1998, 17, 760–772. [Google Scholar] [CrossRef]
van den Berg, J.; Guy, S.J.; Lin, M.; Manocha, D. Reciprocal n-Body Collision Avoidance. In Robotics Research; Pradalier, C., Siegwart, R., Hirzinger, G., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 3–19. [Google Scholar] [CrossRef]
Wang, L.; Ames, A.D.; Egerstedt, M. Safety Barrier Certificates for Collisions-Free Multirobot Systems. IEEE Trans. Robot. 2017, 33, 661–674. [Google Scholar] [CrossRef]
Miettinen, K. Nonlinear Multiobjective Optimization; International Series in Operations Research & Management Science; Kluwer Academic Publishers: Boston, MA, USA, 1998; Volume 12. [Google Scholar] [CrossRef]
Deb, K.; Pratap, A.; Agarwal, S.; Meyarivan, T. A Fast and Elitist Multiobjective Genetic Algorithm: NSGA-II. IEEE Trans. Evol. Comput. 2002, 6, 182–197. [Google Scholar] [CrossRef]
Tang, J.; Duan, H.; Lao, S. Swarm intelligence algorithms for multiple unmanned aerial vehicles collaboration: A comprehensive review. Artif. Intell. Rev. 2023, 56, 4295–4327. [Google Scholar] [CrossRef]
Poudel, S.; Arafat, M.Y.; Moh, S. Bio-Inspired Optimization-Based Path Planning Algorithms in Unmanned Aerial Vehicles: A Survey. Sensors 2023, 23, 3051. [Google Scholar] [CrossRef]
Qiu, H.; Duan, H. A Multi-Objective Pigeon-Inspired Optimization Approach to UAV Distributed Flocking among Obstacles. Inf. Sci. 2020, 509, 515–529. [Google Scholar] [CrossRef]
Ruan, W.Y.; Duan, H.B. Multi-UAV Obstacle Avoidance Control via Multi-Objective Social Learning Pigeon-Inspired Optimization. Front. Inf. Technol. Electron. Eng. 2020, 21, 740–748. [Google Scholar] [CrossRef]
Shi, H.; Gao, W.; Jiang, X.; Su, C.; Li, P. Two-dimensional model-free Q-learning-based output feedback fault-tolerant control for batch processes. Comput. Chem. Eng. 2024, 182, 108583. [Google Scholar] [CrossRef]
Hu, C.; Bai, J.; Zou, H. Two-dimensional iterative learning control under infinite horizon optimization for batch processes with partial actuator failures. Can. J. Chem. Eng. 2026, early view. [Google Scholar] [CrossRef]
Alessio, A.; Bemporad, A. A Survey on Explicit Model Predictive Control. In Nonlinear Model Predictive Control: Towards New Challenging Applications; Magni, L., Raimondo, D.M., Allgöwer, F., Eds.; Springer: Berlin/Heidelberg, Germany, 2009; pp. 345–369. [Google Scholar] [CrossRef]
Hewing, L.; Wabersich, K.P.; Menner, M.; Zeilinger, M.N. Learning-Based Model Predictive Control: Toward Safe Learning in Control. Annu. Rev. Control Robot. Auton. Syst. 2020, 3, 269–296. [Google Scholar] [CrossRef]
Gonzalez, C.; Asadi, H.; Kooijman, L.; Lim, C.P. Neural Networks for Fast Optimisation in Model Predictive Control: A Review. arXiv 2023, arXiv:2309.02668. [Google Scholar]
Khodaverdian, A.; Gohil, D.; Christofides, P.D. Neural Network Implementation of Model Predictive Control with Stability Guarantees. Digit. Chem. Eng. 2025, 16, 100262. [Google Scholar] [CrossRef]
Puente-Castro, A.; Rivero, D.; Pazos, A.; Fernandez-Blanco, E. A Review of Artificial Intelligence Applied to Path Planning in UAV Swarms. Neural Comput. Appl. 2022, 34, 153–170. [Google Scholar] [CrossRef]
Argall, B.D.; Chernova, S.; Veloso, M.; Browning, B. A Survey of Robot Learning from Demonstration. Robot. Auton. Syst. 2009, 57, 469–483. [Google Scholar] [CrossRef]
Ross, S.; Gordon, G.; Bagnell, D. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics; Proceedings of Machine Learning Research; PMLR: Fort Lauderdale, FL, USA, 2011; Volume 15, pp. 627–635. [Google Scholar]
Shah, S.; Dey, D.; Lovett, C.; Kapoor, A. AirSim: High-Fidelity Visual and Physical Simulation for Autonomous Vehicles. In Field and Service Robotics; Springer International Publishing: Cham, Switzerland, 2018; pp. 621–635. [Google Scholar] [CrossRef]
Sabatino, F. Quadrotor Control: Modeling, Nonlinear Control Design, and Simulation. Master’s Thesis, KTH Royal Institute of Technology, Automatic Control, Stockholm, Sweden, 2015. [Google Scholar]
Zhang, X.; Li, X.; Wang, K.; Lu, Y. A Survey of Modelling and Identification of Quadrotor Robot. Abstr. Appl. Anal. 2014, 2014, 320526. [Google Scholar] [CrossRef]
Vicsek, T.; Czirók, A.; Ben-Jacob, E.; Cohen, I.; Shochet, O. Novel Type of Phase Transition in a System of Self-Driven Particles. Phys. Rev. Lett. 1995, 75, 1226–1229. [Google Scholar] [CrossRef]
Sezer, V.; Gokasan, M. A Novel Obstacle Avoidance Algorithm: “Follow the Gap Method”. Robot. Auton. Syst. 2012, 60, 1123–1134. [Google Scholar] [CrossRef]
Tobin, J.; Fong, R.; Ray, A.; Schneider, J.; Zaremba, W.; Abbeel, P. Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); IEEE: Piscataway, NJ, USA, 2017; pp. 23–30. [Google Scholar] [CrossRef]
Qiu, H.; Duan, H. Multi-objective pigeon-inspired optimization for brushless direct current motor parameter design. Sci. China Technol. Sci. 2015, 58, 1915–1923. [Google Scholar] [CrossRef]

Figure 1. Quadrotor model and planar task abstraction.

Figure 2. Local perception model for obstacles and gap-based avoidance.

Figure 3. Teacher–student learning framework for surrogate training.

Figure 4. Online control flow of the proposed surrogate-based controller.

Figure 5. Bucketed closed-loop metrics in the snake corridor scene.

Figure 6. Bucketed closed-loop metrics in the forest scene.

Figure 7. Bucketed closed-loop metrics in the forest and dynamic obstacles scene.

Figure 8. AirSim deployment results for the snake corridor, forest, and dynamic obstacle avoidance scenes. Colored trajectory lines denote different UAVs in the swarm, and the inset shows a zoomed-in view of the local trajectory segment.

Figure 9. Top-view trajectories of the deployed neural surrogate controller in the three representative AirSim scenes. Colored trajectory lines denote different UAVs in the swarm.

Table 1. Online modified MPIO teacher hyperparameters.

Symbol	Value	Notes
P	58	Population size
$K_{max}$	20	Max iterations per step
$K_{d}$	2	Removed per iteration
$A_{max}$	50	Elite archive size
R	0.3	Leader decay factor
$f_{t}$	3.0	Exploration-convergence factor
$p_{l}$	0.9	General-leader ratio
$e_{learn}$	0.01	Follower-learning perturbation
$s_{l}$	2	Follower-learning repetitions
$e_{rand}$	0.01	Fallback random-walk amplitude
$X_{L}$	${[0, 0]}^{⊤}$	Lower bound of the 2-D weight position
$X_{U}$	${[1, 1]}^{⊤}$	Upper bound of the 2-D weight position
$V_{L}$	${[- 0.2, - 0.2]}^{⊤}$	Lower bound of the 2-D pigeon velocity
$V_{U}$	${[0.2, 0.2]}^{⊤}$	Upper bound of the 2-D pigeon velocity
$Δ t$	0.5 s	Step size
$T_{epi}$	59.5 s	Episode horizon

Table 2. Definition of the 41-dimensional surrogate input vector.

Indices	Dim.	Content
1–6	6	Ego planar velocity $(v_{x, i}, v_{y, i})$ , speed $∥ v_{i} ∥$ , heading $ψ_{i}$ , and vertical-state information given by altitude rate $λ_{i}$ and altitude $h_{i}$ .
7–8	2	Desired cruise velocity $(v_{e, x}, v_{e, y})$
9–23	$3 \times 5$	Up to three nearest neighbors: relative position, relative velocity, and distance $(Δ x_{i j}, Δ y_{i j}, Δ v_{x, i j}, Δ v_{y, i j}, d_{i j})$ , with $d_{i j} = \sqrt{Δ x_{i j}^{2} + Δ y_{i j}^{2}}$
24–41	$3 \times 6$	Up to three nearest obstacles: relative center position, radius, planar velocity, and clearance $(Δ x_{i o}, Δ y_{i o}, r_{o}, v_{o, x}, v_{o, y}, c_{i o})$ , with $c_{i o} = d_{i o} - r_{o}$

Table 3. Closed-loop control settings and gap selection coefficients. Ranges denote episode-level perturbations.

Symbol	Value	Notes
$t_{v}$	$0.8$	Speed time constant
$t_{ψ}$	$0.6$	Heading time constant
$v_{x y}$	$[4.0, 12.0]$	Speed range (m/s)
$n_{max}$	$3.0$	Max lateral overload (g)
N	$[3, 9]$	UAVs per episode
$h_{e}$	$[40, 70]$	Target altitude range (m)
$v_{e}$	$(v_{e}, 0), v_{e} \in [6, 10]$	Cruise speed range (m/s)
$R_{des}$	$[6, 10]$	Target spacing range (m)
$R_{comm}$	40	Neighbor radius (m)
$R_{\lim}^{1}$	$2.0$	UAV safety radius (m)
$R_{sense}$	$40.0$	Obstacle sensing radius (m)
$K_{f}$	$0.25$	Spacing gain
$K_{a}$	$0.1$	Velocity-alignment gain
$K_{c}$	$1.0 \times 10^{5}$	Inter-UAV collision-repulsion gain
$R_{\lim}^{2}$	$10.0$	Inflated obstacle safety radius (m)
$Θ_{FOV}$	$π / 2$	Gap-selection half FOV
$k_{clear}$	$3.0$	Clearance weight in $S_{i} (α)$
$k_{width}$	$1.0$	Gap-width weight in $S_{i} (α)$
$k_{prog}$	$0.6$	Progress weight in $S_{i} (α)$
$k_{turn}$	$0.2$	Turn-cost weight in $S_{i} (α)$
$k_{edge}$	$0.3$	Boundary penalty in $S_{i} (α)$

Table 4. Surrogate training, hardware, and risk weighting settings.

Item	Value	Notes
Batch size	512	Mini-batch size
Learning rate	$10^{- 3}$	Initial learning rate
Weight decay	$10^{- 4}$	Weight-decay coefficient
Max epochs	60	Max training epochs
Patience	10	Early-stop patience
CPU	13th Gen Intel Core i7-13700H	14 cores, 20 threads, 2.40 GHz
GPU	NVIDIA GeForce RTX 4060 Laptop GPU	8 GB VRAM
$m_{nbr}^{ref}$	$0.5$	Reference neighbor margin in $ω_{b}$
$m_{obs}^{ref}$	$1.0$	Reference obstacle margin in $ω_{b}$
$w_{nbr}$	$5.0$	Neighbor-margin weight in $ω_{b}$
$w_{obs}$	$4.0$	Obstacle-margin weight in $ω_{b}$
$c_{nbr}$	$3.0$	Neighbor-collision multiplier in $ω_{b}$
$c_{obs}$	$2.5$	Obstacle-collision multiplier in $ω_{b}$

Table 5. Closed-loop comparison on the retained frozen benchmark.

Method	True Collision-Free (%) ↑	Safe Success (%) ↑	Formation Pass (%) ↑	Step Compute Time (ms) ↓	Overrun Ratio (%) ↓
Base MPIO [33]	$41.11 \pm 10.38$	$33.11 \pm 6.94$	$26.44 \pm 2.52$	26,772.83 ± 1396.42	$98.93$
Modified MPIO [16]	$48.00 \pm 7.33$	$38.67 \pm 4.37$	$30.44 \pm 2.69$	$8494.70 \pm 273.23$	$98.75$
Ours	$86.89 \pm 0.77$	$74.22 \pm 2.78$	$34.44 \pm 1.39$	$0.92 \pm 0.10$	$0.00$

Note: Upward and downward arrows indicate whether larger or smaller values are preferred, respectively; bold values mark the best result in each metric column.

Table 6. Additional seed sensitivity check for the final neural surrogate.

Evaluation Seed	Episodes	True CF (%)	Safe Success (%)	Formation Pass (%)	Neighbor Coll. (%)	Obstacle Hard Coll. (%)	Step Time (ms)	Overrun (%)
2028, 2029, 2030	450	$86.89 \pm 0.77$	$74.22 \pm 2.78$	$34.44 \pm 1.39$	$5.56 \pm 1.92$	$7.56 \pm 1.68$	$0.92 \pm 0.10$	$0.00 \pm 0.00$
1000, 2000, 3000	450	$87.78 \pm 0.38$	$74.67 \pm 4.16$	$34.00 \pm 1.15$	$5.56 \pm 1.39$	$6.67 \pm 1.33$	$0.76 \pm 0.03$	$0.00 \pm 0.00$
1000	150	$88.00$	$78.00$	$34.67$	$4.00$	$8.00$	$0.74$	$0.00$
2000	150	$88.00$	$70.00$	$32.67$	$6.67$	$5.33$	$0.80$	$0.00$
3000	150	$87.33$	$76.00$	$34.67$	$6.00$	$6.67$	$0.76$	$0.00$

Table 7. Breakdown of collision types on the retained frozen benchmark.

Method	Neighbor Collision (%) ↓	Obstacle Hard Collision (%) ↓	True Collision-Free (%) ↑
Base MPIO [33]	$48.22$	$10.67$	$41.11$
Modified MPIO [16]	$44.00$	$8.00$	$48.00$
Ours	$5.56$	$7.56$	$86.89$

Note: Upward and downward arrows indicate whether larger or smaller values are preferred, respectively; bold values mark the best result in each metric column.

Table 8. Training and model ablations on the frozen benchmark.

Variant	Setting	True CF (%)	Safe Success (%)	Formation Pass (%)	Neighbor Coll. (%)	Obstacle Hard Coll. (%)	Step Time (ms)	MAE
Full	DAgger + risk, 128–64	$86.89$	$74.22$	$34.44$	$5.56$	$7.56$	$0.923$	$0.1929$
No DAgger	Base data + risk, 128–64	$86.44$	$75.33$	$34.67$	$6.22$	$7.33$	$0.867$	$0.1996$
Uniform loss	DAgger + uniform, 128–64	$86.67$	$72.44$	$34.67$	$6.22$	$7.11$	$0.846$	$0.1869$
Small MLP	DAgger + risk, 64–32	$76.22$	$64.44$	$32.00$	$14.89$	$8.89$	$0.809$	$0.1872$
Large MLP	DAgger + risk, 256–128	$77.78$	$69.11$	$31.33$	$16.00$	$6.22$	$1.678$	$0.1881$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, J.; Wen, Z.; Ning, Z. Neural Surrogate-Enhanced Metaheuristic Optimization for Distributed Quadrotor Swarm Control. Sensors 2026, 26, 3398. https://doi.org/10.3390/s26113398

AMA Style

Li J, Wen Z, Ning Z. Neural Surrogate-Enhanced Metaheuristic Optimization for Distributed Quadrotor Swarm Control. Sensors. 2026; 26(11):3398. https://doi.org/10.3390/s26113398

Chicago/Turabian Style

Li, Jinze, Zeling Wen, and Zhaoke Ning. 2026. "Neural Surrogate-Enhanced Metaheuristic Optimization for Distributed Quadrotor Swarm Control" Sensors 26, no. 11: 3398. https://doi.org/10.3390/s26113398

APA Style

Li, J., Wen, Z., & Ning, Z. (2026). Neural Surrogate-Enhanced Metaheuristic Optimization for Distributed Quadrotor Swarm Control. Sensors, 26(11), 3398. https://doi.org/10.3390/s26113398

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Neural Surrogate-Enhanced Metaheuristic Optimization for Distributed Quadrotor Swarm Control

Abstract

1. Introduction

2. System Modeling and Problem Formulation

2.1. UAV Model

2.2. Swarm Flocking Objectives

2.3. Obstacles and Gap-Based Avoidance Model

3. Multi-Objective Optimization Formulation and Online Modified MPIO Solver

3.1. Objective Optimization and Feasible Pareto Selection

3.2. Online Modified MPIO Teacher

3.3. From Weight Selection to Closed-Loop State Update

4. Neural Surrogate Learning

4.1. Base Data Collection and DAgger Relabeling

4.2. Surrogate Network and Risk-Weighted Training Objective

5. Results and Discussion

5.1. Experimental Protocol

5.2. Closed-Loop Simulation Results

5.3. AirSim Case Study

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI