A Feature-Aware Elite Imitation MARL for Multi-UAV Trajectory Optimization in Mountain Terrain Detection

Zhou, Quanxi; Tao, Ye; Su, Qianxiao; Tsukada, Manabu

doi:10.3390/drones9090645

Open AccessArticle

A Feature-Aware Elite Imitation MARL for Multi-UAV Trajectory Optimization in Mountain Terrain Detection

¹

The School of Information Science and Technology, The University of Tokyo, Tokyo 113-8654, Japan

²

Research Institute of Petroleum Exploration and Development, PetroChina, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(9), 645; https://doi.org/10.3390/drones9090645

Submission received: 5 August 2025 / Revised: 7 September 2025 / Accepted: 11 September 2025 / Published: 15 September 2025

Download

Browse Figures

Versions Notes

Abstract

With the advancement of UAV trajectory planning and sensing technologies, unmanned aerial vehicles (UAVs) are now capable of performing high-performance ground detection and search tasks. Mountainous regions, due to their complex terrain, have long been a focal point in the field of remote sensing. Effective UAV search tasks in such areas must consider not only horizontal coverage but also variations in detection range and angle caused by changes in elevation. Conventional algorithms typically require complete prior knowledge of the environment for trajectory optimization and often depend on scenario-specific policy models, limiting their generalizability. To address these challenges, this paper proposes a Feature-Aware Elite Imitation Multi-Agent Reinforcement Learning (FA-EIMARL) algorithm that leverages partial terrain information to construct a feature extraction network. This approach enables batch training across diverse terrains without the need for full environmental maps. In addition, an elite imitation mechanism has been proposed for convergence acceleration and task performance enhancement. Simulation results demonstrate that the proposed method achieves superior reward performance, convergence rate, and computational efficiency while maintaining strong adaptability to varying terrains.

Keywords:

UAV; reinforcement learning; feature extraction; elite imitation; UAV detection; trajectory optimization

1. Introduction

With the development of infrared detection [1], action cameras [2], and beyond 5G communication technologies [3], unmanned aerial vehicles (UAVs) have been widely applied in the field of remote sensing [4]. Due to mobility, flexibility, and high temporal responsiveness, UAVs are capable of conducting aerial reconnaissance in mountainous or rugged terrains where ground personnel or vehicles cannot easily access [5]. Equipped with thermal infrared (IR) sensors, UAVs can perform surface mapping and accurately detect objects with temperatures higher than the surrounding environment [6]. Meanwhile, UAVs can be integrated with communication modules, enabling multiple UAVs to share data and synchronize task progress in real time [7]. In addition, a UAV range-finding module like Light Detection and Ranging (LiDAR) equips UAVs with the ability to measure terrain elevation [8]. These multifunctional capabilities prepare UAVs for a wide range of applications, including remote sensing, forest fire monitoring, and nighttime search and rescue operations [9,10,11].

In this work, we envision a multi-UAV collaborative system in which UAVs perform search and detection tasks in mountainous regions. UAVs depart from different positions and aim to achieve the highest coverage rate in the shortest time possible. During the flight tasks, each UAV can adjust its speed and flight direction [12]. Regarding the complexity of terrains, UAVs should focus on flight altitude to balance the field of view (FOV) and detection accuracy [13]. Due to limited battery capacities, the UAVs should complete their task before their battery runs out. UAVs are equipped with communication modules to broadcast coverage and location information to avoid collision and duplicate search [14].

In mountainous search, monitoring, or rescue missions, which are typically urgent in nature, it is essential to develop high-precision trajectory optimization models for UAVs [15,16,17]. Moreover, given the limited computational and storage capabilities of UAVs, lightweight and generalizable task models capable of adapting to terrains with varying steepness and elevations are desired. Under these circumstances, integrating environmental features into the model construction and optimization process is particularly valuable. The problem of UAV trajectory design has been extensively studied. Researchers have proposed various trajectory optimization approaches such as a heuristic-based approach [18], hybrid optimization [19], convex optimization [20], and deep reinforcement learning (DRL) based [21]. However, there are still limitations in these existing studies. First, heuristic and hybrid optimization methods struggle to adapt multi-dimensional speed direction and flexible velocity limitation. In addition, convex optimization is not suitable for trajectory problems with a multifaceted search space. On the other hand, the aforementioned methods only enable trajectory design when all environmental parameters are known, and cooperation among multiple UAVs remains insufficient. Although DRL can address complex uncertainty optimization problem [22], the existing models are designed to serve specific tasks in homogeneous scenarios, lacking the generalization ability to transfer to new environments.

In real-world environments, the joint trajectory design for UAVs is highly complex [23], which leads to a series of challenges [24]. From the UAV control perspective, UAVs must avoid collision with other UAVs [25] and maintain their cruising altitude for the balance of FOV and detection precision [26]. In addition, limited battery capacity and UAV dynamics need to be considered when making the joint decisions [27]. UAVs can establish inter-UAV wireless communication links to facilitate the exchange of positioning and environmental coverage record [28]. To address the limitations of existing UAV trajectory optimization methods in complex and dynamic environments, this work proposes a generalized Multi-Agent Reinforcement Learning (MARL) framework based on actor–critic networks. The term generalized refers to the capability of the proposed system to accommodate diverse trajectory constraints and to enable trained models to be deployed across multiple mountainous regions with varying altitudes and terrain complexities for detection tasks.

To jointly optimize key performance indicators—namely, coverage rate, energy consumption, and detection accuracy—we enhance standard deterministic MARL algorithms by introducing an elite imitation (EI) strategy. In this mechanism, the multi-agent system periodically conducts performance evaluations and selects the best-performing agent (elite). A portion of this elite agent’s parameters are then softly integrated into the policies of other agents. To prevent overfitting and encourage continuous exploration, the imitation frequency decreases gradually, and the parameter-sharing ratio is progressively reduced. Given the cooperative nature of the UAVs and the equivalence of their tasks, this strategy accelerates early-stage policy convergence and reduces the negative impact of poor initialization.

Building upon the EI-enhanced MARL framework, we further propose a Feature-Aware Elite Imitation Multi-Agent Reinforcement Learning (FA-EIMARL) algorithm that incorporates environmental information to improve adaptability in varying task scenarios. In FA-EIMARL, UAVs perform random sampling of ground elevation data and store it in a structured point array format. These arrays, labeled with scenario-specific attributes, are input into a classification pipeline composed of a Point Array Network (PAN) for multi-dimensional feature extraction, followed by a classification module. During training, multiple terrain scenarios are sampled at each time step. Extracted features are then fused with the UAVs’ state representations to support policy updates. This design enables FA-EIMARL to generalize effectively across terrain profiles with different altitudes and structural complexities.

In addition, evaluating the quality of IR-based mapping by calculating the geometric relationship between UAVs and the terrain they observe often requires extensive computation, especially when assessing the accuracy of terrain data acquisition in mountainous regions. To improve computational efficiency, we design a lightweight detection quality prediction network (DQPN) that estimates the reliability of mapping results based on partial ground elevation information and UAV altitude. Unreliable regions are subsequently flagged for re-sensing, thereby reducing unnecessary computation while maintaining overall mapping accuracy.

Our key contributions are as follows:

We formulate the joint trajectory optimization system that simultaneously accounts for detection accuracy, UAV energy efficiency, and collision avoidance.
We propose a generalized MARL framework incorporating the EI mechanism and the DQPN structure, which together enhance training stability, accelerate convergence, and reduce inference latency in detection tasks.
We propose a novel FA-EIMARL algorithm that can rapidly adapt to dynamic task spaces based on the extracted environmental features. Experimental results demonstrate that this algorithm outperforms baselines.

The remainder of this paper is organized as follows. Section 2 reviews the related works. Section 3 introduces the exemplary scenario and methodology. Section 4 presents the system model. Section 5 proposes two novel algorithms. Section 6 presents the experiment. Finally, Section 7 concludes this work.

2. Related Works

Trajectory optimization plays a crucial role in UAV research. The most conventional studies focus on addressing the problems in a fixed and predefined task space. For example, the authors of [29] optimized trajectories for multi-UAV systems in a fixed 3D space to cover the 3D space cooperatively and avoid obstacles. In [30], the main objective of path planning is to visit some predetermined checkpoints in the operational area for the maximum target coverage rate. In [31], a multi-UAV coverage path planning approach is introduced for collaborative 3D space coverage in predetermined large-scale search tasks. In [32], UAV trajectory planning is employed to generate remote sensing image acquisition paths within predefined urban task areas. In [33], the selected task scenarios can be represented by the LiDAR data collected by UAVs. In [34], pre-setting terrain simulation data is used for UAV path design. Only a limited number of studies consider UAVs performing tasks in dynamic task spaces. For instance, [35] focuses on UAV trajectory design to cope with dynamic task scenarios. In [36], a novel approach was proposed to optimize UAV navigation in dynamic environments. However, these algorithms still face significant challenges when applied to real-world environments. On the one hand, their high computational complexity makes them unsuitable for time-sensitive applications. On the other hand, they typically rely more on enhancing algorithmic fault tolerance rather than task-specific adaptability. Building upon the aforementioned works, this study aims to extend the field by developing generalized solutions for diverse UAV task spaces.

Current research on UAV trajectory optimization primarily encompasses heuristic algorithms [37], hybrid optimization [38], convex optimization [39], and deep reinforcement learning (DRL) [40]. In [41], convex optimization was applied to address UAV placement and resource allocation. In [42], a hybrid Particle Swarm Optimization (PSO) approach was proposed to enhance UAV path planning efficiency. In [43], an Ant Colony Optimization (ACO) algorithm combined with A-star search was introduced to optimize UAV flight trajectories.

Compared to traditional heuristic- and optimization-based algorithms, DRL provides distinct advantages, including adaptability to dynamic environments, scalability to high-dimensional state and action spaces, support for online learning, and reduced computational overhead in multi-UAV decision-making. Several DRL variants have been successfully applied in UAV trajectory planning tasks. In [44], Twin Delayed Deep Deterministic Policy Gradient (TD3) was utilized to optimize UAV paths, demonstrating strong convergence properties and robustness in task performance. In [45], Deep Deterministic Policy Gradient (DDPG)—a continuous control algorithm—was shown to deliver lower computational complexity and faster decision-making compared to other DRL methods. In [46], Soft Actor–Critic (SAC) was adopted to maximize the communication sum-rate, with results indicating superior performance over Deep Q-Network (DQN), which is limited to discrete action spaces. Additionally, [47] applied Proximal Policy Optimization (PPO) for UAV obstacle avoidance, proving its feasibility and generalizability across various environments.

Building upon conventional DRL frameworks, several enhanced DRL variants have been developed to improve either service performance or task generalization. In [48], a Graph Neural Network (GNN) was integrated with DRL for multi-UAV trajectory planning in smart agriculture. By modeling the relationships between UAVs and IoT devices as graphs, the algorithm enables UAVs to extract useful spatial and temporal features in dynamic environments. In [49], a GNN-based MARL was proposed for the solution of UAV trajectory design. However, GNN-RL has several limitation. First, GNN-based RL requires continuously acquiring partial information from the environment, which typically relies on IoT or other specialized devices for sensing and transmission. In the context of UAV mountain rescue and exploration tasks studied in this work, UAVs cannot easily obtain such information, making GNN-based RL unsuitable. Second, the GNN architecture relies on convolutional layers, which significantly increase the computational and memory burden. Maintaining high-performance operations under these conditions is impractical for lightweight UAVs tasked with rapid search. Finally, GNN-based RL uses experience to extract features from the environment to guide training and prediction. In scenarios with multiple predefined conditions and a known exploration area, this approach offers no clear advantage. In [50], Soft Actor–Critic (SAC) and Twin Delayed DDPG (TD3) were combined with transfer learning to support UAV-assisted integrated sensing and communication in emergency rescue scenarios. The proposed method adapts effectively to situations where certain base stations (BSs) are out of service. Nonetheless, the learned policies are difficult to generalize to unrelated tasks such as payload delivery. In [51], a meta-RL framework was employed to optimize UAV service quality in wireless sensor networks. In [52], meta-RL was also applied for UAV energy distribution and trajectory design. Although meta-learning provides adaptability across multiple tasks, it presents several limitations in our scenario. Meta-learning enables UAVs to perform diverse tasks by fine-tuning network parameters. However, it struggles to balance the inherent trade-off between task-specific performance and generalization, resulting in potentially large variations in effectiveness across tasks. Additionally, achieving multi-task capability requires learning distinct strategies for each task, which entails an extra network for parameter adjustment. This increases computational complexity and may reduce overall task accuracy. In contrast, our scenario demands UAVs to adopt a single strategy capable of handling multiple predefined environments with a lightweight model while maintaining high performance. In terms of computational speed, model size, and task requirements, meta-learning deviates substantially from these needs.

In addition, to achieve more stable training performance, imitation learning has been applied in the RL field. In [53], a multi-virtual-agent imitation learning method was proposed to learn the dispatch policy under different power supply interruption periods. In [54], an imitation learning agent was proposed for a variety of learning approaches. In [55], a pre-training imitation model was applied to improve the learning performance of autonomous vehicles. However, the aforementioned methods merely apply the concept of agent imitation, using already established expert knowledge to achieve rapid learning in a specific environment and obtain better learning outcomes. These methods lack generality and are highly dependent on the constructed scenarios and the prior knowledge available.

Overall, existing DRL and imitation methods demonstrate satisfactory performance in homogeneous environments, and some have been extended to handle multiple task types. However, real-world UAV mission scenarios are inherently complex and demand adaptability to diverse and dynamic environmental conditions. To address these challenges, this paper proposes a novel imitation learning-based DRL framework named FA-EIMARL, which integrates a feature-aware model with a DQPN architecture. Compared to current DRLs, the FA module extracts the features from scenarios, enabling the training model to be adapted to complex known task spaces and achieve the total task performance reward. Compared to existing imitation learning modules, the EI module can adaptively select elite individuals and adjust the degree of learning according to the training process, thereby improving the model’s learning quality and convergence speed in complex environments. Moreover, this method does not rely on specific scenarios or prior knowledge. The DQPN can significantly reduces the UAVs’ inference time while maintaining computational accuracy, thereby ensuring high task efficiency. The proposed approach offers improved generalization, enhanced reward performance, and faster convergence, enabling effective deployment in varied and uncertain UAV task environments.

3. System Overview

This section demonstrates an exemplary scenario and introduces the employed methodology.

3.1. Exemplary Scenario

Figure 1 demonstrates an exemplary scenario located in a mountainous region, where a collaborative multi-UAV system is applied to perform the detection task. The system consists of

n_{UAV}

UAVs, and the set of UAVs is denoted as

UAV = {{UAV}_{1}, {UAV}_{2}, \dots, {UAV}_{n_{UAV}}}

. The position of

{UAV}_{i}

at time t can be represented as

u (i, t) = {u_{x} (i, t), u_{y} (i, t), u_{z} (i, t)}

, where

u_{x} (i, t)

,

u_{y} (i, t)

, and

u_{z} (i, t)

represent the x, y, and z axis coordinates in the three-dimensional (3D) spatial coordinate system, respectively.

The choice of UAV flight altitude has a significant impact on both the sensing area and the quality of detection. Especially in complex terrain environments, optimizing altitude control becomes vital for achieving effective and accurate sensing performance. If the UAVs perform tasks at a low altitude, the limited FOV restricts the coverage area, leading to reduced detection efficiency and quality. Conversely, elevated UAV altitude increases the distance to the target area, which may degrade detection accuracy. Therefore, to balance detection precision and coverage, each

{UAV}_{i}

should maintain its cruising altitude

u_{z} (i, t)

within a certain range

[u_{z - \min}, u_{z - \max}]

, where t represents the current task time. During the flying tasks, each

{UAV}_{i}

departs from the starting point

u_{start} (i)

and explores mountainous task areas. We denote the start point as

u_{start} (i) = u (i, 0)

. To facilitate the statistical analysis and recording of the exploration status, we partition the mission area into detection plots DP of length and width

d_{DP}

, and establish a visitation map

M_{V} (t)

. DP can be represented as

DP = {DP}_{0, 0}, {DP}_{0, 1}, \dots, {DP}_{n_{x}, n_{y}}

, where

n_{x} = \frac{u_{x - \max} - u_{x - \min}}{d_{D P}}

and

n_{y} = \frac{u_{y - \max} - u_{y - \min}}{d_{D P}}

. The UAV performs reconnaissance over the mountainous terrain at an appropriate altitude and inspects the ground plots

DP (u_{x}, u_{y})

corresponding to its position periodically every

t_{d}

units of time. The UAV estimates the mapping results based on its own altitude, the ground elevation, and the terrain, and updates

M_{V} (t)

according to the detection quality. Once

M_{V} (t)

exceeds

M_{th}

, we consider the detection to be sufficiently comprehensive, and the task can be terminated early. Here,

M_{th}

represents the detection threshold.

Also, we set a maximum task time

T_{end}

to ensure that the multi-UAV collaborative system can complete the detection task within

T_{end}

. In addition, UAVs should fly within a 3D task space, which can be represented as

R^{3} : {x \in [u_{x - \min}, u_{x - \max}], y \in [u_{y - \min}, u_{y - \max}], z - E L (x, y) \in [u_{{z - \min}, u_{z - \max}]}

, where

E L (x, y)

denotes the ground elevation at coordinates

(x, y)

. The flying speed of

{UAV}_{i}

can be represented as

v (i, t) = [v_{x} (i, t), v_{y} (i, t), v_{z} (i, t)]

, where

v_{x} (i, t)

,

v_{y} (i, t)

, and

v_{z} (i, t)

represent the velocity in the x, y, and z direction, where the value of

v_{x} (i, t)

and

v_{y} (i, t)

is constrained within the range of

[0, v_{horizon}]

, while the value of

v_{z} (i, t)

is constrained within the range of

[0, v_{vertical}]

. In addition, with the limited onboard battery, each

{UAV}_{i}

should complete its task before the battery runs out. It is worth noting that, to ensure flight safety and task efficiency, each UAV predicts the movements of others based on shared positional information. When UAVs are close, they actively avoid overlapping task areas to prevent collisions and reduce redundant detections, thereby improving overall system performance.

In this work, we utilize 5G New Radio (NR) Vehicle-to-Everything (V2X) as the communication module. Specifically, the UAVs connect with each other and exchange information via 5G NR Vehicle-to-Vehicle (V2V) communication.

Above all, the joint path planning of the multi-UAV system in mountainous search scenarios faces several challenges. First, the complex and uncertain terrain requires dynamic trajectory adjustments to maximize area coverage. Second, UAVs must collaboratively plan their paths to maximize the coverage rate on the visitation map

M_{V} (t)

within the limited mission time

T_{end}

, while ensuring collision avoidance among UAVs. Finally, flight altitude significantly affects the FOV and detection quality: lower altitudes limit coverage due to a narrow FOV, whereas higher altitudes may reduce detection accuracy. Therefore, the path planning strategy must balance these factors by maintaining each UAV’s altitude within the allowable range

[u_{z - \min}, u_{z - \max}]

to achieve maximal coverage in minimal time.

3.2. Methodology

This work adopts a model-based, data-driven hybrid methodology. Specifically, it integrates two approaches: (i) model-based approaches, which involve the energy, communication, and detection models of the UAV system for environment construction and constraint design, and (ii) data-driven approaches, which utilize data analysis and machine learning to extract environmental features and support decision-making. By combining these approaches, the proposed method aims to enhance model robustness and accuracy through the joint exploitation of theoretical knowledge and empirical data.

As Figure 2 shows, the data and system model units are separated in this framework. In the data unit, diverse terrain data is utilized to generate an elevation map dataset for training the feature extraction model and providing the environment sample for UAV detection tasks. In the system model, the energy consumption, communication, and UAV detection models are installed to be connected with the environment for policy updating.

With the aforementioned data and system models as inputs, the core component of the proposed framework is a feature-aware MARL module, with which UAVs interact with the environment and learn trajectory planning policies. The proposed FA-EIMARL algorithm is detailed in Section 5. In a multi-agent Markov Decision Process (MDP), the key elements include agents, observations, actions, and rewards. To improve performance and generalization, we incorporate a feature extraction module into the MARL framework. In addition, an elite imitation strategy is introduced to enhance convergence speed and training stability. To further accelerate the computation of detection quality during environment interactions, we employ a DQPN architecture as a surrogate model for real-time prediction, avoiding costly explicit evaluations.

The following subsections provide a detailed explanation of these components.

As described earlier, the environment in this work is a collaborative multi-UAV system, where each UAV functions as an autonomous agent. An observation represents the state information perceived by a UAV from the environment, including its own position, destination coordinates, task progress indicators, and the current status of the visitation map. Observations define the agent’s state space. The action space consists of velocity components along the x, y, and z axes, allowing UAVs to adjust their trajectories in 3D space. The reward function provides feedback from the environment to guide learning. It captures multiple factors such as flight altitude, energy consumption, detection quality, task process, and penalties for boundary violations and collisions. The feature module processes data through a feature extraction network to obtain scenario-relevant representations that enhance policy learning and generalization. The UAVs learn policies that map observations and extracted features to actions, to maximize cumulative long-term rewards. During each iteration of the FA-EIMARL algorithm, agents select actions based on their current observations, extracted features, and learned policies. They then interact with the environment, receive corresponding rewards, and observe the resulting states, enabling continuous policy updates.

4. System Model

This section presents the energy consumption model, the communication model, and the formal problem formulation.

4.1. Energy Consumption

We assume that the total energy consumption includes the computational, communication, and flight energy consumption. The cumulative energy consumption

E (i, t)

from task start to time t can be represented as

E (i, t) = E_{cmp} (i, t) + E_{comm} (i, t) + E_{fly} (i, t),

(1)

where

E_{cmp} (i, t)

,

E_{comm} (i, t)

, and

E_{fly} (i, t)

represent the cumulative computation, communication, and flight energy consumption of

{UAV}_{i}

, respectively.

The UAV’s cumulative computational energy consumption

E_{cmp} (i, t)

can be represented as

E_{cmp} (i, t) = \int_{0}^{t} P_{cmp} d t,

(2)

P_{cmp} = P_{static} + P_{dynamic}

(3)

P_{dynamic} \approx C^{*} \cdot V^{2} \cdot f^{*} \cdot α^{*}

(4)

where

P_{cmp}

represents UAV computational power,

P_{static}

represents the static power of the computing module,

P_{dynamic}

represents the dynamic power of the computing module,

C^{*}

represents the load capacitance, V represents voltage,

f^{*}

represents clock frequency, and

α^{*}

represents the activity factor.

E_{comm} (i, t) = \int_{0}^{t} P_{comm} d t,

(5)

P_{comm} = P_{TX} + P_{RX} = P_{TX} (1 + \frac{G_{t} G_{r}}{L}),

(6)

where the constant

P_{comm}

represents the UAV communication power,

P_{TX}

represents UAV transmission power,

P_{RX}

represents UAV receiving power,

G_{t}

represents transmission antenna gain,

G_{r}

represents receiving antenna gain, and

L

represents system average path loss.

The UAV’s cumulative flight energy consumption

E_{fly} (i, t)

can be represented as

E_{fly} (i, t) = \int_{0}^{t} P_{fly} (i, t) d t,

(7)

where

P_{fly} (i, t)

denotes the flight power of

{UAV}_{i}

at time t, which depends on the UAV’s structural parameters and flying speed. The detailed derivation is grounded in References [56,57], and it can be expressed as

P_{f l y} (i, t) = \{\begin{matrix} \frac{m_{UAV} g^{3 / 2}}{\sqrt{2 ρ_{air} A_{UAV}} η}, & | v (i, t) | < v_{th}, \\ c_{1} {| v (i, t) |}^{2} + \frac{c_{2}}{g^{2}} \frac{{| a (i, t) |}^{2}}{| v (i, t) |} + \frac{2 c_{2}}{g} \frac{a_{z} (i, t)}{| v (i, t) |} + \\ c_{2} \frac{v_{x}^{2} (i, t) + v_{y}^{2} (i, t)}{{| v (i, t) |}^{3}} + m g | v (i, t) | + \frac{1}{2} m {| a (i, t) |}^{2}, & | v (i, t) | \geq v_{th}, \end{matrix}

(8)

where

m_{UAV}

denotes the mass of the UAVs, g is the gravitational acceleration,

ρ_{air}

denotes the air density,

A_{UAV}

is the frontal area of the UAV,

η

is the mechanical efficiency,

c_{1}

and

c_{2}

represent the drag and lift coefficients, respectively, and

v (i, t)

denotes the velocity vector of

{UAV}_{i}

at time t. Components

v_{x} (i, t)

and

v_{y} (i, t)

represent the UAV’s velocity in the x and y directions, respectively. Component

a (i, t)

represents the acceleration of

{UAV}_{i}

, and

a_{z} (i, t)

represents the z axis direction of the acceleration of

{UAV}_{i}

. A speed threshold

v_{th}

is defined to distinguish between flying and hovering states, which differ significantly in terms of energy consumption. Parameters

c_{1}

,

c_{2}

, and

A_{UAV}

are given by

c_{1} = \frac{1}{2} ρ \cdot A_{UAV} \cdot C_{d},

(9)

c_{2} = \frac{m_{UAV}^{2}}{η \cdot ρ_{air} \cdot n_{prp} \cdot π R_{prp}^{2}},

(10)

A_{UAV} = A_{surf} + n_{prp} π R_{prp}^{2},

(11)

where

C_{d}

denotes the air viscosity coefficient,

n_{prp}

denotes the number of UAV propellers,

R_{prp}

denotes the radius of the UAV propellers, and

A_{surf}

denotes the UAV surface area.

The remaining battery

B (i, t)

can be presented as

B (i, t) = B^{*} - E (i, t),

(12)

where

B^{*}

represents the battery capacity.

4.2. Communication

We assume that the communication among UAVs follows the 5G NR V2X standard as the underlying communication module.

The communication protocol consists of the following procedures:

Each UAV periodically broadcasts its current geographic position and the updated visitation map to nearby UAVs according to a predefined broadcast interval $t_{d}$ in order to improve mission efficiency and avoid collisions.
${UAV}_{i}$ continuously transmits the detection data to the remote mountain base stations. Upon successful reception, the base stations return an acknowledgment message to ${UAV}_{i}$ . If no acknowledgment is received within a certain timeout period $t_{f}$ , ${UAV}_{i}$ initiates retransmission.
If retransmission attempts fail $n_{f}$ times and no acknowledgment is received, ${UAV}_{i}$ sends a request to the nearest ${UAV}_{j}$ . Once ${UAV}_{j}$ confirms the request, a communication link is established between the two UAVs. The data is then relayed to ${UAV}_{j}$ , which attempts to forward the information to the base station.

As for the channel model, we should firstly talk about the path loss

PL

. Due to the large coverage area of the sensing region, the communication environment is highly complex, involving both Line-of-Sight (LoS) and Non-Line-of-Sight (NLoS), two communication modes. As with the 3GPP standard [58],

PL

can be represented as

PL = \{\begin{matrix} 28.0 + 22 {log}_{10} (d_{TS}) + 20 {log}_{10} (f_{c}) & L o S, \\ - 17.5 + 20 {log}_{10} (\frac{40 π f_{c}}{3}) + [46 - 71 {log}_{10} (E L (u_{x}, u_{y}))] {log}_{10} (d_{TS}), & N L o S, \end{matrix}

(13)

where

f_{c}

represents the carrier frequency of the transmitters, and

d_{TS}

represents the distances between transmitters and receivers.

The received signal strength at the UAV or base station can be expressed as

S = P_{comm} - PL = P_{TX} (1 + \frac{G_{t} G_{r}}{L}) - PL .

(14)

To evaluate the quality of the received signal, we use the signal-to-noise ratio (SNR) as the metric, which is defined as

SNR = \frac{S}{P_{thermal}},

(15)

where

P_{thermal}

represents the thermal noise, which can be denoted as

P_{thermal} = k_{B} \times T_{K} \times B w,

(16)

where

k_{B}

represents Boltzmann constant,

T_{K}

represents temperature in Kelvin, and

B w

represents bandwidth.

4.3. UAV Detection

To facilitate computation and assessment of UAV sensing quality, the detection plot is divided into

n_{grid} \times n_{grid}

grids, as shown in Figure 3, which can be represented as

G = G_{0, 0}, G_{0, 1} \dots, G_{0, n_{grid}}

.

The sensing quality of each grid is determined based on the UAV’s distance and viewing angle relative to the grid. The overall sensing quality of the plot is then obtained by aggregating the evaluation of all grids. Considering the mountainous terrain, the center height of each grid cell varies. Therefore, accurate distance estimation between the UAV and each grid cell requires LiDAR-based elevation sensing. Assume that the distance between gird

G_{l, m}

and the task UAV can be denoted as

d_{G} (l, m)

. We define a distance threshold

d_{t h}

. If the actual average distance

d_{G} (l, m)

exceeds

d_{t h}

(i.e.,

d_{G} (l, m) > d_{t h}

), the sensing quality is considered poor; otherwise, it is deemed acceptable.

Meanwhile, as illustrated in Figure 4, the UAV sensing task must also consider the FOV. Specifically, we define a sensing angle threshold

θ_{f}

. If the angle between the UAV’s Line-of-Sight and the normal direction

θ_{g} (l, m, t)

of the grid

G_{l, m}

at time t exceeds

θ_{f}

, the sensing quality is considered poor. The detection quality of each grid

G_{l, m}

can be denoted as

Q_{l, m, t}

. In this context, the detection quality

Q_{overall} (i, j, t)

of the whole detection plot

{DP}_{i, j}

at time t can be represented as

Q_{overall} (i, j, t) = \frac{\sum_{l = 0}^{n_{grid}} \sum_{m = 0}^{n_{grid}} Q (l, m, t)}{n_{grid}^{2}},

(17)

We define a sensing quality threshold

Q th

. If the quality metric

Q_{overall} (i, j)

does not exceed this threshold (i.e.,

Q_{overall (i, j, t)} < Q th

), the sensing is considered unsuccessful.

4.4. Problem Formulation

In this work, the overall performance of the multi-UAV collaborative system is evaluated by jointly considering detection coverage effectiveness, appropriate flight altitude, energy consumption, and collision avoidance. To simplify computation, the time domain is discretized into uniform time steps of duration

t_{step}

, with the discrete index denoted by k.

The detection coverage rate

M_{V} (t)

at time t can be represented as

M_{V} (t) = \sum_{i = 0}^{n_{x}} \sum_{j = 0}^{n_{y}} Q_{overall} (i, j, t),

(18)

We assume that the incremental detection coverage rate contributed by

{UAV}_{i}

at each time step k can be represented as

m_{V} (i, k)

. The final detection coverage rate

M_{V} (K_{end})

can be expressed as

M_{V} (K_{end}) = \sum_{i = 0}^{n_{x}} \sum_{j = 0}^{n_{y}} Q_{overall} (i, j, K_{end}) = \sum_{i = 0}^{n_{UAV}} \sum_{k = 0}^{K_{end}} m_{V} (i, k),

(19)

K_{end} = ⌊ \frac{T_{end}}{t_{step}} ⌋,

(20)

where

K_{end}

represents the discretized task completion time.

Once

M_{V} (k)

exceeds

M_{th}

, we consider the detection to be sufficiently comprehensive, and the task can be terminated early.

If at time step k,

{UAV}_{i}

visits a detection plot that has already been successfully sensed, the action is considered redundant and incurs a penalty

PR

. The total number of redundant visits

N_{redundant}^{(i)}

made by

{UAV}_{i}

during the entire detection process can be defined as

N_{redundant}^{(i)} = \sum_{k = 0}^{K_{end}} I ({DP}_{i} (k) \in {DP}_{visited}),

(21)

where

I (\cdot)

represents the indicator function,

{DP}_{i} (k)

represents the DP visited by

{UAV}_{i}

at time step k, and

{DP}_{visited}

represents the set of visited DPs. In this context, the total penalty

{PR}_{redundant}

caused by redundant visits during the entire detection process can be expressed as

{PR}_{redundant} = PR \cdot (\sum_{i = 0}^{n_{UAV}} N_{redundant}^{(i)}) .

(22)

To describe the performance of

{UAV}_{i}

in maintaining flight altitude, the parameter

H_{d} (i)

is defined as

H_{d} (i) = \sum_{k = 0}^{K_{end} - 1} (h_{d} (i, k + 1) - h_{d} (i, k)),

(23)

h_{d} (i, t) = u_{z} (i, t) - \frac{u_{z - \min} + u_{z - \max}}{2},

(24)

where

h_{d} (i, t)

represents the height difference between the altitude of

{UAV}_{i}

at time t and the midpoint of the assumed flight altitude range.

The penalty

{PH}_{height}

of flying at an inappropriate height can be expressed as

{PH}_{height} = PH \cdot \sum_{i = 0}^{n_{UAV}} H_{d} (i),

(25)

where

PH

represents the penalty parameter.

To avoid UAVs from flying out of the task space, a penalty parameter

PO

is set.

{UAV}_{i}

denotes the total number

N_{flyout}^{(i)}

of UAVs flying out, which can be represented as

N_{flyout}^{(i)} = \sum_{k = 0}^{K_{end} - 1} I (u (i, k) \notin R^{3}),

(26)

The total penalty

{PO}_{flyout}

caused by flying out of the task space throughout the detection process can be expressed as

{PO}_{flyout} = PO \cdot (\sum_{i = 0}^{n_{UAV}} (N_{flyout}^{(i)})),

(27)

To prevent UAV collisions and to improve efficiency by avoiding the clustering of multiple UAVs in the same region, a distance function

d_{UAV} (i, t)

is defined to represent the distance between

{UAV}_{i}

and its nearest neighboring UAV at time t. A penalty coefficient

PN

is defined to evaluate the duration and severity of danger triggered by a UAV. The penalty

{PN}_{near}

for

{UAV}_{i}

can then be expressed as

{PN}_{near} = PN \cdot (\sum_{k = 1}^{K_{end}} I (d_{UAV} (i, k) < d_{safety})) (d_{UAV} (i, k - 1) - d_{UAV} (i, k)),

(28)

To further prevent collisions between UAVs, a penalty parameter

PC

is set. The total number

N_{collision}^{(i)}

of

{UAV}_{i}

flying out can be represented as

N_{collision}^{(i)} = \sum_{k = 0}^{K_{end} - 1} I (u (i, k) = u (j, k) | j \in [0, n_{UAV}]),

(29)

The total penalty

{PC}_{collision}

caused by UAV collisions throughout the detection process can be expressed as

{PC}_{collision} = PC \cdot (\sum_{i = 0}^{n_{UAV}} (N_{collision}^{(i)})),

(30)

In addition, the total energy consumption

E_{consumption}

of the multi-UAV collaborative system can be expressed as

E_{consumption} = \sum_{i = 0}^{n_{UAV}} \int_{0}^{t} E (i, t) d t = \sum_{i = 0}^{n_{UAV}} \sum_{k = 0}^{K_{end}} E (i, k) .

(31)

The objective of our optimization problem is to jointly maximize the coverage rate but minimize the energy, while flying at an inappropriate height, avoiding flying out of the task space, flying near other UAVs, and avoiding collisions. We include the trajectory planning variable

v (i, k)

as our decision variable. The optimization problem is formulated as follows:

\begin{matrix} Pro : & max_{v (i, t)} α_{1} \cdot M_{V} (K_{end}) - {PR}_{redundant} - {PR}_{height} - {PO}_{flyout} \\ - {PN}_{near} - {PC}_{collsion} - α_{2} \cdot E_{consumption}, & (32 a) \\ s . t . & u (i, k + 1) = u (i, k) + v (i, k), \forall k \in [0, K_{end}], & (32 b) \\ 0 \leq v_{x} (i, k) \leq v_{horizon}, & (32 c) \\ 0 \leq v_{y} (i, k) \leq v_{horizon}, & (32 d) \\ 0 \leq v_{z} (i, k) \leq v_{vertical}, & (32 e) \\ u (i, 0) = u_{start} (i), & (32 f) \\ u (i, k) \in R^{3} : {x \in [u_{x - \min}, u_{x - \max}], y \in [u_{y - \min}, u_{y - \max}], \\ z - E L (x, y) \in [u_{{z - \min}, u_{z - \max}]}, & (32 g) \\ B (i, k) \geq 0, & (32 h) \end{matrix}

where parameters

α_{1}

and

α_{2}

represent the weights assigned to different factors affecting task performance. Equation (32b) describes the relationship between the UAV’s position and velocity. Equations (32c), (32d), and (32e) ensure that UAV velocities remain within their maximum limits. Equation (32f) specifies the UAVs’ starting positions. Equation (32g) enforces that UAVs stay within the task space boundaries. Finally, Equation (32h) ensures that the UAVs do not run out of battery.

5. Feature-Aware Elite Imitation Multi-Agent Reinforcement Learning

This section introduces the proposed DQPN architecture and FA-EIMARL algorithm.

5.1. Multi-Agent Markov Decision Process

To solve Equations (32a)–(32h) through MARL, a multi-agent Markov Decision Process (MDP) is designed as follows:

Observation of Each Agent $o_{i, k}$ : In the considered scenario, the observation space should provide information of UAVs, including their current positions, the nearest UAV position and the distances between them, the detection status of PDs near the UAV, remain power, and the system coverage rate. The observation space can be represented as $o_{i, k} = {u (i, k), u (j_{n}, k), d_{UAV} (i, k), [DP (i, k)], B (i, k), M_{V} (k)}$ where $j_{n}$ represents the nearest UAV of ${UAV}_{i}$ , and $[DP (i, k)]$ represents the detection status of PDs near the ${UAV}_{i}$ .
Action of Each Agent $a_{i, k}$ : The action space, which describes the speed in the x, y, and z direction of ${UAV}_{i}$ , can be expressed as $a_{i, k} = {v_{x} (i, k), v_{y} (i, k), v_{z} (i, k)}$ .
Feature of the Environment $f_{i, k}$ : The feature space describes the scenario features agent i obtains, which is introduced in Section 5.4.
Reward of Each Agent $r_{i, k}$ : The reward serves as the environmental feedback received by agent i when observing $o i, k$ , extracting features $f i, k$ , and executing action $a i, k$ . Similarly to the objective function Pro in Equation (32a), the reward $r i, k$ guides UAV_i to improve detection coverage, maintain suitable flight altitude, reduce energy consumption, and avoid collisions. As shown in Equations (18) to (31), the reward can be defined as

$\begin{matrix} r_{i, k} = α_{1} \cdot [m_{V} (i, k) - m_{V} (i, k - 1)] - PR \cdot I (D P_{i} (k) \in {DP}_{visited}) - PH \cdot H_{d} (i) \\ - PO \cdot I (u (i, k) \in R^{3}) - PN \cdot I (d_{UAV} (i, k) < d_{safety})) (d_{UAV} (i, k - 1) - d_{UAV} (i, k)) \\ - PC \cdot I (u (i, k) = u (j, k) | j \in [0, n_{UAV}]) - α_{2} \cdot E (i, k) . \end{matrix}$

(33)
Set of Actions $A_{k}$ : $A_{k}$ denotes the set of actions available to all agents at time step k, which can be formulated as

$A_{k} = {a_{1, k}, a_{2, k}, . . ., a_{n_{UAV}, k}},$

(34)
The Return of Each Agent $R_{i}$ : The objective of agent i’s policy is to maximize the expected cumulative reward, which is defined as

$R_{i} = \sum_{k = 0}^{k_{end}} γ^{k - 1} r_{i, k},$

(35)

where $γ$ denotes the discount factor that balances immediate and future rewards, with a value range of $γ \in [0, 1]$ . In long-term strategies, future rewards are expected to play a more significant role than immediate ones. Accordingly, the influence of immediate rewards gradually diminishes over time, as the term $γ^{k - 1}$ decreases with increasing time step k.

5.2. Detection Quality Prediction Network

To simplify computation and reduce UAV inference time, we designed a DEPN to predict the detection quality of UAVs. The architecture of the DEPN is illustrated in Figure 5. The DEPN consists of convolutional layers and fully connected layers.

We generated a dataset

S_{p}

using the real terrain information and computed the real detection value. During the training process, we input a batch size

n_{batch}

of sample matrix

S_{p m}

, which consists of state vectors

S_{p v}

for different samples of DPs. The state vector is composed of the UAV’s altitude difference relative to the lowest grid cell

G_{lowest}

, the horizontal offset between the

{UAV}_{i}

and a reference grid cell

G_{0, 0}

, and the altitude differences between

G_{lowest}

and the four corner grids, as well as the center grid at time t. It can be expressed as

\begin{matrix} S_{p v} = & {[u_{z} (i, t) - E L (G_{lowest})], [u_{x} (i, t) - x (G_{0, 0})], [u_{y} (i, t) - y (G_{0, 0})], \\ [E L (G_{0, 0}) - E L (G_{lowest}], [E L (G_{0, n_{grid}}) - E L (G_{lowest}], [E L (G_{n_{grid}, 0}) - E L (G_{lowest}]}, \\ [E L (G_{n_{grid}, n_{grid})} - E L (G_{lowest}], [E L (G_{n_{\frac{grid}{2}}, n_{\frac{grid}{2}}}) - E L (G_{lowest}]} . \end{matrix}

(36)

We constructed a sample set

S_{p m}

using various DP data points and computed the real ground detection effectiveness value

V_{p}

for the dataset according to Equation (17). In this situation, we define

V_{p} = {v_{p 1}, v_{p 2}, \dots, v_{p n_{b a t c h}}}

, where

v_{p i}

represents the ith sample. The system’s loss function can thus be defined as

J_{DQPN} = | | V_{p} - V_{r} {| |}^{2},

(37)

where

v_{r}

represents the real detection quality value.

During actual UAV task execution, the estimation value

v_{p}

is obtained directly by feeding the DP information as input into the DEPN module, denoted as

s_{p v}

.

We trained the DQPN

N_{DQPN} (S_{p m} | S_{p v})

module by minimizing this loss function. Assuming the maximum epoch is

λ_{\max}

, the detailed training procedure is described as in Algorithm 1:

Algorithm 1 Detection Quality Prediction Model Generation

1:: Input: The terrain dataset $S_{p}$ .
2:: Initialize the DQPN $N_{DQPN} (S_{p m} | S_{p v})$ module.
3:: for epoch $λ = 1, 2, \dots, λ_{\max}$ do
4:: Extract a batch size $n_{batch}$ of $S_{p m}$ from $S_{p}$ .
5:: Computation the Real value $V_{r}$ of $S_{p m}$ .
6:: Input $S_{p m}$ to $N_{DQPN} (S_{p m} | S_{p v})$ to generate $V_{p}$ .
7:: Compute the loss as Equation (37).
8:: Update $N_{DQPN} (S_{p m} | S_{p v})$ .
9:: end for
10:: return The detection quality prediction model $m a t h c a l N_{DQPN} (S_{p v})$ .

5.3. Elite Imitation Mechanism

To accelerate the training process and enable agents to learn from high-performing individuals, we propose an elite imitation module. This mechanism is designed to help agents escape prolonged periods of low-reward exploration by imitating elite behavior.

We introduce a mimicry cycle factor, denoted by

δ

. During certain episodes determined by this factor, each agent i does not update its policy network

π_{i} (θ)

directly. Instead, it generates action sequences

A_{i} {= a_{i, 0}, a_{i, 1}, \dots, a_{i, K_{end}}}

and interacts with the environment to obtain a corresponding reward vector

R_{i} = {r_{i, 0}, r_{i, 1}, \dots, r_{i, K_{end}}}

.

The elite policy is then identified based on the mean

μ_{i}

and variance

σ^{2}

of the rewards in

R_{i}

. The evaluation metric can be defined as

{RE}_{i} = β_{1} μ_{i} + β_{2} σ^{2} .

(38)

Specifically, we select the agent

i^{'}

with the best-performing reward profile as the elite individual.

Other agents update their policy networks by soft copying from the elite policy using a soft update parameter

ϑ

. The update rule is formulated as

θ_{i} = (1 - ϑ) θ_{i} + ϑ θ_{i}^{'},

(39)

where

θ_{i}^{'}

and

θ_{i}

denote the parameters of the elite and regular agents’ policy networks, respectively.

Furthermore, to maintain a certain level of exploration while benefiting from the elite imitation strategy, we gradually adjust the influence of the imitation process. Specifically, we gradually decrease

ϑ

and increase

δ

. We initialize the values of

ϑ

and

δ

as

ϑ^{*}

and

δ^{*}

. The detailed procedure of this algorithm is shown in Algorithm 2.

Algorithm 2 Elite Imitation Mechanism

1:: Initialize policy network $π_{i} (θ)$ for each agent i.
2:: Initialize $ϑ = ϑ^{*}$ and $δ = δ^{*}$ .
3:: for episode $λ^{'} = 1, 2, \dots, λ_{\max}^{'}$ do
4:: if $λ^{'} ∣ δ$ then
5:: for agent $i = 1, 2, \dots, n_{UAV}$ do
6:: Generate $A_{i}$ from $π_{i} (θ)$ .
7:: Obtain Reward $R_{i}$ from the interaction between agent i and the environment.
8:: end for
9:: Compute the maximum reward evaluation parameter ${RE}_{i}$ as Equation (38).
10:: Update $π_{i} (θ)$ as Equation (39).
11:: Update parameters $ϑ = \frac{1}{2} ϑ$ and $δ = 2 δ$ .
12:: end if
13:: end for

5.4. Feature-Aware MARL

To simulate the continuity of UAV actions, the proposed FA-MARL adopts an actor–critic framework. With the integration of environmental features

f_{i, k}

, the traditional actor–critic structure is extended by incorporating

f_{i, k}

into the input of the critic network. Consequently, the critic estimates the action–state–feature value, denoted as

V (o_{i, k}, a_{i, k}, f_{i, k})

.

Let

A_{B}

,

S_{B}

,

R_{B}

,

F_{B}

, and

S_{B}^{'}

denote the actions, states, rewards, environmental features, and next states in a sampled batch of experience, respectively. The policy (i.e., actor network) for agent i is represented by

π_{i} (S_{B}, F_{B})

.

To improve training stability and performance, two critic networks are employed,

Q_{1} (S_{B}, A_{B}, F_{B})

and

Q_{2} (S_{B}, A_{B}, F_{B})

, along with their corresponding target critic networks,

Q_{1}^{'} (S_{B}, A_{B}, F_{B})

and

Q_{2}^{'} (S_{B}, A_{B}, F_{B})

, which periodically copy parameters from their respective primary critic networks.

The enhanced actor–critic learning process for each agent i can then be implemented through the following steps:

Initialize the update parameter for actor network $θ$ , the update parameters for critic networks $ϕ_{1}$ and $ϕ_{2}$ , and the update parameters for target critic networks $ϕ_{1}^{'}$ and $ϕ_{2}^{'}$ . Initialize the replay buffer $D$ .
During an episode $λ^{'}$ , let agent i interact with the environment and store observation $o_{i, k}$ , action $a_{i, k}$ , rewards $r_{i, k}$ , and features $f_{i, k}$ to the replay buffer $D$ .
Randomly sample one batch of data $A_{B}$ , $S_{B}$ , $R_{B}$ , $F_{B}$ , and $S_{B}^{'}$ from $D$ .
Use the target critic networks and the current policy to compute the target value. Specifically, the target Q-values $Y_{Q}$ are calculated using the minimum of the two target critics, following the Clipped Double Q-learning strategy to mitigate overestimation bias. The formulation is as follows:

$Y_{Q} = R_{B} + γ min \{Q_{1}^{'} (S_{B}^{'}, A_{B}, F_{B}), Q_{2}^{'} (S_{B}^{'}, A_{B}, F_{B})\} .$

(40)
Minimize the mean squared error (MSE) between the predicted Q-values and the target Q-values for the critic update. We update the primary and secondary critic networks by minimizing the loss function $J_{Q i} (ϕ)$ :

$J_{Q i} (ϕ) = E [{(Q_{i} (S_{B}, A_{B}, F_{B}) - Y_{Q})}^{2}], i \in {1, 2},$

(41)

where $E [\cdot]$ denotes the mathematical expectation.
Update the actor network by maximizing the estimated Q-value from critic $Q_{1} (\cdot)$ . The loss function $J_{π} (θ)$ is expressed as

$\nabla_{θ} J_{π} (θ) = E [\nabla_{a} Q_{1} (S_{B}, A_{B}, F_{B}) |_{A_{B} = π_{i} (S_{B}, F_{B})} \cdot \nabla_{θ} π_{i} (θ)] .$

(42)
Softly update the target critic networks to ensure training stability. Instead of directly copying the weights, the target networks are updated using a weighted average of the current and previous target weights:

$ϕ_{i}^{'} = ξ ϕ_{i} + (1 - ξ) ϕ_{i}^{'}, i \in {1, 2},$

(43)

where $ξ$ is the soft update parameter.
Repeat steps 2–7 in all training iterations until the policy converges.

We now proceed to detail the specific steps of the proposed FA-MARL algorithm, which utilizes a pre-trained feature extraction network (FEN)

N_{FEN} (\tilde{S}, \tilde{X}, \tilde{K})

to generate environmental features, where

\tilde{S}

,

\tilde{X}

, and

\tilde{K}

denote a batch size

n_{batch}^{'}

of the task space, dynamic grid positions detected by UAVs, and the discrete time step.

\tilde{S} = {\tilde{s_{0}}, \tilde{s_{1}}, \dots, \tilde{s_{n_{batch}^{'}}}}

.

\tilde{X} = {\tilde{x_{0}}, \tilde{x_{1}}, \dots, \tilde{x_{n_{batch}^{'}}}}

.

\tilde{K} = {\tilde{k_{0}}, \tilde{k_{1}}, \dots, \tilde{k_{n_{batch}^{'}}}}

.

To pre-train the FEN, we first construct a dataset

M

composed of point array samples

{\tilde{s}, \tilde{x}, \tilde{k}}

. Each sample includes elevation data from

n_{sample}

. The label of the sample can be represented as

{\tilde{s}, \tilde{x}}

. Figure 6 shows the structure of the feature extraction model pre-training process. First, the input triplet

{\tilde{S}, \tilde{X}, \tilde{K}}

is passed through the feature extraction layers to generate environmental features

f_{i, k}

for each agent i. The output of this module, denoted as

F^{'} (\tilde{S}, \tilde{X}, \tilde{K})

, is subsequently fed into a series of sequential layers to produce a vectorized label

\tilde{L} (\tilde{S}, \tilde{X}, \tilde{K})

.

This predicted label vector is then transformed into the expected label

{{\tilde{S}}^{'}, {\tilde{X}}^{'}}

. Assuming the true label vector is

{\tilde{S}, \tilde{X}}

and the feature extraction network is parameterized by

ω

as

N_{FEN} (\tilde{S}, \tilde{X}, \tilde{K})

, the corresponding loss function

J_{F} (ω)

is defined as

J_{F} (ω) = {∥{\tilde{S}, \tilde{X}} - {{\tilde{S}}^{'}, {\tilde{X}}^{'}}∥}^{2},

(44)

where

∥\cdot∥

denotes the Euclidean distance.

In the proposed FA-MARL framework, the network

N_{FEN} (\tilde{S}, \tilde{X}, \tilde{K})

is pre-trained for

λ_{\max}^{*}

epochs to ensure reliable environmental feature extraction. After that,

{\tilde{S}, \tilde{X}, \tilde{K}}

is sampled as batch input data. In each episode

λ^{'}

, agents perform detection tasks within random task spaces

\tilde{s}

, with corresponding dynamic ground observations

\tilde{x}

at time

\tilde{k}

.

The output

N_{FEN} (\tilde{s}, \tilde{x}, \tilde{k})

is used as the feature component

f (i, k)

of agent i, which is computed as

f (i, k) = β_{F} \cdot N_{FEN} (\tilde{s}, \tilde{x}, \tilde{k}),

(45)

where

β_{F}

is a feature normalization coefficient. The complete design of the proposed FA-EIMARL algorithm is detailed in Algorithm 3.

Algorithm 3 Feature-Aware Elite Imitation Multi-Agent Reinforcement Learning

1:: Input: $π (\cdot)$ , $Q_{1} (\cdot)$ , $Q_{2} (\cdot)$ , $Q_{1}^{'} (\cdot)$ , $Q_{T}^{'} 2 (\cdot)$ , $Q_{2}^{'} (\cdot)$ , $N_{FEN} (\cdot)$ , and $N_{DQPN} (\cdot)$ .
2:: Initialize $θ$ , $ϕ_{1}$ , $ϕ_{2}$ , $ϕ_{1}^{'}$ , $ϕ_{2}^{'}$ , and $ω$ .
3:: Pre-train $N_{DQPN} (\cdot)$ as Algorithm 1
4:: for epoch $λ^{*} = 1, 2, \dots, λ_{\max}^{*}$ do
5:: Sample ${\tilde{S}, \tilde{X}, \tilde{K}}$
6:: Update $N_{FEN (\cdot)}$ as (44)
7:: end for
8:: Initialize replay buffer $D$
9:: for episode $λ^{'} = 1, 2, \dots, λ_{\max}^{'}$ do
10:: Update elite imitation strategy as Algorithm 2.
11:: Sample $n_{t s}$ task space set $\tilde{S^{'}}$ from $\tilde{S}$ .
12:: for scenario $\tilde{s} \in \tilde{S^{'}}$ do
13:: for Agent $i = 1, 2, \dots, n_{UAV}$ do
14:: Reset $o_{i, 0}$ , $a_{i, 0}$ , $r_{i, 0}$ , and $f_{i, 0}$ .
15:: end for
16:: for time step $k = 1, 2, \dots, k_{\max}$ do
17:: Calculate the observed environmental feature as (45).
18:: Update $o_{i, k}$ , $a_{i, k}$ , $r_{i, k}$ , and $f_{i, k}$ .
19:: Update $Q_{1} (\cdot)$ and $Q_{2} (\cdot)$ as (41).
20:: Update $π (\cdot)$ as (42)
21:: Update $Q_{1}^{'} (\cdot)$ and $Q_{2}^{'} (\cdot)$ as (43).
22:: if Task End, when $M_{V} (k)$ $M_{th}$ > then
23:: Break
24:: end if
25:: end for
26:: end for
27:: end fo
28:: return Policy $π (\cdot)$ .

6. Simulation Results

This section illustrates the experimental setup and simulation results.

6.1. Experimental Setup

We analyzed mountainous terrain data from OpenTopography [59] and constructed 10 UAV task spaces based on the Fujisan mountain dataset, with coordinates ranging from longitude

138 . 6^{\circ}

to

138 . 8^{\circ}

and from latitude

35 . 3^{\circ}

to

35 . 4^{\circ}

, featuring varying elevations and steepness levels to simulate diverse environmental conditions for flight and sensing tasks. Table 1 and Table 2, respectively, summarize the environmental parameters and hyperparameters used in this work.

Based on the Fujisan mountain terrain data and the communication parameters in Table 1, and according to Equation (14), the communication SNR between UAVs mainly depends on the path loss

PL

. According to 3GPP Standards [60,61], the extreme path loss at kilometer-scale distances in the Fujisan mountain terrain is approximately 130 dB, indicating that the SNR of task UAVs is far greater than 13 dB, which can be regarded as indicative of good communication quality.

We evaluate the proposed FA-EIMARL algorithm from four key perspectives: training performance, scalability and complexity analysis, ablation study, and 3D visualization. The following performance metrics are defined:

Reward (M1): Measures the average reward received per agent during one episode.
Detection Coverage Rate (M2): Measures the overall detection quality across an episode.
Task Completion Time (M3): Measures the overall detection mission time across an episode.
Convergence Time (M4): Measures the number of iterations required for the algorithm to converge.
Stability and Generalization (M5): Measures post-convergence reward fluctuations to assess robustness.

In addition, we evaluate the inference time, defined as the duration required for agents to make decisions during execution. The following metrics are used to assess inference efficiency:

Scalability (M6): Measures how inference time per time step changes with the number of UAVs.
Policy Efficiency (M7): Measures the inference time of UAVs during the detection process.

6.2. Training Performance

In the regular experiment, we set

n_{UAV} = 4

. Since non-RL algorithms such as heuristics struggle to cope with continuous action and state space UAV trajectory optimization tasks, we compared the proposed FA-EIMARL just with DRL algorithms. We selected continuous action space MARL algorithms as baselines: MASAC, MATD3, MAPPO, and MADDPG. These algorithms are introduced in Section 2.

Evaluation of M1 (Reward): As shown in Figure 7, the proposed FA-EIMARL achieves significantly higher average rewards compared to all DRL baselines. MAPPO demonstrates the poorest performance, likely due to its on-policy learning strategy, which struggles in highly dynamic environments. In contrast, MADDPG, MATD3, and MASAC attain similar reward levels, outperforming MAPPO. These off-policy methods benefit from experience replay mechanisms that improve resource efficiency. However, they still face challenges in adapting to dynamically changing task spaces, resulting in lower overall performance compared to FA-EIMARL.

Evaluation of M2 (Detection Coverage Rate): As illustrated in Figure 8, FA-EIMARL achieves the task objectives more efficiently than all competing DRL methods. MAPPO again yields the worst results. MADDPG and MATD3 outperform MASAC, both surpassing MAPPO, indicating that the off-policy approaches better support spatial exploration, though still not as effectively as the proposed method.

Evaluation of M3 (Task Completion Time): As illustrated in Table 3, compared with other DRL algorithms, FA-EIMARL requires shorter task completion time and can accomplish the search mission more quickly and efficiently. TD3 and DDPG demonstrate better task time performance than SAC, while PPO fails to complete the task within the specified time.

Evaluation of M4 (Convergence Time): As shown in Figure 7, all baseline methods, except MAPPO, converge within 2000 episodes. Among them, FA-EIMARL exhibits the fastest convergence. This improvement is attributed to two key mechanisms: the elite imitation mechanism(Section 5.3) and the feature-aware extraction method (Section 5.4). The former allows agents to mimic expert-level trajectories during training, effectively narrowing the exploration space and accelerating policy optimization. By selecting and learning from elite individuals, the agent reduces the early-stage exploration cost, adjusting its parameters based on already acquired experience. Furthermore, the learning rate is gradually slowed to maintain sufficient freedom for exploration, ensuring a balanced trade-off between guided learning and autonomous discovery. The latter provides additional environmental features that help the critic more accurately evaluate the value of state–action pairs. This enables the agent to make more informed and reasonable assessments of its actions even during the early stages of training, thereby guiding the convergence direction more effectively and improving overall learning efficiency. Together, these mechanisms significantly enhance sample efficiency and lead to faster convergence compared to other DRL baselines.

Evaluation of M5 (Stability and Generalization): Moreover, the reward curves of the proposed FA-EIMARL are significantly smoother and more stable than those of all baseline methods. Compared to MADDPG and MATD3, FA-EIMARL exhibits higher stability and reduced variance across episodes. It also shows more consistent performance than MASAC and outperforms MAPPO by a large margin in both stability and generalization. This improvement is primarily driven by two core components: the elite imitation mechanism and the feature-aware extraction module. The elite imitation mechanism enables agents to imitate high-performing individuals, which not only reduces unnecessary exploration costs but also helps prevent convergence to suboptimal or biased policies. Meanwhile, the feature-aware extraction method allows UAVs to extract generalized environmental representations from diverse task spaces. This capability empowers agents to dynamically adapt their policies to different scenarios, thereby maintaining robust performance under varying environmental conditions.

Therefore, it can be concluded that the proposed FA-EIMARL achieves superior performance in M1 compared to all other DRL baselines (MADDPG, MATD3, MASAC, MAPPO). Moreover, it demonstrates clear advantages in terms of M2 and M3 metrics, outperforming state-of-the-art DRL approaches in both task efficiency and convergence speed.

6.3. Scalability and Complexity Analysis

In this experiment, we first varied

n_{UAV}

from 4 to 15 to observe the trend of inference time to analyze the computational complexity. Then, we discuss the impact of the DQPN on the performance of FA-EIMARL. After that, we set

n_{UAV} = 4

and compare the inference time of the proposed FA-EIMARL and other baseline algorithms introduced in Section 6.2.

Evaluation of M6 (Scalability):Table 4 illustrates the relationship between the inference time of FA-EIMARL and the number of agents. The inference time remains relatively stable as the number of agents increases. This is because the information required for each UAV to perform its task does not increase with the number of agents, and the use of the DQPN further reduces the additional computational complexity. Table 5 illustrates the relationship between the total task rewards and number of agents. The task reward remains at a relatively high level overall but shows a slight decrease as the number of agents increases. This is primarily because, with more UAVs in the scenario, the likelihood of them approaching each other and entering the safety distance increases. Moreover, since the size of the exploration space remains fixed, a larger number of UAVs makes it more challenging to avoid previously explored areas, which in turn affects the task reward. In general, the increase in the number of agents does not exert a significant impact on the performance of the FA-EIMARL algorithm.

As shown in Table 6, due to the acceleration provided by the DQPN module, the calculation of DP detection quality has been significantly accelerated.

As shown in Table 7, the EI process performs elite selection and parameter replication based on the UAVs’ experience. Consequently, the number of parameters involved is proportional to the neuron parameters of the lightweight network. The resulting overhead is negligible compared with gradient computations during training and the inference time of UAVs, and thus has an insignificant impact on the overall time complexity of the training process.

As shown in Figure 9, the overhead of the FA module does not increase linearly with the number of sampling points; instead, its growth rate gradually decreases, and the overall overhead remains at a low level. Compared with the UAVs’ inference time, it is negligible.

The loss curve for training DQPN is shown as Figure 10. It can be obviously seen that, DQPN has a high level of reliability.

Evaluation of M7 (Policy Efficiency) We choose their minimum duration to achieve stable rewards as the inference time. As depicted in Table 8, it can be observed that FA-EIMARL introduces a slight increase in inference time due to its more complex network structure, FEN. However, since FEN is a lightweight convolutional network, the impact remains limited. Meanwhile, the core acceleration component, the DQPN structure, significantly reduces the time required for predicting UAV detection success rates. This high-precision reduction in inference time gives FA-EIMARL a substantial advantage over other algorithms.

In summary, there exists a trade-off between the richness of information that networks can extract and their computational complexity in deep reinforcement learning. The proposed FA-EIMARL algorithm, with the support of the DQPN module, achieves stable and superior inference time performance compared to other DRL methods.

6.4. Ablation Study

In this experiment, we set

n_{UAV} = 4

and conducted an ablation study to evaluate the effectiveness of the proposed feature-aware (FA) and elite imitation (EI) mechanisms. As shown in Figure 11, the reward performance noticeably degrades when either the FA or EI module is removed. In particular, EIMARL, which lacks any feature extraction module, yields the lowest reward, highlighting the crucial role of the FA mechanism in guiding task perception.

Moreover, both EIMARL and FAMARL, which exclude the EI and FA modules, respectively, exhibit significant fluctuations in reward curves. This observation underscores the importance of both modules in enhancing training stability and convergence robustness. The EI mechanism helps reduce exploration overhead by guiding agents with prior expert behaviors, while the FA module improves the agents’ adaptability to varying task spaces through more informative environmental representations.

6.5. Three-Dimensional Visualization

In this experiment, we set

n_{UAV} = 4

and present the visualization of UAV trajectories in 3D scenes. We simulate different mountainous areas as mentioned in Section 6.1, with different mountain heights and steepness levels. The visualization results for 10 scenarios are available at https://github.com/usainzhou0301/FA-EIMARL (accessed on 4 August 2025).

This section only presents exemplary virtualization results of trajectory optimization for a multi-UAV cooperative system in a certain mountainous area. As shown in Figure 12 and Figure 13, we can clearly observe that the multi-UAV exploration space has a high coverage rate with low trajectory overlap. Each UAV is able to find its own comfortable working space and to some extent fill the coverage gaps left by other UAVs. In addition, the drones can always maintain an appropriate flight altitude and adjust smoothly according to changes in elevation.

7. Conclusions

In this paper, we present a comprehensive study on trajectory optimization for multi-UAV collaborative systems operating in complex mountainous environments. The UAVs are tasked with performing search and detection tasks, aiming to maximize coverage while minimizing task completion time. To address the challenges posed by complex terrain, varying flight altitudes, limited battery capacity, and collision avoidance, we developed a model-based, data-driven methodology that jointly optimizes detection coverage, flight altitude, energy consumption, and collision avoidance.

To realistically reproduce the UAV flight process, we constructed UAV dynamics, communication, and detection models based on the existing literature and real UAV parameters. We formulated the problem and defined constraints for key aspects such as detection coverage, flight altitude, energy consumption, and collision avoidance.

To enhance training efficiency and stability, we introduce an EI mechanism, with which the top-performing agent periodically shares its parameters with others in a controlled manner, accelerating policy convergence while maintaining sufficient exploration. Building on this, we propose the FA-EIMARL algorithm, which integrates environmental features extracted from structured point arrays of terrain elevation. This enables the UAVs to adapt rapidly to dynamic and diverse task environments. Additionally, we designed a lightweight DQPN to efficiently estimate the reliability of terrain mapping, reducing unnecessary computation while ensuring mapping accuracy.

Experimental results demonstrate that the proposed FA-EIMARL significantly outperforms baseline methods, achieving shorter task completion times, higher coverage rates, and better adaptability in scenarios with varying terrain complexities.

Looking forward, we plan to further improve environmental feature extraction techniques to better capture terrain characteristics and enhance UAV decision-making. We also aim to increase scenario complexity, enabling multi-UAV systems to handle a broader range of tasks under dynamic conditions, thereby further improving their robustness, generalization, and operational efficiency in real-world search and rescue missions.

Author Contributions

Conceptualization, Q.Z., Y.T. and M.T.; Methodology, Q.Z.; Software, Q.Z. and Q.S.; validation, Q.Z. and Y.T.; formal analysis, Y.T. and Q.S.; investigation, Q.Z. and Y.T.; resources, Q.Z. and M.T.; data curation, Q.Z.; writing—original draft preparation, Q.Z.; writing—review and editing, Q.Z., Y.T., Q.S. and M.T.; visualization, Q.Z.; supervision, M.T.; project administration, M.T.; funding acquisition, M.T. and Q.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the JST ASPIRE (JPMJAP2325) and JST SPRING GX (JPMJSP2108).

Data Availability Statement

All the experiment results and data can be available at https://github.com/usainzhou0301/FA-EIMARL (accessed on 4 August 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, J.; Zhu, X.; Shan, W. Lightweight Small Target Detection Algorithm for UAV Infrared Aerial Photography Based on YOLOv8. In Proceedings of the 2024 5th International Conference on Artificial Intelligence and Computer Engineering (ICAICE), Wuhu, China, 8–10 November 2024; pp. 395–398. [Google Scholar] [CrossRef]
Chi, C.; Li, B.; Deng, H.; Zhao, S. Depth Camera-LiDAR Fusion Based UAV Autonomous Navigation for Parcel Delivery in Complex Urban Environment. In Proceedings of the 2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Kuching, Malaysia, 6–10 October 2024; pp. 4801–4806. [Google Scholar] [CrossRef]
Yılmaz, A.; Toker, C. Air-to-Air Channel Model for UAV Communications. In Proceedings of the 2022 30th Signal Processing and Communications Applications Conference (SIU), Safranbolu, Turkey, 15–18 May 2022; pp. 1–4. [Google Scholar] [CrossRef]
Raman, R. Evolving Remote Sensing Applications as System-of-Systems. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; pp. 8181–8184. [Google Scholar] [CrossRef]
Yang, F.; Ji, X.; Yang, C.; Li, J.; Li, B. Cooperative search of UAV swarm based on improved ant colony algorithm in uncertain environment. In Proceedings of the 2017 IEEE International Conference on Unmanned Systems (ICUS), Beijing, China, 27–29 October 2017; pp. 231–236. [Google Scholar] [CrossRef]
Massaro, A.; Savino, N.; Selicato, S.; Panarese, A.; Galiano, A.; Dipierro, G. Thermal IR and GPR UAV and Vehicle Embedded Sensor Non-Invasive Systems for Road and Bridge Inspections. In Proceedings of the 2021 IEEE International Workshop on Metrology for Industry 4.0 & IoT (MetroInd4.0&IoT), Rome, Italy, 7–9 June 2021; pp. 248–253. [Google Scholar] [CrossRef]
Liu, W.; Zheng, X.; Luo, Y. Cooperative search planning in wide area via multi-UAV formations based on distance probability. In Proceedings of the 2020 3rd International Conference on Unmanned Systems (ICUS), Harbin, China, 27–28 November 2020; pp. 1072–1077. [Google Scholar] [CrossRef]
Dangrungroj, P.; Udompitaksook, M.; Intarasuk, N.; Pluempitiwiriyawej, C.; Silawan, T. Estimation of Eucalyptus DBH from UAV-LiDAR Data Utilizing Advanced Point Cloud Processing Techniques. In Proceedings of the 2024 9th International Conference on Business and Industrial Research (ICBIR), Bangkok, Thailand, 23–24 May 2024; pp. 0356–0361. [Google Scholar] [CrossRef]
Zhang, K.; Ma, S.; Zheng, R.; Zhang, L. UAV Remote Sensing Image Dehazing Based on Double-Scale Transmission Optimization Strategy. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6516305. [Google Scholar] [CrossRef]
Sherstjuk, V.; Zharikova, M.; Sokol, I. Forest Fire Monitoring System Based on UAV Team, Remote Sensing, and Image Processing. In Proceedings of the 2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP), Lviv, Ukraine, 21–25 August 2018; pp. 590–594. [Google Scholar] [CrossRef]
Harradi, R.; Heller, A.; Hardt, W. Decentralized UAV Hangar: A Study for Water Rescue Missions. In Proceedings of the 2024 International Symposium on Computer Science and Educational Technology (ISCSET), Lauta/Laubusch, Germany, 11–19 July 2024; pp. 1–4. [Google Scholar] [CrossRef]
Liu, S.; Zhou, W.; Qin, M.; Peng, X. An Improved UAV 3D Path Planning Method Based on BiRRT-APF. In Proceedings of the 2025 5th International Conference on Computer, Control and Robotics (ICCCR), Hangzhou, China, 16–18 May 2025; pp. 219–224. [Google Scholar] [CrossRef]
Chang, Y.H.; Tsai, D.C.; Chow, C.W.; Wang, C.C.; Tsai, S.Y.; Liu, Y.; Yeh, C.H. Lightweight Light-Diffusing Fiber Transmitter Equipped Unmanned-Aerial-Vehicle (UAV) for Large Field-of-View (FOV) Optical Wireless Communication. In Proceedings of the 2023 Optical Fiber Communications Conference and Exhibition (OFC), San Diego, CA, USA, 5–9 March 2023; pp. 1–3. [Google Scholar] [CrossRef]
Dou, Y.; Lian, Z.; Wang, Y.; Zhang, B.; Luo, H.; Zu, C. Channel Modeling and Performance Analysis of UAV-Enabled Communications With UAV Wobble. IEEE Commun. Lett. 2024, 28, 2749–2753. [Google Scholar] [CrossRef]
Xu, C.; Liao, X.; Yue, H.; Deng, X.; Chen, X. 3-D Path-Searching for Uavs Using Geographical Spatial Information. In Proceedings of the IGARSS 2019–2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 947–950. [Google Scholar] [CrossRef]
Chen, X.; Xu, H.; Huang, X.; Wang, Y.; Wang, L. LandExplorer-1: A 3D Survey and Monitoring Vehicle System for Natural Resources. IEEE J. Miniaturization Air Space Syst. 2025, 6, 215–221. [Google Scholar] [CrossRef]
Ghazali, S.N.A.M.; Anuar, H.A.; Zakaria, S.N.A.S.; Yusoff, Z. Determining position of target subjects in Maritime Search and Rescue (MSAR) operations using rotary wing Unmanned Aerial Vehicles (UAVs). In Proceedings of the 2016 International Conference on Information and Communication Technology (ICICTM), Kuala Lumpur, Malaysia, 16–17 May 2016; pp. 1–4. [Google Scholar] [CrossRef]
Saadaoui, H.; El Bouanani, F. Information sharing based on local PSO for UAVs cooperative search of unmoved targets. In Proceedings of the 2018 International Conference on Advanced Communication Technologies and Networking (CommNet), Marrakech, Morocco, 2–4 April 2018; pp. 1–6. [Google Scholar] [CrossRef]
Bi, J.; Huang, W.; Li, B.; Cui, L. Research on the Application of Hybrid Particle Swarm Algorithm in Multi-UAV Mission Planning with Capacity Constraints. In Proceedings of the 2024 IEEE International Conference on Unmanned Systems (ICUS), Nanjing, China, 18–20 October 2024; pp. 928–933. [Google Scholar] [CrossRef]
Chen, X.; Lin, Z.; He, H.; Hu, Q.; Cao, R. UAV Path Planning with No-Fly-Zone Constraints by Convex Optimization. In Proceedings of the 2021 40th Chinese Control Conference (CCC), Shanghai, China, 26–28 July 2021; pp. 7713–7717. [Google Scholar] [CrossRef]
Zhou, Q.; Wang, Y. Design of Anti-Interference Path Planning for Cellular-Connected UAVs Based on Improved DDPG. In Proceedings of the 2024 10th IEEE International Conference on High Performance and Smart Computing (HPSC), New York, NY, USA, 10–12 May 2024; pp. 71–76. [Google Scholar] [CrossRef]
McEnroe, P.; Wang, S.; Liyanage, M. Towards Faster DRL Training: An Edge AI Approach for UAV Obstacle Avoidance by Splitting Complex Environments. In Proceedings of the 2024 IEEE 21st Consumer Communications & Networking Conference (CCNC), Las Vegas, NE, USA, 6–9 January 2024; pp. 396–399. [Google Scholar] [CrossRef]
Huang, S.X.; Lung, A.; Chen, J.Y.; Sung, W.M.; Chu, Y.C.; Tu, J.Y.; Hu, C.L. Multi-Hop Data Delivery in Hybrid Self-Organized Networks for Mountain Rescue. In Proceedings of the 2023 IEEE 12th Global Conference on Consumer Electronics (GCCE), Nara, Japan, 10–13 October 2023; pp. 117–118. [Google Scholar] [CrossRef]
ZHOU, Y.; GOU, L.; WAN, J.; GUAN, W.; XU, N.; CHEN, Z.; HUANG, Y. Detection and Resolution Strategy of UAV Swarms based on KT-MF and De-RAM. IEEE Trans. Aerosp. Electron. Syst. 2025, 1–15. [Google Scholar] [CrossRef]
Huang, H.; Zhou, H.; Zheng, M.; Xu, C.; Zhang, X.; Xiong, W. Cooperative Collision Avoidance Method for Multi-UAV Based on Kalman Filter and Model Predictive Control. In Proceedings of the 2019 IEEE International Conference on Unmanned Systems and Artificial Intelligence (ICUSAI), Xi’an, China, 22–24 November 2019; pp. 1–7. [Google Scholar] [CrossRef]
Hu, B.; Li, H.; Bu, S.; Chen, L.; Han, P. OriLoc: Unlimited-FoV and Orientation-Free Cross-View Geolocalization. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 15508–15522. [Google Scholar] [CrossRef]
Ding, Y.; Lu, W.; Zhang, Y.; Feng, Y.; Li, B.; Gao, Y. Energy Consumption Minimization for Secure UAV-enabled MEC Networks Against Active Eavesdropping. In Proceedings of the 2023 IEEE 98th Vehicular Technology Conference (VTC2023-Fall), Hong Kong, China, 10–13 October 2023; pp. 1–5. [Google Scholar] [CrossRef]
Yang, L.; Guo, D.; Liu, Y.; Feng, L. Joint Trajectory and Power Optimization for UAV-Assisted Communication Networks. In Proceedings of the 2024 10th International Conference on Computer and Communications (ICCC), Chengdu, China, 13–16 December 2024; pp. 2542–2547. [Google Scholar] [CrossRef]
Wang, F.; Yang, H.; Meng, Q. A three-dimensional coverage path planning method for multi-UAV collaboration. In Proceedings of the 2023 30th International Conference on Geoinformatics, London, UK, 19–21 July 2023; pp. 1–6. [Google Scholar] [CrossRef]
Pehlivanoğlu, Y.V.; Bekmezci, I.; Pehlivanoğlu, P. Efficient Strategy for Multi-UAV Path Planning in Target Coverage Problems. In Proceedings of the 2022 International Conference on Theoretical and Applied Computer Science and Engineering (ICTASCE), Rome, Italy, 29 September–1 October 2022; pp. 110–115. [Google Scholar] [CrossRef]
HU, Y.; CHEN, Y.; WU, Z. Unmanned Aerial Vehicle and Ground Remote Sensing Applied in 3D Reconstruction of Historical Building Groups in Ancient Villages. In Proceedings of the 2018 Fifth International Workshop on Earth Observation and Remote Sensing Applications (EORSA), Xi’an, China, 18–20 June 2018; pp. 1–5. [Google Scholar] [CrossRef]
Zheng, K.; Liu, X.; Yang, J.; Cai, Z.; Dai, H.; Zhou, Z.; Xiao, X.; Gong, X. BRR-DQN: UAV path planning method for urban remote sensing images. In Proceedings of the 2021 China Automation Congress (CAC), Beijing, China, 22–24 October 2021; pp. 6113–6117. [Google Scholar] [CrossRef]
Raumonen, P.; Brede, B.; Lau, A.; Bartholomeus, H. A Shortest Path Based Tree Isolation Method for UAV LiDAR Data. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; pp. 724–727. [Google Scholar] [CrossRef]
Wang, S.; Zhang, L.; Luo, L.; Zeng, Y. A path planning method of UAV in mountainous environment. In Proceedings of the 2021 IEEE 5th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Xi’an, China, 15–17 October 2021; Volume 5, pp. 1399–1404. [Google Scholar] [CrossRef]
Zhou, S.; Cheng, Y.; Lei, X.; Duan, H. Multi-agent few-shot meta reinforcement learning for trajectory design and channel selection in UAV-assisted networks. China Commun. 2022, 19, 166–176. [Google Scholar] [CrossRef]
Al Baroomi, B.; Myo, T.; Ahmed, M.R.; Al Shibli, A.; Marhaban, M.H.; Kaiser, M.S. Ant Colony Optimization-Based Path Planning for UAV Navigation in Dynamic Environments. In Proceedings of the 2023 7th International Conference on Automation, Control and Robots (ICACR), Lumpur, Malaysia, 4–6 August 2023; pp. 168–173. [Google Scholar] [CrossRef]
Zhao, C.; Liu, Y.; Yu, L.; Li, W. Stochastic Heuristic Algorithms for Multi-UAV Cooperative Path Planning. In Proceedings of the 2021 40th Chinese Control Conference (CCC), Shanghai, China, 26–28 July 2021; pp. 7677–7682. [Google Scholar] [CrossRef]
Gruosso, G. Optimization and management of energy power flow in hybrid electrical vehicles. In Proceedings of the 5th IET Hybrid and Electric Vehicles Conference (HEVC 2014), London, UK, 5–6 November 2014; pp. 1–5. [Google Scholar] [CrossRef]
Liu, C.; Zhang, C.; Jiang, H.; Xiong, F. Online trajectory optimization based on successive convex optimization. In Proceedings of the 2017 36th Chinese Control Conference (CCC), Dalian, China, 26–28 July 2017; pp. 2577–2582. [Google Scholar] [CrossRef]
Zhao, W.; Cui, S.; Qiu, W.; He, Z.; Liu, Z.; Zheng, X.; Mao, B.; Kato, N. A Survey on DRL Based UAV Communications and Networking: DRL Fundamentals, Applications and Implementations. IEEE Commun. Surv. Tutor. 2025. Early Access. [Google Scholar] [CrossRef]
Ren, Y.; Cao, X.; Zhang, X.; Jiang, F.; Lu, G. Joint Space Location Optimization and Resource Allocation for UAV-Assisted Emergency Communication System. In Proceedings of the 2022 IEEE 96th Vehicular Technology Conference (VTC2022-Fall), London, UK, 26–29 September 2022; pp. 1–5. [Google Scholar] [CrossRef]
Nayeem, G.M.; Fan, M.; Daiyan, G.M.; Fahad, K.S. UAV Path Planning with an Adaptive Hybrid PSO. In Proceedings of the 2023 International Conference on Information and Communication Technology for Sustainable Development (ICICT4SD), Dhaka, Bangladesh, 21–23 September 2023; pp. 139–143. [Google Scholar] [CrossRef]
Zheng, J.; Sun, X.; Ji, Y.; Wu, J. Research on UAV Path Planning Based on Improved ACO algorithm. In Proceedings of the 2023 IEEE 11th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), Chongqing, China, 8–10 December 2023; Volume 11, pp. 762–770. [Google Scholar] [CrossRef]
Zhang, Z.; Tian, J.; Wang, D.; Qiao, J.; Li, T. TD3-based Joint UAV Trajectory and Power optimization in UAV-Assisted D2D Secure Communication Networks. In Proceedings of the 2022 IEEE 96th Vehicular Technology Conference (VTC2022-Fall), London, UK, 26–29 September 2022; pp. 1–5. [Google Scholar] [CrossRef]
Wang, Y.; Ren, T.; Fan, Z. Autonomous Maneuver Decision of UAV Based on Deep Reinforcement Learning: Comparison of DQN and DDPG. In Proceedings of the 2022 34th Chinese Control and Decision Conference (CCDC), Hefei, China, 15–17 August 2022; pp. 4857–4860. [Google Scholar] [CrossRef]
Adhikari, B.; Khwaja, A.S.; Jaseemuddin, M.; Anpalagan, A. Sum Rate Maximization for RIS-assisted UAV-IoT Networks using Sample Efficient SAC Technique. In Proceedings of the 2024 IEEE 10th World Forum on Internet of Things (WF-IoT), Ottawa, ON, Canada, 10–13 November 2024; pp. 463–468. [Google Scholar] [CrossRef]
Wan, Y.; Zhao, Z.; Tang, J.; Chen, X.; Qi, J. Multi-UAV Formation Obstacles Avoidance Path Planning Based on PPO Algorithm. In Proceedings of the 2023 9th International Conference on Big Data and Information Analytics (BigDIA), Haikou, China, 15–17 December 2023; pp. 55–62. [Google Scholar] [CrossRef]
Pamuklu, T.; Syed, A.; Kennedy, W.S.; Erol-Kantarci, M. Heterogeneous GNN-RL-Based Task Offloading for UAV-Aided Smart Agriculture. IEEE Netw. Lett. 2023, 5, 213–217. [Google Scholar] [CrossRef]
Du, Y.; Qi, N.; Li, X.; Xiao, M.; Boulogeorgos, A.A.A.; Tsiftsis, T.A.; Wu, Q. Distributed Multi-UAV Trajectory Planning for Downlink Transmission: A GNN-Enhanced DRL Approach. IEEE Wirel. Commun. Lett. 2024, 13, 3578–3582. [Google Scholar] [CrossRef]
Liu, Y.; Mao, W.; Li, X.; Huangfu, W.; Ji, Y.; Xiao, Y. UAV-Assisted Integrated Sensing and Communication for Emergency Rescue Activities Based on Transfer Deep Reinforcement Learning. In Proceedings of the 30th Annual International Conference on Mobile Computing and Networking, ACM MobiCom ’24, New York, NY, USA, 18–22 November 2024; pp. 2142–2147. [Google Scholar] [CrossRef]
Xiao, X.; Yi, M.; Wang, X.; Liu, J.; Zhang, Y.; Hou, R. Meta-Learning Deep Reinforcement Learning for Fresh Data Collection in UAV-Assisted Wireless Sensor Networks. In Proceedings of the 2024 22nd International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt), Seoul, Republic of Korea, 21–24 October 2024; pp. 118–123. [Google Scholar]
Dhuheir, M.; Erbad, A.; Al-Fuqaha, A.; Seid, A.M. Meta Reinforcement Learning for UAV-Assisted Energy Harvesting IoT Devices in Disaster-Affected Areas. IEEE Open J. Commun. Soc. 2024, 5, 2145–2163. [Google Scholar] [CrossRef]
Lin, Y.; Ni, Z.; Tang, Y. An Imitation Learning Method with Multi-Virtual Agents for Microgrid Energy Optimization under Interrupted Periods. In Proceedings of the 2024 IEEE Power and Energy Society General Meeting (PESGM), Seattle, DC, USA, 21–25 July 2024; pp. 1–5. [Google Scholar] [CrossRef]
Masayuki, U.; Tomoyuki, T. Visualization of Fighting Game Player Skills Using Imitation Learning Agents. In Proceedings of the 2024 IEEE 13th Global Conference on Consumer Electronics (GCCE), Kitakyushu, Japan, 29 October–1 November 2024; pp. 82–83. [Google Scholar] [CrossRef]
Choi, J.H.; Kim, D.h.; Yoo, J.S.; Kim, B.J.; Hwang, J.T. Enhancing Autonomous Driving with Pre-trained Imitation and Reinforcement Learning. In Proceedings of the 2025 International Conference on Electronics, Information, and Communication (ICEIC), Osaka, Japan, 19–22 January 2025; pp. 1–3. [Google Scholar] [CrossRef]
Zeng, Y.; Wu, Q.; Zhang, R. Accessing From the Sky: A Tutorial on UAV Communications for 5G and Beyond. Proc. IEEE 2019, 107, 2327–2375. [Google Scholar] [CrossRef]
Sun, C.; Xiong, X.; Ni, W.; Wang, X. Three-Dimensional Trajectory Design for Energy-Efficient UAV-Assisted Data Collection. In Proceedings of the ICC 2022–IEEE International Conference on Communications, Seoul, Republic of Korea, 16–20 May 2022; pp. 3580–3585. [Google Scholar] [CrossRef]
TS 36.331; Version 14.2.2; Evolved Universal Terrestrial Radio Access (E-UTRA), Radio Resource Control (RRC), Protocol Specification. 3rd Generation Partnership Project (3GPP): Sophia Antipolis Cedex, France, 2017.
Survey, U.G. Southern San Andreas Fault from Painted Canyon to Bombay Beach, CA. OT Collection ID: OT.092021.32611.1, Collection Platform: Structure from Motion/Photogrammetry. 2021. Available online: https://portal.opentopography.org/datasetMetadata?otCollectionID=OT.092021.32611.1 (accessed on 5 September 2025).
TR 25.943 V12.0.0; Requirements for Evolved UTRA (E-UTRA) Physical Layer Aspects. Technical Report. 3Gpp: Sophia Antipolis Cedex, France, 2013.
TR 25.943 V4.2.0; Requirements for Evolved UTRA (E-UTRA) Physical Layer Aspects. Technical Report. 3Gpp: Sophia Antipolis Cedex, France, 2001.

Figure 1. The envisioned scenario of the multi-UAV collaborative system, in which UAVs are applied to perform detection in a mountainous region.

Figure 2. Flowchart of the model-based data-driven methodology.

Figure 3. Illustration of UAV sensing and the detection plot divided into grids.

Figure 4. Illustration of the UAV’s FOV, demonstrating how the viewing angle and relative position between the UAV and the target ground area affect the sensing coverage and quality.

Figure 5. The structure of the detection quality prediction network.

Figure 6. The structure of the feature extraction model pre-training process.

Figure 7. Reward performance comparison with DRL algorithms.

Figure 8. Detection coverage quality performance comparison with DRL algorithms.

Figure 9. The computational overhead of the FA module varies with the number of sampling points.

Figure 10. Loss curve of training DQPN.

Figure 11. Reward comparison in ablation study.

Figure 12. A top-down view of multi-UAV collaborative system trajectory visualization with a mountainous terrain elevation heatmap as the background.

Figure 13. Three-dimensional view of multi-UAV collaborative system trajectory visualization.

Table 1. Environmental parameters for UAV task.

Symbol	Definition	Value	Unit
$u_{x - \min}$	Task Space Left Edge	0	m
$u_{x - \max}$	Task Space Right Edge	2000	m
$u_{y - \min}$	Task Space Back Edge	0	m
$u_{y - \max}$	Task Space Front Edge	2000	m
$u_{z - \min}$	Task Space Bottom Edge	50	m
$u_{z - \max}$	Task Space Top Edge	100	m
$T_{end}$	Maximum Mission Time	2000	s
$t_{d}$	Detection Cycle	10	s
$n_{x}$	Number of DPs in x Direction	20	s
$n_{y}$	Number of DPs in y Direction	20	s
$v_{horizon}$	Velocity Limitation at Horizontal Direction	10	m/s
$v_{vertical}$	Velocity Limitation at Vertical Direction	5	m/s
$K_{end}$	Discretized End Time	2000	-
$m_{UAV}$	UAV Mass	0.2	kg
g	Gravitational Acceleration	9.8	m/s²
$ρ_{air}$	Air Density	1.225	kg/m³
$C_{d}$	Drag Coefficient	0.5	-
$A_{surf}$	UAV Fuselage Area	0.01	m²
$n_{prp}$	Propeller Number	4	-
$R_{prp}$	Propeller Radius	0.1	m
$η$	Mechanical Efficiency	0.8	-
$P_{TX}$	UAV Transmit Power	17	dBm
$G_{t}$	Transmit Antenna Gain	2	dBi
$G_{r}$	Receive Antenna Gain	2	dBi
$L$	System Average Path Loss	2	-
$f_{c}$	UAV Communication Frequency	3.5	GHz
$k_{B}$	Boltzmann Constant	$1.38 \times 10^{- 23}$	-
$T_{K}$	Temperature (Kelvin)	298	K
c	Speed of Light	$3 \times 10^{8}$	m/s
$B^{*}$	Battery Capacity	311,040	J
$P_{static}$	Static Power	4	W
V	Voltage	5	V
$f^{*}$	Clock Frequency	$200 \times 10^{6}$	Hz
$α^{*}$	Activity Factor	0.5	-
$C^{*}$	Load Capacitance	6.4	nF
$n_{grid}$	Grid Number at Each Direction of per DP	100	-
$M_{th}$	Coverage Rate Threshold	0.85	-
$d_{th}$	Distance Threshold	150	m
$θ_{f}$	FOV Threshold	45	^∘
$t_{step}$	Uniform Time Step	1	s
$n_{UAV}$	UAV Number	4, 6, 10, 12, 15	-

Table 2. Hyperparameters for reinforcement learning.

Symbol	Definition	Value
$PR$	Penalty Parameter	5
$PH$	Penalty Parameter	$0.02$
$PO$	Penalty Parameter	50
$PN$	Penalty Parameter	$0.01$
$PC$	Penalty Parameter	50
$α_{1}$	Weight Parameter	20
$α_{2}$	Weight Parameter	$0.001$
$γ$	Discounted Factor	0.99
$r_{a}$	Learning Rate for Actor	$10^{- 5}$
$r_{c}$	Learning Rate for Critic	$10^{- 4}$
$n_{batch}$	DQPN Training Batch Size	128
$n_{batch}^{'}$	FEN Training Batch Size	128
$n_{batch}^{*}$	MARL Training Batch Size	128
$D$	Replay Buffer Size	$2^{16}$
$λ_{\max}$	Maximum Training Epoch of DQPN	100
$λ_{\max}^{'}$	Maximum Training Epoch of FEN	1000
$λ_{\max}^{*}$	Maximum Training Episode of MARL	10,000
$ϑ^{*}$	Parameter Initial Value	1
$δ^{*}$	Parameter Initial Value	10
$λ_{\max}$	Maximum Training Epoch	100
$ξ$	Soft Update Parameter	0.01
$β_{F}$	Feature Normalization Coefficient	1
$\tilde{S}$	Scenario Number	10
$\tilde{X}$	Position Number	80

Table 3. The average task completion time comparison with DRL algorithms.

Algorithm	Average Task Completion Time
MADDPG	1842
MASAC	1927
MATD3	1839
MAPPO	>2000
FA-EIMARL	1236

Table 4. The inference time changes as the agent number increases.

Agent Number	Inference Time
$n_{UAV} = 4$	34.71 ms
$n_{UAV} = 6$	35.36 ms
$n_{UAV} = 10$	37.28 ms
$n_{UAV} = 12$	37.42 ms
$n_{UAV} = 15$	39.61 ms

Table 5. The total task rewards change as the agent number increases.

Agent Number	Task Rewards
$n_{UAV} = 4$	4541.68
$n_{UAV} = 6$	4356.57
$n_{UAV} = 10$	4069.32
$n_{UAV} = 12$	3932.35
$n_{UAV} = 15$	3614.98

Table 6. Improvement of time performance brought by DQPN.

Approach	Operation Time
DQPN Prediction	0.4442 ms
Direct Computation	525.9 ms

Table 7. Function parameter of elite imitation module.

Average Overhead	Computational Parameter	Total Extra Overhead
85.09 ms	2300	872.2 ms

Table 8. The performance comparison of inference time with baseline algorithms.

Algorithm	Inference Time
MADDPG	162.39 ms
MASAC	320.21 ms
MATD3	172.74 ms
MAPPO	59.79 ms
FA-EIMARL	34.76 ms

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, Q.; Tao, Y.; Su, Q.; Tsukada, M. A Feature-Aware Elite Imitation MARL for Multi-UAV Trajectory Optimization in Mountain Terrain Detection. Drones 2025, 9, 645. https://doi.org/10.3390/drones9090645

AMA Style

Zhou Q, Tao Y, Su Q, Tsukada M. A Feature-Aware Elite Imitation MARL for Multi-UAV Trajectory Optimization in Mountain Terrain Detection. Drones. 2025; 9(9):645. https://doi.org/10.3390/drones9090645

Chicago/Turabian Style

Zhou, Quanxi, Ye Tao, Qianxiao Su, and Manabu Tsukada. 2025. "A Feature-Aware Elite Imitation MARL for Multi-UAV Trajectory Optimization in Mountain Terrain Detection" Drones 9, no. 9: 645. https://doi.org/10.3390/drones9090645

APA Style

Zhou, Q., Tao, Y., Su, Q., & Tsukada, M. (2025). A Feature-Aware Elite Imitation MARL for Multi-UAV Trajectory Optimization in Mountain Terrain Detection. Drones, 9(9), 645. https://doi.org/10.3390/drones9090645

Article Menu

A Feature-Aware Elite Imitation MARL for Multi-UAV Trajectory Optimization in Mountain Terrain Detection

Abstract

1. Introduction

2. Related Works

3. System Overview

3.1. Exemplary Scenario

3.2. Methodology

4. System Model

4.1. Energy Consumption

4.2. Communication

4.3. UAV Detection

4.4. Problem Formulation

5. Feature-Aware Elite Imitation Multi-Agent Reinforcement Learning

5.1. Multi-Agent Markov Decision Process

5.2. Detection Quality Prediction Network

5.3. Elite Imitation Mechanism

5.4. Feature-Aware MARL

6. Simulation Results

6.1. Experimental Setup

6.2. Training Performance

6.3. Scalability and Complexity Analysis

6.4. Ablation Study

6.5. Three-Dimensional Visualization

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI