Optimizing Spatial State Representation in Reinforcement Learning for Coverage Path Planning in UAV Search Missions

Yuan, Hu; Yan, Shengkai; Liu, Zhuzhi; Wang, Suli; Wang, Qiang; Chen, Gaocheng

doi:10.3390/drones10060442

Open AccessArticle

Optimizing Spatial State Representation in Reinforcement Learning for Coverage Path Planning in UAV Search Missions

by

Hu Yuan

¹

,

Shengkai Yan

¹

,

Zhuzhi Liu

¹,

Suli Wang

¹,

Qiang Wang

¹

and

Gaocheng Chen

^1,2,*

¹

College of Equipment Management and UAV Engineering, Air Force Engineering University, Xi’an 710051, China

²

Unit 94106 of PLA, Xi’an 710613, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(6), 442; https://doi.org/10.3390/drones10060442 (registering DOI)

Submission received: 30 April 2026 / Revised: 28 May 2026 / Accepted: 2 June 2026 / Published: 5 June 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Proposed a novel DQN-A* hybrid algorithm with a multi-stage obstacle-identification strategy for unmanned aerial vehicle (UAV) coverage path planning (CPP).
Established a theoretical framework that derives a recommended range for UAV positional identifiers based on Z-score feature normalization.

What are the implications of the main findings?

The dual-driven reward mechanism and A* integration enhance dynamic obstacle avoidance, target detection probability, and overall search efficiency in unknown environments.
The derived range accelerates model convergence in grid-based CPP tasks driven by deep reinforcement learning.

Abstract

To enhance path planning efficiency in unmanned aerial vehicle (UAV) search missions in complex environments, this paper proposes a coverage path planning (CPP) algorithm for a UAV that integrates the deep Q-network (DQN) with the A* algorithm (DQN-A*). In the proposed DQN-A* algorithm, a dual-driven reward mechanism is established, comprising a probability-weighted reward and a step-dependent reward, steering the UAV toward high-probability regions. Furthermore, to handle previously unknown obstacles in real time, the algorithm employs a multi-stage obstacle-identification strategy, enabling the UAV to improve coverage of traversable cells by dynamically adjusting its local path when newly detected obstacles are encountered. A theoretical analysis derives a principled recommended range for the UAV positional identifier based on statistical feature analysis; this range is then validated through extensive simulations. Additionally, Hamiltonian path pre-training is introduced to accelerate convergence. Comparative simulations demonstrate that the proposed DQN-A* algorithm achieves higher area-coverage and target-detection probabilities than benchmark algorithms in environments with unknown obstacles, offering valuable insights for positional encoding in deep reinforcement learning (DRL)-based robotic coverage problems.

Keywords:

coverage path planning (CPP); deep reinforcement learning (DRL); search missions; unmanned aerial vehicle (UAV)

1. Introduction

The increasing frequency of emergencies, including missing person incidents and natural disasters, highlights the critical need for efficient search missions. Traditional ground-based search methods, constrained by terrain complexity, weather conditions, and other limitations, often suffer from inefficiencies, including delayed access to critical search areas and even risks to the safety of rescue personnel [1,2]. These challenges motivate the development of more capable aerial search solutions. In parallel with advances in autonomous and intelligent technologies, unmanned aerial vehicles (UAVs) have been widely utilized in search missions due to their agility and operational flexibility [3,4].

In UAV-based search missions, the design of search paths critically impacts operational efficiency. Existing UAV path planning strategies generally fall into two categories. One is point-to-point planning, which is optimized for routes between fixed departure and target locations with objectives such as shortest path [5], minimal traversal time [6], and lowest operational cost [7]. Such point-to-point heuristic and sampling-based methods, notably the A* and Rapidly-exploring Random Tree star (RRT*) algorithms, have demonstrated robust obstacle-avoidance and trajectory optimization capabilities across various autonomous systems, including autonomous ground vehicles [8] and space robotics [9]. The other is coverage path planning (CPP), which focuses on the systematic exploration of designated search areas [10], prioritizing objectives like minimizing the revisit rate [11] and reducing total search duration [12]. This study focuses on UAV-based CPP for search missions.

In traditional CPP, common approaches include zigzag [13,14] and spiral patterns [15,16]. While straightforward to implement, these methods are largely restricted to regular, obstacle-free environments and struggle to adapt to irregular or cluttered terrains. Random traversal methods are capable of navigating moderately complex environments but suffer from a high revisit rate.

With advancements in heuristic algorithms, techniques such as genetic algorithms [17] and ant colony optimization [14] have been integrated into CPP. Practically, these methods first decompose irregular environments into grid-like cells, then frame the traversal sequence as a traveling salesman problem (TSP) for heuristic optimization, improving planning efficiency. However, they remain unsuitable for dynamic environments and are susceptible to local optima in highly complex scenarios.

The advancement of reinforcement learning (RL) technology offers a novel solution for CPP in dynamic environments. By enabling agents to interact iteratively with their surroundings, RL inherently adapts to real-time environmental changes. For instance, Tan et al. [18] combined Q-learning with neural networks to minimize path revisitation in CPP. Ai et al. [19] proposed a probability-distribution-based Q-learning approach for maritime search missions, prioritizing high-probability zones while reducing directional turns. However, Q-learning struggles with large state spaces. Deep reinforcement learning (DRL) combines RL with deep neural networks, which model complex state–action relationships and thereby address the scalability limitation of tabular Q-learning [20]. Dong et al. [21] developed a DRL-based framework for multi-UAV cooperative coverage in obstacle-free areas. Hu et al. [22] designed a distributed DRL strategy for collaborative UAV teams, dynamically minimizing overlaps through inter-agent coordination.

Collectively, DRL-based CPP algorithms demonstrate superior adaptability to dynamic environmental changes. However, current research leveraging RL for CPP suffers from critical gaps: First, existing studies rarely address unknown obstacle avoidance. Second, the lack of a clear distinction between the UAV’s position and covered/uncovered areas often leads to slow or failed model convergence. Third, prioritization of high-probability search zones is seldom incorporated in CPP for search missions. To bridge these gaps, this paper proposes a novel CPP algorithm: DQN-A*. The algorithm prioritizes high-probability regions while avoiding unknown obstacles; moreover, we investigate how the UAV positional identifier affects convergence behavior. The contributions of this paper are as follows:

Optimization of spatial state representation. This paper derives a principled range for the UAV positional identifier under Z-score-normalized state encoding and validates its effect on convergence through simulation experiments.
Hybrid CPP architecture with a multi-stage obstacle-identification strategy. Integrating the global decision-making capability of DQN for environmental dynamics with the local optimal path search efficiency of the A* algorithm, we construct the DQN-A* algorithm. This hybrid framework employs a multi-stage obstacle-identification strategy to enable unknown obstacle avoidance and accomplishes CPP.
Probability–time dual-driven reward mechanism. A novel reward system integrates probability-weighted reward (based on spatial target likelihood distributions) and step-dependent reward (penalizing prolonged search durations). This dual mechanism prioritizes exploration of high-probability zones while effectively minimizing path redundancy, thereby enhancing search efficiency.

2. Related Work

CPP plays a critical role in search missions, and it can be generally classified into offline and online paradigms. Offline and online approaches are mainly distinguished by whether the path is computed once before the mission (offline) or dynamically replanned during the mission depending on the dynamic environment (online).

2.1. Offline CPP

Offline CPP methods are applicable to static environments with fully known information. Among these, zigzag and spiral trajectories serve as two fundamental strategies, particularly effective in geometrically regular search areas. Lumelsky et al. [13] employed zigzag patterns for terrain mapping tasks, while Wang et al. [14] enhanced agricultural coverage efficiency by incorporating turning radius constraints for harvest robots. Hai et al. [16] integrated spiral algorithms with knowledge reasoning to reduce the path revisit rate. The wavefront algorithm, operating via a grid-based framework, assigns incremental numerical values to cells in wave-like propagation, enabling agents to traverse areas by following ascending or descending gradients [23]. For complex environments, region decomposition is commonly prioritized. Sun et al. [24] proposed trapezoidal decomposition to partition reconnaissance zones into trapezoidal sub-regions, thereby optimizing UAV paths and base station deployment. Although computationally efficient, this method generates excessive sub-regions and struggles with non-polygonal obstacles. To address these limitations, boustrophedon decomposition merges adjacent trapezoids to minimize redundant coverage paths [25], while Morse decomposition resolves non-polygonal obstacle partitioning through critical point analysis [26]. Spanning tree methods, which construct grid-based trees with the starting point as the root node and uncovered areas as leaves, offer rapid path generation in highly complex environments [27]. Heuristic algorithms have opened new avenues for solving intricate CPP problems. These methods typically decompose the environment into sub-regions and optimize the visitation sequence, effectively transforming the CPP problem into a TSP. For instance, Tang et al. [17] decomposed port areas using Google Maps data and employed genetic algorithms to sequence sub-region access, reducing UAV turning maneuvers and improving coverage efficiency. Jia et al. [28] optimized multi-UAV multi-task coverage under flight time constraints using greedy-ant colony hybrid algorithms. Ahmed et al. [29] developed a particle swarm optimization framework to generate collision-free 3D trajectories for distributed UAV fleets, balancing energy consumption, flight risks, mobility, and surveillance priorities.

2.2. Online CPP

In real-world search scenarios, environmental information is often dynamic and unknown. Online CPP provides a solution for such conditions, focusing on environments with incomplete prior knowledge. UAVs typically rely on sensors to probe the environment and feed real-time perceptual data into decision-making models, enabling CPP under uncertainty. Current online CPP methodologies fall into two main categories: bio-inspired neural-network approaches and reinforcement-learning-based approaches.

Inspired by biological neural networks, Yang et al. [30] pioneered the application of neural networks to robotic path planning. Yang et al. [31] refined this model by representing sub-regions as neurons interconnected through synaptic links. In this framework, excitatory inputs originate from uncovered areas, while inhibitory inputs stem from environmental obstacles. By computing dynamic changes in neuronal activity values, they proposed a neural-network-driven CPP method to construct coverage paths. Tang et al. [32] further optimized neuronal activity calculation rules, enabling prioritized coverage of critical sub-regions. Huo et al. [33] proposed a CPP algorithm based on improved biologically inspired neural networks in spray painting, and the generated spraying paths are coherent and orderly with a low revisit rate. However, these methods exhibit drawbacks: during stagnation phases, they require waiting for the decay of neuronal activity values, and they often suffer from a high path revisit rate. RL trains agents to interact dynamically with environments. Agents receive rewards for optimal actions and penalties for suboptimal decisions, eventually learning to achieve globally optimal policies. RL holds distinct advantages in online CPP scenarios with minimal prior environmental knowledge. For instance, Tan et al. [18] integrated RL with neural networks, employing RL to adjust directional choices at dynamically changing states, thereby reducing path redundancy. However, their Q-learning framework struggled with high-dimensional state spaces. Hu et al. [22] developed a DRL-based distributed multi-UAV cooperative coverage method. While effective for collaborative pathfinding, their approach lacked an explicit position identifier for UAVs, leading to slow convergence rates and failure to converge in highly complex environments.

Existing studies demonstrate that DRL-based CPP algorithms excel at addressing dynamically evolving environmental information in search missions. However, these methods remain prone to stagnation or deadlock when encountering unknown obstacles and face challenges such as slow convergence rates or even failure to converge.

3. Problem Formulation

This study focuses on the scenario of UAV-assisted ground search missions. To simplify the problem formulation, the search area is discretized into a grid-based environment [34,35,36]. The UAV’s detection field, determined by its flight altitude and viewing angle, is modeled as a dynamic window moving over the grid. The 3D visual scanning process is projected onto a 2D plane, transforming the UAV’s path planning into a grid traversal problem. The UAV is required to fully cover all traversable grid cells (i.e., non-obstacle regions) during its mission. The core challenge lies in designing a robust CPP algorithm that enables the UAV to: (1) achieve complete coverage of the designated search area, (2) minimize redundant revisits to previously scanned cells, and (3) intelligently avoid unknown obstacles in real time.

3.1. Detection Field of the UAV

In this study, the UAV is equipped with a visible-light-based payload camera (illustrated in Figure 1), where

θ

denotes the field-of-view (FOV) angle of the camera, h denotes the flight altitude of the UAV above ground level, and r denotes the resulting radius of the circular ground footprint of the camera. The UAV maintains a constant flight altitude h above ground level and a fixed heading (yaw angle), so that the camera’s principal axis remains aligned with the world frame throughout the search. Small variations in pitch and roll induced by translational accelerations are assumed to be compensated by the onboard low-level flight controller, so that the projected detection field can be modeled as a stable circular region centered on the UAV’s ground-projected position. The radius r of this circular detection footprint is given by

r = h tan \frac{θ}{2} .

(1)

3.2. Search Area Discretization

To ensure full coverage of the search area by the UAV, the search area is discretized into grid cells selected as circle-inscribed squares within the UAV detection field, as shown in Figure 2. The grid side length is defined as

d = \sqrt{2} r .

(2)

For a search area of length L and width W, let

⌈ x ⌉

denote the ceiling function, i.e.,

⌈ x ⌉

is the smallest integer not less than x. To guarantee complete coverage of the search-area boundaries even when L and W are not integer multiples of d, the numbers of grid cells along the two axes are computed as

m = ⌈L / d⌉,

(3)

n = ⌈W / d⌉ .

(4)

This yields

m d \geq L

and

n d \geq W

, so the resulting

m \times n

grid strictly contains the original rectangular search area.

3.3. Search Scenario Description

This study focuses on a typical two-dimensional UAV search scenario, as depicted in Figure 3. In this scenario, the search area is discretized into grid cells. A grid cell is considered fully covered once the UAV reaches its center. For clarity in visualization, previously covered areas are labeled “1”; uncovered areas remain labeled “0”; and the UAV’s current cell is marked by the identifier “x”. The selection of x and its influence on model convergence are discussed in Section 5.2. The mission is deemed complete when the UAV traverses all grid cells, ensuring comprehensive coverage of the search area.

4. Model Construction

In Section 3, concerning the search path planning problem description, continuous path planning is discretized into dynamic autonomous path decision-making. The state updates of this system conform to a Markov Decision Process (MDP). Therefore, we employ an MDP model to characterize the autonomous decision-making process of the UAV for dynamic search path planning, and utilize reinforcement learning to solve for policies. An MDP is primarily described using a five-tuple

(\begin{matrix} S, A, P, R, γ \end{matrix})

to define the system’s state transition process, where S represents the state space, A denotes the action space, P is the state transition function, R signifies the reward function, and

γ

is the discount factor.

While DQN enables UAVs to interact with changing environments, learn obstacle avoidance strategies, and achieve full coverage of target areas, standalone DQN algorithms face risks of premature convergence or stagnation in highly complex scenarios. To mitigate these limitations, this study proposes a hybrid path planning model that synergizes the A* algorithm with DQN. The integrated framework leverages the global decision-making capability of DQN for environmental coverage sequencing and the local path search efficiency of the A* algorithm for real-time unknown obstacle avoidance.

4.1. Action Space

In this model, the UAV operates at a fixed altitude and is restricted to movement in four adjacent directions (forward, backward, left, and right) relative to its current grid cell, as illustrated in Figure 4. The action space, denoted as set

A = {1, 2, 3, 4}

, dynamically adjusts based on the UAV’s position: when near the search area boundary, actions leading to boundary violations are excluded to confine the UAV within valid regions. To balance exploration and exploitation, an

ε

-greedy strategy [37,38] is employed as

a = \{\begin{matrix} arg \max_{a} Q (s, a; θ), & 1 - ε \\ random (A), & ε, \end{matrix}

(5)

where the greedy factor

ε

controls the probability of selecting a random action versus the action maximizing the Q-value. Initially, higher

ε

values promote exploratory behavior to avoid local optima, while

ε

decays progressively during training to prioritize Q-value-driven actions and accelerate model convergence.

The four-directional action space (up/down/left/right) is chosen for three reasons: (i) it matches the axis-aligned grid discretization, ensuring that each action moves the UAV to exactly one adjacent cell; (ii) it minimizes the output dimensionality of the Q-network, facilitating faster convergence on small-to-medium grids; and (iii) in the coverage setting, diagonal movement does not improve worst-case coverage count on a square grid because each cell must still be visited individually. We note explicitly that this 4-direction choice is a deliberate trade-off favoring action-space compactness and is not a claim of trajectory-length optimality; the corresponding 8-direction variant is investigated in Section 5.3.5.

4.2. State Space

In the DQN framework for UAV CPP, the input layer of the neural network corresponds to the current state, and the output layer determines the optimal action for that state.

Taking the search area depicted in Figure 2 as an example, a target region of size

W \times L

is converted into a state space represented as an

m \times n

matrix

S

using (3) and (4). As formalized in (6), these identifiers explicitly capture both the search progress (via 0/1 flags) and the UAV’s real-time localization (via x), enabling efficient exploration strategies within the DQN framework.

S = (\begin{matrix} s_{11} & \dots & s_{1 n} \\ ⋮ & ⋱ & ⋮ \\ s_{m 1} & \dots & s_{m n} \end{matrix}) .

(6)

Each element

s_{i j}

in the state matrix takes one of three values:

s_{i j} = 0

(uncovered),

s_{i j} = 1

(covered), and

s_{i j} = x

(UAV position). To facilitate stable convergence in DRL, input features are typically normalized. We analyze the feature distinctness under Z-score standardization:

\hat{v} = \frac{v - μ}{σ}

(7)

where

μ

and

σ

are the mean and standard deviation of the current state.

Consider a grid map of size N. At a typical intermediate search stage, let k denote the number of covered cells. The state vector contains one x, k ones, and

N - k - 1

zeros.

The mean (

μ

) of the grid state matrix is calculated as

μ = \frac{1}{N} \sum_{i = 1}^{m} \sum_{j = 1}^{n} s_{i j} = \frac{x \cdot 1 + 1 \cdot k + 0 \cdot (N - k - 1)}{N} = \frac{x + k}{N} .

(8)

Thus,

{(E [s])}^{2}

is given by

{(E [s])}^{2} = {(\frac{x + k}{N})}^{2} .

(9)

E [s^{2}]

can be calculated using the following equation

E [s^{2}] = \frac{1 \cdot x^{2} + k \cdot 1^{2} + (N - k - 1) \cdot 0^{2}}{N} = \frac{x^{2} + k}{N} .

(10)

Based on the definition of variance (

σ^{2}

)

σ^{2} = E [s^{2}] - {(E [s])}^{2} .

(11)

Substituting (9) and (10) into (11) yields the variance as

σ^{2} = \frac{x^{2} + k}{N} - {(\frac{x + k}{N})}^{2} .

(12)

To isolate the dominant contribution of x, expand the expression and rearrange the numerator term

N^{2} σ^{2} = (N - 1) x^{2} + k (N - k) - 2 k x .

(13)

When the value of x is salient relative to the background (

x ≫ \sqrt{N} / 2

) and the coverage ratio is moderate (

k \approx N / 2

), a critical cancellation effect is observed:

Mean Shift Effect (

- 2 k x

): As k increases, the global mean

μ

shifts upward. This reduces the deviation of the outlier x from the mean, thereby exerting a negative contribution to the variance.

Background Noise Effect (

+ k (N - k)

): The mixture of 0 and 1 in the background introduces variance akin to a Bernoulli distribution, exerting a positive contribution to the total variance. These two terms counteract each other in magnitude, causing the total variance to be dominated primarily by the outlier x. Consequently, the variance can be approximated as

N^{2} σ^{2} \approx (N - 1) x^{2} \approx N x^{2} .

(14)

This yields a simplified closed-form solution for the standard deviation

σ \approx \frac{x}{\sqrt{N}} .

(15)

Note that this closed-form approximation is intended as a tractable design heuristic valid in the salient-outlier regime

x ≫ \sqrt{N} / 2

; the recommended range below should therefore be read as a principled operating interval rather than a strict analytical optimum, and is verified empirically in Section 5.2. We define two antagonistic objectives for the optimal x:

Maximizing UAV Salience $S_{U A V}$ (Lower Bound of x): The UAV position must be a statistical outlier to ensure the UAV position is a statistically distinguishable feature within the input to the neural network. We require the Z-score difference between the UAV and the background (value 1) to exceed a significance threshold $δ_{salience}$ (typically $\geq 3.0$ for robust outlier detection):

$S_{U A V} = \hat{x} - \hat{1} = \frac{x - 1}{σ} \approx \sqrt{N} (1 - \frac{1}{x}) \geq 3.0 .$

(16)

Consequently, the lower bound of x is obtained as

$x \geq \frac{\sqrt{N}}{\sqrt{N} - 3} .$

(17)

Note that for large N the lower bound approaches unity and therefore lies outside the salient-outlier regime in which $σ \approx x / \sqrt{N}$ holds; it should thus be read as a conservative constraint preventing x from degenerating to the background value rather than as a tight analytical bound. The upper bound, by contrast, is obtained well within the validity regime of the approximation.
Preserving Background Contrast $C_{b g}$ (Upper Bound of x): The network must distinguish between uncovered (0) and covered (1) areas. If x is too large, $σ$ increases, compressing the normalized difference between 0 and 1 below the gradient sensitivity threshold $ϵ$ (empirically $\approx 0.5$ ):

$C_{b g} = | \hat{1} - \hat{0} | = \frac{1}{σ} \approx \frac{\sqrt{N}}{x} \geq 0.5 .$

(18)

Consequently, the upper bound of x is obtained as

$x \leq 2 \sqrt{N} .$

(19)

Combining these constraints, the theoretical optimal interval for the UAV position identifier is derived as

x \in [\frac{\sqrt{N}}{\sqrt{N} - 3}, 2 \sqrt{N}] .

(20)

This interval is well-defined only for

\sqrt{N} > 3

(i.e.,

N > 9

); when

N \leq 9

the salience constraint of Equation (16) admits no finite x, and the lower bound should instead be replaced by the background-value floor

x > 1

. All grid sizes evaluated in this paper (

N \in {25, 36, 64, 100}

) satisfy

N > 9

.

The elements in the matrix

Ω

defined by (21) correspond to those in matrix

S

, representing the probability of target presence at each grid cell. Equation (22) acts as a normalization constraint, ensuring the target-existence probabilities sum to 1 over all grid cells.

Ω = (\begin{matrix} ω_{11} & \dots & ω_{1 n} \\ ⋮ & ⋱ & ⋮ \\ ω_{m 1} & \dots & ω_{m n} \end{matrix}),

(21)

\sum_{i = 1}^{m} \sum_{j = 1}^{n} ω_{i j} = 1 .

(22)

4.3. Reward Design

To encourage the UAV to autonomously explore uncovered areas, a corresponding reward-penalty mechanism is established. During model training, positive rewards are designed to incentivize the UAV to explore uncovered areas, while negative rewards penalize redundant exploration of already covered regions. The two components of the proposed reward function have direct precedent in the literature: the probability-weighted positive term follows the standard practice in Bayesian and probabilistic search of weighting cell rewards by the prior probability of target presence [19,39]; the revisit penalty follows the same revisit-penalization principle used by [18]; and the step-dependent denominator acts as an elapsed-effort discount, consistent with recent step-cost reward shaping designs adopted in DRL-based UAV CPP to encourage efficient traversal under operational constraints [40]. The core dual-driven reward function is formulated as

R_{core} = \{\begin{matrix} \frac{ω_{i j} λ_{1}}{K + C}, & s_{i j} = 0, \\ λ_{2}, & s_{i j} = 1, \end{matrix}

(23)

where

λ_{1}

and

λ_{2}

denote the reward coefficient and penalty coefficient, respectively, C is the per-episode step count during the UAV’s search process (re-initialized to zero at the beginning of each episode), and K is a constant step factor introduced to mitigate drastic fluctuations in the per-step reward when C is small. The term

s_{i j} = 0

indicates that the UAV enters an uncovered grid cell, granting a probability-proportional reward; conversely,

s_{i j} = 1

signifies entry into a previously covered grid cell, incurring a constant penalty. The reward value for an uncovered grid cell is positively correlated with the target existence probability

ω_{i j}

and inversely correlated with the per-episode step count C. The probability-weighted rewards incentivize the UAV to explore areas with higher likelihoods of containing targets, while the decaying reward structure progressively reduces exploration incentives as steps accumulate, thereby encouraging the UAV to expedite task completion. This dual mechanism balances focused exploration of priority zones with efficient coverage of the entire search space, enhancing mission effectiveness in uncertain or time-constrained environments.

For implementation completeness, the reward signal R actually used during training augments the dual-driven term

R_{core}

with an energy-aware shaping term and four standard navigation-shaping terms commonly adopted in grid-based DRL navigation [40]:

\begin{matrix} R = & R_{core} + R_{energy} + r_{oob} 1 [out - of - bounds] + r_{obs} 1 [obstacle collision] \\ + R_{done} 1 [full coverage] + R_{timeout} 1 [step - budget exceeded] . \end{matrix}

(24)

The energy-aware term

R_{energy}

provides a point-mass-compatible surrogate for the dominant components of rotary-wing energy expenditure: it charges a flat per-step propulsion cost

α_{flight} > 0

on every transition and an additional cost

α_{turn} > 0

whenever the current action differs from the previous one (with no turn cost charged on the first step of an episode). The turn-cost component is a discrete surrogate for the turning-radius penalty of a real rotary-wing platform and biases the policy towards straighter, lower-energy trajectories, while setting

α_{flight} = α_{turn} = 0

recovers the energy-agnostic reward. The remaining four navigation-shaping terms act only on boundary or terminal events:

r_{oob}

and

r_{obs}

are bounded fail-safe penalties for out-of-bounds actions and obstacle collisions;

R_{done}

is a completion bonus that grows with the remaining step budget, rewarding faster full-coverage episodes; and

R_{timeout}

is a graded penalty proportional to the uncovered fraction when the per-episode step budget

C_{max}

is exhausted. Here,

R_{done}

scales with the base bonus

β_{done}

and the remaining step budget, while

R_{timeout}

scales with the coefficient

β_{tmo}

and the uncovered fraction; all coefficients, together with the per-episode step budget

C_{max}

, are listed in Table 1.

4.4. Model Parameter Update

Given the high-dimensional state space encountered during UAV search operations, traditional tabular Q-learning becomes computationally intractable for representing such extensive state spaces. Consequently, following the standard DQN formulation introduced by [38] and the textbook treatment of value-based RL in [37], a deep neural network

Q (s_{t}, a_{t}; θ)

is employed to approximate the autonomous decision-making process of the UAV. The network takes the state space S as input and outputs a one-dimensional vector whose dimensionality matches that of the action space A. Each element in this vector represents the Q-value (action value) corresponding to an action. Throughout this section, the subscript t indexes the discrete decision step of the MDP; that is,

s_{t}

denotes the state observed by the agent at step t,

a_{t}

the action selected at that step,

r_{t}

the immediate reward, and

θ

collects the trainable parameters of the Q-network [37]. The Bellman equation for the action-value function is

Q_{π} (s_{t}, a_{t}; θ) = \sum_{s_{t + 1}} P (s_{t + 1} | s_{t}, a_{t}) (R_{t} + γ V_{s_{t + 1}}),

(25)

where

γ

is the discount factor and

V_{s_{t + 1}}

denotes the state-value function.

V_{s_{t + 1}}

is defined as

V_{s_{t + 1}} = \sum_{a_{t + 1}} π (a_{t + 1} | s_{t + 1}) Q_{π} (s_{t + 1}, a_{t + 1}) .

(26)

Substituting (26) into (25) yields

\begin{matrix} Q_{π} (s_{t}, a_{t}; θ) = \sum_{s_{t + 1}} P (s_{t + 1} | s_{t}, a_{t}) \\ \times [R_{t} + γ \sum_{a_{t + 1}} π (a_{t + 1} | s_{t + 1}) Q_{π} (s_{t + 1}, a_{t + 1})] . \end{matrix}

(27)

Maximizing over (27) yields an expression independent of the policy

π

as

Q (s_{t}, a_{t}; θ) = R_{t} + γ max_{a_{t + 1}} Q (s_{t + 1}, a_{t + 1}; θ) .

(28)

Using a one-step temporal-difference backup, the sampled DQN target is given by

Q (s_{t}, a_{t}; θ) = r_{t} + γ \cdot max_{a_{t + 1}} Q (s_{t + 1}, a_{t + 1}; θ) .

(29)

The temporal difference (TD) algorithm is employed to compute

Q (s_{t}, a_{t}; θ)

. To stabilize the bootstrapped target and reduce oscillations during training, a target network

Q (s_{t + 1}, a_{t + 1}; θ^{-})

that is architecturally identical to the Q-network is constructed to calculate the TD target as

y_{t} = r_{t} + γ \cdot max_{a_{t + 1}} Q (s_{t + 1}, a_{t + 1}; θ^{-}) .

(30)

The TD error between the estimated and target Q-values is calculated by

δ_{t} = Q (s_{t}, a_{t}; θ) - y_{t} .

(31)

The difference between the estimated Q-value and the target Q-value is quantified using the Mean Squared Error (MSE)

L (θ) = \frac{1}{2} {[Q (s_{t}, a_{t}; θ) - y_{t}]}^{2} .

(32)

Taking the gradient with respect to

θ

yields

\nabla_{θ} L (θ) = [Q (s_{t}, a_{t}; θ) - y_{t}] \cdot \nabla_{θ} Q (s_{t}, a_{t}; θ) .

(33)

The parameter

θ

is updated via gradient descent as

θ \leftarrow θ - η \cdot \nabla_{θ} L (θ),

(34)

where

η

is the learning rate.

To mitigate the impact of parameter updates on Q-value estimation errors, the parameters of the Q-network are updated at each iteration during training. The target network parameters

θ^{-}

are periodically synchronized with the Q-network parameters

θ

after a fixed number of steps (e.g., soft update or delayed hard update).

4.5. Integrated Framework of DQN and A* Algorithm

To enable autonomous CPP for UAVs in environments with insufficient or absent prior information, we propose DQN-A*, a hierarchical control architecture combining the DQN and A* algorithms. In this hierarchical framework, the DQN acts as a high-level global planner responsible for determining the coverage sequence based on the current grid state, while the A* algorithm serves as a low-level local controller specifically triggered for short-range obstacle avoidance. Note that the input grid state directly determines the output action. Because the model is trained with limited prior environmental knowledge and without initial consideration of obstacles, prematurely assigning a “1” to a detected obstacle would disrupt the UAV’s action selection. To mitigate this, uncovered areas (including potential obstacles) are uniformly initialized as “0”. The obstacle marker is set to the UAV’s positional spatial encoding value “x” only when the UAV detects an obstacle and its next planned position overlaps with the obstacle. The updated grid state is then fed into the Q-network to determine the UAV’s subsequent position. Using the A* algorithm, the planner dynamically generates a collision-avoidance grid path and upgrades the obstacle identifier to “1”. The refined grid state is then re-input into the Q-network for the next action iteration, based on the “0-x-1” multi-stage obstacle-identification strategy. This iterative process of Q-network actions and A*-guided obstacle avoidance achieves unknown obstacle avoidance and complete coverage of the target area.

The rationale for using A* as the local obstacle-avoidance module is as follows. A* is invoked only within a small local window: from the UAV’s current cell

P_{1}

to the next DQN-recommended cell

P_{2}

, on a partial sub-grid containing only the newly discovered obstacle and the surrounding already-explored cells. The branching factor and search depth are therefore bounded by a small constant, so the worst-case number of node expansions in A* is likewise bounded in this setting. In return, A* provides two properties that simpler local planners (e.g., greedy best-first search, potential fields, Bug-family algorithms) do not: it is complete and optimal on the local grid under an admissible heuristic, and its heuristic-guided expansion gives it lower per-call cost than uninformed Dijkstra on the same local grid. The heuristic used here is the Manhattan distance

h (n) = | n_{x} - P_{2, x} | + | n_{y} - P_{2, y} |

, which is admissible under 4-connected unit-cost movement, ensuring that the returned local detour is the shortest collision-free path from

P_{1}

to

P_{2}

on the local sub-grid. The algorithmic procedures are detailed in Algorithm 1, in which the local A* invocation is expanded into an explicit sub-procedure, and the flow chart is depicted in Figure 5.

Algorithm 1 DQN-A* CPP Algorithm

1:: Input: Discretized search grid, set of confirmed obstacle cells $O$ , UAV initial position $P_{0}$
2:: Initialize all grid cells as “0” (uncovered)
3:: while there exists at least one grid cell labeled “0” do
4:: Input current grid state $S$ to Q-network
5:: Select action a maximizing Q-value using the $ε$ -greedy policy of Equation (5)
6:: if the UAV’s next planned position does not overlap with a detected obstacle then
7:: Execute action a and move to the new cell
8:: else
9:: Save current position $P_{1}$
10:: Mark the obstacle cell with the temporary UAV identifier x (multi-stage rule)
11:: Re-evaluate state $S^{'}$ to obtain new target cell $P_{2}$
12:: ▹ Local A* sub-procedure
13:: Define the local sub-grid as the bounding box of ${P_{1}, P_{2}} \cup O$ , padded by one cell
14:: Define the obstacle set as the cells in $O$ within the sub-grid
15:: Define $g (n)$ as the number of grid steps from $P_{1}$ to n on non-obstacle cells
16:: Define $h (n) = | n_{x} - P_{2, x} | + | n_{y} - P_{2, y} |$ (Manhattan distance, admissible under 4-connected motion)
17:: Run standard $A^{*}$ open/closed-list expansion with $f (n) = g (n) + h (n)$ , breaking ties by lower $h (n)$
18:: Retrieve the returned shortest collision-free path $P_{1} \to P_{2}$ on the local sub-grid
19:: Mark every cell along the returned path as covered in the global state matrix $S$
20:: Add the obstacle cell to $O$ and permanently relabel it as “1”
21:: end if
22:: Update the grid state and store the transition in the replay buffer
23:: end while
24:: Output: Complete coverage trajectory

4.6. Hamiltonian Path Pre-Training

To accelerate the convergence of the DQN model, we introduce Hamiltonian path pre-training as expert prior knowledge. A Hamiltonian path visits every node in a graph exactly once, which naturally aligns with the objective of CPP.

4.6.1. Path Construction

Hamiltonian paths are constructed using a weight-guided backtracking algorithm (Algorithm 2). Starting from a randomly selected edge cell, the algorithm recursively explores unvisited neighboring cells in decreasing order of target existence probability

ω_{i j}

. Among cells with equal probability, the exploration order is randomized to ensure path diversity across multiple runs. This weight-guided strategy biases the generated paths toward high-probability regions, producing expert demonstrations that are consistent with the search mission’s priority objective.

Algorithm 2 Weight-Guided Hamiltonian Path Construction.

Require:: Grid size $m \times n$ , probability matrix $Ω$ , start position $({row}_{0}, {col}_{0})$
Ensure:: Hamiltonian path $τ$ or failure
1:: $τ \leftarrow [({row}_{0}, {col}_{0})]$ ; $visited [{row}_{0}] [{col}_{0}] \leftarrow True$
2:: function S_OLVE( $r, c, count$ )
3:: if $count = m \times n$ then return True
4:: end if
5:: $moves \leftarrow$ unvisited neighbors of $(r, c)$ , sorted by $ω$ in descending order
6:: Randomly shuffle moves sharing equal $ω$ values
7:: for each $(n r, n c) \in moves$ do
8:: Append $(n r, n c)$ to $τ$ ; mark $(n r, n c)$ as visited
9:: if S_OLVE( $n r, n c, count + 1$ ) then return True
10:: end if
11:: Remove $(n r, n c)$ from $τ$ ; unmark $(n r, n c)$
12:: end for
13:: return False
14:: end function

It is important to note that computing an exact Hamiltonian path is inherently an NP-hard problem. While the weight-guided backtracking algorithm (Algorithm 2) operates efficiently on the low-dimensional topological grids evaluated in this study (e.g.,

N = 25

to

N = 100

), its

O (k^{N})

time complexity renders it susceptible to the curse of dimensionality in massive, high-resolution search grids. However, because this path generation is strictly an offline pre-training step, the computational overhead does not impact the UAV’s real-time online operational efficiency. Furthermore, for scaling to significantly larger environments, this exact backtracking method can be seamlessly substituted with polynomial-time approximation heuristics, such as the Spanning Tree Coverage (STC) algorithm or sub-region decomposition strategies. These alternatives can provide the necessary expert demonstrations to warm-start the network without incurring exponential computational costs.

4.6.2. Replay Buffer Pre-Filling

Using Algorithm 2, P Hamiltonian paths are generated from distinct randomly selected edge starting positions. Each path

τ^{(p)} = {c_{1}^{(p)}, c_{2}^{(p)}, \dots, c_{N}^{(p)}}

is then executed step-by-step in the grid environment. At each step t along the path, the environment produces a complete transition tuple

(s_{t}^{(p)}, a_{t}^{(p)}, r_{t}^{(p)}, s_{t + 1}^{(p)}, {done}_{t}^{(p)})

, where the reward

r_{t}^{(p)}

is computed using the reward mechanism defined in (24). All

P \times (N - 1)

expert transitions are stored in the replay buffer

D

before DQN training begins:

D_{0} = ⋃_{p = 1}^{P} {\{(s_{t}^{(p)}, a_{t}^{(p)}, r_{t}^{(p)}, s_{t + 1}^{(p)}, {done}_{t}^{(p)})\}}_{t = 1}^{N - 1} .

(35)

During the subsequent DQN training phase, the agent samples mini-batches from

D

, which initially consists entirely of expert transitions. As the agent accumulates new exploratory experiences, the proportion of expert data naturally decreases because the replay buffer operates in a first-in-first-out (FIFO) manner with fixed capacity (

| D | = 100, 000

). This mechanism is analogous to Deep Q-learning from Demonstrations (DQfD), but without explicit priority weighting for expert versus agent-generated transitions.

5. Computer Simulations

5.1. Parameter Settings

To validate model performance, we selected a

350 m \times 350 m

area as the search region, with the UAV operating at a fixed altitude. The UAV’s FOV radius was set to 50 m, and its flight speed was configured as 5 m/s. Based on Equations (3) and (4), the search area was decomposed into a

5 \times 5

grid. A Q-network was constructed with a 25-dimensional input layer, a 256-dimensional hidden layer, and a 4-dimensional output layer. Key simulation parameters are listed in Table 1. All experiments reported in Section 5 use the energy-aware reward of Equation (24) with

(α_{flight}, α_{turn}) = (0.1, 0.2)

.

For our experimental setting (

N = 25

), according to Equation (20), the value of the UAV position identifier

x \in [2.5, 10]

. In Section 5.2, we conduct a sensitivity analysis based on this range, with the values appropriately extended for validation.

Methodology for hyperparameter selection. The hyperparameter values in Table 1 were determined by a combination of (i) values inherited from established DQN practice [38], (ii) a small grid search on the

5 \times 5

grid using held-out validation seeds, and (iii) the sensitivity analyses for the positional identifier x (Section 5.2) and the learning rate

η

(see the learning-rate sensitivity analysis below).

Network architecture. We selected the network architecture via an empirical sweep on the

5 \times 5

grid (Table 2): hidden-layer widths in

{64, 128, 256, 512}

at fixed depth, and depths in

{1, 2, 3}

at fixed width 256, each over 10 random seeds under otherwise identical hyperparameters. Coverage increases monotonically with width—from

81.7 %

(capacity-limited, 64 units) and a high-variance

89.3 %

(128 units) to

97.7 %

and

99.4 %

for the 256- and 512-unit networks. Although the 512-unit network is marginally best, we adopt the 256-unit network as the default: it trails the 512-unit result by about 1.7 percentage points while using substantially fewer parameters (Section 5.3.4) and incurring lower per-step inference cost—a trade-off dictated by the limited on-board compute and energy budget of a small rotary-wing UAV and by the one-forward-pass-per-cell execution in the physics-engine validation of Section 5.8. Increasing depth did not help but reduced coverage (

97.7 % \to 96.2 % \to 92.3 %

), consistent with the modest state dimensionality (

N = 25

to 100), for which a single moderately wide layer already suffices while remaining easier to optimize. We therefore adopt a single 256-unit hidden layer.

The target probability distribution in the search area adopts a discretized bivariate Gaussian distribution scheme [39,41]. To align with the discretized search grid, the probability distribution is discretized accordingly, with quantitative settings detailed in Figure 6. Specifically, the centroid region exhibits the highest target existence probability, while probabilities decay following a Gaussian profile toward peripheral areas. Technical note: the probabilistic values in this context serve to validate the bivariate Gaussian structural characteristics; a generalization to multimodal distributions is investigated in Section 5.7.

5.2. Analysis of UAV Position Identifier Impacts on Training Effectiveness

The RL framework drives autonomous decision-making through interactive reward mechanisms. However, the neural network’s ability to extract state features heavily depends on the spatial state representation. To empirically validate the theoretical bounds of the UAV position identifier x derived from the Z-score normalization analysis (Section 4.2), we conducted comparative computer simulations using different discrete values of x. To ensure statistical reliability, each configuration was evaluated across 10 distinct random seeds.

According to Equation (20), the theoretical optimal interval for a

5 \times 5

grid (

N = 25

) is

x \in [2.5, 10]

. Figure 7 illustrates the training episode returns across three distinct regimes of x, consistent with our theoretical derivations:

Insufficient Salience Regime ( $x < 2.5$ ): As shown in Figure 7a, when x takes values of 1 or 2, the model exhibits sluggish convergence and pronounced variance. Theoretically, this occurs because x falls below the lower bound, causing the normalized UAV salience ( $S_{U A V}$ ) to be indistinguishable from the background covered regions (labeled as “1”).
Optimal Balance Regime ( $2.5 \leq x \leq 10$ ): As depicted in Figure 7b, marker values within the theoretical bounds (e.g., $x \in {6, 7, 8, 9, 10}$ ) demonstrate rapid, stable, and nearly identical convergence trajectories. In this regime, the identifier is large enough to highlight the UAV’s position without excessively inflating the global variance $σ^{2}$ .
Background Contrast Degradation Regime ( $x > 10$ ): Figure 7c reveals a gradual decline in convergence speed as x increases beyond the upper bound (e.g., $x = 20, 30$ ). An excessively large x notably inflates the state variance $σ^{2}$ , compressing the normalized difference between uncovered (“0”) and covered (“1”) areas below the network’s gradient sensitivity threshold ( $C_{b g} < 0.5$ ), thereby degrading the agent’s spatial awareness.

To further validate the scalability and generalizability of these theoretical bounds, we repeated the simulations on a larger

6 \times 6

grid (

N = 36

). According to Equation (20), the theoretically optimal interval scales dynamically to

x \in [2, 12]

. As evidenced in Figure 8, the empirical performance exactly mirrors the predicted trend:

x = 1

remains suboptimal (Figure 8a), the fastest convergence plateaus within the

x \in [6, 12]

range (Figure 8b), and performance steadily deteriorates for values exceeding 12, such as 20 or 30 (Figure 8c).

These results support the Z-score feature normalization analysis. By balancing UAV salience against background contrast, the derived range provides a principled starting point that substantially reduces the trial-and-error tuning typically associated with state encoding in DRL. For all subsequent simulations, the UAV position marker was fixed to a representative value

x = 8

well inside the derived optimal interval

[2.5, 10]

.

Learning-Rate Sensitivity Analysis

To examine the sensitivity of DQN-A* to the choice of learning rate, we re-trained the model on the

5 \times 5

grid under three fixed learning rates

η \in {5 \times 10^{- 4}, 1 \times 10^{- 4}, 1 \times 10^{- 5}}

and two adaptive schedules (cosine annealing, step decay), each over 10 random seeds with all other hyperparameters held identical to Table 1; results are reported in Table 3. An overly small fixed learning rate stalls training:

η = 1 \times 10^{- 4}

reaches only

82.7 %

coverage, and

η = 1 \times 10^{- 5}

collapses to

66.3 %

. The largest fixed rate performs best (

97.7 %

at

η = 5 \times 10^{- 4}

), and neither adaptive schedule improves on it (

96.1 %

cosine,

93.9 %

step decay). We therefore adopt

η = 5 \times 10^{- 4}

as the default.

5.3. Ablation Study

In search problems, detection probability and coverage ratio constitute critical metrics for evaluating search paths (assuming 100% accuracy of the UAV’s onboard sensors, where grid cell target existence probability equates to detection probability). This section conducts an ablation study to evaluate the impact of individual components within the proposed model as shown in Table 4. Specifically, it investigates the influence of the Hamiltonian path on model training efficiency, the effect of the dual-driven reward mechanism on detection probability, and the contribution of the integrated A* algorithm for unknown obstacle avoidance. In addition, a multi-channel state representation and an 8-direction action space are also investigated.

5.3.1. Effect of Hamiltonian Path Pre-Training

The Hamiltonian path accelerates model convergence. A Hamiltonian path traverses every node in a graph exactly once. In the ablation algorithm w/oHp, the influence of the Hamiltonian path was removed during model training. As shown in Figure 9, both the DQN-A* algorithm and the w/oHp algorithm achieved convergence. However, DQN-A* converged faster than w/oHp. Therefore, the Hamiltonian path aligns with the requirements of CPP algorithms and can serve as expert prior knowledge to enhance the model’s convergence rate.

5.3.2. Effect of the Dual-Driven Reward Mechanism

The dual-driven reward mechanism prioritizes UAV exploration of high-probability target regions. In the ablation algorithm w/oDdrm, the dual-driven reward mechanism was removed from the DQN-A* algorithm. As illustrated in Figure 10, during search steps 5–22, the DQN-A* algorithm demonstrated enhanced detection probability compared to the w/oDdrm approach. This accelerated convergence is attributed to the dual-driven reward mechanism, which enables DQN-A* to prioritize searching regions with higher probabilities.

5.3.3. Effect of the A* Local Planner

The integration of the A* algorithm notably enhances the UAV’s adaptability to unknown obstacles. The ablation algorithm w/oA* employs only the DQN algorithm for path planning during UAV search operations. In contrast, the DQN-A* model integrates both DQN and A* algorithms. Figure 11 shows the coverage ratio performance of both algorithms in environments containing unknown obstacles. DQN-A* achieved a higher mean coverage ratio of 97%, compared to 88% for w/oA*. Furthermore, DQN-A* outperformed w/oA* in both the mean coverage ratio and its standard error. While the DQN algorithm is a single-agent approach capable of autonomous decision-making, its integration with the A* algorithm notably enhances its dynamic adaptability to unknown environments.

5.3.4. Single-Channel Versus Multi-Channel State Representation

This section examines whether a multi-channel formulation would be preferable to the single-channel scalar representation adopted in Section 4.2. Two considerations motivate our choice. First, the target deployment platform is a small rotary-wing UAV whose on-board compute and energy budgets are tightly constrained, and a multi-channel representation is normally paired with a convolutional front-end that is markedly heavier than the single-channel MLP baseline (Table 5). Second, the single-channel design is the natural carrier of the “0-x-1” multi-stage obstacle-identification rule of Section 4.5, in which the scalar dimension carries three graded semantic states; replicating the same behavior inside a binary obstacle channel would reintroduce graded values and partially negate the channel-separation rationale.

To verify that the single-channel choice incurs no performance cost, we implemented a multi-channel variant in which the state is a

3 \times m \times n

tensor (channel 0: coverage; channel 1: ego-position; channel 2: obstacles). Two variants were evaluated alongside the single-channel baseline (

x = 8

): Multi-channel–CNN, processed by a 2-layer convolutional front-end (16 and 32 channels,

3 \times 3

kernels, ReLU, no pooling) followed by a 64-unit FC head; and Multi-channel–MLP, with the tensor flattened and processed by the same 256-unit MLP as the baseline, which isolates the effect of channel separation from that of the CNN inductive bias. All variants share an identical training budget (Table 1), Hamiltonian-path pre-training, and dual-driven reward, and were evaluated over 10 random seeds on the

5 \times 5

grid.

As shown in Figure 12, the three learning curves are indistinguishable, converging to within one standard error of one another. Channel separation thus introduces no representational advantage that the policy can exploit beyond what is already provided by the scalar identifier x, and the deployment considerations above—smaller footprint and direct compatibility with the “0-x-1” obstacle-handling mechanism—become the deciding factors in favor of the single-channel formulation.

5.3.5. Ablation on Action Space Dimensions (4-Direction vs. 8-Direction)

To rigorously address the geometric nature of grid-based CPP, we conducted a quantitative ablation study comparing the strictly constrained orthogonal action space (4-direction: up, down, left, right) against an expanded continuous-like discrete space (8-direction: including diagonal hops).

As summarized in Table 6, while an 8-direction space is traditionally optimal for point-to-point navigation, it introduces counterproductive dynamics in grid-based coverage tasks. Because the grid resolution strictly aligns with the UAV’s downward-facing sensor footprint, diagonal movements frequently bypass adjacent orthogonal cells. Consequently, the agent is forced to execute sharp, redundant backtracking maneuvers to cover these omitted regions.

The empirical data demonstrate that as the environment scales (e.g., the

8 \times 8

grid), expanding the action space to eight directions paradoxically degrades the coverage ratio while significantly increasing the average number of turns compared with the tightly constrained 4-direction planner.

It should be noted that our proposed discrete orthogonal formulation operates strictly at the abstraction level of a high-level global topological planner. In practical deployments, these topological waypoints are intended to be translated into smooth, continuous kinematic trajectories—accounting for continuous yaw control and turning radii—by low-level flight controllers such as Model Predictive Control (MPC).

5.4. Comparison with Baselines

We compare DQN-A* against (i) the offline boustrophedon planner and the vanilla DQN baseline, and (ii) two additional value-based baselines: Double DQN (DDQN), which decouples action selection from target-value estimation to mitigate overestimation, and Dueling DQN, which factorizes the action-value function into a state-value stream and an advantage stream.

Figure 13 reports three performance metrics on the obstacle-free

5 \times 5

grid:

C_{100}

(steps required for 100% area coverage),

P_{d, 50 %}

(steps required for 50% detection probability), and

P_{d, 90 %}

(steps required for 90% detection probability). Since environmental information is fully known a priori, the boustrophedon algorithm achieves the smallest

C_{100}

value (25 time steps). DQN-A* and the DRL baselines all achieve full coverage; DQN-A* prioritizes high-probability regions and reaches

P_{d, 50 %}

in the fewest steps among the DRL methods.

When unknown obstacles are present in the search environment, the superior dynamic adaptability of the DQN-A* algorithm becomes evident, as illustrated in Figure 14. Under these conditions, the boustrophedon algorithm, which is designed for static offline planning, achieves only 38% detection probability and 41% coverage ratio. In contrast, the proposed DQN-A* algorithm demonstrates exceptional environmental adaptability, attaining 93% detection probability and 97% coverage ratio while exhibiting the most stable performance among the three compared algorithms.

To further validate the algorithmic robustness and evaluate the underlying Q-value estimator, we expanded our comparative analysis to include two representative variants from the value-based family: Double DQN (DDQN) and Dueling DQN. It is worth noting that Proximal Policy Optimization (PPO) and other actor-critic methods are outside the scope of the present comparison because the proposed framework is built around replay-buffer pre-filling and value-based off-policy learning. A systematic comparison with actor-critic methods is left for future work.

Interestingly, as illustrated in Figure 15, the results suggest that, in this particular coverage task, standard DQN can converge faster than DDQN and Dueling DQN under the same reward design and pre-training protocol. Through a mechanistic analysis, one possible explanation lies in the unique spatial characteristics of the CPP task and their synergy with our hybrid architecture:

Overestimation as an Exploration Bonus: In standard RL tasks, the inherent overestimation bias of the vanilla DQN is typically considered a theoretical flaw. However, in an unvisited-region-driven search task, this bias inadvertently functions as an optimistic initialization. It provides a strong intrinsic exploration bonus, driving the UAV to rapidly explore unvisited high-probability regions. Conversely, the accurate and conservative value estimations of DDQN inadvertently suppress this initial exploratory momentum.
Action Sensitivity in Grid Coverage: Dueling DQN is uniquely designed to decouple state-value and advantage estimations, excelling in environments with numerous “action-irrelevant” states. Nevertheless, in the dense grid-based CPP task, every discrete action strictly dictates the coverage progress, leaving virtually no action-irrelevant states. Consequently, the dual-stream architecture of Dueling DQN introduces optimization overhead during the early training phase rather than accelerating it.
Synergy with the Decoupled A* Architecture: Because the decoupled A* algorithm serves as a robust, real-time safety net guaranteeing collision-free obstacle avoidance, the global DRL planner is inherently shielded from local deadlock failures. Consequently, leveraging the highly exploratory vanilla DQN as the global planner, structurally supported by the strict A* algorithm, achieves an optimal empirical balance between learning efficiency and operational robustness.

Ultimately, this extended comparative analysis not only justifies the selection of the vanilla DQN but also highlights the modularity of the proposed framework. It demonstrates that the primary methodological contributions of this study—including the probability–time dual-driven reward mechanism, the derivation of the optimal spatial state encoding, and the macro-micro hybrid architecture—are functionally decoupled from the specific network architecture. While the framework can flexibly accommodate various advanced off-policy RL algorithms, the specific pairing of DQN and A* is empirically the most effective in our experiments.

5.5. Generalization Analysis

In real-world search missions, UAVs may deploy from diverse locations within the area. To validate the model’s adaptability to varied takeoff positions, we conducted computer simulations with three distinct starting points, as illustrated in Figure 16. Results demonstrate that the model consistently achieves complete CPP across all initial positions, effectively prioritizing high-probability target zones while minimizing redundant path revisits. This confirms the algorithm’s operational robustness under practical deployment scenarios.

During UAV search missions, unknown obstacles (e.g., dynamic impediments or weather-induced no-fly zones) may disrupt planned trajectories. To evaluate the DQN-A* algorithm’s obstacle-avoidance capability, we simulated four scenarios with abruptly introduced obstacles at the edge, corner, intermediate, and central regions (Figure 17). The algorithm successfully circumvented obstacles in all cases while completing coverage of non-obstacle areas, demonstrating robust capability under spatial constraints. Notably, in Scenarios I and II, the UAV prioritized exploration of high-probability target zones during obstacle avoidance, achieving non-redundant search paths. In Scenarios III and IV, where obstacles occupied critical path nodes, the UAV retained prioritization of high-likelihood regions and ensured full non-obstacle coverage, but redundantly explored adjacent areas, resulting in partially redundant path segments.

The proposed DQN-A* algorithm successfully completed full-area coverage tasks across all three tested initial deployment positions (Figure 16). This capability holds critical operational significance for large-scale search domains or scenarios requiring dynamic adjustment of coverage granularity. Specifically, after completing a task in one zone, the UAV autonomously transitions to the next without needing to return to a predefined starting coordinate. As illustrated in Figure 18, the agent seamlessly progresses from Zone I to Zone II after a full-coverage search.

5.6. Scalability Analysis Across Grid Sizes

To evaluate the scalability of the proposed DQN-A* algorithm beyond the

5 \times 5

baseline, additional computer simulations were conducted on

8 \times 8

(

N = 64

) and

10 \times 10

(

N = 100

) grids under identical hyperparameter settings. The results are summarized in Table 7.

As shown in Table 7, DQN-A* attains a mean coverage ratio of

97.7 %

on the

5 \times 5

grid, which decreases gradually with grid size to

91.0 %

on the

10 \times 10

grid. This monotonic, graceful degradation confirms that the proposed algorithm scales effectively across the evaluated grid sizes, with the residual uncovered fraction growing only modestly as the state space expands.

5.7. Generalization to Multimodal Probability Distributions

The reward function in Equation (23) couples the per-cell probability

ω_{i j}

to the immediate reward but does not assume any particular parametric form of the distribution

Ω

over cells;

Ω

enters the algorithm only through its tabulated grid values, which satisfy the normalization constraint of Equation (22). Consequently, the algorithm can in principle accept any non-negative, normalized probability map, including multimodal ones (e.g., several plausible target locations), mixtures of Gaussians, uniform-with-hotspots maps, and empirically learned distributions.

To demonstrate this empirically, we re-trained and re-evaluated DQN-A* under unimodal, bimodal, and trimodal target distributions on the

5 \times 5

grid over 10 random seeds each; the results are reported in Table 8. Across all three distributions the algorithm attains a high coverage ratio above

95 %

(

97.7 %

unimodal,

95.7 %

bimodal,

96.5 %

trimodal), with only a modest reduction relative to the unimodal case and a correspondingly small increase in mean path length, as expected when target mass is spread across spatially separated peaks. No distribution causes the algorithm to fail or to systematically omit a region, confirming that the method generalizes across distributional forms without any change to the reward or the network.

A natural concern with multiple local maxima is whether the agent might fixate on the single highest-probability region and systematically overlook secondary peaks (a “greedy lock-in”). The dual-driven reward mechanism structurally mitigates this risk through two complementary effects: (a) Efficiency pressure. The step-dependent denominator

(K + C)

progressively discounts all per-step rewards as the episode lengthens, creating a global incentive to complete coverage efficiently rather than dwelling in any single region. (b) Revisit penalty. Once a high-probability region has been fully covered, re-entering any of its cells incurs the constant penalty

λ_{2} < 0

, making it strictly disadvantageous to remain in or oscillate around the dominant peak; the agent is thus structurally pushed toward the positive rewards still available at secondary peaks. The comparable coverage achieved under the bimodal and trimodal maps in Table 8 is consistent with both mechanisms.

5.8. High-Fidelity Physics-Engine Validation

To bridge the gap between the grid-based model and the physical constraints of real flight, this subsection reports a qualitative feasibility demonstration in which the trained DQN-A* policy is executed in a closed loop within NVIDIA Isaac Lab (v2.2.0), a GPU-accelerated simulator built on the PhysX rigid-body engine. The agent is modeled with the hardware parameters of the Bitcraze Crazyflie, and unknown obstacles are instantiated as cylindrical mesh primitives that the UAV must physically circumvent.

The validation adopts an event-triggered hierarchical architecture that mirrors the DQN-A* design of Section 4.5. The trained Q-network acts as the high-level planner: whenever the quadrotor enters a new grid cell, its continuous position is mapped back to a discrete cell index, the state matrix

S

is updated with the

0 / x / 1

identifiers, and the Q-network performs one forward pass to select the next action and target waypoint. The cascade Proportional–Integral–Derivative (PID) controller acts as the low-level tracker, converting each discrete waypoint into continuous thrust and torque commands so that turning radius and acceleration profiles emerge from the simulated dynamics rather than being prescribed. When the distance to a confirmed obstacle falls below a fixed threshold, the multi-stage obstacle-identification rule triggers the local A* sub-procedure to generate a collision-free detour waypoint.

The closed-loop system was evaluated on the coverage scenarios of Section 5, including obstacle-free deployments from different initial positions and deployments with an unknown obstacle at the edge, corner, intermediate, and central cells. In all tested scenarios the quadrotor took off, executed the DQN-A*-generated coverage sequence under PID tracking, circumvented the physical obstacle when present, and completed the traversal of all reachable cells before landing. A representative closed-loop trajectory is shown in Figure 19, and the full set of flights is provided as Supplementary Video S1.

6. Conclusions

This study proposes the DQN-A* algorithm for UAV CPP in search missions. The algorithm adapts to diverse initial UAV positions, avoids unknown obstacles, and prioritizes high-probability target region exploration. Computer simulations validated the effectiveness of the proposed framework, with the key findings summarized as follows:

This study systematically investigates the impact of UAV positional labeling in grid-based maps on model training effectiveness through both theoretical derivation and empirical validation. This finding provides valuable guidance for designing DRL-based path planning algorithms in robot coverage tasks across grid-mapped environments.
The proposed DQN-A* couples a DQN global planner with an A* local controller through a multi-stage “0-x-1” obstacle-identification rule, enabling reliable avoidance of previously unknown obstacles while preserving complete coverage—a capability absent from the standalone DQN baseline.
A novel Probability–Time Dual-Driven Reward Mechanism is designed within the DQN-A* algorithm. This mechanism prioritizes exploration of high-probability regions, thus achieving higher target detection probability than benchmark algorithms.

Finally, it is important to acknowledge the limitations of the current study. In this paper, the UAV’s action space is abstracted as discrete two-dimensional grid movements. In practical deployments, these discrete grid waypoints must be translated into continuous dynamic trajectories—accounting for turning radii, acceleration, and deceleration—by the UAV’s low-level flight controllers, such as PID controllers or MPC.

Future Work

Three follow-up directions are identified. First, a continuous-action formulation solved with Twin Delayed DDPG (TD3) or Soft Actor-Critic (SAC) would replace the discrete 4-direction action space and natively model turning radius and acceleration constraints. Second, dynamic obstacles would be handled via time-decaying confidence markers with event-triggered global re-planning. Third, an extended baseline comparison against Rainbow DQN—directly compatible with the Hamiltonian-path pre-filling of Section 4.6—and, in the longer term, PPO and SAC, is a natural follow-up to Section 5.4.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/drones10060442/s1, Supplementary Video S1: Closed-loop execution of the trained policy in NVIDIA Isaac Lab.

Author Contributions

Conceptualization, H.Y. and G.C.; methodology, H.Y.; software, S.Y.; validation, S.Y. and S.W.; formal analysis, S.Y.; investigation, Q.W.; resources, Q.W.; data curation, S.W. and Z.L.; writing—original draft preparation, H.Y.; writing—review and editing, G.C.; visualization, Z.L.; supervision, G.C.; project administration, G.C.; funding acquisition, Q.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the China Postdoctoral Science Foundation, grant number 2025M774450.

Data Availability Statement

The original contributions presented in this study are included in the article material. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

UAV	Unmanned Aerial Vehicle
CPP	Coverage Path Planning
DRL	Deep Reinforcement Learning
DQN	Deep Q-Network
MDP	Markov Decision Process
RL	Reinforcement Learning
RRT*	Rapidly-exploring Random Tree star
TSP	Traveling Salesman Problem
PPO	Proximal Policy Optimization
DDPG	Deep Deterministic Policy Gradient
DDQN	Double DQN
TD3	Twin Delayed DDPG
SAC	Soft Actor-Critic
PID	Proportional–Integral–Derivative
MPC	Model Predictive Control
FOV	Field of View

References

Lidbetter, T. Search and rescue in the face of uncertain threats. Eur. J. Oper. Res. 2020, 285, 1153–1160. [Google Scholar] [CrossRef]
Xu, S.; Zhou, Z.; Li, J.; Wang, L.; Zhang, X.; Gao, H. Communication-Constrained UAVs’ Coverage Search Method in Uncertain Scenarios. IEEE Sens. J. 2024, 24, 17092–17101. [Google Scholar] [CrossRef]
Khosravi, M.; Arora, R.; Enayati, S.; Pishro-Nik, H. A Search and Detection Autonomous Drone System: From Design to Implementation. IEEE Trans. Autom. Sci. Eng. 2025, 22, 3485–3501. [Google Scholar] [CrossRef]
Martinez-Alpiste, I.; Golcarenarenji, G.; Wang, Q.; Alcaraz-Calero, J.M. Search and rescue operation using UAVs: A case study. Expert Syst. Appl. 2021, 178, 114937. [Google Scholar] [CrossRef]
Qi, M.; Zheng, H. Temporal-Constrained DDPG-Based Path Planning for UAV in Dynamic Environment. IEEE Sens. J. 2026, 26, 4301–4312. [Google Scholar] [CrossRef]
Wang, L.; Kan, J.; Guo, J.; Wang, C. 3D Path Planning for the Ground Robot with Improved Ant Colony Optimization. Sensors 2019, 19, 815. [Google Scholar] [CrossRef] [PubMed]
Sangeetha, V.; Krishankumar, R.; Ravichandran, K.S.; Kar, S. Energy-efficient green ant colony optimization for path planning in dynamic 3D environments. Soft Comput. 2021, 25, 4749–4769. [Google Scholar] [CrossRef]
Meng, Q.; Qian, C.; Sun, Z.Y.; Zhao, S. Autonomous parking method based on improved A* algorithm and model predictive control. Nonlinear Dyn. 2025, 113, 6839–6862. [Google Scholar] [CrossRef]
Chen, T.; Wang, Y.; Wen, H.; Kang, J. Autonomous assembly of multiple flexible spacecraft using RRT* algorithm and input shaping technique. Nonlinear Dyn. 2023, 111, 11223–11241. [Google Scholar] [CrossRef]
Lee, T.K.; Baek, S.H.; Choi, Y.H.; Oh, S.Y. Smooth coverage path planning and control of mobile robots based on high-resolution grid map representation. Robot. Auton. Syst. 2011, 59, 801–812. [Google Scholar] [CrossRef]
Bao, H.; Wang, Y.; Zhu, H.; Wang, D. Area Complete Coverage Path Planning for Offshore Seabed Organisms Fishing Autonomous Underwater Vehicle Based on Improved Whale Optimization Algorithm. IEEE Sens. J. 2024, 24, 12887–12903. [Google Scholar] [CrossRef]
Han, G.; Zhou, Z.; Zhang, T.; Wang, H.; Liu, L.; Peng, Y.; Guizani, M. Ant-Colony-Based Complete-Coverage Path-Planning Algorithm for Underwater Gliders in Ocean Areas with Thermoclines. IEEE Trans. Veh. Technol. 2020, 69, 8959–8971. [Google Scholar] [CrossRef]
Lumelsky, V.J.; Mukhopadhyay, S.; Sun, K. Dynamic path planning in sensor-based terrain acquisition. IEEE Trans. Robot. Autom. 1990, 6, 462–472. [Google Scholar] [CrossRef]
Wang, L.; Wang, Z.; Liu, M.; Ying, Z.; Xu, N.; Meng, Q. Full Coverage Path Planning Methods of Harvesting Robot with Multi-Objective Constraints. J. Intell. Robot. Syst. 2022, 106, 17. [Google Scholar] [CrossRef]
Gabriely, Y.; Rimon, E. Spiral-STC: An on-line coverage algorithm of grid environments by a mobile robot. In Proceedings of the Proceedings 2002 IEEE International Conference on Robotics and Automation, Washington, DC, USA, 11–15 May 2002; pp. 954–960. [Google Scholar]
Hai, V.P.; Asadi, F.; Abut, N.; Kandilli, I. Hybrid Spiral STC-Hedge Algebras Model in Knowledge Reasonings for Robot Coverage Path Planning and Its Applications. Appl. Sci. 2019, 9, 1909. [Google Scholar] [CrossRef]
Tang, G.; Tang, C.; Zhou, H.; Claramunt, C.; Men, S. R-DFS: A Coverage Path Planning Approach Based on Region Optimal Decomposition. Remote Sens. 2021, 13, 1525. [Google Scholar] [CrossRef]
Tan, X.; Han, L.; Gong, H.; Wu, Q. Biologically Inspired Complete Coverage Path Planning Algorithm Based on Q-Learning. Sensors 2023, 23, 4647. [Google Scholar] [CrossRef]
Ai, B.; Jia, M.; Xu, H.; Xu, J.; Wen, Z.; Li, B.; Zhang, D. Coverage path planning for maritime search and rescue using reinforcement learning. Ocean Eng. 2021, 241, 110098. [Google Scholar] [CrossRef]
Lv, L.; Zhang, S.; Ding, D.; Wang, Y. Path Planning via an Improved DQN-Based Learning Policy. IEEE Access 2019, 7, 67319–67330. [Google Scholar] [CrossRef]
Dong, J.; Li, X.; Liu, Y. Multi-quadrotor Cooperative Area Coverage Mission Planning Based on DQN. In Proceedings of the 2020 5th International Conference on Advanced Robotics and Mechatronics (ICARM), Shenzhen, China, 18–21 December 2020; pp. 672–677. [Google Scholar]
Hu, W.; Yu, Y.; Liu, S.; She, C.; Guo, L.; Vucetic, B.; Li, Y. Multi-UAV Coverage Path Planning: A Distributed Online Cooperation Method. IEEE Trans. Veh. Technol. 2023, 72, 11727–11740. [Google Scholar] [CrossRef]
Nam, L.H.; Huang, L.; Li, X.J.; Xu, J.F. An approach for coverage path planning for UAVs. In Proceedings of the 2016 IEEE 14th International Workshop on Advanced Motion Control (AMC), Auckland, New Zealand, 22–24 April 2016; pp. 411–416. [Google Scholar]
Sun, W.; Luo, Z.; Huang, K.; Shi, J. Joint Deployment and Coverage Path Planning for Capsule Airports with Multiple Drones. Drones 2023, 7, 457. [Google Scholar] [CrossRef]
Nasirian, B.; Mehrandezh, M.; Janabi-Sharifi, F. Efficient Coverage Path Planning for Mobile Disinfecting Robots Using Graph-Based Representation of Environment. Front. Robot. AI 2021, 8, 624333. [Google Scholar] [CrossRef] [PubMed]
Cai, C.; Chen, J.; Yan, Q.; Liu, F. A Multi-Robot Coverage Path Planning Method for Maritime Search and Rescue Using Multiple AUVs. Remote Sens. 2022, 15, 93. [Google Scholar] [CrossRef]
Feng, L.; Katupitiya, J. UAV-based persistent full area coverage with dynamic priorities. Robot. Auton. Syst. 2022, 157, 104244. [Google Scholar] [CrossRef]
Jia, Y.; Zhou, S.; Zeng, Q.; Li, C.; Chen, D.; Zhang, K.; Liu, L.; Chen, Z. The UAV Path Coverage Algorithm Based on the Greedy Strategy and Ant Colony Optimization. Electronics 2022, 11, 2667. [Google Scholar] [CrossRef]
Ahmed, N.; Pawase, C.J.; Chang, K. Distributed 3-D Path Planning for Multi-UAVs with Full Area Surveillance Based on Particle Swarm Optimization. Appl. Sci. 2021, 11, 3417. [Google Scholar] [CrossRef]
Yang, S.X.; Meng, M. An efficient neural network approach to dynamic robot motion planning. Neural Netw. 2000, 13, 143–148. [Google Scholar] [CrossRef] [PubMed]
Yang, S.X.; Luo, C. A Neural Network Approach to Complete Coverage Path Planning. IEEE Trans. Syst. Man Cybern. B 2004, 34, 718–724. [Google Scholar] [CrossRef]
Tang, F. Coverage path planning of unmanned surface vehicle based on improved biological inspired neural network. Ocean. Eng. 2023, 278, 114354. [Google Scholar] [CrossRef]
Huo, L.; Liu, Y.; Chen, Z.; Yang, Y.; Yan, X.; Xia, H.; Sun, Q. Complete Coverage Path Planning Algorithm Based on Improved Biologically Inspired Neural Networks in Spray Painting. IEEE Robot. Autom. Lett. 2025, 10, 5697–5704. [Google Scholar] [CrossRef]
Huang, Y.; Wang, Y.; Li, Z.; Zhang, H.; Zhang, C. A Hierarchical Multi Robot Coverage Strategy for Large Maps with Reinforcement Learning and Dense Segmented Siamese Network. IEEE Robot. Autom. Lett. 2025, 10, 444–451. [Google Scholar] [CrossRef]
Luo, Q.; Luan, T.H.; Shi, W.; Fan, P. Deep Reinforcement Learning Based Computation Offloading and Trajectory Planning for Multi-UAV Cooperative Target Search. IEEE J. Sel. Areas Commun. 2023, 41, 504–520. [Google Scholar] [CrossRef]
Yi, L.; Hayat, A.A.; Wan, A.Y.S.; Le, A.V.; Tang, Q.R.; Elara, M.R. Complete coverage path planning for omnidirectional self-reconfigurable cleaning robot using aGBNN. IEEE Trans. Autom. Sci. Eng. 2025, 23, 2212–2230. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press Cambridge: Cambridge, MA, USA, 2018; Volume 1, p. 25. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-Level Control through Deep Reinforcement Learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Krzysiak, R.; Butail, S. Information-Based Control of Robots in Search-and-Rescue Missions with Human Prior Knowledge. IEEE Trans. Hum.-Mach. Syst. 2022, 52, 52–63. [Google Scholar] [CrossRef]
Ni, J.; Gu, Y.; Gu, Y.; Zhao, Y.; Shi, P. UAV Coverage Path Planning with Limited Battery Energy Based on Improved Deep Double Q-network. Int. J. Control Autom. Syst. 2024, 22, 2591–2601. [Google Scholar] [CrossRef]
Qi, P. Algorithm design and Simulation of optimal maritime search scheme. In First International Conference on Information Sciences, Machinery, Materials and Energy; Atlantis Press: Dordrecht, The Netherlands, 2015; pp. 1772–1775. [Google Scholar]

Figure 1. Unmanned aerial vehicle (UAV) detection field.

Figure 2. Grid discretization of the search area.

Figure 3. Search mission scenario.

Figure 4. UAV action space.

Figure 5. The flow chart of the DQN-A* algorithm.

Figure 6. Target probability distribution.

Figure 7.

N = 25

. Comparison study showing episode return of different UAV position markers over 10 seeds. The solid lines are the average of all runs, and the shaded area is the standard error of the mean. (a) UAV position marker 1–6. (b) UAV position marker 6–10. (c) UAV position marker 10–30.

Figure 7.

N = 25

. Comparison study showing episode return of different UAV position markers over 10 seeds. The solid lines are the average of all runs, and the shaded area is the standard error of the mean. (a) UAV position marker 1–6. (b) UAV position marker 6–10. (c) UAV position marker 10–30.

Figure 8.

N = 36

. Comparison study showing episode return of different UAV position markers over 10 seeds. The solid lines are the average of all runs, and the shaded area is the standard error of the mean. (a) UAV position marker 1–6. (b) UAV position marker 6–12. (c) UAV position marker 12–30.

Figure 8.

N = 36

. Comparison study showing episode return of different UAV position markers over 10 seeds. The solid lines are the average of all runs, and the shaded area is the standard error of the mean. (a) UAV position marker 1–6. (b) UAV position marker 6–12. (c) UAV position marker 12–30.

Figure 9. Ablation study showing episode return of DQN-A* and w/oHp over 10 seeds. The solid lines are the average of all runs, and the shaded area is the standard error of the mean.

Figure 10. Ablation study showing detection probability of DQN-A* and w/oDdrm.

Figure 11. Ablation study showing coverage ratio of DQN-A* and w/oA* over 10 seeds. The solid lines are the average of all runs, and the error bars are the standard error of the mean.

Figure 12. Episode return of the three state-representation variants on the

5 \times 5

grid over 10 seeds. Solid lines are the cross-seed mean; shaded bands are the standard error of the mean.

Figure 12. Episode return of the three state-representation variants on the

5 \times 5

grid over 10 seeds. Solid lines are the cross-seed mean; shaded bands are the standard error of the mean.

Figure 13. Comparison study in static environment showing coverage ratio and detection probability of DQN-A*, DQN, and boustrophedon.

Figure 14. Comparison study in environments with unknown obstacles showing (a) coverage ratio and (b) detection probability of DQN-A*, DQN, and boustrophedon over 10 seeds. The solid lines are the average of all runs, and the shaded area is the standard error of the mean.

Figure 15. Episode return of the proposed framework utilizing different RL algorithms (DQN, DDQN, and Dueling DQN).

Figure 16. Complete coverage search trajectories validation across three distinct initial positions.

Figure 17. Coverage paths under unknown obstacle scenarios.

Figure 18. UAV inter-zone search transition.

Figure 19. Closed-loop execution of the trained DQN-A* policy in NVIDIA Isaac Lab. The high-level Q-network selects one action per traversed grid cell; the low-level cascade PID controller produces the continuous trajectory. The red curve denotes the UAV trajectory and the gray cylinder denotes the obstacle.

Table 1. Model parameters.

Parameters	Value
Total gradient updates	17,000
Parallel environments	1024
Discount factor ( $γ$ )	0.95
Learning rate ( $η$ )	$5 \times 10^{- 4}$
Replay buffer size	100,000
Batch size	512
$ε_{start}$	1.0
$ε_{end}$	0.05
$ε$ linear-decay updates	15,000
Target network sync period (updates)	500
Reward scaling factor ( $λ_{1}$ )	500
Penalty factor ( $λ_{2}$ )	$- 0.5$
Step count factor (K)	10
Max episode step budget ( $C_{max}$ )	200
Out-of-bounds penalty ( $r_{oob}$ )	$- 1.5$
Obstacle-collision penalty ( $r_{obs}$ )	$- 3.0$
Completion bonus base ( $β_{done}$ )	50
Timeout penalty coefficient ( $β_{tmo}$ )	20
Flight energy coefficient ( $α_{flight}$ )	$0.1$
Turn energy coefficient ( $α_{turn}$ )	$0.2$

Table 2. Network-architecture sweep on the

5 \times 5

grid: coverage ratio (mean ± standard deviation over 10 random seeds) and mean number of turns. The width sweep fixes depth at one hidden layer; the depth sweep fixes width at 256 units. The adopted default is shown in bold.

Table 2. Network-architecture sweep on the

5 \times 5

grid: coverage ratio (mean ± standard deviation over 10 random seeds) and mean number of turns. The width sweep fixes depth at one hidden layer; the depth sweep fixes width at 256 units. The adopted default is shown in bold.

Configuration	Coverage Ratio (%)	Mean Turns
Width sweep (depth $= 1$ )
Width 64	$81.7 \pm 4.4$	$32.6$
Width 128	$89.3 \pm 6.8$	$20.6$
Width 256 (default)	$97.7 \pm 0.6$	$11.7$
Width 512	$99.4 \pm 0.7$	$12.0$
Depth sweep (width $= 256$ )
Depth 1 (default)	$97.7 \pm 0.6$	$11.7$
Depth 2	$96.2 \pm 2.9$	$15.6$
Depth 3	$92.3 \pm 1.0$	$12.9$

Table 3. Learning-rate sensitivity on the

5 \times 5

grid: coverage ratio (mean ± standard deviation over 10 random seeds) and mean episode length (steps to termination, capped at

C_{max} = 200

) for three fixed learning rates and two adaptive schedules. The adopted default is shown in bold.

Table 3. Learning-rate sensitivity on the

5 \times 5

grid: coverage ratio (mean ± standard deviation over 10 random seeds) and mean episode length (steps to termination, capped at

C_{max} = 200

) for three fixed learning rates and two adaptive schedules. The adopted default is shown in bold.

Learning Rate/Schedule	Coverage Ratio (%)	Mean Episode Length (Steps)
$η = 5 \times 10^{- 4}$ (default)	$97.7 \pm 0.6$	$25.2$
$η = 1 \times 10^{- 4}$	$82.7 \pm 3.0$	$78.1$
$η = 1 \times 10^{- 5}$	$66.3 \pm 9.2$	$130.0$
Cosine annealing	$96.1 \pm 2.5$	$25.7$
Step decay	$93.9 \pm 1.6$	$29.7$

Table 4. Ablation experiments.

Algorithms	Hp	Ddrm	A*
DQN-A*	√	√	√
w/oHp	×	√	√
w/oDdrm	√	×	√
w/oA*	√	√	×

Hp: Hamiltonian-path pre-training; Ddrm: dual-driven reward mechanism; √: component included; ×: component removed.

Table 5. Trainable-parameter count of the policy network for the three state-representation variants on the

5 \times 5

grid.

Table 5. Trainable-parameter count of the policy network for the three state-representation variants on the

5 \times 5

grid.

Variant	Architecture	Parameters	Relative Size
Single-channel ( $x = 8$ )	MLP (256-unit hidden)	7684	$1.00 \times$
Multi-channel–MLP	MLP (256-unit hidden)	$20, 484$	$2.67 \times$
Multi-channel–CNN	Conv(16)→Conv(32)→FC(64)	$56, 612$	$7.37 \times$

Table 6. Quantitative ablation of action space dimensions on coverage efficiency (mean over 10 random seeds).

Grid Size	Action Space	Coverage Ratio (%)	Mean Turns
$5 \times 5$	4-Direction (Ours)	97.7	11.7
$5 \times 5$	8-Direction	98.5	18.8
$8 \times 8$	4-Direction (Ours)	95.6	53.9
$8 \times 8$	8-Direction	92.8	65.4

Table 7. Coverage ratio of DQN-A* across different grid sizes (mean ± standard deviation over 10 random seeds).

Grid Size	N	Coverage Ratio (%)
$5 \times 5$	25	$97.7 \pm 0.6$
$8 \times 8$	64	$95.6 \pm 2.8$
$10 \times 10$	100	$91.0 \pm 3.6$

Table 8. Generalization across target probability distributions on the

5 \times 5

grid: coverage ratio (mean ± standard deviation over 10 random seeds) and mean path length.

Table 8. Generalization across target probability distributions on the

5 \times 5

grid: coverage ratio (mean ± standard deviation over 10 random seeds) and mean path length.

Distribution	Coverage Ratio (%)	Mean Steps
Unimodal	$97.7 \pm 0.6$	$25.2$
Bimodal	$95.7 \pm 2.0$	$26.3$
Trimodal	$96.5 \pm 0.8$	$26.8$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yuan, H.; Yan, S.; Liu, Z.; Wang, S.; Wang, Q.; Chen, G. Optimizing Spatial State Representation in Reinforcement Learning for Coverage Path Planning in UAV Search Missions. Drones 2026, 10, 442. https://doi.org/10.3390/drones10060442

AMA Style

Yuan H, Yan S, Liu Z, Wang S, Wang Q, Chen G. Optimizing Spatial State Representation in Reinforcement Learning for Coverage Path Planning in UAV Search Missions. Drones. 2026; 10(6):442. https://doi.org/10.3390/drones10060442

Chicago/Turabian Style

Yuan, Hu, Shengkai Yan, Zhuzhi Liu, Suli Wang, Qiang Wang, and Gaocheng Chen. 2026. "Optimizing Spatial State Representation in Reinforcement Learning for Coverage Path Planning in UAV Search Missions" Drones 10, no. 6: 442. https://doi.org/10.3390/drones10060442

APA Style

Yuan, H., Yan, S., Liu, Z., Wang, S., Wang, Q., & Chen, G. (2026). Optimizing Spatial State Representation in Reinforcement Learning for Coverage Path Planning in UAV Search Missions. Drones, 10(6), 442. https://doi.org/10.3390/drones10060442

Article Menu

Optimizing Spatial State Representation in Reinforcement Learning for Coverage Path Planning in UAV Search Missions

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Offline CPP

2.2. Online CPP

3. Problem Formulation

3.1. Detection Field of the UAV

3.2. Search Area Discretization

3.3. Search Scenario Description

4. Model Construction

4.1. Action Space

4.2. State Space

4.3. Reward Design

4.4. Model Parameter Update

4.5. Integrated Framework of DQN and A* Algorithm

4.6. Hamiltonian Path Pre-Training

4.6.1. Path Construction

4.6.2. Replay Buffer Pre-Filling

5. Computer Simulations

5.1. Parameter Settings

5.2. Analysis of UAV Position Identifier Impacts on Training Effectiveness

Learning-Rate Sensitivity Analysis

5.3. Ablation Study

5.3.1. Effect of Hamiltonian Path Pre-Training

5.3.2. Effect of the Dual-Driven Reward Mechanism

5.3.3. Effect of the A* Local Planner

5.3.4. Single-Channel Versus Multi-Channel State Representation

5.3.5. Ablation on Action Space Dimensions (4-Direction vs. 8-Direction)

5.4. Comparison with Baselines

5.5. Generalization Analysis

5.6. Scalability Analysis Across Grid Sizes

5.7. Generalization to Multimodal Probability Distributions

5.8. High-Fidelity Physics-Engine Validation

6. Conclusions

Future Work

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI