An Improved Multi-UAV Area Coverage Path Planning Approach Based on Deep Q-Networks

Ni, Jianjun; Ge, Yuechen; Zhao, Yonghao; Gu, Yang

doi:10.3390/app152011211

Open AccessArticle

An Improved Multi-UAV Area Coverage Path Planning Approach Based on Deep Q-Networks

¹

College of Artificial Intelligence and Automation, Hohai University, Changzhou 213200, China

²

College of Information Science and Engineering, Hohai University, Changzhou 213200, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(20), 11211; https://doi.org/10.3390/app152011211

Submission received: 25 September 2025 / Revised: 15 October 2025 / Accepted: 16 October 2025 / Published: 20 October 2025

(This article belongs to the Section Robotics and Automation)

Download

Browse Figures

Versions Notes

Abstract

Multi-UAV area coverage path planning is a challenging and important task in the field of multi-robots. To achieve efficient and complete coverage in grid-based environments with obstacles and complex boundaries, a multi-UAV area coverage path planning method based on an improved Deep Q-Network (DQN) is proposed in this paper. In the proposed method, a map preprocessing technique based on Depth-First Search (DFS) is introduced to automatically detect and remove unreachable areas. Subsequently, to achieve a reasonable task allocation, the Divide Areas based on Robots’ initial Positions (DARP) algorithm is utilized. In the path planning stage, an enhanced Dueling DQN reinforcement learning architecture is employed by introducing action encoding and prioritized experience replay mechanisms, which improves both training efficiency and policy quality. Moreover, a reward function specifically designed for complete coverage tasks is proposed, effectively reducing redundant visits and mitigating path degradation. Extensive experiments conducted on several benchmark maps show that the proposed method outperforms traditional DQN, Boustrophedon path planning, and Spanning Tree Coverage (STC) methods in terms of coverage rate, redundancy rate, and path length.

Keywords:

multi-UAV systems; coverage path planning; deep reinforcement learning; area partitioning

1. Introduction

In recent years, with the rapid development of unmanned systems and mobile robotics [1,2,3,4,5], coverage path planning (CPP) of multiple unmanned aerial vehicles (UAVs) has been widely applied in various real-world scenarios, such as environmental monitoring [6,7], agricultural inspection [8,9], security patrolling [10,11], and disaster search and rescue [12,13]. The core objective of CPP is to plan one or multiple trajectories in environments with obstacles and spatial constraints, enabling agents to efficiently and completely traverse all reachable areas within the task region, thereby maximizing spatial coverage while minimizing path redundancy and resource consumption.

In practical applications, traditional CPP methods, such as Boustrophedon coverage [14], spiral path planning [15], and graph-based Spanning Tree Coverage (STC) [16], are structurally simple and easy to deploy. However, they often encounter significant challenges in complex environments, including low coverage efficiency, severe path redundancy, poor adaptability to obstacles, and limited scalability to multi-UAV cooperative tasks. To address these issues, various techniques, such as heuristic search [17], graph partitioning [18], clustering algorithms [19], and bio-inspired methods [20,21] have been presented in recent years. Improved ant colony optimization [22,23], genetic algorithms [24,25], fuzzy logic approaches [26], and K-Means clustering [27] are among the methods that have shown improvements in path quality. Nevertheless, these approaches still face difficulties in policy learning and adaptive decision-making in high-dimensional and dynamic environments.

With the rapid advances in Deep Reinforcement Learning (DRL), its applications to CPP problems have demonstrated great potential [28,29,30,31]. Reinforcement learning optimizes policies through interaction with the environment, making it well-suited for complex, dynamic systems that are difficult to model explicitly. The introduction of the Dueling Deep Q-Network (Dueling DQN) architecture has further enhanced the stability of value function estimation and improved policy convergence [32]. Consequently, adaptive coverage strategy learning based on reinforcement learning has become a research hotspot in recent years.

On the other hand, the incorporation of multi-agent systems (MAS) has significantly expanded the application boundaries of CPP [33,34,35]. Compared to single-agent systems, multi-agent systems offer higher spatial parallelism and task execution efficiency when dealing with large-scale environments. However, multi-UAV cooperative CPP tasks pose two critical challenges: reasonable task area partitioning and path conflict management. Traditional methods such as K-Means clustering [27] or regular grid division often fail to ensure region connectivity and load balancing, leading to area overlaps and inter-agent conflicts. To overcome these issues, this paper introduces the Divide Areas based on Robots’ initial Positions (DARP) algorithm [18], which dynamically partitions the task space based on UAVs’ starting positions and obstacle layouts, ensuring that each UAV is assigned a reasonable and connected task region, thereby significantly reducing path redundancy.

In summary, this paper proposes a novel multi-UAV coverage path planning method that combines DARP-based area partitioning with an improved Dueling DQN reinforcement learning framework. The main contributions are as follows: (1) A map preprocessing mechanism based on Depth-First Search (DFS) is introduced to eliminate completely unreachable areas, enhancing policy learning efficiency. (2) The DARP algorithm is utilized to achieve dynamic multi-UAV task partitioning, ensuring load balancing and minimizing inter-agent path conflicts. (3) An enhanced Dueling DQN network is constructed by incorporating action encoding and prioritized experience replay, improving policy generalization and training stability. Finally, comprehensive experimental evaluations are conducted on several benchmark maps and the results demonstrate that the proposed method significantly outperforms traditional approaches in terms of coverage rate, redundancy rate, and path efficiency.

2. Environment Modeling and Problem Definition

2.1. Grid Map Modeling

In this study, the grid map model is used [36,37]. The entire coverage environment is modeled as a two-dimensional discrete grid map:

G \in Z^{H \times W}

(1)

where H and W represent the height and width of the grid map. Each cell in the map

g_{x, y} \in G

has one of the following states: (1) free space (a UAV can enter and execute a coverage task); (2) obstacle (physically inaccessible area, entry is not permitted, set as a collection

O \subset G

); (3) no-fly zone (restricted functional area, not accessible, denoted as

Z \subset G

).

All areas available for UAV coverage are defined as

G_{f r e e} = G \land (O \cup Z)

(2)

2.2. Multi-UAV System Modeling

This study considers a system composed of N autonomous UAVs with the objective of collaboratively completing the full coverage of the entire workspace, which is defined as follows:

A_{N} = \{1, 2, \dots, N\}

(3)

Each UAV occupies only one grid cell and satisfies the following constraints: (1) At any time, multiple UAVs are not allowed to be at the same position. (2) Entering obstacles/no-fly zones or going beyond map boundaries is not allowed. Each UAV performs path planning and learning exclusively within its assigned subregion. Let the position of the i-th UAV be denoted as

ϕ_{t}^{i}

. In addition, for simplicity without losing generality [21,38], all directional actions of UAVs (up, down, left, right) are assumed to have equal execution cost, without modeling vehicle dynamics such as turning or strafing effort. In reality, maneuvers like rotation or side strafing usually require higher energy or time consumption, which should be taken into full consideration in the real-world applications.

In this paper, the task is similar to that in our previous work [39]; namely, UAVs are deployed from ground vehicles that first move to suitable positions within the operational area. By pre-assigning UAVs to these optimal locations, the system ensures efficient coverage and enables effective strategy execution. Thus, in this study, UAVs fly at a fixed altitude, and the grid map is designed so that each cell corresponds to the UAV’s stable movement range in one cycle. As a result, UAV navigation is simplified to moving to the center of adjacent cells, enabling a discrete model for spatial coverage. This setting not only ensures the controllability of path calculation, but also provides a clear and standardized action and state space for strategy training.

2.3. Task Objective Definition

Each UAV’s task is to complete full coverage within its area of responsibility as efficiently as possible. The following requirements need to be met during the coverage process: (1) full coverage (traverse all passable units in the area); (2) legitimacy of the path (the moving path must not violate the constraints of the map); (3) conflict avoidance (ensure that there is no spatial conflict between UAVs); (4) path efficiency (minimize redundant access and the number of steps moved).

To measure the coverage efficiency of a UAV, this paper defines the following metrics [40,41]:

(1): Total steps $T_{i}$ : $T_{i}$ represents the total number of execution steps of the UAV.
(2): Coverage ratio $C R$ : $C R$ is used to evaluate the total coverage efficiency, which is defined as follows:

$C R_{i} = \frac{| V_{i} |}{| L_{i} |}$

(4)

where $V_{i}$ is the set of cells actually visited by UAV i; $L_{i}$ is the area for which UAV i is responsible; and $|\cdot|$ indicates the number of elements in the set, that is, the number of corresponding grids.
(3): Redundancy ratio $R R$ : $R R$ denotes the proportion of steps during which a UAV revisits grid cells it has already visited before, which is defined as follows:

${R R}_{i} = \frac{1}{T_{i}} \sum_{t = 1}^{T_{i}} \min (1, \sum_{τ = 1}^{t - 1} 1 (ϕ_{i}^{t} = ϕ_{i}^{τ})),$

(5)

where $1$ $(\cdot)$ is the indicator function that equals 1 when the condition inside the parentheses is true and 0 otherwise. This formulation checks, at each step t, whether the current position $ϕ_{i}^{t}$ has appeared in any previous step $τ \in {1, \dots, t - 1}$ . If so, that step counts as a repeated visit (value 1); otherwise, it counts as a new visit (value 0).

The comprehensive objective function used in this study is as follows, aiming to ensure coverage while suppressing path redundancy:

\max \sum_{i = 1}^{N} ({C R}_{i} - λ \cdot {R R}_{i})

(6)

where

λ > 0

is a redundancy penalty factor that controls the trade-off between coverage efficiency and path duplication.

3. Proposed Method

To realize the task defined in Section 2, an integrated method is proposed in this paper, including three main parts. First, a map preprocessing method is proposed to remove unreachable areas. Then, a regional division and task allocation module is presented to improve the efficiency of the collaborative operation. At last, a path planning module based on reinforcement learning is proposed to realize the final task.

3.1. Map Preprocessing Method

In real-world environments, due to the irregular distribution of obstacles, there may exist free regions that are completely enclosed by obstacles. Although these regions are labeled as “traversable“ in the grid map, they are actually inaccessible from the map boundaries. If not properly handled, such regions would severely impair the training efficiency of reinforcement learning algorithms.

To address this issue, this paper designs a map preprocessing module based on Depth-First Search (DFS) to automatically identify and relabel these inaccessible but non-obstacle regions. The core idea is as follows: (1) Starting from all traversable cells along the map boundaries, perform DFS exploration. (2) Mark all unvisited free cells during the search as “inaccessible”. (3) Treat these cells as obstacles during the reinforcement learning process to avoid invalid path exploration.

Notably, this DFS serves exclusively as a static, offline environment-cleaning tool for one-time topological screening during initialization, rather than for path generation or dynamic decision-making—distinguishing it from the subsequent Dueling DQN-based CPP. In addition, this DFS-based preprocessing is a coarse-grained method that does not require a highly detailed map; it can operate effectively on low-resolution maps or satellite imagery, simplifying data acquisition and preprocessing requirements.

The overall process is illustrated in Algorithm 1, and the effect of this preprocessing is shown in Figure 1.

Algorithm 1 Preprocessing unreachable areas via Depth-First Search (DFS) strategy.

Require:: Grid map $G \in Z^{H \times W}$ , where 1 indicates an obstacle
Ensure:: Updated $G$ in which unreachable free cells are marked as 3
1:: $v i s i t e d \leftarrow \emptyset$ ▹ Initialize a set to record reachable cells
2:: function DFS( $x, y$ )
3:: if $x < 0$ or $x \geq H$ or $y < 0$ or $y \geq W$ then
4:: return ▹ Return if outside the map boundaries
5:: end if
6:: if $(x, y) \in v i s i t e d$ or $G [x] [y] = 1$ then
7:: return ▹ Skip already visited or obstacle cells
8:: end if
9:: Add $(x, y)$ to $v i s i t e d$
10:: for $(d x, d y) \in {(- 1, 0), (1, 0), (0, - 1), (0, 1)}$ do
11:: DFS( $x + d x, y + d y$ ) ▹ Recursively explore four neighbors
12:: end for
13:: end function
14:: for $i = 0$ to $H - 1$ do
15:: for $j = 0$ to $W - 1$ do
16:: if $i = 0$ or $i = H - 1$ or $j = 0$ or $j = W - 1$ then
17:: if $G [i] [j] \neq 1$ then
18:: DFS( $i, j$ ) ▹ Start DFS from each free border cell
19:: end if
20:: end if
21:: end for
22:: end for
23:: for $i = 0$ to $H - 1$ do
24:: for $j = 0$ to $W - 1$ do
25:: if $(i, j) \notin v i s i t e d$ and $G [i] [j] \neq 1$ then
26:: $G [i] [j] \leftarrow 3$ ▹ Mark unreachable free cells as 3
27:: end if
28:: end for
29:: end for

DFS is launched from every free border cell, recursively exploring four connected neighbors while skipping obstacles and already visited cells. After all reachable areas are discovered, any remaining free cell not in the visited set is marked as 3, indicating it is unreachable. This method can be efficiently executed during the map initialization phase, offering advantages such as zero learning cost and high parallelizability. Compared with traditional flood-fill algorithms, the proposed approach provides a simpler search process and better integration with reinforcement learning environments.

3.2. Area Division via DARP

In multi-UAV coverage path planning tasks, avoiding conflicts between UAVs and ensuring balanced workload distribution are key challenges. To address this issue, we adopt the Divide Areas based on Robot initial Positions (DARP) algorithm, first introduced by Kapoutsis et al. [18], to partition the traversable space into disjoined and connected regions, each assigned to a single UAV.

DARP guarantees that (1) each region is connected; (2) each UAV’s initial position is inside its assigned region; (3) the number of cells per region is approximately equal; (4) there are no overlaps among regions.

Specifically, let

G_{f r e e}

denote the free area of the map. The goal is to divide the task space into N responsibility areas

\{L_{1}, \dots, L_{N}\}

, satisfying the following:

\begin{matrix} ⋃_{i = 1}^{N} L_{i} = G_{f r e e} \\ L_{i} \cap L_{j} = ⌀ for i \neq j \\ ϕ_{0}^{i} \in L_{i} \\ L_{i} is connected \end{matrix}

(7)

Let

f = \frac{|G_{f r e e}|}{N}

be the ideal number of cells per UAV. The optimization goal of DARP is to minimize region assignment errors:

\min J = \frac{1}{2} \sum_{i = 1}^{N} {(k_{i} - f)}^{2}

(8)

where

k_{i} = |L_{i}|

represents the area (number of cells) assigned to the i-th UAV.

The algorithm iteratively updates cell assignments based on cost matrices and connectivity constraints. The process is outlined in Algorithm 2.

Algorithm 2 DARP algorithm for multi-UAV area partitioning.

Require:: Grid map $G_{free} \in Z^{H \times W}$ , initial positions ${ϕ_{0}^{1}, \dots, ϕ_{0}^{N}}$
Ensure:: Subregions ${L_{1}, \dots, L_{N}}$ satisfying load balance and connectivity
1:: Initialize cost matrices $C_{i} (x, y) \leftarrow dist ((x, y), ϕ_{0}^{i})$ ▹ Distance-based initial cost
2:: Assign initial labels $A (x, y) \leftarrow \arg \min_{i} C_{i} (x, y)$ ▹ Initial partitioning
3:: repeat
4:: for each agent $i = 1$ to N do
5:: $L_{i} \leftarrow {(x, y) ∣ A (x, y) = i}$
6:: $k_{i} \leftarrow | L_{i} |$
7:: Update $C_{i} (x, y) \leftarrow C_{i} (x, y) + λ \cdot | k_{i} - f |$ ▹ Load balancing
8:: end for
9:: for all cells $(x, y) \in G_{free}$ do
10:: $A (x, y) \leftarrow \arg \min_{i} C_{i} (x, y)$ ▹ Reassign cells
11:: end for
12:: Enforce connectivity in each $L_{i}$ ▹ Ensure connected subregions
13:: Penalize disconnected components ▹ Guide iterative correction
14:: until convergence or max iterations
15:: return ${L_{1}, \dots, L_{N}}$

The proposed DARP algorithm partitions the free cells of a grid map among multiple UAVs while ensuring connectivity and balanced workload. Each UAV maintains a cost matrix based on distance to its initial position, and cells are initially assigned to the UAV with the lowest cost. Iteratively, each UAV’s subregion size is evaluated, and the cost matrices are adjusted to penalize over- or under-loaded regions. Cells are then reassigned according to the updated costs, and connectivity of each subregion is enforced, with disconnected components penalized. This process repeats until convergence or a maximum number of iterations is reached, resulting in contiguous, balanced subregions for all UAVs. DARP significantly outperforms traditional partitioning methods, as summarized in Table 1.

An example of DARP-based area division is shown in Figure 2, where each color indicates the area assigned to a specific agent. DARP ensures balanced and connected subregions under various obstacle distributions, providing a solid foundation for reinforcement learning-based path planning.

3.3. Reinforcement Learning Path Planning Module

To generate efficient coverage paths within each UAV’s assigned region, we formulate the path planning problem as a Partially Observable Markov Decision Process (POMDP) and design a reinforcement learning (RL) framework based on an improved Dueling Deep Q-Network (Dueling DQN). The enhancements include action encoding and prioritized experience replay (PER) to improve learning stability and convergence.

3.3.1. POMDP Formulation

Due to the limited field of view (FoV) and communication constraints, each UAV can only access partial observations of the environment. Thus, the coverage problem is naturally formulated as a POMDP, which is defined by the tuple

P = 〈 S, A, O, P, R, Ω, γ 〉

(9)

where

S

denotes the global state space representing the full environment status,

A

is the discrete action space available to each UAV, and

O

represents the observation space derived from local partial maps and inter-UAV communication. The function

P (s^{'} |s, a)

defines the transition probability from state s to

s^{'}

under action a, while

R (s, a)

specifies the reward received for executing action a in state s. The observation function

Ω (o, s)

describes the probability of observing o given the true state s, and

γ \in [0, 1]

is the discount factor that balances immediate and future rewards.

Therefore, the definition of the UAV’s state

s_{t}^{i}

in this paper is as follows:

s_{t}^{i} = (M_{t}^{i}, C_{t}^{i}, ϕ_{t}^{i})

(10)

The local state

s_{t}^{i}

of UAV i at time t consists of three components: (1)

M_{t}^{i}

is the local target history map, encoding the past observed locations of targets within the UAV’s field of view; (2)

C_{t}^{i}

is the environmental coverage map, recording which areas in the local observation range have already been visited or covered; (3)

ϕ_{t}^{i}

represents the current position of the UAV on the map.

The action

a_{t}^{i} \in A

denotes the decision made by UAV i at time t, and the discrete action space is defined as

A = \{u p, d o w n, l e f t, r i g h t\}

(11)

At each time step, a UAV selects one of four discrete actions to control its movement direction within its assigned subregion. Hovering is not considered since each movement step is assumed to consume energy and aims at expanding coverage efficiency.

3.3.2. Action Coding Mechanisms

To enhance the semantic expressiveness of the policy network with respect to actions, this study introduces an action encoding mechanism in the advantage estimation process. A directional encoding strategy is adopted, where each action is represented as a displacement vector in the 2D space, defined as

u p : (- 1, 0), d o w n : (1, 0), l e f t : (0, - 1), r i g h t : (0, 1)

(12)

This explicit geometric representation provides not only clear spatial semantics but also captures the behavioral characteristics of each action on the map, thereby embedding prior knowledge about action relationships directly into the learning process.

In conventional Deep Q-Networks (DQN), the Q-network maps the state to a set of Q-values, each corresponding to one discrete action:

q_{-} v a l u e s = Q_{θ} (s), s \in R^{d_{s}}, q_{-} v a l u e s \in R^{| A |}

(13)

where

| A |

is the number of discrete actions.

However, a common challenge arises when the number of discrete actions becomes excessively large: the learning process of Deep Q-Networks (DQN) tends to slow down considerably. To address this issue, we propose a structural modification to the conventional Q-network. Our redesigned network no longer outputs the Q-values of all the possible actions in a single forward pass, but instead takes a set of encoded actions as input and returns their corresponding Q-values. This approach allows prior knowledge regarding the relational structure between actions, such as geometric similarities in their encoding, to be explicitly incorporated into the model. As a result, the sample efficiency is significantly improved, and the Q-values of under-sampled actions can be better generalized via their similarity to frequently observed actions.

3.3.3. Improved Dueling DQN Structure

To enhance the accuracy and convergence speed of policy evaluation in reinforcement learning, this paper proposes a structural improvement to the standard Dueling DQN architecture by introducing an action-conditioned advantage estimation mechanism.

In the standard Dueling DQN, the Q-value function is decomposed into two components: a state-value stream

V (s)

estimating the value of being in state s, and an advantage stream

A (s, a)

estimating the relative benefit of taking action a in state s. The final Q-value is calculated as

Q (s, a) = V (s) + (A (s, a) - \frac{1}{| A |} \sum_{a^{'}} A (s, a^{'}))

(14)

where

| A | = 4

corresponds to four discrete directional actions: up, down, left, and right.

In the standard implementation, the advantage stream shares a common state representation and outputs Q-values for all actions simultaneously. However, this structure lacks explicit modeling of action semantics, thereby limiting policy generalization in spatially structured environments.

To address this issue, we propose an action-conditioned Dueling DQN modification, where each action is individually processed with its semantic encoding. The enhanced computation proceeds as follows:

(1): The input state s is encoded through a convolutional encoder into a high-dimensional feature vector $f_{s}$ .
(2): Each action $a \in A$ is encoded as a 2D directional vector $e_{a} \in R^{2}$ .
(3): The state feature $f_{s}$ is concatenated with each action encoding $e_{a}$ , forming a combined representation $[f_{s}; e_{a}]$ , which is then input to the fully connected advantage stream to estimate

A (s, a) = MLP ([f_{s}; e_{a}])

(15)

The value stream

V (s)

remains unchanged and operates only on

f_{s}

. The Q-value is finally computed using Equation (14), maintaining the original Dueling DQN aggregation form.

This action-conditioned enhancement explicitly injects directional semantics into the learning process, enabling more accurate advantage estimation and better policy generalization in grid-structured coverage environments. The structure of the improved Dueling DQN is shown in Figure 3.

3.3.4. Reward Function Design

To guide UAVs toward more efficient coverage behaviors and reduce unnecessary actions, a hierarchical reward function is designed in this paper, which consists of the following components:

r_{t} = R_{1} + R_{2}

(16)

In this study, the base reward

R_{1}

is defined as

R_{1} = \{\begin{matrix} - 0.05, & movement penalty \\ + 2.0, & if ϕ_{t} \notin L_{t}, newly cover \\ - 2.0, & if ϕ_{t} \notin M, invalid move (out of bounds or obstacle) \\ - 0.5, & if ϕ_{t} \in L_{t} \land \exists ϕ_{t}^{'} \notin L_{t}, revisit with uncovered nearby \\ - 0.05, & if ϕ_{t} \in L_{t} \land \forall ϕ_{t}^{'} \in L_{t}, revisit fully surrounded \end{matrix}

(17)

where

L_{t}

denotes the set of grid cells covered at time step t, M represents the set of all traversable grid cells, and

ϕ_{t}^{'}

represents the set of locations that the UAV can reach at the last moment.

The task-specific reward

R_{2}

is defined as

R_{2} = \{\begin{matrix} + 20, & complete coverage of the entire map \\ + 0.5, & move to a coverage frontier cell \\ + 1.0, & reach a new predefined milestone \end{matrix}

(18)

The coverage frontier refers to grid cells located at the boundary of the UAV’s responsible region or adjacent to obstacles. These cells are of strategic importance because moving toward them often leads to the discovery of new areas or ensures complete and efficient coverage in complex environments.

The reward magnitudes were systematically determined to reflect the hierarchical objectives of the coverage task—complete coverage, path efficiency, and collision avoidance—and were validated through sensitivity analysis: (1) Positive reinforcement for first-time coverage (+2.0) was set sufficiently higher than the per-step cost (−0.05) to ensure that discovering new cells remains the primary driver of agent behavior. (2) Penalty for invalid transitions (−2.0) provides a strong negative signal to prevent collisions and out-of-bound actions, critical for safe UAV operation. (3) Movement cost (−0.05) imposes a modest penalty that discourages unnecessary steps while allowing adequate exploration. (4) Revisit penalties (−0.5/−0.05) distinguish between revisits when unexplored neighbors exist versus when the neighborhood is already fully covered, encouraging UAVs to seek new territory whenever possible. (5) Terminal reward (+20) strongly incentivizes the global objective of complete coverage.

This hierarchical reward design, together with the above justification and sensitivity validation, effectively balances exploration and task completion, prevents repetitive local behaviors, and significantly improves overall coverage efficiency.

3.3.5. Priority Experience Replay Mechanism

To further improve learning efficiency, this study incorporates the prioritized experience replay (PER) mechanism. The core principle of PER is to assign sampling priority to each transition based on its Temporal-Difference (TD) error, which reflects the agent’s learning potential from that sample.

Specifically, the probability

P (i)

of sampling a transition i from the buffer is proportional to its TD error magnitude

δ_{i}

, defined as

P (i) = \frac{| δ_{i} |^{α}}{\sum_{k} | δ_{k} |^{α}}

(19)

where

α \in [0, 1]

is a hyperparameter that controls the degree of prioritization. When

α = 0

, PER reduces to uniform sampling.

To compensate for the bias introduced by prioritized sampling, importance-sampling (IS) weights are introduced in the loss function:

L = \frac{1}{N} \sum_{j} w_{j} \cdot {(y_{j} - Q (s_{j}, a_{j}; θ))}^{2}

(20)

where the IS weight

w_{j}

is computed as

w_{j} = {(\frac{1}{N} \cdot \frac{1}{P (j)})}^{β}

(21)

where

P (j)

is the sampling probability of transition j, and

β \in [0, 1]

controls the degree of importance correction, which is typically annealed from a small initial value to 1 during training.

This integration of PER improves sample efficiency and enables the UAV to focus more effectively on transitions that contribute significantly to learning progress.

Remark 1.

As we know, end-to-end multi-agent reinforcement learning often suffers from high training complexity and path conflicts in large environments. To overcome these issues, we propose a hierarchical “DARP + local RL” framework, where DARP first divides the global workspace into connected and balanced subregions, and each UAV then learns an independent DQN-based policy within its assigned region. This integration reduces inter-UAV conflicts, accelerates policy convergence, and improves training stability by localizing the learning space and enabling parallel policy training.

4. Experiments and Evaluations

To validate the effectiveness of the proposed multi-UAV coverage path planning method based on DARP-based region partitioning and the improved Dueling DQN strategy, a series of simulation experiments were conducted. The system performance was evaluated from multiple perspectives, including coverage rate and path redundancy.

4.1. Experimental Settings and Parameter Configuration

The experiments in this study were conducted in a custom-designed 2D grid map environment, which includes multiple obstacles, no-fly zones, and inaccessible areas. The DARP algorithm was employed to partition the entire task area into several non-overlapping sub-regions, which were then assigned to multiple UAVs to perform local coverage tasks independently.

All experiments were implemented in Python 3.10 using PyTorch 2.5.1, with the hardware featuring an Intel Core i12600KF CPU (Intel Corporation, Santa Clara, CA, USA) and an NVIDIA GeForce RTX 3070 Ti GPU (NVIDIA Corporation, Santa Clara, CA, USA). Each UAV’s path planning policy was trained separately using the proposed improved Dueling DQN network. The parameter settings used in the training are listed in Table 2.

In each experiment, the environment size was set to

20 \times 20

. UAV starting locations were randomized within non-obstacle and non-no-fly-zone cells to ensure each episode began with a valid initial state. A unified random seed (seed = 42) was used for all randomization processes (obstacle placement, UAV initialization, network weight initialization) to eliminate random variability and enable result reproduction.

4.2. Experimental Results and Visual Analysis

The visualized coverage path trajectories of different path planning strategies in the same environment are shown in Figure 4. As observed from Figure 4, the coverage trajectories are generated by different path planning algorithms in the same obstacle map environment. The results in Figure 4 show the following: (a) Improved DQN method: The trajectory as a whole presents a uniform pattern from the edge to the center, which effectively covers each area, and the path is almost not crossed or stacked, reflecting good global exploration ability and coverage efficiency. (b) Initial DQN: There are a large number of local loops in the trajectory, and repeated visits are frequent, especially in the vicinity of obstacles, indicating that the strategy has the problem of falling into local optimum. (c) Boustrophedon method: The path presents a “Z”-shaped distribution, which is regular but lacks adaptation to the complex obstacle layout, and some areas are not fully covered, resulting in insufficient coverage. (d) Inner spiral coverage method: The path is compact, but exhibits significant “blind spots” in the face of irregular obstacles and map edges. (e) Spanning Tree Coverage (STC) method: Paths are significantly verbose and have a large number of invalid duplicates, especially at regional edges and branch nodes, resulting in high overlap rates and path waste.

To evaluate the adaptability of the improved DQN method under varied environmental complexities, Figure 5 visualizes its coverage trajectories in maps with different sizes and obstacle densities.

From the overall performance in Figure 4 and Figure 5, the improved DQN not only covers a wide area, but also has a smoother path and a reasonable distribution, which is significantly better than the traditional methods and the unimproved network.

4.3. Analysis of Experimental Results

The performance comparisons of UAVs when participating in tasks in a

20 \times 20

map based on different methods are listed in Table 3. In this study, all experiments were independently conducted 10 times, and the final results are presented as the mean values.

As can be seen from Table 3, the improved DQN method proposed in this paper has achieved significant advantages in the three core indicators. Among them, the coverage ratio (

C R

) reaches 0.99, which is about 11% higher than the initial DQN (0.88), indicating that the method can more fully complete the comprehensive coverage of the mission area. At the same time, the repetition ratio (

R R

) dropped from 0.32 to 0.04, a decrease of 87.5%, significantly reducing duplicate access behavior in the path and improving resource efficiency. In terms of task execution efficiency, the total number of steps on the path decreased from 478 to 361, a decrease of about 24.5%, reflecting a more streamlined path and more efficient task completion.

Furthermore, in terms of training efficiency, the improved DQN required approximately 3 h 57 min to reach convergence, compared to 4 h 11 min for the initial DQN, which demonstrates that the proposed method effectively accelerates the training process through its enhanced network structure and learning strategy. This acceleration reduces the overall computational cost while maintaining high performance across all evaluation metrics. Moreover, after training, the proposed model can operate in real time during the execution phase, further verifying its practicality and applicability in UAV coverage tasks that require rapid and adaptive decision-making.

Figure 6 shows the trend of coverage ratio and reward during the training process, which further verifies the advantages of the proposed method in terms of policy convergence speed and performance stability. The results in Figure 6 show that the improved DQN showed a faster learning speed at the early stage of training, and its coverage rate and reward curves were significantly higher than those of the initial DQN, indicating that the strategy evolved towards efficient coverage from the beginning. In the middle and late stages of training, the improved DQN curve tends to be stable and close to the optimal value, indicating that the network has effectively converged and has little fluctuation, and has better generalization ability and stability. In contrast, the coverage rate of the initial DQN increases slowly and the reward fluctuates greatly over a long period of time, indicating that the strategy is stuck in local optima and learning is unstable. These trends show that the method not only has better final performance, but also makes the training process more efficient and reliable, which further enhances its practicability.

The underlying reasons for the above significant performance improvements are as follows: (1) The improved Dueling DQN network structure greatly improves the evaluation accuracy of state value by introducing action semantics and independent advantage estimation paths, enabling agents to plan paths more reasonably and avoid repeated revisits and local stagnation. (2) The action coding mechanism transforms the original discrete actions into spatial direction vectors, and introduces spatial semantics into the high-dimensional state representation, which makes the policy network have stronger direction recognition and decision-making capabilities, and reduces the problem of policy blindness and boundary oscillation. (3) The prioritized experience replay (PER) mechanism guides the network to learn the most strategically valuable transfer through the priority learning of the samples with high TD error, which significantly accelerates the training process and reduces invalid exploration, which is the direct result of the rapid rise in the reward curve in Figure 6. (4) Map preprocessing (DFS) and DARP task division work together to construct a cleaner and more explicit training input state, eliminate the interference caused by the “unreachable area”, and achieve regional connectivity and task-balanced distribution, creating ideal environmental conditions for each agent’s strategy training.

4.4. Ablation Experiment

4.4.1. Structural Module Ablation

This section studies the influence of five key modules in the proposed architecture by removing them one at a time: (i) map preprocessing (DFS), (ii) region partitioning (DARP), (iii) prioritized experience replay (PER), (iv) action encoding, and (v) Dueling Architecture Improvement. The experimental results are summarized in Table 4.

The results clearly show that each component contributes significantly to the system’s effectiveness: (1) Removing DFS preprocessing leads to the most severe performance drop. Coverage ratio declines to 0.91, and repetition ratio triples to 0.12. This is because unreachable regions interfere with learning and cause the agent to waste actions exploring invalid paths. (2) Without DARP, region assignments become imbalanced and disconnected, resulting in increased path overlap and coverage inefficiency. (3) Disabling PER reduces learning efficiency. Although the final coverage ratio remains acceptable (0.95), the repetition ratio rises, and more training steps are required, indicating slower convergence and less efficient exploration. (4) Excluding action encoding impairs the UAV’s understanding of directional semantics. This leads to suboptimal decision-making, especially in spatially constrained environments, causing higher redundancy (0.06) and longer paths. (5) Using the standard Dueling DQN instead of the improved architecture results in a less accurate estimation of Q-values due to insufficient modeling of action context. This degrades policy quality, with a repetition ratio of 0.08 and a longer execution path (368 steps).

4.4.2. Structural Module Ablation

To assess the impact of the task-specific reward

R_{2}

, this article further tested the UAV’s performance after removing rewards: boundary reward and milestone reward. The results are shown in Table 5 and Figure 7.

Table 5 summarizes the final task performance of the UAV under different reward settings. When the boundary reward is removed, the repetition ratio increases significantly (0.04 → 0.09), indicating that the UAV tends to hover around central regions and neglect less accessible border areas. Similarly, the absence of the milestone reward reduces the UAV’s motivation to make long-range progress, which results in a higher repetition ratio (0.07) and an inability to reduce total steps. The removal of both components leads to a noticeable degradation in coverage (down to 0.93) and the worst overall efficiency.

Figure 7 illustrates the learning curves of coverage ratio and cumulative reward throughout training under different reward settings. The full reward setting shows the fastest growth in both metrics, reaching convergence earlier and maintaining stability over time. In contrast, curves under ablated reward settings either grow more slowly, plateau at lower levels, or exhibit larger fluctuations, suggesting weaker learning signals and delayed convergence.

Combining Table 5 and Figure 7, we can find that (1) the boundary exploration reward effectively reduces “detour” behavior and improves the coverage quality at the edges and (2) the coverage progress milestone reward has a significant positive effect on breaking sparse rewards and accelerating training convergence.

4.5. Experiment with Different Numbers of UAVs Under a Large Environment

To further test the scalability of the proposed method with varying numbers of UAVs, experiments were conducted within the

50 \times 50

grid environment. The results are shown in Table 6.

Table 6 shows that as the number of UAVs increases, the average number of steps required per UAV decreases, while the coverage ratio remains consistently high (above 0.97), and the repetition ratio steadily decreases. This indicates that the system achieves better parallelism and division of labor with more UAVs, while effectively avoiding redundant coverage and path conflicts.

Figure 8 visualizes the coverage paths of seven UAVs under the DARP-based region division and improved DQN strategy. The paths are clearly distributed in distinct subregions without overlap, demonstrating that the region division is effective and the policy ensures cooperative, non-conflicting coverage.

5. Conclusions

This paper focuses on the multi-UAV complete coverage path planning problem and proposes a cooperative coverage method that combines DARP regional partitioning with an improved Dueling DQN reinforcement learning structure. The aim is to improve path efficiency and task completion in large-scale obstacle environments. In this study, a map preprocessing module is proposed, which significantly reduces the ineffective interference in the state space and policy learning. Then, the dynamic task allocation mechanism is presented, to perform load-balanced regional partitioning of the map, making multi-UAV task collaboration more efficient and reasonable. In addition, an improved reinforcement learning structure is designed to improve policy accuracy and training convergence. The experimental results show that the proposed method performs excellently in static complex maps and also demonstrates good scalability and modularity, making it suitable for a broader range of multi-UAV task scenarios.

Although the proposed method has achieved promising results in simulations, there are still limitations in practical engineering applications. Some parameters such as the rationality of the selection of reward values and coefficients are adjusted for simulation optimization; their adaptability in real complex scenarios has not been verified. In addition, the unique key dynamic factors of unmanned aerial vehicles (UAVs), including battery constraints, three-dimensional navigation, weak communication, wind interference, and sensor noise, have not been considered. Moreover, the current model simplifies vehicle motion by assuming equal cost for all directional movements. In real UAV operations, such maneuvers typically incur higher energy and time costs. Future work will incorporate strafing and turning penalties and other dynamic constraints to enhance the physical realism and practical applicability of the proposed method.

Author Contributions

Funding acquisition, J.N.; project administration, J.N. and Y.G. (Yang Gu); writing—original draft, Y.G. (Yuechen Ge); writing—review and editing, Y.G. (Yuechen Ge) and Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Jiangsu Province Key R&D Program (BE2023340) and the National Natural Science Foundation of China (61873086).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ni, J.; Gu, Y.; Gu, Y.; Zhao, Y.; Shi, P. UAV Coverage Path Planning with Limited Battery Energy Based on Improved Deep Double Q-network. Int. J. Control Autom. Syst. 2024, 22, 2591–2601. [Google Scholar] [CrossRef]
Yi, X.; Zhu, A.; Yang, S.X.; Shi, D. An improved neural dynamics based approach with territorial mechanism to online path planning of multi-robot systems. Int. J. Mach. Learn. Cybern. 2021, 12, 3561–3572. [Google Scholar] [CrossRef]
Galceran, E.; Carreras, M. A survey on coverage path planning for robotics. Robot. Auton. Syst. 2013, 61, 1258–1276. [Google Scholar] [CrossRef]
Yepez-Figueroa, J.J.; Victores, J.G.; Ona, E.D.; Balaguer, C.; Jardon, A. Design and Development of an Omnidirectional Three-Wheeled Industrial Mobile Robot Platform. Appl. Sci. 2025, 15, 5277. [Google Scholar] [CrossRef]
Feng, S.; Li, X.; Ren, L.; Xu, S. Reinforcement learning with parameterized action space and sparse reward for UAV navigation. Intell. Robot. 2023, 3, 161–175. [Google Scholar] [CrossRef]
Jensen-Nau, K.R.; Hermans, T.; Leang, K.K. Near-optimal area-coverage path planning of energy-constrained aerial robots with application in autonomous environmental monitoring. IEEE Trans. Autom. Sci. Eng. 2020, 18, 1453–1468. [Google Scholar] [CrossRef]
Zhang, N.; Yue, L.; Zhang, Q.; Gao, C.; Zhang, B.; Wang, Y. A UAV Coverage Path Planning Method Based on a Diameter–Height Model for Mountainous Terrain. Appl. Sci. 2025, 15, 1988. [Google Scholar] [CrossRef]
Oksanen, T.; Visala, A. Coverage path planning algorithms for agricultural field machines. J. Field Robot. 2009, 26, 651–668. [Google Scholar] [CrossRef]
Din, A.; Ismail, M.Y.; Shah, B.; Babar, M.; Ali, F.; Baig, S.U. A deep reinforcement learning-based multi-agent area coverage control for smart agriculture. Comput. Electr. Eng. 2022, 101, 108089. [Google Scholar] [CrossRef]
An, V.; Qu, Z.; Roberts, R. A rainbow coverage path planning for a patrolling mobile robot with circular sensing range. IEEE Trans. Syst. Man Cybern. Syst. 2017, 48, 1238–1254. [Google Scholar] [CrossRef]
Ni, J.; Zhu, S.; Tang, G.; Ke, C.; Wang, T. A Small-Object Detection Model Based on Improved YOLOv8s for UAV Image Scenarios. Remote Sens. 2024, 16, 2465. [Google Scholar] [CrossRef]
Cho, S.W.; Park, H.J.; Lee, H.; Shim, D.H.; Kim, S.Y. Coverage path planning for multiple unmanned aerial vehicles in maritime search and rescue operations. Comput. Ind. Eng. 2021, 161, 107612. [Google Scholar] [CrossRef]
Guan, J.; Cheng, H. Metropolis criterion pigeon-inspired optimization for multi-UAV swarm controller. Intell. Robot. 2024, 4, 61–73. [Google Scholar] [CrossRef]
Choset, H. Coverage of known spaces: The boustrophedon cellular decomposition. Auton. Robot. 2000, 9, 247–253. [Google Scholar] [CrossRef]
Li, L.; Shi, D.; Jin, S.; Yang, S.; Lian, Y.; Liu, H. SP2E: Online Spiral Coverage with Proactive Prevention Extremum for Unknown Environments. J. Intell. Robot. Syst. 2023, 108, 30. [Google Scholar] [CrossRef]
Gabriely, Y.; Rimon, E. Spanning-tree based coverage of continuous areas by a mobile robot. Ann. Math. Artif. Intell. 2001, 31, 77–98. [Google Scholar] [CrossRef]
Tan, C.S.; Mohd-Mokhtar, R.; Arshad, M.R. A comprehensive review of coverage path planning in robotics using classical and heuristic algorithms. IEEE Access 2021, 9, 119310–119342. [Google Scholar] [CrossRef]
Kapoutsis, A.C.; Chatzichristofis, S.A.; Kosmatopoulos, E.B. DARP: Divide areas algorithm for optimal multi-robot coverage path planning. J. Intell. Robot. Syst. 2017, 86, 663–680. [Google Scholar] [CrossRef]
Chen, J.; Du, C.; Zhang, Y.; Han, P.; Wei, W. A clustering-based coverage path planning method for autonomous heterogeneous UAVs. IEEE Trans. Intell. Transp. Syst. 2021, 23, 25546–25556. [Google Scholar] [CrossRef]
Ni, J.; Li, X.; Hua, M.; Yang, S.X. Bioinspired neural network-based Q-learning approach for robot path planning in unknown environments. Int. J. Robot. Autom. 2016, 31, 464–474. [Google Scholar] [CrossRef]
Tan, X.; Han, L.; Gong, H.; Wu, Q. Biologically inspired complete coverage path planning algorithm based on Q-learning. Sensors 2023, 23, 4647. [Google Scholar] [CrossRef] [PubMed]
Chibin, Z.; Xingsong, W.; Yong, D. Complete coverage path planning based on ant colony algorithm. In Proceedings of the 2008 15th International Conference on Mechatronics and Machine Vision in Practice, Auckland, New Zealand, 2–4 December 2008; pp. 357–361. [Google Scholar]
Mai, X.; Dong, N.; Liu, S.; Chen, H. UAV path planning based on a dual-strategy ant colony optimization algorithm. Intell. Robot. 2023, 3, 666–684. [Google Scholar] [CrossRef]
Ni, J.; Yang, S.X. A fuzzy-logic based chaos GA for cooperative foraging of multi-robots in unknown environments. Int. J. Robot. Autom. 2012, 27, 15–30. [Google Scholar] [CrossRef]
Shivgan, R.; Dong, Z. Energy-efficient drone coverage path planning using genetic algorithm. In Proceedings of the 2020 IEEE 21st International Conference on High Performance Switching and Routing (HPSR), Newark, NJ, USA, 11–14 May 2020; pp. 1–6. [Google Scholar]
Song, Q.; Zhao, Q.; Wang, S.; Liu, Q.; Chen, X. Dynamic path planning for unmanned vehicles based on fuzzy logic and improved ant colony optimization. IEEE Access 2020, 8, 62107–62115. [Google Scholar] [CrossRef]
Ni, J.; Gu, Y.; Tang, G.; Ke, C.; Gu, Y. Cooperative coverage path planning for multi-mobile robots based on improved k-means clustering and deep reinforcement learning. Electronics 2024, 13, 944. [Google Scholar] [CrossRef]
Luis, S.Y.; Reina, D.G.; Marín, S.L.T. A deep reinforcement learning approach for the patrolling problem of water resources through autonomous surface vehicles: The ypacarai lake case. IEEE Access 2020, 8, 204076–204093. [Google Scholar] [CrossRef]
Xiao, J.; Wang, G.; Zhang, Y.; Cheng, L. A distributed multi-agent dynamic area coverage algorithm based on reinforcement learning. IEEE Access 2020, 8, 33511–33521. [Google Scholar] [CrossRef]
Xiao, J.; Yuan, G.; Xue, Y.; He, J.; Wang, Y.; Zou, Y.; Wang, Z. A deep reinforcement learning based distributed multi-UAV dynamic area coverage algorithm for complex environment. Neurocomputing 2024, 595, 127904. [Google Scholar] [CrossRef]
Xing, B.; Wang, X.; Yang, L.; Liu, Z.; Wu, Q. An algorithm of complete coverage path planning for unmanned surface vehicle based on reinforcement learning. J. Mar. Sci. Eng. 2023, 11, 645. [Google Scholar] [CrossRef]
Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.; Lanctot, M.; Freitas, N. Dueling network architectures for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1995–2003. [Google Scholar]
Mou, Z.; Zhang, Y.; Gao, F.; Wang, H.; Zhang, T.; Han, Z. Deep reinforcement learning based three-dimensional area coverage with UAV swarm. IEEE J. Sel. Areas Commun. 2021, 39, 3160–3176. [Google Scholar] [CrossRef]
Zhang, B.; Jing, T.; Lin, X.; Cui, Y.; Zhu, Y.; Zhu, Z. Deep Reinforcement Learning-based Collaborative Multi-UAV Coverage Path Planning. J. Phys. Conf. Ser. 2024, 2833, 012017. [Google Scholar] [CrossRef]
Yuan, G.; Xiao, J.; He, J.; Jia, H.; Wang, Y.; Wang, Z. Multi-agent cooperative area coverage: A two-stage planning approach based on reinforcement learning. Inf. Sci. 2024, 678, 121025. [Google Scholar] [CrossRef]
Ni, J.; Yang, L.; Wu, L.; Fan, X. An Improved Spinal Neural System-Based Approach for Heterogeneous AUVs Cooperative Hunting. Int. J. Fuzzy Syst. 2018, 20, 672–686. [Google Scholar] [CrossRef]
Zhang, J.; Guo, J.; Zhu, D.; Xie, Y. Dynamic path planning fusion algorithm with improved A* algorithm and dynamic window approach. Int. J. Mach. Learn. Cybern. 2025, 16, 2057–2071. [Google Scholar] [CrossRef]
Kazemdehbashi, S. Adaptive grid-based decomposition for UAV-based coverage path planning in maritime search and rescue. arXiv 2024, arXiv:2412.00899. [Google Scholar] [CrossRef]
Ni, J.; Tang, M.; Chen, Y.; Cao, W. An Improved Cooperative Control Method for Hybrid Unmanned Aerial-Ground System in Multitasks. Int. J. Aerosp. Eng. 2020, 2020, 9429108. [Google Scholar] [CrossRef]
Si, H.; Miao, Z.; Zhang, W.; Sun, T. A Map Segmentation Method Based on Image Processing for Robot Complete Coverage Operation. J. Field Robot. 2025, 42, 916–929. [Google Scholar] [CrossRef]
Xia, H.; Huo, L.; Sun, Q.; Liu, Y.; Yang, Y. Spray Coating Coverage Path Planning Based on Multi-layer Feature Aggregation RainbowNet. Measurement 2025, 253, 117710. [Google Scholar] [CrossRef]
Xu, N.; Shi, Z.; Yin, S.; Xiang, Z. A hyper-heuristic with deep Q-network for the multi-objective unmanned surface vehicles scheduling problem. Neurocomputing 2024, 596, 127943. [Google Scholar] [CrossRef]
Gong, J.; Kim, H.; Lee, S. Resilient Multi-Robot Coverage Path Redistribution Using Boustrophedon Decomposition for Environmental Monitoring. Sensors 2024, 24, 7482. [Google Scholar] [CrossRef]
Wang, N.; Jin, Z.; Wang, T.; Xiao, J.; Zhang, Z.; Wang, H.; Zhang, M.; Li, H. Hybrid path planning methods for complete coverage in harvesting operation scenarios. Comput. Electron. Agric. 2025, 231, 109946. [Google Scholar] [CrossRef]
Nair, V.G.G.; Adarsh, R.S.S.; Jayalakshmi, K.P.; Dileep, M.V.; Guruprasad, K.R. Cooperative Online Workspace Allocation in the Presence of Obstacles for Multi-robot Simultaneous Exploration and Coverage Path Planning Problem. Int. J. Control Autom. Syst. 2023, 21, 2338–2349. [Google Scholar] [CrossRef]

Figure 1. Map preprocessing using DFS to relabel inaccessible areas surrounded by obstacles.

Figure 2. Illustration of DARP-based task zoning on a grid map with multiple UAVs and obstacles: (a) original map; (b) area division result.

Figure 3. Improved Dueling DQN structure, where the orange part integrates action encoding.

Figure 4. (a) Improved DQN method, (b) initial DQN method, (c) Boustrophedon method, (d) Inner Spiral coverage method, (e) Spanning Tree Coverage method.

Figure 5. Coverage trajectories in different maps based on the improved DQN method: (a) the

15 \times 15

map with an obstacle density of 20%; (b) the

20 \times 20

map with an obstacle density of 10%.

Figure 5. Coverage trajectories in different maps based on the improved DQN method: (a) the

15 \times 15

map with an obstacle density of 20%; (b) the

20 \times 20

map with an obstacle density of 10%.

Figure 6. Performance comparison between the improved DQN and the initial DQN during training.

Figure 7. Training curves of coverage ratio and cumulative reward under different reward configurations.

Figure 8. Coverage paths of 7 UAVs in a

50 \times 50

map under DARP partitioning, where different colored squares represent starting points and circles represent end points.

Figure 8. Coverage paths of 7 UAVs in a

50 \times 50

map under DARP partitioning, where different colored squares represent starting points and circles represent end points.

Table 1. DARP compared with traditional methods.

Approach	Starting Point Perception	Obstacle Avoidance	Regional Connectivity	Load Balancing	Dynamic Adaptability
Uniform grid segmentation	No	No	No	Poor	Poor
K-Means clustering	Partially	No	No	Normal	Normal
DARP	Yes	Yes	Yes	Good	Good

Table 2. Parameter settings.

Parameter	Value	Remark
Optimizer	Adam	Adaptive optimizer
Learning Rate	0.0001	Learning step size
Discount Factor $λ$	0.9	Reward discount
Replay Memory Size	4096	Prioritized Experience Replay
Target Network Update Frequency	20 episodes	Target sync interval
Number of Episodes	500	Training episodes
Batch Size	128	Mini-batch size
Prioritized Replay Buffer $α$	0.6	Priority exponent
Prioritized Replay Buffer $β$	$0.4 \to 1.0$	Bias correction

Table 3. Coverage performance across different planning methods.

Approach	$CR$	$RR$	Total Step	Training Time
Improved DQN (Ours)	0.99	0.04	361	3 h 57 min
Initial DQN [42]	0.88	0.32	478	4 h 11 min
Boustrophedon [43]	0.80	0.02	278	–
Inner Spiral [44]	0.94	0.03	333	–
Spanning Tree [45]	1.00	0.93	700	–

Table 4. Structural module ablation results.

Module Removed	Coverage Ratio	Repetition Ratio	Total Steps
None (Full model)	0.99	0.04	361
−DFS Preprocessing	0.91	0.12	389
−DARP Partitioning	0.94	0.09	378
−PER Mechanism	0.95	0.07	372
−Action Encoding	0.96	0.06	365
−Dueling Improvement	0.95	0.08	368

Table 5. Reward mechanism ablation results.

Model	Cover Ratio	Repetition Ratio	Total Step
Improved DQN	0.99	0.04	361
Without boundary guidance	0.96	0.09	388
Without progress reward	0.97	0.07	376
With only basic reward	0.93	0.11	399

Table 6. Multi-UAV performance with different numbers of UAVs.

Number of UAVs	Coverage Ratio	Repetition Ratio	Average Steps
4	0.97	0.065	537
7	0.98	0.05	300
10	0.98	0.045	208

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ni, J.; Ge, Y.; Zhao, Y.; Gu, Y. An Improved Multi-UAV Area Coverage Path Planning Approach Based on Deep Q-Networks. Appl. Sci. 2025, 15, 11211. https://doi.org/10.3390/app152011211

AMA Style

Ni J, Ge Y, Zhao Y, Gu Y. An Improved Multi-UAV Area Coverage Path Planning Approach Based on Deep Q-Networks. Applied Sciences. 2025; 15(20):11211. https://doi.org/10.3390/app152011211

Chicago/Turabian Style

Ni, Jianjun, Yuechen Ge, Yonghao Zhao, and Yang Gu. 2025. "An Improved Multi-UAV Area Coverage Path Planning Approach Based on Deep Q-Networks" Applied Sciences 15, no. 20: 11211. https://doi.org/10.3390/app152011211

APA Style

Ni, J., Ge, Y., Zhao, Y., & Gu, Y. (2025). An Improved Multi-UAV Area Coverage Path Planning Approach Based on Deep Q-Networks. Applied Sciences, 15(20), 11211. https://doi.org/10.3390/app152011211

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Improved Multi-UAV Area Coverage Path Planning Approach Based on Deep Q-Networks

Abstract

1. Introduction

2. Environment Modeling and Problem Definition

2.1. Grid Map Modeling

2.2. Multi-UAV System Modeling

2.3. Task Objective Definition

3. Proposed Method

3.1. Map Preprocessing Method

3.2. Area Division via DARP

3.3. Reinforcement Learning Path Planning Module

3.3.1. POMDP Formulation

3.3.2. Action Coding Mechanisms

3.3.3. Improved Dueling DQN Structure

3.3.4. Reward Function Design

3.3.5. Priority Experience Replay Mechanism

4. Experiments and Evaluations

4.1. Experimental Settings and Parameter Configuration

4.2. Experimental Results and Visual Analysis

4.3. Analysis of Experimental Results

4.4. Ablation Experiment

4.4.1. Structural Module Ablation

4.4.2. Structural Module Ablation

4.5. Experiment with Different Numbers of UAVs Under a Large Environment

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI