1. Introduction
In recent years, with the rapid development of unmanned systems and mobile robotics [
1,
2,
3,
4,
5], coverage path planning (CPP) of multiple unmanned aerial vehicles (UAVs) has been widely applied in various real-world scenarios, such as environmental monitoring [
6,
7], agricultural inspection [
8,
9], security patrolling [
10,
11], and disaster search and rescue [
12,
13]. The core objective of CPP is to plan one or multiple trajectories in environments with obstacles and spatial constraints, enabling agents to efficiently and completely traverse all reachable areas within the task region, thereby maximizing spatial coverage while minimizing path redundancy and resource consumption.
In practical applications, traditional CPP methods, such as Boustrophedon coverage [
14], spiral path planning [
15], and graph-based Spanning Tree Coverage (STC) [
16], are structurally simple and easy to deploy. However, they often encounter significant challenges in complex environments, including low coverage efficiency, severe path redundancy, poor adaptability to obstacles, and limited scalability to multi-UAV cooperative tasks. To address these issues, various techniques, such as heuristic search [
17], graph partitioning [
18], clustering algorithms [
19], and bio-inspired methods [
20,
21] have been presented in recent years. Improved ant colony optimization [
22,
23], genetic algorithms [
24,
25], fuzzy logic approaches [
26], and K-Means clustering [
27] are among the methods that have shown improvements in path quality. Nevertheless, these approaches still face difficulties in policy learning and adaptive decision-making in high-dimensional and dynamic environments.
With the rapid advances in Deep Reinforcement Learning (DRL), its applications to CPP problems have demonstrated great potential [
28,
29,
30,
31]. Reinforcement learning optimizes policies through interaction with the environment, making it well-suited for complex, dynamic systems that are difficult to model explicitly. The introduction of the Dueling Deep Q-Network (Dueling DQN) architecture has further enhanced the stability of value function estimation and improved policy convergence [
32]. Consequently, adaptive coverage strategy learning based on reinforcement learning has become a research hotspot in recent years.
On the other hand, the incorporation of multi-agent systems (MAS) has significantly expanded the application boundaries of CPP [
33,
34,
35]. Compared to single-agent systems, multi-agent systems offer higher spatial parallelism and task execution efficiency when dealing with large-scale environments. However, multi-UAV cooperative CPP tasks pose two critical challenges: reasonable task area partitioning and path conflict management. Traditional methods such as K-Means clustering [
27] or regular grid division often fail to ensure region connectivity and load balancing, leading to area overlaps and inter-agent conflicts. To overcome these issues, this paper introduces the Divide Areas based on Robots’ initial Positions (DARP) algorithm [
18], which dynamically partitions the task space based on UAVs’ starting positions and obstacle layouts, ensuring that each UAV is assigned a reasonable and connected task region, thereby significantly reducing path redundancy.
In summary, this paper proposes a novel multi-UAV coverage path planning method that combines DARP-based area partitioning with an improved Dueling DQN reinforcement learning framework. The main contributions are as follows: (1) A map preprocessing mechanism based on Depth-First Search (DFS) is introduced to eliminate completely unreachable areas, enhancing policy learning efficiency. (2) The DARP algorithm is utilized to achieve dynamic multi-UAV task partitioning, ensuring load balancing and minimizing inter-agent path conflicts. (3) An enhanced Dueling DQN network is constructed by incorporating action encoding and prioritized experience replay, improving policy generalization and training stability. Finally, comprehensive experimental evaluations are conducted on several benchmark maps and the results demonstrate that the proposed method significantly outperforms traditional approaches in terms of coverage rate, redundancy rate, and path efficiency.
3. Proposed Method
To realize the task defined in
Section 2, an integrated method is proposed in this paper, including three main parts. First, a map preprocessing method is proposed to remove unreachable areas. Then, a regional division and task allocation module is presented to improve the efficiency of the collaborative operation. At last, a path planning module based on reinforcement learning is proposed to realize the final task.
3.1. Map Preprocessing Method
In real-world environments, due to the irregular distribution of obstacles, there may exist free regions that are completely enclosed by obstacles. Although these regions are labeled as “traversable“ in the grid map, they are actually inaccessible from the map boundaries. If not properly handled, such regions would severely impair the training efficiency of reinforcement learning algorithms.
To address this issue, this paper designs a map preprocessing module based on Depth-First Search (DFS) to automatically identify and relabel these inaccessible but non-obstacle regions. The core idea is as follows: (1) Starting from all traversable cells along the map boundaries, perform DFS exploration. (2) Mark all unvisited free cells during the search as “inaccessible”. (3) Treat these cells as obstacles during the reinforcement learning process to avoid invalid path exploration.
Notably, this DFS serves exclusively as a static, offline environment-cleaning tool for one-time topological screening during initialization, rather than for path generation or dynamic decision-making—distinguishing it from the subsequent Dueling DQN-based CPP. In addition, this DFS-based preprocessing is a coarse-grained method that does not require a highly detailed map; it can operate effectively on low-resolution maps or satellite imagery, simplifying data acquisition and preprocessing requirements.
The overall process is illustrated in Algorithm 1, and the effect of this preprocessing is shown in
Figure 1.
Algorithm 1 Preprocessing unreachable areas via Depth-First Search (DFS) strategy. |
- Require:
Grid map , where 1 indicates an obstacle - Ensure:
Updated in which unreachable free cells are marked as 3 - 1:
▹ Initialize a set to record reachable cells - 2:
function DFS() - 3:
if or or or then - 4:
return ▹ Return if outside the map boundaries - 5:
end if - 6:
if or then - 7:
return ▹ Skip already visited or obstacle cells - 8:
end if - 9:
Add to - 10:
for do - 11:
DFS() ▹ Recursively explore four neighbors - 12:
end for - 13:
end function - 14:
for to do - 15:
for to do - 16:
if or or or then - 17:
if then - 18:
DFS() ▹ Start DFS from each free border cell - 19:
end if - 20:
end if - 21:
end for - 22:
end for - 23:
for to do - 24:
for to do - 25:
if and then - 26:
▹ Mark unreachable free cells as 3 - 27:
end if - 28:
end for - 29:
end for
|
DFS is launched from every free border cell, recursively exploring four connected neighbors while skipping obstacles and already visited cells. After all reachable areas are discovered, any remaining free cell not in the visited set is marked as 3, indicating it is unreachable. This method can be efficiently executed during the map initialization phase, offering advantages such as zero learning cost and high parallelizability. Compared with traditional flood-fill algorithms, the proposed approach provides a simpler search process and better integration with reinforcement learning environments.
3.2. Area Division via DARP
In multi-UAV coverage path planning tasks, avoiding conflicts between UAVs and ensuring balanced workload distribution are key challenges. To address this issue, we adopt the Divide Areas based on Robot initial Positions (DARP) algorithm, first introduced by Kapoutsis et al. [
18], to partition the traversable space into disjoined and connected regions, each assigned to a single UAV.
DARP guarantees that (1) each region is connected; (2) each UAV’s initial position is inside its assigned region; (3) the number of cells per region is approximately equal; (4) there are no overlaps among regions.
Specifically, let
denote the free area of the map. The goal is to divide the task space into
N responsibility areas
, satisfying the following:
Let
be the ideal number of cells per UAV. The optimization goal of DARP is to minimize region assignment errors:
where
represents the area (number of cells) assigned to the
i-th UAV.
The algorithm iteratively updates cell assignments based on cost matrices and connectivity constraints. The process is outlined in Algorithm 2.
Algorithm 2 DARP algorithm for multi-UAV area partitioning. |
- Require:
Grid map , initial positions - Ensure:
Subregions satisfying load balance and connectivity - 1:
Initialize cost matrices ▹ Distance-based initial cost - 2:
Assign initial labels ▹ Initial partitioning - 3:
repeat - 4:
for each agent to N do - 5:
- 6:
- 7:
Update ▹ Load balancing - 8:
end for - 9:
for all cells do - 10:
▹ Reassign cells - 11:
end for - 12:
Enforce connectivity in each ▹ Ensure connected subregions - 13:
Penalize disconnected components ▹ Guide iterative correction - 14:
until convergence or max iterations - 15:
return
|
The proposed DARP algorithm partitions the free cells of a grid map among multiple UAVs while ensuring connectivity and balanced workload. Each UAV maintains a cost matrix based on distance to its initial position, and cells are initially assigned to the UAV with the lowest cost. Iteratively, each UAV’s subregion size is evaluated, and the cost matrices are adjusted to penalize over- or under-loaded regions. Cells are then reassigned according to the updated costs, and connectivity of each subregion is enforced, with disconnected components penalized. This process repeats until convergence or a maximum number of iterations is reached, resulting in contiguous, balanced subregions for all UAVs. DARP significantly outperforms traditional partitioning methods, as summarized in
Table 1.
An example of DARP-based area division is shown in
Figure 2, where each color indicates the area assigned to a specific agent. DARP ensures balanced and connected subregions under various obstacle distributions, providing a solid foundation for reinforcement learning-based path planning.
3.3. Reinforcement Learning Path Planning Module
To generate efficient coverage paths within each UAV’s assigned region, we formulate the path planning problem as a Partially Observable Markov Decision Process (POMDP) and design a reinforcement learning (RL) framework based on an improved Dueling Deep Q-Network (Dueling DQN). The enhancements include action encoding and prioritized experience replay (PER) to improve learning stability and convergence.
3.3.1. POMDP Formulation
Due to the limited field of view (FoV) and communication constraints, each UAV can only access partial observations of the environment. Thus, the coverage problem is naturally formulated as a POMDP, which is defined by the tuple
where
denotes the global state space representing the full environment status,
is the discrete action space available to each UAV, and
represents the observation space derived from local partial maps and inter-UAV communication. The function
defines the transition probability from state
s to
under action
a, while
specifies the reward received for executing action
a in state
s. The observation function
describes the probability of observing
o given the true state
s, and
is the discount factor that balances immediate and future rewards.
Therefore, the definition of the UAV’s state
in this paper is as follows:
The local state of UAV i at time t consists of three components: (1) is the local target history map, encoding the past observed locations of targets within the UAV’s field of view; (2) is the environmental coverage map, recording which areas in the local observation range have already been visited or covered; (3) represents the current position of the UAV on the map.
The action
denotes the decision made by UAV
i at time
t, and the discrete action space is defined as
At each time step, a UAV selects one of four discrete actions to control its movement direction within its assigned subregion. Hovering is not considered since each movement step is assumed to consume energy and aims at expanding coverage efficiency.
3.3.2. Action Coding Mechanisms
To enhance the semantic expressiveness of the policy network with respect to actions, this study introduces an action encoding mechanism in the advantage estimation process. A directional encoding strategy is adopted, where each action is represented as a displacement vector in the 2D space, defined as
This explicit geometric representation provides not only clear spatial semantics but also captures the behavioral characteristics of each action on the map, thereby embedding prior knowledge about action relationships directly into the learning process.
In conventional Deep Q-Networks (DQN), the Q-network maps the state to a set of
Q-values, each corresponding to one discrete action:
where
is the number of discrete actions.
However, a common challenge arises when the number of discrete actions becomes excessively large: the learning process of Deep Q-Networks (DQN) tends to slow down considerably. To address this issue, we propose a structural modification to the conventional Q-network. Our redesigned network no longer outputs the Q-values of all the possible actions in a single forward pass, but instead takes a set of encoded actions as input and returns their corresponding Q-values. This approach allows prior knowledge regarding the relational structure between actions, such as geometric similarities in their encoding, to be explicitly incorporated into the model. As a result, the sample efficiency is significantly improved, and the Q-values of under-sampled actions can be better generalized via their similarity to frequently observed actions.
3.3.3. Improved Dueling DQN Structure
To enhance the accuracy and convergence speed of policy evaluation in reinforcement learning, this paper proposes a structural improvement to the standard Dueling DQN architecture by introducing an action-conditioned advantage estimation mechanism.
In the standard Dueling DQN, the
Q-value function is decomposed into two components: a state-value stream
estimating the value of being in state
s, and an advantage stream
estimating the relative benefit of taking action
a in state
s. The final
Q-value is calculated as
where
corresponds to four discrete directional actions:
up,
down,
left, and
right.
In the standard implementation, the advantage stream shares a common state representation and outputs Q-values for all actions simultaneously. However, this structure lacks explicit modeling of action semantics, thereby limiting policy generalization in spatially structured environments.
To address this issue, we propose an action-conditioned Dueling DQN modification, where each action is individually processed with its semantic encoding. The enhanced computation proceeds as follows:
- (1)
The input state s is encoded through a convolutional encoder into a high-dimensional feature vector .
- (2)
Each action is encoded as a 2D directional vector .
- (3)
The state feature is concatenated with each action encoding , forming a combined representation , which is then input to the fully connected advantage stream to estimate
The value stream
remains unchanged and operates only on
. The
Q-value is finally computed using Equation (
14), maintaining the original Dueling DQN aggregation form.
This action-conditioned enhancement explicitly injects directional semantics into the learning process, enabling more accurate advantage estimation and better policy generalization in grid-structured coverage environments. The structure of the improved Dueling DQN is shown in
Figure 3.
3.3.4. Reward Function Design
To guide UAVs toward more efficient coverage behaviors and reduce unnecessary actions, a hierarchical reward function is designed in this paper, which consists of the following components:
In this study, the base reward
is defined as
where
denotes the set of grid cells covered at time step
t,
M represents the set of all traversable grid cells, and
represents the set of locations that the UAV can reach at the last moment.
The task-specific reward
is defined as
The coverage frontier refers to grid cells located at the boundary of the UAV’s responsible region or adjacent to obstacles. These cells are of strategic importance because moving toward them often leads to the discovery of new areas or ensures complete and efficient coverage in complex environments.
The reward magnitudes were systematically determined to reflect the hierarchical objectives of the coverage task—complete coverage, path efficiency, and collision avoidance—and were validated through sensitivity analysis: (1) Positive reinforcement for first-time coverage (+2.0) was set sufficiently higher than the per-step cost (−0.05) to ensure that discovering new cells remains the primary driver of agent behavior. (2) Penalty for invalid transitions (−2.0) provides a strong negative signal to prevent collisions and out-of-bound actions, critical for safe UAV operation. (3) Movement cost (−0.05) imposes a modest penalty that discourages unnecessary steps while allowing adequate exploration. (4) Revisit penalties (−0.5/−0.05) distinguish between revisits when unexplored neighbors exist versus when the neighborhood is already fully covered, encouraging UAVs to seek new territory whenever possible. (5) Terminal reward (+20) strongly incentivizes the global objective of complete coverage.
This hierarchical reward design, together with the above justification and sensitivity validation, effectively balances exploration and task completion, prevents repetitive local behaviors, and significantly improves overall coverage efficiency.
3.3.5. Priority Experience Replay Mechanism
To further improve learning efficiency, this study incorporates the prioritized experience replay (PER) mechanism. The core principle of PER is to assign sampling priority to each transition based on its Temporal-Difference (TD) error, which reflects the agent’s learning potential from that sample.
Specifically, the probability
of sampling a transition
i from the buffer is proportional to its TD error magnitude
, defined as
where
is a hyperparameter that controls the degree of prioritization. When
, PER reduces to uniform sampling.
To compensate for the bias introduced by prioritized sampling, importance-sampling (IS) weights are introduced in the loss function:
where the IS weight
is computed as
where
is the sampling probability of transition
j, and
controls the degree of importance correction, which is typically annealed from a small initial value to 1 during training.
This integration of PER improves sample efficiency and enables the UAV to focus more effectively on transitions that contribute significantly to learning progress.
Remark 1. As we know, end-to-end multi-agent reinforcement learning often suffers from high training complexity and path conflicts in large environments. To overcome these issues, we propose a hierarchical “DARP + local RL” framework, where DARP first divides the global workspace into connected and balanced subregions, and each UAV then learns an independent DQN-based policy within its assigned region. This integration reduces inter-UAV conflicts, accelerates policy convergence, and improves training stability by localizing the learning space and enabling parallel policy training.
4. Experiments and Evaluations
To validate the effectiveness of the proposed multi-UAV coverage path planning method based on DARP-based region partitioning and the improved Dueling DQN strategy, a series of simulation experiments were conducted. The system performance was evaluated from multiple perspectives, including coverage rate and path redundancy.
4.1. Experimental Settings and Parameter Configuration
The experiments in this study were conducted in a custom-designed 2D grid map environment, which includes multiple obstacles, no-fly zones, and inaccessible areas. The DARP algorithm was employed to partition the entire task area into several non-overlapping sub-regions, which were then assigned to multiple UAVs to perform local coverage tasks independently.
All experiments were implemented in Python 3.10 using PyTorch 2.5.1, with the hardware featuring an Intel Core i12600KF CPU (Intel Corporation, Santa Clara, CA, USA) and an NVIDIA GeForce RTX 3070 Ti GPU (NVIDIA Corporation, Santa Clara, CA, USA). Each UAV’s path planning policy was trained separately using the proposed improved Dueling DQN network. The parameter settings used in the training are listed in
Table 2.
In each experiment, the environment size was set to . UAV starting locations were randomized within non-obstacle and non-no-fly-zone cells to ensure each episode began with a valid initial state. A unified random seed (seed = 42) was used for all randomization processes (obstacle placement, UAV initialization, network weight initialization) to eliminate random variability and enable result reproduction.
4.2. Experimental Results and Visual Analysis
The visualized coverage path trajectories of different path planning strategies in the same environment are shown in
Figure 4. As observed from
Figure 4, the coverage trajectories are generated by different path planning algorithms in the same obstacle map environment. The results in
Figure 4 show the following: (a) Improved DQN method: The trajectory as a whole presents a uniform pattern from the edge to the center, which effectively covers each area, and the path is almost not crossed or stacked, reflecting good global exploration ability and coverage efficiency. (b) Initial DQN: There are a large number of local loops in the trajectory, and repeated visits are frequent, especially in the vicinity of obstacles, indicating that the strategy has the problem of falling into local optimum. (c) Boustrophedon method: The path presents a “Z”-shaped distribution, which is regular but lacks adaptation to the complex obstacle layout, and some areas are not fully covered, resulting in insufficient coverage. (d) Inner spiral coverage method: The path is compact, but exhibits significant “blind spots” in the face of irregular obstacles and map edges. (e) Spanning Tree Coverage (STC) method: Paths are significantly verbose and have a large number of invalid duplicates, especially at regional edges and branch nodes, resulting in high overlap rates and path waste.
To evaluate the adaptability of the improved DQN method under varied environmental complexities,
Figure 5 visualizes its coverage trajectories in maps with different sizes and obstacle densities.
From the overall performance in
Figure 4 and
Figure 5, the improved DQN not only covers a wide area, but also has a smoother path and a reasonable distribution, which is significantly better than the traditional methods and the unimproved network.
4.3. Analysis of Experimental Results
The performance comparisons of UAVs when participating in tasks in a
map based on different methods are listed in
Table 3. In this study, all experiments were independently conducted 10 times, and the final results are presented as the mean values.
As can be seen from
Table 3, the improved DQN method proposed in this paper has achieved significant advantages in the three core indicators. Among them, the coverage ratio (
) reaches 0.99, which is about 11% higher than the initial DQN (0.88), indicating that the method can more fully complete the comprehensive coverage of the mission area. At the same time, the repetition ratio (
) dropped from 0.32 to 0.04, a decrease of 87.5%, significantly reducing duplicate access behavior in the path and improving resource efficiency. In terms of task execution efficiency, the total number of steps on the path decreased from 478 to 361, a decrease of about 24.5%, reflecting a more streamlined path and more efficient task completion.
Furthermore, in terms of training efficiency, the improved DQN required approximately 3 h 57 min to reach convergence, compared to 4 h 11 min for the initial DQN, which demonstrates that the proposed method effectively accelerates the training process through its enhanced network structure and learning strategy. This acceleration reduces the overall computational cost while maintaining high performance across all evaluation metrics. Moreover, after training, the proposed model can operate in real time during the execution phase, further verifying its practicality and applicability in UAV coverage tasks that require rapid and adaptive decision-making.
Figure 6 shows the trend of coverage ratio and reward during the training process, which further verifies the advantages of the proposed method in terms of policy convergence speed and performance stability. The results in
Figure 6 show that the improved DQN showed a faster learning speed at the early stage of training, and its coverage rate and reward curves were significantly higher than those of the initial DQN, indicating that the strategy evolved towards efficient coverage from the beginning. In the middle and late stages of training, the improved DQN curve tends to be stable and close to the optimal value, indicating that the network has effectively converged and has little fluctuation, and has better generalization ability and stability. In contrast, the coverage rate of the initial DQN increases slowly and the reward fluctuates greatly over a long period of time, indicating that the strategy is stuck in local optima and learning is unstable. These trends show that the method not only has better final performance, but also makes the training process more efficient and reliable, which further enhances its practicability.
The underlying reasons for the above significant performance improvements are as follows: (1) The improved Dueling DQN network structure greatly improves the evaluation accuracy of state value by introducing action semantics and independent advantage estimation paths, enabling agents to plan paths more reasonably and avoid repeated revisits and local stagnation. (2) The action coding mechanism transforms the original discrete actions into spatial direction vectors, and introduces spatial semantics into the high-dimensional state representation, which makes the policy network have stronger direction recognition and decision-making capabilities, and reduces the problem of policy blindness and boundary oscillation. (3) The prioritized experience replay (PER) mechanism guides the network to learn the most strategically valuable transfer through the priority learning of the samples with high TD error, which significantly accelerates the training process and reduces invalid exploration, which is the direct result of the rapid rise in the reward curve in
Figure 6. (4) Map preprocessing (DFS) and DARP task division work together to construct a cleaner and more explicit training input state, eliminate the interference caused by the “unreachable area”, and achieve regional connectivity and task-balanced distribution, creating ideal environmental conditions for each agent’s strategy training.
4.4. Ablation Experiment
4.4.1. Structural Module Ablation
This section studies the influence of five key modules in the proposed architecture by removing them one at a time: (i) map preprocessing (DFS), (ii) region partitioning (DARP), (iii) prioritized experience replay (PER), (iv) action encoding, and (v) Dueling Architecture Improvement. The experimental results are summarized in
Table 4.
The results clearly show that each component contributes significantly to the system’s effectiveness: (1) Removing DFS preprocessing leads to the most severe performance drop. Coverage ratio declines to 0.91, and repetition ratio triples to 0.12. This is because unreachable regions interfere with learning and cause the agent to waste actions exploring invalid paths. (2) Without DARP, region assignments become imbalanced and disconnected, resulting in increased path overlap and coverage inefficiency. (3) Disabling PER reduces learning efficiency. Although the final coverage ratio remains acceptable (0.95), the repetition ratio rises, and more training steps are required, indicating slower convergence and less efficient exploration. (4) Excluding action encoding impairs the UAV’s understanding of directional semantics. This leads to suboptimal decision-making, especially in spatially constrained environments, causing higher redundancy (0.06) and longer paths. (5) Using the standard Dueling DQN instead of the improved architecture results in a less accurate estimation of Q-values due to insufficient modeling of action context. This degrades policy quality, with a repetition ratio of 0.08 and a longer execution path (368 steps).
4.4.2. Structural Module Ablation
To assess the impact of the task-specific reward
, this article further tested the UAV’s performance after removing rewards: boundary reward and milestone reward. The results are shown in
Table 5 and
Figure 7.
Table 5 summarizes the final task performance of the UAV under different reward settings. When the boundary reward is removed, the repetition ratio increases significantly (0.04 → 0.09), indicating that the UAV tends to hover around central regions and neglect less accessible border areas. Similarly, the absence of the milestone reward reduces the UAV’s motivation to make long-range progress, which results in a higher repetition ratio (0.07) and an inability to reduce total steps. The removal of both components leads to a noticeable degradation in coverage (down to 0.93) and the worst overall efficiency.
Figure 7 illustrates the learning curves of coverage ratio and cumulative reward throughout training under different reward settings. The full reward setting shows the fastest growth in both metrics, reaching convergence earlier and maintaining stability over time. In contrast, curves under ablated reward settings either grow more slowly, plateau at lower levels, or exhibit larger fluctuations, suggesting weaker learning signals and delayed convergence.
Combining
Table 5 and
Figure 7, we can find that (1) the boundary exploration reward effectively reduces “detour” behavior and improves the coverage quality at the edges and (2) the coverage progress milestone reward has a significant positive effect on breaking sparse rewards and accelerating training convergence.
4.5. Experiment with Different Numbers of UAVs Under a Large Environment
To further test the scalability of the proposed method with varying numbers of UAVs, experiments were conducted within the
grid environment. The results are shown in
Table 6.
Table 6 shows that as the number of UAVs increases, the average number of steps required per UAV decreases, while the coverage ratio remains consistently high (above 0.97), and the repetition ratio steadily decreases. This indicates that the system achieves better parallelism and division of labor with more UAVs, while effectively avoiding redundant coverage and path conflicts.
Figure 8 visualizes the coverage paths of seven UAVs under the DARP-based region division and improved DQN strategy. The paths are clearly distributed in distinct subregions without overlap, demonstrating that the region division is effective and the policy ensures cooperative, non-conflicting coverage.
5. Conclusions
This paper focuses on the multi-UAV complete coverage path planning problem and proposes a cooperative coverage method that combines DARP regional partitioning with an improved Dueling DQN reinforcement learning structure. The aim is to improve path efficiency and task completion in large-scale obstacle environments. In this study, a map preprocessing module is proposed, which significantly reduces the ineffective interference in the state space and policy learning. Then, the dynamic task allocation mechanism is presented, to perform load-balanced regional partitioning of the map, making multi-UAV task collaboration more efficient and reasonable. In addition, an improved reinforcement learning structure is designed to improve policy accuracy and training convergence. The experimental results show that the proposed method performs excellently in static complex maps and also demonstrates good scalability and modularity, making it suitable for a broader range of multi-UAV task scenarios.
Although the proposed method has achieved promising results in simulations, there are still limitations in practical engineering applications. Some parameters such as the rationality of the selection of reward values and coefficients are adjusted for simulation optimization; their adaptability in real complex scenarios has not been verified. In addition, the unique key dynamic factors of unmanned aerial vehicles (UAVs), including battery constraints, three-dimensional navigation, weak communication, wind interference, and sensor noise, have not been considered. Moreover, the current model simplifies vehicle motion by assuming equal cost for all directional movements. In real UAV operations, such maneuvers typically incur higher energy and time costs. Future work will incorporate strafing and turning penalties and other dynamic constraints to enhance the physical realism and practical applicability of the proposed method.